Skip to content

Inference worse with onnxruntime-gpu than native pytorch for seq2seq model #404

@Matthieu-Tinycoaching

Description

@Matthieu-Tinycoaching

System Info

Optimum: 1.4.1.dev0
torch: 1.12.1+cu116
onnx: 1.12.0
onnxruntime-gpu: 1.12.1
python: 3.8.13
CUDA: 11.6
cudnn: 8.4.1
RTX 3090

Who can help?

@JingyaHuang @echarlaix

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I compared inference on GPU of a native torch Helsinki-NLP/opus-mt-fr-en model with respect to the optimized onnx model thanks to Optimum library. So, I have defined a fastAPI microservice based on two classes below for GPU both torch and optimized ONNX, repsectively:

class Seq2SeqModel:
    tokenizer: Optional[MarianTokenizer]
    model: Optional[MarianMTModel]

    def load_model(self):
        """Loads the model"""
        # model_id="Helsinki-NLP/opus-mt-fr-en"
        model_path = Path("./app/artifacts/HF")
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to("cuda")
        self.tokenizer = tokenizer
        self.model = model

    async def predict(self, input: PredictionInput) -> PredictionOutput:
        """Runs a prediction"""
        if not self.tokenizer or not self.model:
            raise RuntimeError("Model is not loaded")
        tokens = self.tokenizer(input.text, return_tensors="pt").to("cuda")
        translated = self.model.generate(**tokens, num_beams=beam_size)
        return PredictionOutput(translated_text=self.tokenizer.decode(translated[0], skip_special_tokens=True))

class OnnxOptimizedSeq2SeqModel:
    tokenizer: Optional[MarianTokenizer]
    model: Optional[ORTModelForSeq2SeqLM]

    def load_model(self):
        """Loads the model"""
        # model_id="Helsinki-NLP/opus-mt-fr-en"
        onnx_path = Path("./app/artifacts/OL_1")
        tokenizer = AutoTokenizer.from_pretrained(onnx_path)
        optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
            onnx_path,
            encoder_file_name="encoder_model_optimized.onnx",
            decoder_file_name="decoder_model_optimized.onnx",
            decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
            provider="CUDAExecutionProvider"
        )
        self.tokenizer = tokenizer
        self.model = optimized_model

app = FastAPI()
seq2seq_model = Seq2SeqModel()
onnx_optimized_seq2seq_model = OnnxOptimizedSeq2SeqModel()
beam_size = 3

@app.on_event("startup")
async def startup():
    seq2seq_model.load_model()
    onnx_optimized_seq2seq_model.load_model()

@app.post("/prediction")
async def prediction(
    output: PredictionOutput = Depends(seq2seq_model.predict),
) -> PredictionOutput:
    return output

@app.post("/prediction_onnx_optimized")
async def prediction(
    output: PredictionOutput = Depends(onnx_optimized_seq2seq_model.predict),
) -> PredictionOutput:
    return output

Expected behavior

When load testing the model on my local computer, I was surprised by two things:

  1. The performance on GPU of the optimized ONNX model is worse than the native torch (maybe linked to Inference performance drop 22X on GPU hardware with optimum[onnxruntime-gpu] (compared with transformer) #365 and Optimize ONNX model based on encoder-decoder #396?) :

GPU_optimized_onnxruntime
GPU_torch

  1. When running this fastAPI service into a docker image I got the following warning:

2022-09-28 08:20:21.214094612 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:566 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

Does this mean the CUDAExecutionProvider is not working even if I set it in?:

        optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
            onnx_path,
            encoder_file_name="encoder_model_optimized.onnx",
            decoder_file_name="decoder_model_optimized.onnx",
            decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
            provider="CUDAExecutionProvider"
        )

What could be caused that? I saw in https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html that CUDA 11.6 is not mentionned, could it be this?

Metadata

Metadata

Labels

bugSomething isn't workinginferenceRelated to InferenceonnxruntimeRelated to ONNX Runtime

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions