Triton Inference 서버 사용하기

카테고리 없음

Triton Inference 서버 사용하기

뿅삥 2023. 9. 24. 23:01

Triton Inference server란?

Triton Inference Server는 딥러닝 모델을 배포하고 추론하는 데 사용되는 오픈 소스 소프트웨어입니다. 이를 통해 모델 배포와 관리가 간편하며, 다양한 플랫폼 및 언어에서 추론을 실행할 수 있습니다.

triton inference server는 pytorch, onnx, tensort 등 다수의 프레임워크를 지원합니다. 간단한 triton inference server에서 python 백엔드로 서빙 구현하는 예시입니다.

1. 도커 이미지 준비

Dockerfile을 생성해 줍니다. 서버에는 이미 Docker가 설치되어 있는 상태여야 합니다.

FROM nvcr.io/nvidia/tritonserver:23.08-py3

도커 파일에 위 내용을 작성합니다. triton docker image 23.08 버전을 사용합니다.

Dockerfile을 작성하였다면, 터미널에서 다음 명령어로 triton server 이미지를 빌드합니다.

sudo docker build -t triton_test .

2. 모델 repository 준비

모델 추론 코드, config 파일을 triton server의 규칙에 맞게 준비합니다.

<model-name> : 원하는 모델 이름으로 폴더를 생성합니다. ex) my_test

<version> : 1,2,3,.. 등의 int값을 넣습니다. ex) 1

<model-definition-file> : 모델 추론 코드를 작성합니다.

config.pbtxt : 모델 config값을 작성합니다.

3. 모델 추론 코드

공식 github에 많은 example 코드가 있습니다. 그 코드들을 참고해서 개발한 모델의 추론 코드를 작성합니다.

아래 model_inference 함수처럼 ML모델 추론 코드를 작성하고, excute부분에서 request를 받을 때 해당 함수가 돌아가게 합니다.

model.py 파일 내용

import triton_python_backend_utils as pb_utils

## 사용자 지정함수
def model_inference(x):
    y = x+1
    return y
    
## https://github.com/triton-inference-server/python_backend/blob/main/examples/pytorch/model.py
class TritonPythonModel:
    def initialize(self, args):
        self.model_config = model_config = json.loads(args['model_config'])

        # Get OUTPUT0 configuration
        output0_config = pb_utils.get_output_config_by_name(
            model_config, "OUTPUT__0")

        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config['data_type'])

    def execute(self, requests):
        """`execute` MUST be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference request is made
        for this model. Depending on the batching configuration (e.g. Dynamic
        Batching) used, `requests` may contain multiple requests. Every
        Python model, must create one pb_utils.InferenceResponse for every
        pb_utils.InferenceRequest in `requests`. If there is an error, you can
        set the error argument when creating a pb_utils.InferenceResponse
        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest
        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """

        output0_dtype = self.output0_dtype

        responses = []

        # Every Python backend must iterate over everyone of the requests
        # and create a pb_utils.InferenceResponse for each of them.
        for request in requests:
            # Get INPUT__0
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT__0").as_numpy()
            model_input = in_0

            # model pred
            model_output = model_inference(model_input)
            out_0 = model_output

            # Create output tensors. You need pb_utils.Tensor
            # objects to create pb_utils.InferenceResponse.
            out_tensor_0 = pb_utils.Tensor("OUTPUT__0",
                                           out_0.astype(output0_dtype))

            # Create InferenceResponse. You can set an error here in case
            # there was a problem with handling this inference request.
            # Below is an example of how you can set errors in inference
            # response:
            #
            # pb_utils.InferenceResponse(
            #    output_tensors=..., TritonError("An error occured"))
            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0])
            responses.append(inference_response)

        # You should return a list of pb_utils.InferenceResponse. Length
        # of this list must match the length of `requests` list.
        return responses

    def finalize(self):
        print('Cleaning up...')

config 파일 작성방식 역시 공식 문서를 토대로 서빙하고자 하는 모델에 맞게 작성해야 합니다.

input,output의 data type, shape 등을 정해주고, batch size, cpu, gpu 어디서 추론할지 정해줍니다.

config.pbtxt 파일 내용

name: "my_test"
backend: "python"

input [
  {
    name: "INPUT__0"
    data_type: TYPE_INT32
    dims: [ -1, 1 ]
  }
]

output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_INT32
    dims: [ -1, 1 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

"my_test" 모델의 추론코드와 config파일의 구조는 다음과 같습니다.

4. 서버 실행

터미널에서 다음과 같이 입력하면 서버를 실행 시킬수 있습니다.

sudo docker run \
    --name triton_my_test \
    -v /home/myworkspace/repository:/models \
    triton_test:latest tritonserver \
    --model-repository=/models \
    --log-verbose=1

-v 뒤에는 위에서 생성한 repository 경로를 넣어줍니다.

정상적으로 실행되었다면, 다음과 같은 메세지를 확인할 수 있습니다.

5. 클라이언트 코드

모델서버가 요청을 받으면 추론결과를 내보낼 준비되었습니다. 공식 문서를 참고해서 추론하고자 하는 모델에 맞게 클라이언트 코드를 작성합니다.

grpc 클라이언트 객체 생성

import tritonclient.grpc as grpcclient
my_triton_server_url = "xxx.xxx.xxx.xxx:8001"
client = grpcclient.InferenceServerClient(url=my_triton_server_url)

가짜 input 데이터 생성

import numpy as np

input_data = np.array([[4019]],dtype='int32')

inputs = []
# grpcclient.InferInput(name, shape, datatype)
inputs.append(grpcclient.InferInput("INPUT__0", input_data.shape, 'INT32'))
inputs[0].set_data_from_numpy(input_data)

4019라는 값으로 array 데이터를 만들었습니다. config.pbtxt에 작성한 내용과 동일하게 이름, shape, type을 지정해 줍니다.

output 데이터 지정

outputs = []
# grpcclient.InferRequestedOutput(name, class_count=0)
outputs.append(grpcclient.InferRequestedOutput("OUTPUT__0"))

결과값으로 전달받을 output 이름을 config.pbtxt에 작성한 내용과 동일하게 지정합니다.

input 데이터에 대해 triton inference server에 추론을 요청합니다.

result = client.infer(model_name="my_test",
                      inputs=inputs,
                      model_version = "1",
                      outputs=outputs)
                      
print(result.as_numpy("OUTPUT__0"))
>>> array([[4020]], dtype=int32)

result에서 모델 결과를 확인할 수 있습니다.