CN113342538A

CN113342538A - Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model

Info

Publication number: CN113342538A
Application number: CN202110894802.1A
Authority: CN
Inventors: 唐伟鹏; 吴小炎; 吴名朝
Original assignee: Whale Cloud Technology Co Ltd
Current assignee: Whale Cloud Technology Co Ltd
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2021-09-03

Abstract

The invention discloses a method for designing an inference engine for improving GPU computational throughput by separating a script and a model, which comprises the following steps: carrying out module splitting and abstraction on the logics of CPU processing and GPU processing; serializing data transmitted between CPU processing and GPU processing; containerized process communication; modular containerized multi-instance deployments; reverse proxy and load balancing; the number of instances is adjusted in real time using elastic expansion. Has the advantages that: by abstracting and decoupling the CPU processing and the GPU processing, multiplexing is increased, the problem of serialization is solved, independent distributed computing can be really achieved through the CPU preprocessing and the GPU neural network computing, containerization examples with different proportions are configured according to different models according to actual environments, for example, 20 CPU containers are matched with 4 GPU containers, GPU resources are fully utilized, and the GPU throughput rate is improved.

Description

Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model

Technical Field

The invention relates to the technical field of GPUs (graphics processing units), in particular to a method for designing an inference engine for improving GPU computing throughput by separating scripts and models.

Background

Nowadays, most of one AI capability contains a plurality of algorithms, and the algorithms are data-dependent and interactive with each other, for example, the output of the a algorithm is used as the input of the B algorithm. Moreover, algorithms typically involve a number of pre-and post-processing of the data, such as picture size normalization, etc.

At present, because the script and the model are not separated, the calculation of the CPU and the GPU is not separated and becomes a whole, and the calculation capacity of the CPU is far lower than that of the GPU, so that the throughput of the GPU becomes very low.

For example, when an enterprise performs digital transformation, it must face an AI scenario, and needs an AI application and an AI capability. In the production process of the real AI capability, there must be a call of the AI capability, and usually an API implementation is provided to the outside based on the AI capability open platform. In the AI capability open platform, capabilities of video and image types exist, and there is a hardware acceleration demand for computing resources, which is often solved by using a GPU, which is an expensive computing resource.

Because the preprocessing of data between algorithms is usually based on CPU computation, and the visual matrix neural network relies on GPU computation, there is communication between the two, and on the premise of not being separated, the CPU computation to the GPU computation are serial, and on such a mechanism, the GPU throughput is reduced. This problem is particularly significant when one AI capability contains multiple algorithms with dependencies between them, and GPU throughput is typically less than 40%.

At present, in the prior art, an algorithm for separating a model and a script is not used, as shown in fig. 2, when the script performs operations such as preprocessing on data, the logic of the GPU is idle, and the processing time of the CPU is usually longer than that of the GPU, so that great waste is caused. For another example, a face comparison algorithm generally includes steps of face detection, face verification, face alignment, face recognition, and the like, and an actual flow is shown in fig. 3.

Therefore, in order to solve the phenomenon, the script (CPU preprocessing) and the model (GPU neural network calculation) can be separated and decoupled from each other, so that the independent distributed calculation of the CPU preprocessing and the GPU neural network calculation can be really realized, the serial behaviors of mutual dependence are solved, and the use throughput of the GPU is improved.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a method for designing an inference engine for improving the GPU computing throughput by separating a script and a model, so as to overcome the technical problems in the prior related art.

Therefore, the invention adopts the following specific technical scheme:

the inference engine design method for improving GPU computational throughput by separating scripts and models comprises the following steps:

carrying out module splitting and abstraction on the logics of CPU processing and GPU processing;

serializing data transmitted between CPU processing and GPU processing;

containerized process communication;

modular containerized multi-instance deployments;

reverse proxy and load balancing;

the number of instances is adjusted in real time using elastic expansion.

Further, the module splitting and abstracting of the logic of the CPU processing and the GPU processing includes the following steps:

aiming at the logics of the CPU processing and the GPU processing, module splitting and abstraction are carried out according to a splitting principle, and the coupled logic is separated and decoupled;

performing second abstraction on the CPU processing;

the GPU processing is abstracted a second time.

Further, the splitting principle is a modularization, atom and multiplexing principle.

Furthermore, the containerized process communication method is to use a remote process call to transmit serialized data to realize data communication between different processes.

Further, the module containerized multi-instance deployment includes the steps of:

a container platform is adopted, and the script and the model are separately deployed based on containerization multi-copy deployment;

deploying the number of instances in a corresponding proportion according to the difference proportion of the computing forces between the CPU and the GPU script logic mutual calling module;

the controller automatically maintains the corresponding instance.

Further, the reverse proxy and load balancing comprises the steps of:

carrying out reverse proxy on multiple instances of the same module and distributing a request;

the reverse proxy server uses different load balancing strategies to dynamically distribute the requests according to the performance of the instance nodes, so as to achieve the optimal performance of the service nodes.

Further, the real-time adjustment of the number of instances by using elastic expansion and contraction comprises the following steps:

monitoring the indexes of the examples by utilizing the elastic telescopic function of the container platform;

setting specific comparison threshold values and jitter duration;

and the number of copies is adjusted in real time, so that the blockage caused by insufficient instances and the resource waste caused by idle instances are avoided.

Further, the index of the example includes the number of calls, the CPU usage and the memory usage.

The invention has the beneficial effects that: through abstracting and decoupling CPU processing and GPU processing, multiplexing is increased, the problem of serialization is solved, independent distributed computing can be achieved through CPU preprocessing and GPU neural network computing, containerization examples with different proportions are configured according to actual environments and different models, all modules can be multiplexed and used in parallel, for example, 20 CPU containers are matched with 4 GPU containers, GPU resources are fully utilized, GPU throughput is improved, and efficiency and convenience are improved through data transmission by using a gPC.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow diagram of a method for designing an inference engine with separation of scripts and models to improve GPU computational throughput, according to an embodiment of the present invention;

FIG. 2 is a logic flow diagram of the algorithmic capability of an existing undivided model and script;

FIG. 3 is a flow diagram of a face comparison algorithm for the algorithmic capability of existing unseparated models and scripts;

FIG. 4 is a flowchart of a system call process of an inference engine design method for improving GPU computational throughput by separating scripts from models, according to an embodiment of the present invention;

FIG. 5 is a structural diagram of an inference engine design method for improving GPU computational throughput by separating scripts and models according to an embodiment of the present invention, in which logic of CPU processing and GPU processing is subjected to module splitting and abstraction;

FIG. 6 is a gRPC communication architecture diagram in an inference engine design method for improving GPU computational throughput by separating scripts from models, according to an embodiment of the present invention;

fig. 7 is a diagram of a multi-container replica deployment structure in an inference engine design method for improving GPU computation throughput by separating scripts and models according to an embodiment of the present invention.

Detailed Description

For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.

According to the embodiment of the invention, the inference engine design method for improving the GPU computing throughput by separating the script and the model is provided.

Referring to the drawings and the detailed description, the invention will be further explained, as shown in fig. 1, in an embodiment of the invention, a method for designing an inference engine for improving GPU computation throughput by separating a script and a model, the method includes the following steps:

s1, carrying out module splitting and abstraction on the logic of CPU processing and GPU processing;

specifically, as shown in fig. 5, the module splitting and abstracting the logic of the CPU processing and the GPU processing includes the following steps:

s11, carrying out module splitting and abstraction according to a splitting principle aiming at the logics of the CPU processing and the GPU processing, and separating and decoupling the coupled logic;

s12, performing second abstraction on the CPU processing;

s13, abstract the GPU processing a second time (re-abstract is needed because many operations in the CPU processing logic are duplicated and multiple different inference capabilities may use the same operations).

The splitting principle is a modularization, atom and multiplexing principle.

S2, serializing data transmitted between the CPU processing and the GPU processing;

the CPU processing and the GPU processing are separated, data are required to be transmitted among different processes and even different machines, the data are required to be serialized, and a multi-bit numpy matrix type with larger data volume of input parameters and output parameters of a model is considered, so that the data can be converted by using a tool such as a pick which has advantages in both access type and reading speed (more efficient) compared with a common file.

S3, communication of the containerization process;

specifically, as shown in fig. 6, the containerized process communication method is to use a remote procedure call (gRPC) to transfer serialized data, so as to implement data communication between different processes. That is, because the CPU processing and the GPU processing are separated, different processes and even different machines are required to communicate with each other. While gRPC is a high-performance, open-source and universal RPC framework, mobile and HTTP/2-oriented design. Designed based on the HTTP/2 standard, brings about features such as bi-directional flow, flow control, header compression, multiple multiplexed requests over a single TCP connection, etc. These characteristics make it perform better on mobile devices, saving more power and space.

S4, deploying modules in a containerized multi-instance mode;

specifically, as shown in fig. 7, the modular containerized multi-instance deployment includes the following steps:

s41, the characteristics of the same part of logic are often multiplexed based on the inequality of the computing power of a CPU and a GPU and different reasoning power, so a container platform is adopted, the deployment is based on containerization multi-copy deployment, and the script and the model are separately deployed;

s42, deploying the number of instances in the corresponding proportion according to the difference proportion of the computing forces between the CPU and the GPU script logic mutual calling module;

and S43, automatically maintaining the corresponding instance by the controller (for example, automatically pulling up when the instance goes wrong, etc.).

S5, reverse proxy and load balancing;

specifically, the reverse proxy and load balancing includes the following steps:

s51, reverse proxy is carried out on multiple instances of the same module, and a request is distributed (the calling of the same module is realized without being specific to each instance, and only needs to be carried out to a corresponding reverse proxy server, so that the logic writing of calling of a client is simplified);

and S52, the reverse proxy server dynamically allocates the requests according to the performance of the instance nodes by using different load balancing strategies to achieve the optimal performance of the service nodes.

In addition, in an actual environment, the service of the container platform is used for realizing the requirements, the client does not need to know each specific instance when calling the same module, only needs to request the corresponding service, and then the service is automatically distributed to the corresponding back-end instance based on a set load balancing strategy (polling, designated weight, hash and the like).

And S6, adjusting the number of the instances in real time by using elastic expansion and contraction.

Specifically, the real-time adjustment of the number of instances by using elastic expansion and contraction comprises the following steps:

s61, monitoring the indexes of the embodiment by using the elastic expansion function of the container platform;

s62, setting specific contrast threshold and jitter duration;

and S63, adjusting the number of copies in real time, and avoiding the blockage caused by insufficient instances and the resource waste caused by idle instances.

The indexes of the examples comprise the calling quantity, the CPU utilization rate and the memory usage amount.

To sum up, a series of processes such as separation and independent operation of CPU processing and GPU processing are realized by combining the above six steps, and then improvement of GPU throughput is realized, as shown in fig. 4, the whole calling process of the system is as follows:

firstly, receiving service calling by inference service;

secondly, CPU processing, such as detecting the base64 of the image, and executing the picture size normalization processing;

third, GPU processing (RNN CNN PNN): transmitting the processed image matrix to a neural network computing link of a GPU through an engine API (application programming interface) and outputting a reasoning computing result, wherein the process comprises a plurality of times of computing processing of the neural network, and is similar to image detection, image alignment and image object identification;

fourthly, CPU subsequent processing: carrying out standardized packaging on a single neural network calculation result in a GPU reasoning process;

fifthly, integrating results: and combining a plurality of standardized results of CPU post-processing, and uniformly outputting the combined standardized results to a calling party.

The following are specific specifications and examples:

1) script structure

(1) Entry function

Take the demo _ face _ test as an example. A face _ test packet exists in a code directory, the packet is an entry of the whole calling of the model capability, and a plurality of algorithms are combined in the script to synthesize the final result.

Py there is a method predict:

def predict(params=None):

start = time.time()

time.sleep(10)

logger = getLogger('video')

logger.info("%s predict start at %s" % (threading.currentThread().getName(), start))

ret = {'data': {}, 'code': 0, 'message': 'success'}

# time.sleep(1)

# return json.dumps(ret)

try:

# read Picture

if params and 'image' in params:

image = params['image']

img = parse_image_file(image, 'cv')

else:

raise RuntimeError('`image` not in params.')

# img = cv2.imread(get_model_own(__file__, 'image.jpg'))

rand = np.random.randint(-5, 6, size=img.shape)

img_new = img + rand

img_new = img_new.astype(np.uint8)

# face detection (GPU)

start2 = time.time()

logger.info("%s face_detect_api start at %s" % (threading.currentThread().getName(), start2))

face_out = face_detect_api(img_new)

end2 = time.time()

logger.info("%s face_detect_api finish at %s, cost totally %s ms." % (

threading.currentThread().getName(), end2, round((end2 - start2) * 1000)))

logger.info('%s shape of face_detect_api: %s ' % (threading.currentThread().getName(), len(face_out)))

ret['data'] = {'len':len(face_out)}

# ret['data'] = '1'

# face audit (GPU)

box_list = []

landmarks_list = []

start3 = time.time()

logger.info("%s check_face start at %s" % (threading.currentThread().getName(), start3))

for i in range(len(face_out)):

x_min = face_out[i]['ext_boxes'][0]

x_max = face_out[i]['ext_boxes'][2]

y_min = face_out[i]['ext_boxes'][1]

y_max = face_out[i]['ext_boxes'][3]

if check_face('video', img, face_out[i]) == FaceStatus.SUCCESS:

box_list.append([x_min, y_min, x_max, y_max])

landmarks_list.append(face_out[i]['landmarks'])

end3 = time.time()

logger.info("%s check_face finish at %s, cost totally %s ms." % (

threading.currentThread().getName(), end3, round((end3 - start3) * 1000)))

# face alignment

aligned_list = []

for i, j in zip(box_list, landmarks_list):

aligned_list.append(face_align(img, [i], [j]))

aligneds = np.concatenate(aligned_list)

# face recognition (GPU)

start4 = time.time()

logger.info("%s face_api2 start at %s" % (threading.currentThread().getName(), start4))

face_feature_output = face_feature_api(aligneds)

end4 = time.time()

logger.info("%s face_api2 finish at %s, cost totally %s ms." % (

threading.currentThread().getName(), end4, round((end4 - start4) * 1000)))

# acquisition result

face_feature_output = np.array([item['feature_value'] for item in face_feature_output])

ret['data']['num'] = len(face_feature_output)

time1 = time.time()

logger.info("%s matrix_distance start at %s" % (threading.currentThread().getName(), time1))

if len(face_feature_output) > 0:

face_num = 10000

faces_new = face_feature_output.repeat(face_num // face_feature_output.shape[0], 0)

matches = matrix_distance(face_feature_output, faces_new) < 0.5

logger.info('num of matches:%s' % np.sum(matches))

time2 = time.time()

logger.info("%s matrix_distance finish at %s, cost totally %s ms." % (

threading.currentThread().getName(), time2, round((time2 - time1) * 1000)))

except Exception as e:

traceback.print_exc()

ret['code'] = -1

ret['message'] = str(e)

return json.dumps(ret)

The data returned by this method is a json string.

(2) Algorithm structure

In one of its algorithm implementation package face _ detection, the directory structure is as follows:

py is where the model loads and provides an inference instance. It has a get _ instance () method that provides a unique instance of the inference object, which is a function that gets this model call instance to the grpc:

def get_instance(gpu_device_id):

return MtcnnModelCaller.instance(gpu_device_id)

as in mtcnmodelcaller, instance (gpu _ device _ id) in the figure, the function is implemented as follows:

@classmethod

def instance(cls, *args, **kwargs):

if not hasattr(MtcnnModelCaller, "_instance"):

with MtcnnModelCaller._instance_lock:

gpu_device_id = args[0]

MtcnnModelCaller.gpu_device_id = gpu_device_id

MtcnnModelCaller._instance = MtcnnModelCaller()

return MtcnnModelCaller._instance

one method, call _ model (params), is provided in classMtcnnModelCaller;

def call_model(self, params):

if params['type'] == 'pnet':

return self.call_PNet(params['data'])

if params['type'] == 'rnet':

return self.call_RNet(params['data'])

if params['type'] == 'onet':

return self.call_ONet(params['data'])

if params['type'] == 'lnet':

return self.call_LNet(params['data'])

return None

this is the entry method provided to the grpc service, which when requested rpc selects a different model instance based on the parameters, calls its call _ model (self, params) method.

(3) Model configuration

Yaml under the model/face _ test package, the program is as follows:

model_info:

- model_service: 'face_detection'

entrypoint: script.face_detection.model_service

- model_service: 'face_feature_extract'

entrypoint: script.face_feature_extract.model_service

- model_service: 'face_structure_angle'

entrypoint: script.face_structure_angle.model_service

the model _ service of all gpu models required by the capability of face _ test is configured in the interface _ test. For example, a model _ service in an example face _ detection in an algorithm structure has a configuration in this configuration, and the field name is model _ service.

(4) Model invocation

The script/model _ separate _ common/model _ controller. py script has a remote model calling instance, and when in use, the instance of the remote model calling instance is acquired first, and the face _ detection in the face _ test is taken as an example, and the name face _ detection of the model _ service is acquired when the instance is called. Then call the call _ model method directly, and the parameters called by the model are introduced.

model_caller = ModelCaller.instance('face_detection')

data = {

'sliced_index':sliced_index,

'img':img,

'scales':scales

}

total_boxes = model_caller.call_model('pnet',data)

In this example, model _ type pnet is also introduced, since multiple models are used in face _ detection. A type needs to be passed in to determine the model of the call.

In mtcnn _ detector in the example, data is received, data processing and transformation are firstly carried out, then a model is called to repeat repeatedly, and finally a result is returned.

After the separation and transformation of the script and the model, the utilization rate of the GPU is greatly improved.

In summary, by means of the technical scheme of the invention, the CPU processing and the GPU processing are abstracted and decoupled, multiplexing is increased, the problem of serialization is solved, independent distributed computing of the CPU preprocessing and the GPU neural network computing is really achieved, different models are configured with containerization examples in different proportions according to actual environments, for example, 20 CPU containers are matched with 4 GPU containers, GPU resources are fully utilized, and the GPU throughput rate is improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. The inference engine design method for improving GPU computational throughput by separating scripts and models is characterized by comprising the following steps:

serializing data transmitted between CPU processing and GPU processing;

containerized process communication;

modular containerized multi-instance deployments;

reverse proxy and load balancing;

the number of instances is adjusted in real time using elastic expansion.

2. The method for designing an inference engine with script and model separation for improving GPU computational throughput of claim 1, wherein the module splitting and abstraction of the logic of CPU processing and GPU processing comprises the following steps:

performing second abstraction on the CPU processing;

the GPU processing is abstracted a second time.

3. The method for designing an inference engine with separation of scripts and models and improvement of GPU computational throughput according to claim 2, characterized in that the splitting principle is a modularization, atom and multiplexing principle.

4. The method for designing an inference engine for improving GPU computational throughput by separating scripts and models according to claim 1, wherein the containerized process communication method is to use remote process calls to transfer serialized data to realize data communication between different processes.

5. The method for designing inference engine with script and model separation for improving GPU computational throughput of claim 1, wherein the modular containerized multi-instance deployment comprises the following steps:

the controller automatically maintains the corresponding instance.

6. The method for designing an inference engine with script and model separation for improving GPU computing throughput of claim 1, wherein the reverse proxy and load balancing comprises the following steps:

7. The method for designing an inference engine with script and model separation for improving GPU computational throughput according to claim 1, wherein the real-time adjustment of the number of instances by elastic scaling comprises the following steps:

setting specific comparison threshold values and jitter duration;

8. The method of claim 7, wherein the metrics of the instances include a number of calls, a CPU usage rate, and a memory usage amount.