CN113610209A

CN113610209A - Neural network model reasoning acceleration method for monitoring video stream scene

Info

Publication number: CN113610209A
Application number: CN202110911956.7A
Authority: CN
Inventors: 陈轶; 张文; 牛少彰; 王茂森; 崔浩亮
Original assignee: Southeast Digital Economic Development Research Institute
Current assignee: Southeast Digital Economic Development Research Institute
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2021-11-05

Abstract

The invention discloses a neural network model reasoning acceleration method facing a monitoring video stream scene, which is characterized in that a neural network hierarchy is taken as a unit, segmentation is carried out according to the neural network reasoning calculation direction, an original neural network model is segmented into a plurality of modules, the output of the former module is taken as the input of the latter module, and the segmented modules are processed in parallel by utilizing a multi-process technology, namely, one module corresponds to one process; input data are independently calculated among all modules of the neural network in a multi-process mode in parallel, and data transmission among module processes is carried out in a message queue mode. The invention relates to the technical field of neural network models, and particularly provides a neural network model reasoning acceleration method for a monitoring video stream scene.

Description

Neural network model reasoning acceleration method for monitoring video stream scene

Technical Field

The invention relates to the technical field of neural network models, in particular to a neural network model reasoning acceleration method for a monitoring video stream scene.

Background

With the continuous development and evolution of the neural network, the neural network achieves excellent performance in the field of computer vision by virtue of strong fitting learning capability and exceeds the human level in multiple tasks. Meanwhile, the application of the network camera is popularized due to the arrival of the internet of things era. The network camera provides data information externally in a video stream mode, and the data density is high. Therefore, the real-time performance of the neural network model for processing the video stream is high. The accuracy of the neural network model is improved mainly by optimizing an internal structure, increasing the number of network layers and the like. However, neural network inference latency is also greatly increased as the network hierarchy deepens, which is unacceptable for time-sensitive projects.

Aiming at the problem, the existing work is mainly optimized from two aspects of improving the operation efficiency of neural network operators and model compression. In the aspect of operator operation efficiency, Intel and some scientific research institutions respectively provide own computation libraries MKL and OpenBLAS for linear algebra computation widely involved in neural network computation, and the neural network computation efficiency is improved by reducing the computation complexity of advanced linear algebra computation such as matrix multiplication, inversion and singular value computation.

In terms of model compression, the model compression methods are roughly classified into three methods, i.e., model quantization, pruning, and distillation. The model quantization is mainly realized by compressing floating point 32-bit variables commonly used in neural network calculation to a floating point 16-bit or integer type according to a certain rule, but the floating point 32-bit matrix operation optimization scheme is mature, and the model quantization can only compress the model volume and cannot effectively increase the model operation speed. Model pruning is expected to reduce the calculation time by thinning the original neural network structure, but is limited by the efficiency of sparse matrix operation, and the model operation time cannot be reduced by the scheme at present. Model distillation is realized by transferring the knowledge of the trained teacher model (large model) to the student model (small model), so as to achieve the purpose of reducing the calculation amount and the calculation time of the neural network model. However, post-distillation model accuracy and computation speed are related to the model distillation algorithm, and different teacher model distillations may correspond to different model distillation algorithms. Therefore, the compression method of the model distillation cannot be widely popularized.

Therefore, the existing neural network model reasoning acceleration method still has great limitation and cannot meet the real-time detection operation requirement of the monitoring video flow scene, so that the invention provides the neural network model reasoning acceleration method facing the monitoring video flow scene.

Disclosure of Invention

In order to overcome the defects, the invention provides a neural network model reasoning acceleration method oriented to a monitoring video stream scene, which starts from the feedforward characteristic and the hierarchical cascade characteristic of a neural network and improves the running speed of neural network reasoning calculation by using a multi-process technology.

The invention provides the following technical scheme: the invention relates to a neural network model reasoning acceleration method for a monitoring video stream scene, which specifically comprises the following steps:

the method comprises the following steps that firstly, a neural network hierarchy is taken as a unit, segmentation is carried out according to the neural network reasoning and calculating direction, an original neural network model is segmented into a plurality of modules, the output of the former module is used as the input of the latter module, and the segmented modules are processed in parallel by utilizing a multi-process technology, namely, one module corresponds to one process;

reading a video stream of a monitoring camera, and decoding the video stream into a frame image according to a frame rate to serve as model input data;

and thirdly, input data are independently calculated among all modules of the neural network in a multi-process mode in parallel, data transmission among module processes is carried out in a message Queue mode, the former module (Modulepre) transmits an output result to a message Queue (Queue), and the latter module (Modulenext) takes the output result from the message Queue (Queue) as input data of the module, so that feedforward full-process calculation of the neural network model is realized.

Further, when the frame image data is single data, the multi-process neural network model is input according to the first step and the second step, the overall operation time from input to output of the neural network is the sum of the running time of each module (Tmemory-sum) and the time (Tqueue-sum) required by the queue data to enter and exit between processes, namely Tmemory-sum + Tqueue-sum, and the total time is greater than the original operation time (Told) of the overall neural network.

Further, when the frame image data is a plurality of data, the frame image data is sequentially input into the multi-process neural network model according to the first step and the second step, although the time for finishing the operation of each data is still greater than the operation time when the whole neural network is not divided, the minimum time interval of the operation between the data is reduced from the original (Told) to the maximum value of the sum of the operation time (Tmemory) of a single divided module and the time (Tqueue) required by the enqueuing and dequeuing operation of the queue corresponding to the process, namely Tmemory + Tqueue | max; when the Tmodel + Tqueue is smaller than the Told, the operation time interval between the data is reduced; the reduction of the inter-data operation time interval means that the frame rate of the monitoring video stream can be improved, and the real-time performance of the monitoring video stream data in the neural network processing is improved.

Further, if there is inconsistency in the operation time of each process, the dequeuing and enqueuing frequencies of the message queue will also be inconsistent, and this inconsistency will cause that an image of a certain frame in the video stream is covered by an image of the next frame without being processed by the neural network in the queue; for a monitoring video stream scene, video frame images are obtained by quantization from analog quantity, each frame of image is not required to be processed under the actual condition, and the video frame rate is set along with the operation speed of a neural network; in addition, the video frame missing situation is essentially the same as the frame missing situation caused by the slow inference speed of the neural network.

The invention with the structure has the following beneficial effects: the invention relates to a neural network model reasoning acceleration method for a monitoring video stream scene, which has the following beneficial effects:

(1) compared with a model compression method, the method can accelerate the model reasoning speed on the premise of not reducing the model precision;

(2) combining a multi-process technology with a neural network cascade characteristic and a feedforward characteristic to essentially accelerate the neural network model reasoning speed;

(3) compared with the mode that a plurality of neural network models are started simultaneously, the technical scheme of the invention occupies smaller hardware resources.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flow chart of an embodiment of a neural network model inference acceleration method for a surveillance video stream scene according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the technical solution of the present invention clearer, the following takes a neural network vgg16 model as an example, and the following describes in detail each implementation step of the present invention with reference to the accompanying drawings.

Step 1: dividing the neural network model vgg16 by taking a hierarchy as a unit, and dividing and storing every four layers into independent modules which are named as a neural network module 1(module 1), a neural network module 2(module 2), a neural network module 3(module 3) and a neural network module 4(module 4);

step 2: a message queue corresponds to each module, three message queues are created among 4 neural network modules, namely a message queue 1(queue 1), a message queue 2(queue2) and a message queue 3(queue3), the queues are set to be non-blocking types, and the buffer sizes of the queues are set;

and step 3: starting a process 1, reading a video stream of a monitoring camera, decoding a frame image according to a frame rate to be used as model input data, and sending the data to a message queue (queue 1);

and 4, step 4: starting a process 2, acquiring data from a message queue 1(queue 1) by a neural network module 1(module 1) for operation, sending an operation result to the message queue 2(queue2), and reading next frame data from an input source to input the next frame data into the neural network module 1(module 1) after the operation result is sent;

and 5: starting the process 3, acquiring data from the message queue 2(queue2) by the neural network module 2(module 2) for operation, sending an operation result to the message queue 3(queue3), and reading next frame data from the message queue 2(queue2) to input the next frame data into the neural network module 2(module 2) after the operation result is sent;

step 6: starting the process 4, acquiring data from the message queue 3(queue3) by the neural network module 3(module 3) for operation, sending an operation result to the message queue 4(queue 4), and reading the next data from the message queue 3(queue3) after the operation result is sent to input the next data into the neural network module 3(module 3);

and 7: and starting the process 5, acquiring data from the message queue (queue 4) by the neural network module 4(module 4), operating and outputting an operation result, wherein the operation result is a final operation result of the neural network model, and reading the next data from the message queue (queue 4) after the operation result is sent and inputting the next data into the neural network module 4(module 4).

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A neural network model reasoning acceleration method for a monitoring video stream scene is characterized by comprising the following steps:

and thirdly, independently calculating input data among all modules of the neural network in a multi-process mode in parallel, wherein data transmission among module processes is carried out in a message queue mode, the former module transmits an output result into a message queue, and the latter module fetches data from the message queue to serve as the input data of the former module, so that feedforward full-process calculation of the neural network model is realized in the mode.

2. The neural network model reasoning and accelerating method oriented to the scene of the surveillance video stream as recited in claim 1, wherein when the frame image data is single data, the frame image is input into the multi-process neural network model according to the first step, the second step and the third step, the total operation time from input to output of the neural network is the sum of the operation time of each module and the time required for queue data to enter and exit between processes, and the sum of the required time is greater than the original operation time of the whole neural network.

3. The neural network model reasoning and accelerating method oriented to the monitored video stream scene as claimed in claim 1, wherein when the frame image data is multiple data, the frame image is input to the multi-process neural network model according to the first step, the second step and the third step, and the minimum time interval of the operation between the data is reduced to the maximum value of the sum of the operation time of the single divided module and the time required for the enqueue and dequeue operations of the queue corresponding to the process.

4. The method according to claim 1, wherein if there is inconsistency in the computation time of each process, the dequeuing and enqueuing frequencies of the message queue will also be inconsistent, and the inconsistency will cause an image of a frame in the video stream to be overwritten by an image of a next frame without being processed by the neural network in the queue.