CN112001351A

CN112001351A - Method, system, computer device and storage medium for processing multiple video streams

Info

Publication number: CN112001351A
Application number: CN202010906411.2A
Authority: CN
Inventors: 郁强; 方思勰
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2020-11-27

Abstract

The present application relates to a method, system, computer device and storage medium for processing multiple video streams, wherein the method comprises: acquiring a video stream to be processed; converting a video stream to be processed into a video frame to be processed; converting a video frame to be processed in a preset time period into a video frame Tensor and pushing the video frame Tensor into a preprocessing queue, wherein the video frame Tensor comprises the number of video streams to be processed, the length, the width and the RGB of the video frame to be processed; monitoring the preprocessing queue, acquiring video frames Tensor in the preprocessing queue, and putting the video frames Tensor into a machine learning engine for reasoning; and acquiring an inference result of the video frames Tensor in the preprocessing queue, wherein the inference result comprises a coordinate Tensor, a classification Tensor, a confidence Tensor and a quantity Tensor of the video frames Tensor in the preprocessing queue. The invention is different from the prior art in that: the method can process dynamic video streams in batches, and can stably and efficiently process multi-path video stream addresses by using a thread pool and a multi-GPU counting method, so that the detection concurrency is improved.

Description

Method, system, computer device and storage medium for processing multiple video streams

Technical Field

The present application relates to the field of object detection technologies, and in particular, to a method, a system, a computer device, and a storage medium for processing multiple video streams.

Background

With the rapid development of the deep learning in the field of artificial intelligence, the more and more fields of computer vision receive huge opportunities and challenges. The target detection is a popular technology of computer vision and digital image processing, is also a basic algorithm in the field of universal identity recognition, plays an important role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like, and is widely applied to the field of intelligent video monitoring.

In the related art, the target detection algorithm mainly uses fixed batch detection, and generally adopts single-video-stream inference, that is, each video frame in a single video stream is detected as a basic unit, for example, the fixed batch size is set to be 1, and each time, the 1 video frame is sent to an inference engine for inference. The method has the problems of low detection speed and poor concurrency performance. If the batch size is set to 16 on the basis of the related art, when the number of samples is less than 16, the samples need to be expanded, which results in waste of detection resources.

At present, no effective solution is provided for the problems of low processing speed and poor concurrency performance of a target detection algorithm in the related technology.

Disclosure of Invention

The embodiment of the application provides a method, a system, computer equipment and a storage medium for processing a plurality of video streams, so as to at least solve the problems of low detection speed and poor concurrency performance of a method for processing each video frame in a single video stream as a basic unit by an object detection algorithm in the related art.

In a first aspect, an embodiment of the present application provides a method for processing multiple video streams, where the method includes: acquiring a video stream to be processed; converting a video stream to be processed into a video frame to be processed; converting a video frame to be processed in a preset time period into a video frame Tensor and pushing the video frame Tensor into a preprocessing queue, wherein the video frame Tensor comprises the number of video streams to be processed, the length, the width and the RGB of the video frame to be processed; monitoring the preprocessing queue, acquiring video frames Tensor in the preprocessing queue, and putting the video frames Tensor into a machine learning engine for reasoning; and acquiring an inference result of the video frames Tensor in the preprocessing queue, wherein the inference result comprises a coordinate Tensor, a classification Tensor, a confidence Tensor and a quantity Tensor of the video frames Tensor in the preprocessing queue.

In some of these embodiments, obtaining the video stream to be processed comprises: receiving a video stream request, wherein the video stream request comprises a video stream address, a video stream duration, a video stream ID, a request time and an asynchronous callback address; converting the format of the video stream request into a data structure of a video stream processing unit, and registering the video stream processing unit on a stream executor; and acquiring the registered video stream information as the video stream to be processed through the video stream registration list.

In some of these embodiments, the method further comprises: sending the inference result into a post-processing queue; and monitoring the post-processing queue, acquiring the reasoning result in the post-processing queue, and performing post-processing to obtain a target result.

In some of these embodiments, the machine learning engine comprises: the model preprocessing module is used for preprocessing the video frame Tensor to obtain a preprocessed Tensor; the trained model network module is used for reasoning the preprocessed Tensor to obtain a characteristic result Tensor; and the model post-processing module is used for decoding the characteristic result Tensor to obtain an inference result.

In some of these embodiments, pre-processing the video frame sensor includes: the video frame Tensor is resized and normalized.

In some of these embodiments, the trained model network module comprises: acquiring an existing video frame Tensor; acquiring feature information of the existing video frame Tensor; inputting the existing video frame Tensor into the machine learning model, and taking the characteristic information of the existing video frame Tensor as supervision to train the machine learning model to obtain the machine learning model with complete training.

In some embodiments, monitoring the post-processing queue, obtaining the inference result in the post-processing queue, performing post-processing to obtain a target result, reading the inference result from the post-processing queue, converting the inference result into a post-processing data structure, and performing post-processing on the post-processing data structure to obtain the target result, wherein the post-processing at least comprises target detection deduplication, target tracking and similarity comparison; and pushing the target result to the asynchronous callback address.

In some embodiments, monitoring the preprocessing queue, obtaining the video frames sensor in the preprocessing queue, and putting the video frames sensor into the machine learning engine for reasoning further includes: the inference process is put into a single thread, the CPU uses a thread pool to accelerate inference, and the GPU uses multiple GPUs to accelerate inference, wherein the multiple GPU accelerates inference comprises counting the running tasks of the multiple GPUs by using a reference counting method, a video frame Tensor is sent to the GPU with the least number of tasks to carry out inference by requesting each time, the GPU counting is increased by one by requesting each time, and the GPU counting is reduced by one by finishing the request.

In a second aspect, an embodiment of the present application provides a multiple video stream processing system, including: the preprocessing module is used for acquiring a video stream to be processed; converting a video stream to be processed into a video frame to be processed; converting a video frame to be processed in a preset time period into a video frame Tensor and pushing the video frame Tensor into a preprocessing queue, wherein the video frame Tensor comprises the number of video streams to be processed, the length, the width and the RGB of the video frame to be processed; the inference module is used for monitoring the preprocessing queue, acquiring the video frames Tensor in the preprocessing queue and putting the video frames Tensor into the machine learning engine for inference; and acquiring an inference result of the video frames Tensor in the preprocessing queue, wherein the inference result comprises a coordinate Tensor, a classification Tensor, a confidence Tensor and a quantity Tensor of the video frames Tensor in the preprocessing queue.

In some of these embodiments, the system further comprises: the post-processing module is used for sending the reasoning result into a post-processing queue; and monitoring the post-processing queue, acquiring the reasoning result in the post-processing queue, and performing post-processing to obtain a target result.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor, when executing the computer program, implements the multiple video stream processing methods according to the first aspect.

In a fourth aspect, the present application provides a storage medium, on which a computer program is stored, which when executed by a processor implements the multiple video stream processing methods as described in the first aspect above.

Compared with the related art, the multiple video stream processing method, the multiple video stream processing system, the computer device and the storage medium provided by the embodiment of the application solve the problem that in the related art, because a single video stream is adopted for reasoning, and the method for reading video frames in fixed batch size solves the problem of low efficiency of processing dynamic video stream, and reads video frames from multi-channel video stream and converts the video frames into video frames Tensor at fixed time, a plurality of video frames acquired within a fixed time interval may be converted into one video frame sensor, the video frame sensor comprises four dimensions of coordinates, classification, confidence coefficient and quantity, the video frame sensors of dynamic batches are sent to a GPU with the least quantity of executed tasks for reasoning, compared with the prior art that fixed batches of video frames are uniformly put into the GPU for reasoning each time, the method has the technical effects of efficiently processing multiple paths of video streams, high detection concurrency and high detection speed.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow diagram of a method of processing multiple video streams according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of multiple video stream processing according to a preferred embodiment of the present application;

FIG. 3 is a block diagram of a plurality of video stream processing systems according to an embodiment of the present application;

FIG. 4 is a diagram of a hardware configuration of an electronic device according to an embodiment of the present application;

FIG. 5 is a flow chart of a video stream pre-processing method of the present application;

FIG. 6 is a flow chart of a method of dynamic batch processing in the present application;

fig. 7 is a flow chart of the video stream processing system of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

The present embodiment provides a multiple video stream processing method, and fig. 1 is a flowchart of the multiple video stream processing method according to the embodiment of the present application, and as shown in fig. 1, the flowchart includes a preprocessing step, an inference step, and a post-processing step, and specifically, the method includes:

step S101, obtaining a video stream to be processed.

In some of these embodiments, obtaining the video stream to be processed comprises: receiving a video stream request, wherein the video stream request comprises a video stream address, a video stream duration, a video stream ID, a request time, and an asynchronous callback address; converting the format of the video stream request into a data structure of a video stream processing unit, and registering the video stream processing unit on a stream executor; and acquiring the registered video stream information as the video stream to be processed through the video stream registration list.

In this embodiment, a video stream request is received, format conversion is performed, and the video stream is registered in the stream executor, so that the video stream can be registered and deleted at any time, and management of the video stream is facilitated.

Step S102, converting the video stream to be processed into video frames to be processed.

In this embodiment, a plurality of video streams are acquired, and a video frame to be processed is obtained by decomposing the plurality of video streams.

Step S103, converting the video frames to be processed in the preset time period into video frames Tensor and pushing the video frames Tensor into a preprocessing queue, wherein the video frames Tensor comprises the number of the video streams to be processed, the length, the width and the RGB of the video frames to be processed.

In this embodiment, by periodically decomposing a plurality of video streams to obtain video frames to be processed and converting the video frames to video frames sensor, a plurality of video frames to be processed collected within a fixed time interval can be converted into one video frame sensor. Compared with the method in the prior art that the single video stream is adopted for reasoning and the video frames are read in a fixed batch size, the steps can be adopted to process the video stream in a dynamic batch size.

In this embodiment, the video frame sensor is pushed into the preprocessing queue to form a task queue, the tasks in the queue are sequentially executed, the video frame sensor includes the video frames to be processed of the multiple paths of video streams at the same time, and the video frames sensor can be simultaneously pushed into the preprocessing queue for related processing no matter the video frames sensor includes several video frames to be processed.

In steps S101 to S103, the preprocessing procedure provided by the present solution specifically includes: sending a video stream request to a server through protocols such as HTTP, GRPC and the like, converting the video stream request into a data structure of a video stream processing unit after the server receives the request, registering the video stream processing unit in a stream executor, and immediately returning a registration success request result after the registration is successful. Acquiring a registered video stream processing unit as a to-be-processed video stream through a registration list, acquiring the to-be-processed video stream at intervals, decomposing the to-be-processed video stream to obtain to-be-processed video frames, converting the to-be-processed video frames into video frames, pushing the video frames into a preprocessing queue, and acquiring the time interval of the to-be-processed video frames as time _ block.

In order to efficiently utilize server resources, the preprocessing flow can be put into a single thread, a thread pool is used for hosting and reading the video frames to be processed in the video stream to be processed, and the maximum thread number of the thread pool is preprocessing _ threads.

This concludes the pre-processing flow.

And recording error information in the video stream processing unit if an error occurs in the flow.

Illustratively, in an initial state, no video stream is registered, the fixed stream executor does not process, a to-be-processed video stream is registered in 3s, and assuming that the video stream executor sets the detection time interval time _ block to 5s, the video stream executor only reads the to-be-processed video frame in the 5s and pushes the to-be-processed video frame into the pre-processing queue. And 7s, registering a video stream to be processed, reading two video frames to be processed at the 10 th s, converting the two video frames to be processed into video frames Tensor with the number of the video streams to be processed and the characteristics of the length, the width, the RGB and the like of the video frames to be processed, and pushing the video frames Tensor into a preprocessing queue for relevant processing. More specifically, when there are 3 to-be-processed video streams accessed at 20:00 and the to-be-processed video streams need to be processed into to-be-processed video frames with the height of 512 and the width of 512, the 3 to-be-processed video streams are simultaneously processed into to-be-processed video frames, if the to-be-processed video frames are exactly at the time point that the to-be-processed video frames need to be processed into the video frames Tensor, the 3 to-be-processed video frames are processed into (3, 512, 512, 3) video frames Tensor and pushed into the preprocessing queue, otherwise, the to-be-processed video frames are continuously read.

And step S104, monitoring the preprocessing queue, acquiring the video frames Tensor in the preprocessing queue, and putting the video frames Tensor into a machine learning engine for reasoning.

In this embodiment, the video frames to be processed are converted into video frames Tensor, which is equivalent to obtaining one dynamic batch of video frames to be processed each time, i.e., 4 or 8 video frames to be processed in the Tensor can all be inferred simultaneously, and the superiority of matrix operation can be fully utilized. In the prior art, fixed batch size detection is adopted, if the batch size is set to be 1, the detection speed is slow, and if the batch size is set to be 16, samples need to be expanded when less than 16 samples are met, so that the waste of detection resources is caused. According to the scheme, the video frame sensor is placed into the machine learning engine, so that the detection of the dynamic batch size can be realized, namely the parallel computing advantage of the CUDA can be utilized without making the batch size in advance, and the processing efficiency of the dynamic video stream for target detection is improved.

In the embodiment, a preprocessing queue is monitored, video frames Tensor are preprocessed and then placed into a machine learning engine for reasoning, and a Tensor group/row corresponding to the video frames Tensor is obtained. Wherein the trained model network module comprises: acquiring an existing video frame Tensor; acquiring feature information of the existing video frame Tensor; inputting the existing video frame Tensor into the machine learning model, and taking the characteristic information of the existing video frame Tensor as supervision to train the machine learning model to obtain the machine learning model with complete training.

In this embodiment, since each image of a video frame to be processed is converted into a matrix for operation, the latitude of the matrix is fixed, and therefore, the size adjustment is performed. Specifically, the video frame Tensor obtained from the preprocessing queue is transmitted to a model preprocessing module compiled by XLA, and the model preprocessing module carries out size adjustment and standardization processing on the video frame Tensor, so that different features of the video frame Tensor have the same scale.

And step S105, acquiring an inference result of the video frames Tensor in the preprocessing queue, wherein the inference result comprises a coordinate Tensor, a classification Tensor, a confidence level Tensor and a quantity Tensor of the video frames Tensor in the preprocessing queue, and acquiring the information of the detected target according to four dimensions of the Tensor.

In steps S104 to S105, the inference process provided by the present solution is specifically to read the video frame sensor from the preprocessing queue, and infer, by using a machine learning engine, a plurality of video frames to be processed in one video frame sensor put into the GPU, so as to obtain four dimensions of the coordinate, classification, confidence, and number of the video frame sensor.

In order to utilize server resources most efficiently, an inference flow is placed into a single thread, a preprocessing flow and a post-processing flow are placed into a CPU for processing, a thread pool is used for accelerating inference aiming at the CPU, the maximum thread number of the thread pool is session _ threads, and multiple GPUs are used for accelerating inference aiming at the GPU, wherein the multiple GPU accelerating inference comprises counting tasks which are running by using a reference counting method, a video frame Tensor is sent to the GPU with the minimum task number for inference every request, the GPU count is increased by one every request, and the GPU count is decreased by one after the request is completed.

And ending the reasoning process.

For example, assuming that the video stream executor sets the detection time interval time _ block to 5s, the preprocessing queue receives 2 sensors at 5s and 10s in sequence, and after reasoning, the coordinates, the classification sensors, the confidence sensors and the number sensors are pushed into the post-processing queue. More specifically, when 8 video frames are contained in one video frame sensor, inference is carried out simultaneously, and the obtained result includes coordinates of the 8 video frames sensor, classification sensors of the 8 video frames, confidence coefficients of the 8 video frames sensor, and the number of the 8 video frames sensor.

Through the steps from S101 to S105, the video frames to be processed are read from the multiple paths of video streams to be processed at regular time and are converted into the video frames Tensor, a plurality of video frames to be processed collected in a fixed time interval can be converted into one video frame Tensor, the video frame Tensor comprises four dimensions of coordinates, classification, confidence coefficient and quantity, the video frames Tensor in dynamic batches are sent into a GPU with the minimum number of executing tasks for reasoning, and the technical effects of efficient processing of the multiple paths of video streams, high detection concurrency and high detection speed are achieved. The invention is different from the prior art in that: the method can process dynamic video streams in batches, and can stably and efficiently process multi-path video stream addresses by using a thread pool and a multi-GPU counting method, so that the detection concurrency is improved.

Specifically, in the related art, single-video stream inference and fixed batch size detection are generally adopted, that is, each video frame in a single-video stream is detected as one basic unit, for example, the fixed batch size is set to 1, and each time 1 video frame is sent to an inference engine for inference, the detection speed is slow. If the batch size is set to 16 on the basis of the related art, when the number of samples is less than 16, the samples need to be expanded, which results in waste of detection resources. According to the scheme, the video frame sensor is placed into the machine learning engine, so that the detection of the dynamic batch size can be realized, namely the parallel computing advantage of the CUDA can be utilized without making the batch size in advance, and the efficiency of processing the dynamic video stream in target detection is improved.

In some embodiments, the method further includes step S106, sending the inference result to a post-processing queue; and monitoring the post-processing queue, acquiring the reasoning result in the post-processing queue, and performing post-processing to obtain a target result.

In this embodiment, the obtained inference result is converted into a format capable of being post-processed, and the post-processing program processes the inference result to obtain the target result. The post-processing may include target detection deduplication, target tracking, similarity comparison, and the like, and the target result is pushed to the asynchronous callback address of the to-be-processed video stream record.

In step S106, the post-processing procedure provided by the present embodiment is to monitor the post-processing queue, convert the coordinate Tensor, the classification Tensor, the confidence Tensor, and the quantity Tensor into a post-processing data structure, and submit the post-processing data structure to the post-processing program for processing.

In order to make the most efficient use of server resources, the post-processing flow is put into a single thread, and the inference result obtained by using thread pool splitting inference is used, wherein the maximum thread number of the thread pool is postprocess _ threads. And after the processing is finished, pushing the target result to the asynchronous callback address recorded by the video stream processing unit through HTTP and GRPC protocols.

And the process flow ends thereafter.

Illustratively, the post-processing program reads the reasoning results in the post-processing queue in sequence for processing, and pushes the target result to the callback address http:// xxx.

Fig. 3 is a block diagram of a plurality of video stream processing systems according to an embodiment of the present application, as shown in fig. 2, the systems including: the preprocessing module is used for acquiring a video stream to be processed; converting a video stream to be processed into a video frame to be processed; converting a video frame to be processed in a preset time period into a video frame Tensor and pushing the video frame Tensor into a preprocessing queue, wherein the video frame Tensor comprises the number of video streams to be processed, the length, the width and the RGB of the video frame to be processed; the inference module is used for monitoring the preprocessing queue, acquiring the video frames Tensor in the preprocessing queue and putting the video frames Tensor into the machine learning engine for inference; and acquiring an inference result of the video frames Tensor in the preprocessing queue, wherein the inference result comprises a coordinate Tensor, a classification Tensor, a confidence Tensor and a quantity Tensor of the video frames Tensor in the preprocessing queue.

In some of these embodiments, the pre-processing module is configured to receive a video stream request, wherein the video stream request includes a video stream address, a video stream duration, a video stream ID, a request time, and an asynchronous callback address; converting the format of the video stream request into a data structure of a video stream processing unit, and registering the video stream processing unit on a stream executor; and acquiring the registered video stream information as the video stream to be processed through the video stream registration list.

In some embodiments, the inference module is configured to include a model preprocessing module for preprocessing a video frame Tensor to obtain a preprocessed Tensor; the trained model network module is used for reasoning the preprocessed Tensor to obtain a characteristic result Tensor; and the model post-processing module is used for decoding the characteristic result Tensor to obtain an inference result.

In some of these embodiments, the inference module is configured for resizing and normalizing the video frames Tensor.

In some of these embodiments, the post-processing module is configured to feed the inference results into a post-processing queue; and monitoring the post-processing queue, acquiring the reasoning result in the post-processing queue, and performing post-processing to obtain a target result.

In some embodiments, the post-processing module is configured to read the inference result from the post-processing queue and convert the inference result into a post-processing data structure, and perform post-processing on the post-processing data structure to obtain a target result, where the post-processing at least includes target detection deduplication, target tracking, and similarity comparison; and pushing the target result to the asynchronous callback address.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

The present embodiment also provides an electronic device comprising a memory 304 and a processor 302, wherein the memory 304 stores a computer program, and the processor 302 is configured to execute the computer program to perform the steps of any of the above method embodiments.

Specifically, the processor 302 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Memory 304 may include, among other things, mass storage 304 for data or instructions. By way of example, and not limitation, memory 304 may include a hard disk drive (hard disk drive, HDD for short), a floppy disk drive, a solid state drive (SSD for short), flash memory, an optical disk, a magneto-optical disk, tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Memory 304 may include removable or non-removable (or fixed) media, where appropriate. The memory 304 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 304 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 304 includes Read-only memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or FLASH memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a static random-access memory (SRAM) or a dynamic random-access memory (DRAM), where the DRAM may be a fast page mode dynamic random-access memory 304 (FPMDRAM), an extended data output dynamic random-access memory (EDODRAM), a synchronous dynamic random-access memory (SDRAM), or the like.

Memory 304 may be used to store or cache various data files for processing and/or communication purposes, as well as possibly computer program instructions for execution by processor 302.

The processor 302 may implement any one of the multiple video stream detection methods of the above embodiments by reading and executing computer program instructions stored in the memory 304.

Optionally, the electronic apparatus may further include a transmission device 306 and an input/output device 308, where the transmission device 306 is connected to the processor 302, and the input/output device 308 is connected to the processor 302.

Alternatively, in this embodiment, the processor 302 may be configured to execute the following steps by a computer program:

s101, obtaining a video stream to be processed.

S102, converting the video stream to be processed into video frames to be processed.

S103, converting the video frames to be processed in the preset time period into video frames Tensor and pushing the video frames Tensor into a preprocessing queue, wherein the video frames Tensor comprises the number of the video streams to be processed, the length, the width and the RGB of the video frames to be processed.

S104, monitoring the preprocessing queue, acquiring the video frames Tensor in the preprocessing queue, and putting the video frames Tensor in a machine learning engine for reasoning.

S105, acquiring an inference result of the video frames Tensor in the preprocessing queue, wherein the inference result comprises a coordinate Tensor, a classification Tensor, a confidence Tensor and a quantity Tensor of the video frames Tensor in the preprocessing queue.

S106, sending the inference result into a post-processing queue; and monitoring the post-processing queue, acquiring the reasoning result in the post-processing queue, and performing post-processing to obtain a target result.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In addition, in combination with the method for detecting multiple video streams in the foregoing embodiments, the present application embodiment may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any one of the methods of multiple video stream detection in the above embodiments.

It should be understood by those skilled in the art that various features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the features in the above embodiments are not described, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above examples are merely illustrative of several embodiments of the present application, and the description is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for processing multiple video streams, the method comprising:

acquiring a video stream to be processed;

converting a video stream to be processed into a video frame to be processed;

converting a video frame to be processed in a preset time period into a video frame Tensor and pushing the video frame Tensor into a preprocessing queue, wherein the video frame Tensor comprises the number of video streams to be processed, the length, the width and the RGB of the video frame to be processed;

monitoring the preprocessing queue, acquiring video frames Tensor in the preprocessing queue, and putting the video frames Tensor into a machine learning engine for reasoning;

and acquiring an inference result of the video frames Tensor in the preprocessing queue, wherein the inference result comprises a coordinate Tensor, a classification Tensor, a confidence Tensor and a quantity Tensor of the video frames Tensor in the preprocessing queue.

2. The method of claim 1, wherein obtaining the video stream to be processed comprises:

receiving a video stream request, wherein the video stream request comprises a video stream address, a video stream duration, a video stream ID, a request time and an asynchronous callback address;

converting the format of the video stream request into a data structure of a video stream processing unit, and registering the video stream processing unit on a stream executor;

and acquiring the registered video stream information as the video stream to be processed through the video stream registration list.

3. The method of processing multiple video streams of claim 1, further comprising:

sending the inference result into a post-processing queue;

and monitoring the post-processing queue, acquiring the reasoning result in the post-processing queue, and performing post-processing to obtain a target result.

4. The method of processing multiple video streams of claim 1, wherein the machine learning engine comprises:

the model preprocessing module is used for preprocessing the video frame Tensor to obtain a preprocessed Tensor;

the trained model network module is used for reasoning the preprocessed Tensor to obtain a characteristic result Tensor;

and the model post-processing module is used for decoding the characteristic result Tensor to obtain an inference result.

5. The method of claim 4, wherein preprocessing the video frames sensor comprises:

the video frame Tensor is resized and normalized.

6. The method of processing multiple video streams of claim 4, wherein the trained model network module comprises:

acquiring an existing video frame Tensor;

acquiring feature information of the existing video frame Tensor;

inputting the existing video frame Tensor into the machine learning model, and taking the characteristic information of the existing video frame Tensor as supervision to train the machine learning model to obtain the machine learning model with complete training.

7. The method of claim 3, wherein monitoring the post-processing queue, obtaining the inference result in the post-processing queue for post-processing to obtain the target result comprises:

reading the inference result from the post-processing queue, converting the inference result into a post-processing data structure, and performing post-processing on the post-processing data structure to obtain a target result, wherein the post-processing at least comprises target detection duplication removal, target tracking and similarity comparison;

and pushing the target result to the asynchronous callback address.

8. The method of claim 1, wherein monitoring a pre-processing queue, obtaining video frames sensor in the pre-processing queue and placing them in a machine learning engine for inference further comprises:

the inference process is put into a single thread, the CPU uses a thread pool to accelerate inference, and the GPU uses multiple GPUs to accelerate inference, wherein the multiple GPU accelerates inference comprises counting the running tasks of the multiple GPUs by using a reference counting method, a video frame Tensor is sent to the GPU with the least number of tasks to carry out inference by requesting each time, the GPU counting is increased by one by requesting each time, and the GPU counting is reduced by one by finishing the request.

9. A multiple video stream processing system, comprising:

the preprocessing module is used for acquiring a video stream to be processed; converting a video stream to be processed into a video frame to be processed; converting a video frame to be processed in a preset time period into a video frame Tensor and pushing the video frame Tensor into a preprocessing queue, wherein the video frame Tensor comprises the number of video streams to be processed, the length, the width and the RGB of the video frame to be processed;

the inference module is used for monitoring the preprocessing queue, acquiring the video frames Tensor in the preprocessing queue and putting the video frames Tensor into the machine learning engine for inference; and acquiring an inference result of the video frames Tensor in the preprocessing queue, wherein the inference result comprises a coordinate Tensor, a classification Tensor, a confidence Tensor and a quantity Tensor of the video frames Tensor in the preprocessing queue.

10. The plurality of video stream processing systems of claim 9, wherein the system further comprises:

the post-processing module is used for sending the reasoning result into a post-processing queue; and monitoring the post-processing queue, acquiring the reasoning result in the post-processing queue, and performing post-processing to obtain a target result.

11. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is configured to execute the computer program to perform the plurality of video stream processing methods of any of claims 1 to 8.

12. A storage medium having stored thereon a computer program, wherein the computer program is arranged to perform a plurality of video stream processing methods according to any one of claims 1 to 8 when executed.