CN110245267B

CN110245267B - Multi-user video stream deep learning sharing calculation multiplexing method

Info

Publication number: CN110245267B
Application number: CN201910413748.7A
Authority: CN
Inventors: 汤善江; 刘言杰; 于策; 孙超; 肖健
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2023-08-11
Anticipated expiration: 2039-05-17
Also published as: CN110245267A

Abstract

The invention relates to a computer and video processing, in order to realize the service of providing video inquiry through technologies such as target recognition and target detection in a multi-user scene, the invention discloses a multi-user video stream deep learning sharing calculation multiplexing method, firstly, when a request with detection operation or recognition operation comes, the request is combined according to the relevance of space dimension; then, according to the relevance of the time dimension, inquiring whether proper data are available for multiplexing, and calling a deep learning model with configured parameters for analysis; for the non-reusable part, firstly, finding the most suitable parameter configuration in the analysis process according to the balance relation of the speed precision, then calling a difference detector and a deep learning model to perform video analysis according to the parameter configuration, finally outputting an analysis result and storing the analysis result into a data warehouse, and a lifting module is used for lifting the original result precision in the database so as to facilitate multiplexing the query request with high precision. The invention is mainly applied to video processing occasions.

Description

Multi-user video stream deep learning sharing calculation multiplexing method

Technical Field

The invention relates to a computer and video processing, in particular to a multi-user video stream deep learning sharing calculation multiplexing method.

Background

Currently, deep learning has become an important engine that pushes artificial intelligence applications to land and become popular. Particularly in the field of computer vision, the rapid development of deep learning brings about a profound reform in the field, wherein the most representative is the continuous development of image analysis technologies such as target detection and target recognition. Object detection may be used to classify objects, for example, identify if there is a dog, a person, a table, but not the name of the person in the image; target recognition may then identify the specific identity of the person. For one image, all targets in the image can be quickly identified by applying the target detection model, which brings about great change to the video analysis process.

In the conventional video analysis process, a query service is provided for a user mainly through a manual labeling mode. By means of development of deep learning, video content can be automatically analyzed through technologies such as target recognition and target detection, and more high-quality and rich query services are provided. For example, when a user clips a video, all the clips that someone appears in the video can be automatically found by the object recognition algorithm, which brings great convenience to the user. Therefore, replacing the traditional manual tagged video query approach with deep learning based video stream analysis has become a trend. However, because of the characteristics of high resource requirement, long training time and the like of the deep learning model, a plurality of users can commonly share a computing platform in a resource sharing mode, so that the utilization rate of computing resources of the deep learning system is improved, and the enterprise cost is reduced.

Disadvantages of the prior art

The existing video analysis technology based on deep learning has two disadvantages: firstly, the query result is single, and secondly, the locality of multi-user query under a shared platform is not solved. The single query result is particularly obvious in a system such as a noscope, a chameleon and the like. In general, the object detection model can identify thousands of different object categories, but at a slower rate. Therefore, in order to solve the problem that the deep learning model is too slow to process video data, the noscope proposes to identify the vehicle by training a proprietary model with fewer network layers, so that the identification speed is greatly improved. This approach, however, gives up the versatility of the deep learning model, and fails to meet the needs of users when they need to query multiple targets.

On the other hand, noscope, chameleon, videostorm these models do not address the limitations of multi-user queries under shared platforms. Limitations are manifested in several aspects:

locality in the time dimension. The locality of the time dimension shows that queries at different time periods have repeatability on query data, and when new queries arrive, some users have previously focused on the same content. This is particularly apparent in some popular videos. The locality of the temporal dimension also embodies similarity in the content of successive frames. Due to the unique characteristics of video data itself, there is extremely high similarity between successive frames to maintain video consistency. In addition, there are some surveillance videos, such as camera surveillance in parks, where there is little change in the content at some time (e.g., evening). When objects in video frames are identified by the deep learning model, the identification results between these successive frames will also be substantially the same, and therefore these successive frames bring about a large number of repeated computations.

Locality of spatial dimensions. The method is mainly characterized in that the same piece of data is subjected to query processing by a plurality of reasoning requests which arrive at the same time. These queries tend to have the same and similar nature. For these query calculations, it is often possible to do a clipping and merging operation to avoid unnecessary redundant calculations.

Locality among data result logics. There is a possibility of logical multiplexing of data results embodied between different models. For example, object detection and object recognition tend to have a logically multiplexed relationship with each other. For example, for a certain frame of a video, when the object detection indicates no person, the object recognition algorithm may not recognize a third, thereby avoiding the calculation of the object recognition person. Otherwise, the result of object recognition can also be directly multiplexed into object detection under certain conditions. If the object recognition algorithm shows that the person exists, the person is indicated to exist, and object detection on the person is directly avoided.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a multi-user video stream deep learning sharing calculation multiplexing method. The video query service is provided through the technologies of target identification, target detection and the like under the multi-user scene. And the locality of multi-user video query in a shared computing environment is optimized, so that multiple users can share the video query according to the deep learning reasoning result, the running speed is improved, and the problem that the speed of the deep learning model is too low when processing video data is solved. Meanwhile, the balance problem of speed and precision caused by sharing is solved. The method comprises the steps that firstly, when a request with detection operation or identification operation arrives, the requests are combined according to the relevance of space dimensions, and overlapping parts in the requests are cut off; then, according to the relevance of the time dimension, requesting to search an object detection database or an object recognition database, inquiring whether proper data are available for multiplexing, when no multiplexing data exist, applying the data in the object recognition database or the object detection database to filter and reduce invalid calculation on detection operation or recognition operation according to the relevance among logics, and then calling a deep learning model with configured parameters for analysis; for the non-reusable part, firstly, finding the most suitable parameter configuration in the analysis process according to the balance relation of the speed precision, then calling a difference detector and a deep learning model to perform video analysis according to the parameter configuration, finally outputting an analysis result and storing the analysis result into a data warehouse, and a lifting module is used for lifting the original result precision in the database so as to facilitate multiplexing the query request with high precision.

The relevant parameters such as resolution, selection of a deep learning model and frame skip rate need to be reasonably configured.

Multiplexing the time dimension comprises the following specific steps:

1) Detecting the similarity between the continuous frames by adopting a difference detection mode, acquiring a histogram of the continuous frames by using a difference detector, calculating the distance of the histogram, further calculating the similarity, and judging whether the data can be multiplexed according to the similarity;

2) When a new request comes, the request is split into two parts, namely a reusable part and a non-reusable part, and the data of the reusable part is already stored in a database, so that a result can be obtained directly through a database query operation with faster running time; and for the non-reusable part, processing still through the deep learning model to obtain a request result, and feeding back the requested result to the database so as to facilitate the multiplexing of subsequent queries.

Multiplexing of spatial dimensions: and combining the requests in a request cutting and combining mode, cutting out overlapping parts in the requests, and reducing repeated requests for multiple times.

Multiplexing of logical dimensions: and (3) establishing association between object detection and object identification data, and finding out data with association between different models so as to filter each other.

Multiplexing logic dimensions comprises the following specific steps:

the video contains m frames of images, l types of objects can be detected and k persons can be identified, data are stored by using two matrixes of Dl x m and Rk x m, dl x m represents an object matrix, rk x m represents a person matrix, when a new query arrives, whether corresponding data exist in Dl x m and Rk x m is queried firstly, if yes, direct use is performed, dij is 1, j objects exist in the ith frame of image, and if 0, j objects do not exist;

theorem one: if D1j is 0, then when 0<j < = m, R x j is also 0;

theorem two: if Rij is 1, then when 0<j < = m, D1j is also 1;

theorem one indicates that when the jth frame detects no person, then the target recognition model will detect no person in the jth frame; and secondly, when the character i is identified in the j-th frame, the target detection matrix automatically updates the people in the j-th frame, so that the database data is dynamically updated according to the first theorem and the second theorem during logic multiplexing, and the association is established for target detection and target identification.

The method comprises the following specific steps of balancing the precision and the speed of query:

fitting the precision and speed relations of different models under different parameters to obtain the corresponding relation between the speed and the precision of each model, and selecting the optimal parameters for video analysis when the query arrives so as to meet the user requirements;

step two, according to the Markov chain rule, the precision of the previous frame and the difference between the two frames are used for predicting the precision of the next frame, the numerical value of the adjustment parameter is continuously corrected through a verification experiment, so that accurate precision evaluation can be performed, delta diff is specifically used for representing the difference between the two frames, A (fi-1) is used for representing the precision of the i-1 frame, k is used as the adjustment parameter, and then according to the Markov chain rule, the precision of the i-th frame is as follows:

A(fi-1)＝k*δdiff*A(fi-1).

finally, the result multiplexing between reasoning requests is conditional multiplexing;

and thirdly, reserving the result of the deep learning model, re-detecting partial results of the difference detector by using a high-precision model, fully multiplexing the calculation result of the difference detector about the similarity between two frames, and re-evaluating the precision between the continuous frames, thereby improving the detection precision on the whole.

Conditional multiplexing specifically includes employing increasing the frequency of using a deep learning model.

The invention has the characteristics and beneficial effects that:

the invention builds various deep learning models, thereby realizing the service of providing video query through technologies such as target recognition target detection and the like in a multi-user scene. And the method optimizes the limitation of multi-user video query in a shared computing environment, so that multiple users can share the video query according to the deep learning reasoning result, the running speed is improved, and the problem that the speed of the deep learning model is too low when processing video data is solved. Meanwhile, the balance problem of speed and precision caused by sharing is solved.

Description of the drawings:

fig. 1: the request results of the same type of data are multiplexed.

Fig. 2: and (5) merging the queries.

Fig. 3: multiplexing of logical dimensions.

Fig. 4: the multiple reasoning request results multiplex the overall architecture.

Detailed Description

The invention relates to the field of computer vision and high-performance computation, and provides a multi-user video stream deep learning sharing computation multiplexing method. The method realizes the service of providing video query through technologies such as target recognition target detection and the like in a multi-user scene. And the locality of multi-user video query in a shared computing environment is optimized, so that multiple users can share the video query according to the deep learning reasoning result, the running speed is improved, and the problem that the speed of the deep learning model is too low when processing video data is solved. Meanwhile, the balance problem of speed and precision caused by sharing is solved.

Aiming at the limitation of the prior art, a new video stream analysis method is provided, and data sharing in a multi-user sharing environment is supported, so that the analysis speed is improved. In a shared computing environment, the content of an inference request submitted by a user includes several aspects: request type (object detection, object identification, object tracking, etc.), request data and range, required accuracy, etc. In order to meet the diversified reasoning request contents of users, various deep learning models are required to be built in a shared computing environment, and appropriate models are dynamically selected when the request comes so as to meet the demands of the users. In the multiplexing problem of the multi-reasoning request result, the research emphasis is to find the relation among the multi-reasoning requests in the diversified request contents, construct the association relation among the multi-reasoning requests to multiplex so as to accelerate the running time, solve the problems of precision and the like caused by multiplexing, and achieve the effect of meeting the user demands.

The method mainly comprises two large modules, namely, establishing the relevance between requests and solving the problem of balance of precision and speed caused by multiplexing.

Constructing associations between requests

And step one, multiplexing of time dimension. For video data, the relevance of the temporal dimension is reflected in two aspects: in video analysis, there is a high degree of similarity in content between successive frames; secondly, for hot data, a large number of queries with different time periods can search and analyze the hot data. These two aspects lead to a large number of multiplexing possibilities. Firstly, for the similarity of continuous frames, as the reasoning request time of the deep learning model is longer, for two frames with high similarity, the result of the former frame (called as a reference frame) can be multiplexed to the latter frame for use, thereby avoiding the detection of the latter frame by the deep learning model and improving the speed. For this purpose, we detect the similarity between successive frames by means of difference detection, and the difference detector obtains the histogram of successive frames and calculates the histogram distance, further calculating the similarity. And then judging whether the data can be multiplexed according to the similarity.

The repeatability of queries over the query data at different time periods also brings a great deal of redundant computation to the system. Because no correlation is carried out between the queries, each query can independently carry out request analysis on the video data, and huge resource waste is brought. For this problem, the request results are stored and repeated partial direct multiplexing is found when a new query arrives. As shown in fig. 1, when a new request arrives, the request is first split into two parts, a reusable part and a non-reusable part. The reusable portion of data already exists in the database and results can be obtained directly through a faster running database query operation. For non-reusable parts, the request results are still obtained by processing through a deep learning model. And then feeding back the requested result to the database so as to facilitate multiplexing of subsequent queries. Assuming that a video contains m frames of images, l kinds of objects can be detected and k kinds of objects can be identified, two matrixes of Dl x m and Rk x m are used for storing data, when a new query arrives, whether corresponding data exist in the Dl x m and the Rk x m is queried first, if yes, the corresponding data are directly used, and repeated calculation is avoided. Dij is 1 to indicate that j objects exist in the ith frame picture, and is 0 to indicate that the objects do not exist.

And step two, multiplexing the space dimension. In a shared computing environment, there are a large number of concurrent query operations, which result in a large number of repeated computations when these concurrent query requests are of interest to the same piece of data. The invention adopts a request cutting and combining mode to combine the requests, cuts out the overlapped part, and reduces repeated requests for multiple times. As shown in fig. 2. There is an overlap of the partial data for two simultaneous incoming request queries Q1 and Q2. By means of query merging, two requests are reasonably merged into one request Q1', and repeated calculation of heavy parts is avoided. The same is true when multiple queries arrive at the same time, the multiple queries are combined to find overlap between multiple requests to the maximum extent to reduce duplicate detection.

And step three, multiplexing of logic dimensions. Various analysis modes are provided in the video analysis system, including object detection, object recognition, object tracking and the like on data in the video, which requires various deep learning models to provide different detection capabilities. However, multiplexing possibilities exist between these different models. At the most intuitive point, if no person is detected in some frames, the process of identifying the person in the frames by the object identification model can be directly omitted. Conversely, if the object identification has identified a person, object detection may also eliminate detection of the person in these frames. As shown in fig. 3, we correlate object detection with object recognition data, find data with correlation between different models, and filter each other. For example, when the object recognition model recognizes a person, the object recognition database is updated, and data is fed back to the object detection database for updating, and the detection result is marked as a person. Filtering can be directly performed according to the result when the new reasoning request inquires whether people exist or not. We use D1j to record if there are any people in the j-th frame, and use Rij to record if there are people i in the j-th frame, R x j represents all people in the j-th frame. From these properties we sum up the following two theorem.

Theorem one: if D1j is 0, then when 0<j < = m, R x j is also 0.

Theorem two: if Rij is 1, then when 0<j < = m, D1j is also 1.

Theorem one indicates that when the jth frame detects no person, then the object recognition model will detect no person in the jth frame. The second theorem indicates that when the person i is identified in the j-th frame, the target detection matrix automatically updates that the person is in the j-th frame. Therefore, when the logic multiplexing is carried out, the database data can be dynamically updated according to the first theorem and the second theorem, and the association is established for the target detection and the target identification.

Balancing the accuracy and speed of a query

In the current deep learning model, although the analysis accuracy is gradually improved, the resource consumption caused by the improvement of the accuracy is also increased, which requires longer running time. Generally, a model with high precision has a slower running speed, and a model with low precision has a faster running speed. On the other hand, the resolution of the picture in the video analysis flow, the skip frame number of the difference detector and the like also have influence on the precision and the speed. This presents challenges for parameter configuration during analysis. Because in practical applications, the requirements of accuracy and speed are different from one inference request to another. Part of the request, such as detection of kidnapping, requires higher accuracy to maintain correctness, for which part of the speed can be sacrificed; and requests such as traffic light detection and the like need higher timeliness, and the accuracy reaches a proper level. Therefore, how to reasonably select parameters such as models, resolution, skip frames and the like in the video analysis process is the first problem.

Fitting the precision and speed relations of different models under different parameters to obtain the corresponding relation between the speeds and the precision of each model. When the inquiry arrives, the optimal parameters can be selected for video analysis so as to meet the requirements of users.

On the other hand, a discrepancy detector introduces uncertainty into the measurement of accuracy. The difference detector detects the similarity between the current frame and the previous frame (reference frame), and directly multiplexes the analysis result of the previous frame under the condition of high similarity, thereby avoiding costly reasoning calculation. However, direct multiplexing causes a deviation between the multiplexed result and the actual result, and reduces the detection accuracy. Therefore, effective precision evaluation is required for the difference detector, so that the overall effective precision can be calculated, and the precision requirement of a user can be accurately met. Since the difference detector detects two consecutive frames, the accuracy of displaying the next frame is closely related to the accuracy of the previous frame and the degree of difference between the two frames. This characteristic between successive frames conforms to the markov chain rule.

And step two, predicting the precision of the next frame according to the Markov chain rule and the precision of the previous frame and the difference between the two frames and adding corresponding adjusting parameters. The numerical value of the adjusting parameter is continuously corrected through a large number of verification experiments, so that accurate precision evaluation can be performed. We use δdiff to represent the degree of difference between two frames, a (fi-1) to represent the accuracy of the i-1 frame, and k as the adjustment parameter, then according to the markov chain law, the accuracy of the i-th frame is:

A(fi-1)＝k*δdiff*A(fi-1).

finally, the resulting multiplexing between reasoning requests is a conditional multiplexing, not a complete multiplexing without consideration, and the impact of accuracy in multiplexing is particularly important. For example, when a request for high precision results arrives, existing low precision results are obviously not multiplexed to the high precision request. Therefore, there is a problem in that data of different accuracies cannot be multiplexed. Obviously, the most intuitive method is to discard the low-precision result and use a high-precision model for re-detection, but this will undoubtedly bring about great expense. Considering that the multiplexed data is a result brought together by the deep learning model and the difference detector, for example, the deep learning model detects the first frame and the difference detector is used to process the four frames later, wherein the result accuracy brought by the deep learning model is higher than the result accuracy brought by the difference detector. For example, for the same piece of video data, it is assumed that the query request Q1 requires 90% of accuracy, and according to this requirement, the accuracy speed balancing section selects a frequency of the number of frame skip steps of 5 for detection. That is, every 5 frames are identified by the deep learning model, and the skipped 5 frames acquire the degree of difference by the difference detector. After a period of time, the new query request Q2 comes to require 95% of accuracy, and it is obvious that the result of Q1 cannot be directly used for Q2, and at this time, the accuracy requirement can be met when the number of frame skipping steps is 2. First, the data in Q1, i.e., the detected frames are multiplexed, and the 3 rd frame is detected by using the deep learning model for the 5 frames skipped by using the difference detector, so that the requirement that the number of frame skipping steps is 2 can be achieved. Finally, the effect of conditional multiplexing is achieved.

And thirdly, reserving the result of the deep learning model, and re-detecting partial results (such as a certain frame in the latter four frames) of the difference detector by using a high-precision model. And fully multiplexing the calculation result of the difference detector about the similarity between two frames, re-evaluating the precision between the continuous frames, thereby improving the detection precision as a whole.

The overall architecture of the data multiplexing model for multiple reasoning requests is shown in fig. 4, combining the above parts. The system provides diversified video analysis requests, and the deep learning models finally used by the requests mainly have two types: an object recognition model and an object detection model. First, the method includes the steps of. When a request with a detection operation (or identification operation) arrives, the system first merges the requests and cuts out the overlapping part of the requests according to the relevance of the space dimension. Then, in the multiplexing module part of the figure, according to the relevance of the time dimension, the data is requested to the object detection database (or the object identification database) for searching, and whether the appropriate data is available for multiplexing is queried. In addition, filtering the detection operation (or recognition operation) by applying data in the object recognition database (or object detection database) reduces invalid computations based on the correlation between logics. And when no reusable data exists, calling a deep learning model with configured parameters for analysis. Before the deep learning model is called, related parameters are firstly configured through a speed-precision balancing part according to the requirements of users on precision and speed, and the speed as high as possible is achieved while the requirements of the user on precision are met. In the reference part of the figure, relevant parameters such as resolution, selection of a deep learning model, frame skip rate and the like are required to be reasonably configured, then a difference detector and the deep learning model are called for video analysis, and finally analysis results are output and stored in a data warehouse. The precision promotion module continuously updates the result precision in the database so that the high-precision query request can be reused.

Claims

1. A multi-user video stream deep learning sharing calculation multiplexing method is characterized in that firstly, when a request with detection operation or identification operation comes, the request is combined according to the relevance of space dimension, and the overlapping part in the request is cut off; then, according to the relevance of the time dimension, requesting to search an object detection database or an object recognition database, inquiring whether data are available for multiplexing, when no multiplexing data exist, applying the data in the object recognition database or the object detection database to filter and reduce invalid calculation on detection operation or recognition operation according to the relevance among logics, and then calling a deep learning model with configured parameters for analysis; for the non-reusable part, firstly, according to the balance relation of the speed precision, parameter configuration in the analysis process is found, then a difference detector and a deep learning model are called to carry out video analysis according to the parameter configuration, finally, analysis results are output and stored in a data warehouse, and a lifting module is used for lifting the original result precision in the database so as to facilitate multiplexing to a high-precision query request; the relevant parameters including resolution, selection of a deep learning model and frame skip rate are required to be reasonably configured; the method comprises the following specific steps of balancing the precision and the speed of query:

A(fi-1)＝k*δdiff*A(fi-1).

finally, the result multiplexing among the reasoning requests is conditional multiplexing, namely, the frequency of using the deep learning model is increased; third, reserving the result of the deep learning model, re-detecting partial result of the difference detector by using a high-precision model, fully multiplexing the calculation result of the difference detector about the similarity between two frames, and re-evaluating the precision between the continuous frames, so as to improve the detection precision on the whole;

wherein, multiplexing of logical dimensions: establishing association between object detection and object identification data, finding out data with association between different models so as to filter each other, and multiplexing logic dimensions, wherein the specific steps are as follows:

an m-frame image is contained in a video, l-type objects are detectable and k human objects are identifiable, data are stored by using two matrixes of Dl and Rk, dl and m are the object matrixes, rk and m are the character matrixes, when a new query arrives, whether corresponding data exist in Dl and Rk, if yes, the corresponding data exist in Dl and Rk are queried, and if yes, the corresponding data are directly used, and D is used directly _ij 1 represents that the ith object exists in the jth frame of picture, and 0 represents that the ith object does not exist; r is R _ij 1 represents that the ith person exists in the jth frame picture, and 0 represents that the ith person does not exist; d (D) _1j Recording whether a person exists in the j-th frame; r is R _*j Representing all the characters in the j-th frame;

theorem one: if D1j is 0, then when 0<j < = m, R x j is also 0;

theorem two: if Rij is 1, then when 0<j < = m, D1j is also 1;

2. The multi-user video stream deep learning sharing calculation multiplexing method as claimed in claim 1, wherein the multiplexing of the time dimension comprises the specific steps of:

3. The multi-user video stream deep learning sharing computation multiplexing method of claim 1, wherein the multiplexing of spatial dimensions: and combining the requests in a request cutting and combining mode, cutting out overlapping parts in the requests, and reducing repeated requests for multiple times.