CN111131835B

CN111131835B - Video processing method and system

Info

Publication number: CN111131835B
Application number: CN201911410027.7A
Authority: CN
Inventors: 张德宇; 罗云臻; 张尧学; 贾富程; 段思婧
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-02-26
Anticipated expiration: 2039-12-31
Also published as: CN111131835A

Abstract

The invention discloses a video processing method and a system, wherein the method comprises the following steps: s1, grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame; s2, with the minimum delay as an optimization target, determining the processing main bodies of the basic frames and the change frames in the frame grouping, and distributing the processing main bodies to the determined processing main bodies; the processing main body comprises a local end and a server end; and S3, identifying the basic frame and the change frame through the processing main body to obtain an identification result. The method has the advantages of effectively reducing the identification delay and ensuring the identification accuracy.

Description

Video processing method and system

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method and a system for video processing, and more particularly, to a method and a system for video processing and dynamic object recognition.

Background

With the development of mobile equipment photographing and camera shooting technology, recording our daily life by using short videos has become an increasingly obvious development trend. According to the short video industry report of the iiMedia research, the amount of video uploaded from mobile devices to video platforms in china alone is already very large, e.g., the amount of short video on "tremble" and "watermelon video" has been over 2000 million. Through life experience and experimental verification, the video can be found to contain a great deal of information, such as abnormal events, human-human interaction and human-human interaction.

However, deep learning, which is one of the better ways to identify and extract information in a video, is performed, and specifically, a Convolutional Neural Network (CNN) is generally used to process a video frame. However, deep learning generally has a large amount of calculation tasks, and thus causes a large calculation delay, as shown in document 1(l.n. huynh, y.lee, and r.k.balan, "deep: Mobile global estimated missing frame for continuous vision Applications," in Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (Mobile), 2017.), and even with the support of a Mobile device GPU, it takes 600 milliseconds to perform typical CNN processing on a video frame.

Therefore, much attention has recently been paid to improving the efficiency of performing deep learning tasks on Mobile devices, such as document 2(m.xu, m.zhu, y.liu, f.x.lin, and x.liu), "deep cache: printed cache for Mobile device vision," in Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (Mobile com), 2018.) proposed deep cache that uses input frame content as a cache key and infers the result as a cache value. By utilizing information redundancy between continuous frames in the video, the DeepCache can reuse the cached inference result between two frames, thereby obviously reducing the execution time and energy consumption. Also, as in document 1, there is a method of improving the efficiency of a deep learning task of a mobile device, such as increasing the computation speed by decomposing a CNN model and unloading a convolution layer to a GPU of the mobile device.

Some of these works of the conventional mobile device deep learning technology are to perform deep learning processing on a mobile device with a single video frame as an object, that is, the relationship between video frames is not deeply considered, so that motion information included in one short video cannot be efficiently recognized, and other parts are to focus on static information such as target detection, and intensive research is necessary for recognizing dynamic information.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a video processing method and a video processing system which can effectively reduce the identification delay and ensure the identification accuracy.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: a video processing method, comprising:

s1, grouping frames in a video to be processed to obtain a Frame grouping, and dividing frames in the Frame grouping (GoF) into a basic Frame and a change Frame;

s2, with the minimum delay as an optimization target, determining the processing main bodies of the basic frames and the change frames in the frame grouping, and distributing the processing main bodies to the determined processing main bodies; the processing main body comprises a local end and a server end;

and S3, identifying the basic frame and the change frame through the processing main body to obtain an identification result.

Further, the first frame within the frame grouping is a base frame, and the remaining frames are change frames.

Further, the data information recorded by the change frame includes the change amount of the change frame and the previous frame;

the variance includes a motion vector and a residual.

Further, the step S1 is preceded by a prediction step S0, which specifically includes: and predicting the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model according to a preset accuracy rate.

Further, the step S2 of using the minimum delay time as the optimization target specifically includes performing optimization through an objective function shown in the following formula:

s.t.f(S_GoF，n_P)≥Λ，

0＜S_GoF≤1，

0＜n_P≤11，

in the above formula, LOP is the name abbreviation of delay Optimization Problem (Latency Optimization Problem), f (S)_GoF，n_P) At a sampling rate S_GoFAnd the number of changed frames n_PThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, O_I(t) division of base frame when the t-th frame group arrivesDetermination of formula O_p(t) determining the assignment of the changed frames when the t-th frame group arrives, S_GoFIs the sampling rate, n_PFor varying the number of frames, Q, within a grouping of frames_m(T) is the time required for local processing, Q, according to the allocation_sAnd (T) is the time required by the service end to process according to the distribution condition, and T is the maximum duration limit of the video.

s.t.O_I(t)，O_P(t)∈{0，1}.

in the above equation, mod-LOP is a simplified name reduction of delay Optimization Problem (modified delay Optimization Problem), O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Q_m(T) is the time required for local processing, Q, according to the allocation_sAnd (T) is the time required for the server to process according to the distribution condition.

A video processing system comprises a frame grouping module, an allocation module and a result processing module;

the frame grouping module is used for grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame;

the allocation module is used for determining the processing subjects of the base frame and the change frame in the frame grouping by taking the minimum delay as an optimization target and allocating the processing subjects to the determined processing subjects; the processing main body comprises a local end and a server end;

the result processing module is used for acquiring the result of the processing main body for identifying the basic frame and the change frame to obtain an identification result.

Further, a first frame in the frame grouping is a basic frame, and the rest frames are changed frames; the data information recorded by the change frame comprises the change amount of the change frame and the previous frame; the variance includes a motion vector and a residual.

And the prediction module is used for predicting the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model according to a preset accuracy rate.

Further, the allocation module takes the minimum delay as an optimization target, and specifically includes optimizing by an objective function shown in the following formula:

s.t.f(S_GoF，n_P)≥Λ，

0＜S_GoF≤1，

0＜n_P≤11，

in the above formula, LOP is the name abbreviation of delay Optimization Problem (Latency Optimization Problem), f (S)_GoF，n_P) At a sampling rate S_GoFAnd the number of changed frames n_PThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determining the assignment of the changed frames when the t-th frame group arrives, S_GoFIs the sampling rate, n_PFor varying the number of frames, Q, within a grouping of frames_m(T) is the time required for local processing, Q, according to the allocation_s(T) is the time required for the server to process according to the distribution condition, and T is the maximum duration limit of the video;

or:

the allocation module takes the minimum delay as an optimization target, and specifically comprises the following steps of optimizing through an objective function shown as the following formula:

s.t.O_I(t)，O_P(t)∈{0，1}.

in the above formula, mod-LOP is a name abbreviation of a simplified delay Optimization Problem (modified delay Optimization Problem), O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Q_m(T) is the time required for local processing, Q, according to the allocation_sAnd (T) is the time required for the server to process according to the distribution condition.

Compared with the prior art, the invention has the advantages that:

1. the invention takes the minimum time delay as an optimization target to distribute a processing main body for the basic frame and the change frame by grouping the video frequency into the frame group consisting of the basic frame and the change frame, namely distributing the frame in the frame group to a local end or a service end for identification, thereby not only ensuring the identification accuracy, but also greatly reducing the time delay of image processing; the delay perceived by the user is greatly shortened.

2. The invention predicts the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model, and the intelligent algorithm model predicts the sampling rate S of the frame grouping based on different frame grouping_GoFAnd the number of changed frames n_PTraining an intelligent algorithm model according to the accuracy, and calculating a set of sampling rate S for the given accuracy by the intelligent algorithm model through the trained intelligent algorithm model_GoFAnd the number of changed frames n_PThe parameters can meet the requirement of accuracy, and the calculation amount required by video processing is effectively reduced.

Drawings

Fig. 1 is a schematic diagram of a video motion recognition process according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a system architecture according to an embodiment of the present invention.

Fig. 3 is a diagram illustrating an example of a GoF assignment decision upon arrival according to an embodiment of the present invention.

FIG. 4 shows the accuracy and sampling rate S according to an embodiment of the present invention_GoFAnd the number of changed frames n_PSchematic of the relationship between.

Fig. 5 shows the delay of the operation of ResNet-18 and ResNet-152 on the mobile device MI 8 according to an embodiment of the present invention.

Fig. 6 shows the delay of the operation of the ResNet-18 and ResNet-152 on the edge server according to an embodiment of the present invention.

Fig. 7 is a diagram illustrating a comparison of delay obtained after block search is implemented in OpenCL and RenderScript (RS, a component of the Android operating system for mobile devices, which provides an API using heterogeneous hardware acceleration) according to an embodiment of the present invention.

Fig. 8 is a diagram illustrating an occupancy rate of a GPU in the case of implementing block search in two ways, OpenCL and RS, according to an embodiment of the present invention.

Fig. 9 is a delay situation diagram obtained by adopting two implementation manners, OpenCL and RS + jni (java Native interface), for video compression and paralleling the video compression process and the inference process of I frames according to the embodiment of the present invention.

Fig. 10 is a diagram of a comparison of different channels and different accuracy requirements according to an embodiment of the present invention.

Fig. 11 is a diagram of a comparison of different channels and different accuracy requirements in an embodiment of the invention.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

As shown in fig. 1 and fig. 2, the video processing method of the present embodiment includes: s1, grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame; s2, determining a processing main body of a basic frame and a processing main body of a change frame in a frame grouping by taking the minimum delay as an optimization target, and distributing the processing main body to the determined processing main body; the processing main body comprises a local end and a server end; and S3, identifying the basic frame and the change frame through the processing main body to obtain an identification result.

In this embodiment, a mobile device (e.g., a smart phone) is taken as a local side, and an edge server connected to the mobile device via a network is taken as a server side for convenience of description. The mobile equipment shoots video images through the camera and then identifies the actions of the shot video images. In this embodiment, specifically, millet MI 8 running Android 9 is used as a mobile device, a desktop computer running Ubuntu is used as an edge server, a CPU of the desktop computer is Intel Core i 78700K, a GPU is GeForce RTX 2080, and a UCF-101 data set is used to implement the technical scheme of the method.

In this embodiment, before the step S1, a prediction step S0 is further included, which specifically includes: and predicting the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model (an off-line predictor in the figure 2) according to a preset accuracy rate. Since there is a functional relationship between the sampling rate of the frames and the accuracy rate of the motion recognition, as shown in fig. 4, in the present embodiment, the relationship between the sampling rate and the accuracy rate of the motion recognition is learned through an intelligent algorithm model, i.e., through the sampling rate S based on different frame groups_GoFAnd the number of changed frames n_PThe intelligent algorithm model is trained according to the accuracy, and the trained intelligent algorithm model can calculate a group of sampling rates S under the condition of given accuracy_GoFAnd the number of changed frames n_PAnd (4) parameters. The intelligent algorithm model preferably adopts an off-line model, is trained in advance and then is loaded to the mobile equipment, and the sampling rate of frame grouping and the number of changed frames are calculated.

In the present embodiment, f (S) is preferably used_GoF,n_P) To characterize the functional relationship between the sampling rate of a frame and the accuracy of motion recognition, the sampling rate S is characterized by a binary cubic polynomial_GoFAnd the number of changed frames n_PBy fitting a polynomial to the measured values in fig. 4, the sampling rate S can be obtained_GoFAnd the number of changed frames n_P. The binary cubic polynomial may be represented by the following formula:

f(S_GoF,n_P)＝p00+p10×S_GoF+p01×n_P+p20×S_GoF ²+p11×S_GoF×n_P+p02×n_P ²+p30×S_GoF ³+p21×S_GoF ²×n_P+p12×S_GoF×n_P ²+p03×n_P ³

in the above formula, p00, p10, p20, p11, p02, p30, p21, p12 and p03 are all binary cubic polynomial coefficients, and the rest parameters are defined as above. Table 1 below is a case where a binary cubic polynomial is fitted through the UCF-101 dataset.

Table 1:

the Accuracy Setting field is input into an intelligent algorithm model, the Sampled GoF is converted into a fractional form because the GoF samples are discrete in the actual operation of the system, 1261 in Data (1261) represents the number of videos selected for testing in a test set of a UCF-101 Data set, tuple content under the field represents the Accuracy rate obtained by actual testing, and Gap represents the difference value between the actually measured Accuracy rate and the input Accuracy rate.

In this embodiment, the first frame within a frame grouping is the base frame (I-frame), and the remaining frames are the change frames (P-frames). The data information recorded by the change frame comprises the change amount of the change frame and the previous frame; the variance includes a motion vector and a residual. Preferably, the variance is encoded only between itself and the previous frame using the previous frame as a reference, the variance is composed of a Motion Vector (MV) and a residual, the Motion Vector represents the movement of a pixel Block between the two frames, and the residual represents the difference between the base frame and the frame restored from the Motion Vector by means of Block Search. In this embodiment, delay conditions in two block search modes, OpenCL and renderscript (rs), are shown in fig. 7. The GPU occupancy situation by the two block search methods of OpenCL and RS is shown in fig. 8.

In this embodiment, it is preferable to identify the base frame with the large CNN model ResNet-152 and the change frame with the small CNN model ResNet-18 (i.e., identify the motion vectors and residuals). By filtering out redundant information between dropped frames, the CNN model can significantly reduce the complexity of motion recognition while achieving better accuracy performance. In each frame grouping, the motion vectors and residuals of all the change frames are respectively added together to enhance the information contained in the change frames and reduce the inference times of the change frames in the CNN model (ResNet 18). Both the CNN model ResNet-152 and CNN model ResNet-18 are deep learning models.

In this embodiment, as shown in fig. 2, after the frame grouping is divided and the base frame and the change frame in the frame grouping are determined, the base frame and the change frame are allocated to the local side process and the server side process, the time spent on allocating the base frame and the change frame is different, and the time delay experienced by the user is different. Therefore, in the present embodiment, by taking the minimum delay as an optimization target, the processing subjects of the base frame and the change frame within the frame grouping are determined and assigned to the determined processing subjects. The step S2 of using the minimum delay time as the optimization target specifically includes performing optimization by an objective function expressed by the following two expressions (LOP) or (mod-LOP).

In this embodiment, formula (LOP) is:

s.t.f(S_GoF，n_P)≥Λ，

0＜S_GoF≤1，

0＜n_P≤11，

in the above formula, LOP is the name abbreviation of delay optimization problem (Latency Optimiz)ation Problem)，f(S_GoF，n_P) At a sampling rate S_GoFAnd the number of changed frames n_PThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determining the assignment of the changed frames when the t-th frame group arrives, S_GoFIs the sampling rate, n_PFor varying the number of frames, Q, within a grouping of frames_m(T) time required for local processing, Q, according to allocation_sAnd (T) is the time required by the server side for processing according to the distribution condition, and T is the maximum duration limit of the video.

Since the embodiment adopts the intelligent recognition model to predict the sampling rate S_GoFAnd the number of changed frames n_PTherefore, the sampling rate S can also be removed from (LOP) in the equation_GoFAnd the number of changed frames n_PAnd reduces it to a scheduling problem to allocate decision O_I(t) and O_p(t) as a variable. While taking into account changes in the state of the system, Q_s(T) and Q_mThe value of (T) changes during the arrival of the concatenated frame grouping, and therefore, the objective function can be optimized as shown in the following equation:

s.t.O_I(t)，O_P(t)∈{0，1}.

in the above formula, mod-LOP is a name abbreviation of a simplified delay Optimization Problem (modified delay Optimization Problem), O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Q_m(t) time required for local processing, Q, depending on the allocation_sAnd (t) is the time required for the server to process according to the distribution situation. Q_m(t) and Q_s(t) is determined by the system status obtained by the system analyzer and updated after a frame grouping is completed.

In this embodiment, for the deep learning model, the mobile device uses a deep learning framework tensoflow Lite specially designed for the mobile device, and the edge server uses a Pytorch.

In this embodiment, when the basic frame and/or the change frame are allocated to the server for processing, the basic frame and/or the change frame are compressed and then sent to the server through the network for processing. In the compression process, for a base frame, the frame is compressed with Discrete Cosine Transform (DCT) and Entropy Coding (EC) using intra prediction according to the h.264 standard. Under the condition of motion vectors and residual errors, DCT and EC are directly operated to carry out compression, and in order to avoid the influence on precision, lossless compression is achieved by removing the quantization process. And packaging the compressed data, and sending the data to a server through a TCP/IP protocol. When the server receives the data, a Decoder (Decoder) is used for decoding the data, and then a CNN model is operated to identify the actions in the frame, so that the scores of the identified actions are obtained. When the base frame and/or the change frame are identified by the local terminal, the mobile device directly identifies the actions in the base frame and/or the change frame through the CNN model to obtain the scores of the identified actions. And after the scores of all the actions are weighted and summed, the action with the highest score is taken as a recognition result (namely a label). To improve the performance of compression-based deep learning model inference, OpenCL is used to implement compression-related operations on mobile GPUs, so that inference running on the mobile device CPU can be parallel to video compression. As shown in fig. 9, in this embodiment, two implementation manners, OpenCL and RS + jni (java Native interface), are adopted for video compression, and a video compression process and an inference process of an I frame are performed in parallel to obtain a delay condition. In fig. 9, an example "reference line" represents a delay situation when only I-frame inference is run alone, an "OpenCL implementation" represents a delay situation in parallel with video compression and I-frame inference implemented by OpenCL, and an "RS + JNI implementation" represents a delay situation in parallel with video compression and I-frame inference implemented by RS + JNI.

In this embodiment, the mobile device is at O_I(t) and O_pThere are four cases on (t). For Q under different choices_s(t) and Q_m(t) there will be different update rules inOn this basis we select the optimal allocation decision to minimize the value of the (mod-LOP) objective function. Since the calculation of the GoF at time t-1 may not be completed when the frame grouping GoF at time t arrives, g (t) is used to represent the time interval between the (t-1) th and the t-th frame grouping GoF. The remaining computation time is available r_s(t)＝max(Q_s(t-1) -g (t),0) and r_m(t)＝max(Q_m(t-1) -g (t),0) represents r_s(t)、r_m(t) is used to represent the remaining computation time of the server and the mobile device, respectively. O is_I(t) '0' indicates that the base frame is allocated to local side processing, O_I(t) ═ 1 indicates that the base frame is allocated to server processing; o is_p(t) '0' indicates that the change frame is allocated to the local side process, O_p(t)' 1 indicates that the change frame is allocated to the server process. The time for allocating decision processing when the frame grouping arrives is shown in fig. 3, when t is 1, the online scheduler decides to unload the I frame to the edge server and keeps the P frame to be calculated locally, i.e. O_I(t) 1 and O_p(t)＝0，Q_m(t) and Q_s(t) is obtained from the system state provided by the system analyzer and updated after the completion of the GoF. The user perceived delay is the time interval between the arrival and completion of the last to processed GoF, e.g., the second to processed GoF in fig. 3.

The first situation is as follows: o is_I(t)＝0，O_pIf (t) is 0, that is, the base frame (I frame) and the changed frame (P frame) of the frame group GoF arriving at time t are both calculated locally, then the delay is as follows:

in the above formula, d_I,m(t) the predicted delay to run ResNet-152 on the mobile device at time t, d_sch(t) delay of block search for obtaining information required for P frame at time t, d_P,m(t) denotes the delay of running ResNet-18 on the mobile device at time t, with the remaining parameters defined as above. Since the GPU is used to obtain the MV and the residual, d_I,m(t) and d_sch(t) may be in parallel. In movingThe delay for running ResNet-18 and ResNet-152 on mobile device millet MI 8 is shown in FIG. 5.

Case two: o is_I(t)＝1，O_pAnd (t) 1, namely, distributing the I frame and the P frame in the frame grouping GoF arriving at the time t to a server for processing, namely, the mobile equipment needs to compress the frames in the frame grouping and then sends the frames to an edge server, the edge server waits for data to arrive and receive, and then runs ResNet-152 and ResNet-18 on the I frame and the P frame respectively, and the time for the edge server to wait for the data is equal to the sum of video compression and data transmission. Then, the delay is shown as follows:

in the above equation, I and P frames have compression and acquisition delays d_I,C(t)+d_sch(t)+d_P,C(t) wherein d_I,C(t) represents the delay of the predicted compressed I frame at time t, d_P,C(t) represents the delay of the predicted compressed P frame at time t, d_I,W(t) latency of predicted I-frame at time t, d_P，W(t) latency of P frames predicted at time t, d_I,s(t) delay to run ResNet-152 on the edge server predicted for time t, d_P,s(t) delay to run ResNet-18 on the edge server predicted for time t, with the remaining parameters defined as above. The delay scenario for running ResNet-18 and ResNet-152 on an edge server is shown in FIG. 6.

Case three: o is_I(t)＝0，O_pAnd (t) is 1, i.e. an I frame in a frame group GoF arriving at the time t is distributed to a local end, P frames are distributed to a server for processing, ResNet-152 running on a CPU of the mobile device can be compressed in parallel with Block Search and P frames running on a GPU, and the edge server can perform residual calculation in the process of waiting for the arrival of a calculation task. Then, the delay is shown as follows:

in the above formula, the definition of each parameter is the same as above.

Case four: o is_I(t)＝1，O_pWhen t is 0, I-frames in a frame grouping GoF arriving at the time t are distributed to a server, P-frames are distributed to local processing, ResNet-18 running on a CPU of the mobile device can be compressed in parallel with the I-frames running on a GPU, and an edge server can perform residual calculation in the process of waiting for a calculation task to arrive. Then, the delay is shown as follows:

in the above formula, the definition of each parameter is the same as above.

In this embodiment, under the combination of poor 4 wireless channel states and 3 different accuracy requirements, two cases of "deep action" and "local execution" are performed, for example, as shown in fig. 10 (in fig. 10, (a), (b), and (c) respectively represent three different case combinations), where the legend deep action represents a result obtained by a complete execution flow using the method, and the "local execution" represents that a method is not used, but all results obtained by sampling are locally calculated. It can be seen from the figure that even in the case of very bad channel state (bandwidth is 0.75Mbps), the DeepAction can still effectively reduce the computation delay.

In this embodiment, under the combination of better 4 wireless channel states and 3 different accuracy requirements, the three cases of "deep action", "local execution", and "remote execution" are performed, for example, as shown in fig. 11 (in fig. 11, (a), (b), (c) respectively represent three different case combinations), the legend "deep action" represents the result obtained by the complete execution flow using the method, "local execution" represents that the method is not used, but the results obtained by sampling are all put in the local calculation, and the legend "remote execution" represents that the method is not used, but all frames are all allocated to the edge server for calculation. It can be seen from the figure that even under the condition that the channel state is excellent (bandwidth is 93.84Mbps), the DeepAction can effectively reduce the calculation delay compared with the remote execution.

The video processing system of the embodiment comprises a frame grouping module, a distribution module and a result processing module; the frame grouping module is used for grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame; the distribution module is used for determining the processing main bodies of the basic frames and the change frames in the frame grouping by taking the minimum delay as an optimization target and distributing the processing main bodies to the determined processing main bodies; the processing main body comprises a local end and a server end; and the result processing module is used for acquiring the result of the processing main body for identifying the basic frame and the change frame to obtain the identification result. The first frame in the frame grouping is a basic frame, and the rest frames are change frames; the data information recorded by the change frame comprises the change amount of the change frame and the previous frame; the variance includes a motion vector and a residual.

In this embodiment, the method further includes a predicting module, configured to predict a sampling rate and a number of changed frames of the frame grouping through an intelligent algorithm model according to a preset accuracy. The allocation module takes the minimum delay as an optimization target, and specifically comprises the following steps of optimizing through an objective function shown as the following formula:

s.t.f(S_GoF，n_P)≥Λ，

0＜S_GoF≤1，

0＜n_P≤11，

in the above formula, LOP is the name abbreviation of delay Optimization Problem (Latency Optimization Problem), f (S)_GoF，n_P) At a sampling rate S_GoFAnd the number of changed frames n_PThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, O_I(t) isDetermination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determining the assignment of the changed frames when the t-th frame group arrives, S_GoFIs the sampling rate, n_PFor varying the number of frames, Q, within a grouping of frames_m(T) time required for local processing, Q, according to allocation_s(T) is the time required by the server side for processing according to the distribution condition, and T is the maximum duration limit of the video;

or:

s.t.O_I(t)，O_P(t)∈{0，1}.

in the above formula, mod-LOP is a name abbreviation of a simplified delay Optimization Problem (modified delay Optimization Problem), O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Q_m(T) time required for local processing, Q, according to allocation_sAnd (T) is the time required for the server to process according to the distribution situation.

Through the system of the embodiment, the processing method can be realized, the identification delay is effectively reduced, and the identification accuracy is ensured.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A video processing method, characterized by:

s1, grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame;

s3, identifying the basic frame and the change frame through the processing main body to obtain an identification result;

the step S1 is preceded by a prediction step S0, which specifically includes: and predicting the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model according to a preset accuracy rate.

2. The video processing method of claim 1, wherein: the first frame in the frame grouping is a base frame and the remaining frames are change frames.

3. The video processing method according to claim 2, wherein: the data information recorded by the change frame comprises the change amount of the change frame and the previous frame;

the variance includes a motion vector and a residual.

4. A video processing method according to any one of claims 1 to 3, characterized by: the step S2 of using the minimum delay as the optimization target specifically includes optimizing by an objective function shown in the following formula:

s.t.f(S_GoF，n_P)≥Λ，

0＜S_GoF≤1，

0＜n_P≤11，

in the above formula, LOP is the name abbreviation of delay optimization problem, f (S)_GoF，n_P) At a sampling rate S_GoFAnd the number of changed frames n_PThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determining the assignment of the changed frames when the t-th frame group arrives, S_GoFIs the sampling rate, n_PFor varying the number of frames, Q, within a grouping of frames_m(T) is the time required for local processing, Q, according to the allocation_sAnd (T) is the time required by the service end to process according to the distribution condition, and T is the maximum duration limit of the video.

5. A video processing method according to any one of claims 1 to 3, characterized by: the step S2 of using the minimum delay as the optimization target specifically includes optimizing by an objective function shown in the following formula:

s.t.O_I(t)，O_P(t)∈{0，1}.

in the above formula, mod-LOP is a name abbreviation for the simplified delay optimization problem, O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Q_m(T) is the time required for local processing, Q, according to the allocation_sAnd (T) is the time required for the server to process according to the distribution condition.

6. A video processing system, characterized by: the system comprises a frame grouping module, a distribution module and a result processing module;

the result processing module is used for acquiring the result of the processing main body for identifying the basic frame and the change frame to obtain an identification result;

the device also comprises a prediction module which is used for predicting the sampling rate of the frame grouping and the number of the changed frames through an intelligent algorithm model according to the preset accuracy rate.

7. The video processing system of claim 6, wherein: the first frame in the frame grouping is a basic frame, and the rest frames are change frames; the data information recorded by the change frame comprises the change amount of the change frame and the previous frame; the variance includes a motion vector and a residual.

8. The video processing system according to claim 6 or 7, wherein: the allocation module takes the minimum delay as an optimization target, and specifically comprises the following steps of optimizing through an objective function shown as the following formula:

s.t.f(S_GoF，n_P)≥Λ，

0＜S_GoF≤1，

0＜n_P≤11，

in the above formula, LOP is the name abbreviation of delay optimization problem, f (S)_GoF，n_P) At a sampling rate S_GoFAnd the number of changed frames n_PThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, O_I(t) determination of the allocation of the base frame when the t-th frame grouping arrives, O_p(t) determining the assignment of the changed frames when the t-th frame group arrives, S_GoFIs the sampling rate, n_PFor varying the number of frames, Q, within a grouping of frames_m(T) is the time required for local processing, Q, according to the allocation_s(T) is the time required for the server to process according to the distribution condition, and T is the maximum duration limit of the video;

or:

s.t.O_I(t)，O_P(t)∈{0，1}.