CN111131835B - Video processing method and system - Google Patents

Video processing method and system Download PDF

Info

Publication number
CN111131835B
CN111131835B CN201911410027.7A CN201911410027A CN111131835B CN 111131835 B CN111131835 B CN 111131835B CN 201911410027 A CN201911410027 A CN 201911410027A CN 111131835 B CN111131835 B CN 111131835B
Authority
CN
China
Prior art keywords
frame
frames
grouping
gof
change
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911410027.7A
Other languages
Chinese (zh)
Other versions
CN111131835A (en
Inventor
张德宇
罗云臻
张尧学
贾富程
段思婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201911410027.7A priority Critical patent/CN111131835B/en
Publication of CN111131835A publication Critical patent/CN111131835A/en
Application granted granted Critical
Publication of CN111131835B publication Critical patent/CN111131835B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/13Adaptive entropy coding, e.g. adaptive variable length coding [AVLC] or context adaptive binary arithmetic coding [CABAC]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/625Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using discrete cosine transform [DCT]

Abstract

The invention discloses a video processing method and a system, wherein the method comprises the following steps: s1, grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame; s2, with the minimum delay as an optimization target, determining the processing main bodies of the basic frames and the change frames in the frame grouping, and distributing the processing main bodies to the determined processing main bodies; the processing main body comprises a local end and a server end; and S3, identifying the basic frame and the change frame through the processing main body to obtain an identification result. The method has the advantages of effectively reducing the identification delay and ensuring the identification accuracy.

Description

Video processing method and system
Technical Field
The present invention relates to the field of video processing technologies, and in particular, to a method and a system for video processing, and more particularly, to a method and a system for video processing and dynamic object recognition.
Background
With the development of mobile equipment photographing and camera shooting technology, recording our daily life by using short videos has become an increasingly obvious development trend. According to the short video industry report of the iiMedia research, the amount of video uploaded from mobile devices to video platforms in china alone is already very large, e.g., the amount of short video on "tremble" and "watermelon video" has been over 2000 million. Through life experience and experimental verification, the video can be found to contain a great deal of information, such as abnormal events, human-human interaction and human-human interaction.
However, deep learning, which is one of the better ways to identify and extract information in a video, is performed, and specifically, a Convolutional Neural Network (CNN) is generally used to process a video frame. However, deep learning generally has a large amount of calculation tasks, and thus causes a large calculation delay, as shown in document 1(l.n. huynh, y.lee, and r.k.balan, "deep: Mobile global estimated missing frame for continuous vision Applications," in Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (Mobile), 2017.), and even with the support of a Mobile device GPU, it takes 600 milliseconds to perform typical CNN processing on a video frame.
Therefore, much attention has recently been paid to improving the efficiency of performing deep learning tasks on Mobile devices, such as document 2(m.xu, m.zhu, y.liu, f.x.lin, and x.liu), "deep cache: printed cache for Mobile device vision," in Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (Mobile com), 2018.) proposed deep cache that uses input frame content as a cache key and infers the result as a cache value. By utilizing information redundancy between continuous frames in the video, the DeepCache can reuse the cached inference result between two frames, thereby obviously reducing the execution time and energy consumption. Also, as in document 1, there is a method of improving the efficiency of a deep learning task of a mobile device, such as increasing the computation speed by decomposing a CNN model and unloading a convolution layer to a GPU of the mobile device.
Some of these works of the conventional mobile device deep learning technology are to perform deep learning processing on a mobile device with a single video frame as an object, that is, the relationship between video frames is not deeply considered, so that motion information included in one short video cannot be efficiently recognized, and other parts are to focus on static information such as target detection, and intensive research is necessary for recognizing dynamic information.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a video processing method and a video processing system which can effectively reduce the identification delay and ensure the identification accuracy.
In order to solve the technical problems, the technical scheme provided by the invention is as follows: a video processing method, comprising:
s1, grouping frames in a video to be processed to obtain a Frame grouping, and dividing frames in the Frame grouping (GoF) into a basic Frame and a change Frame;
s2, with the minimum delay as an optimization target, determining the processing main bodies of the basic frames and the change frames in the frame grouping, and distributing the processing main bodies to the determined processing main bodies; the processing main body comprises a local end and a server end;
and S3, identifying the basic frame and the change frame through the processing main body to obtain an identification result.
Further, the first frame within the frame grouping is a base frame, and the remaining frames are change frames.
Further, the data information recorded by the change frame includes the change amount of the change frame and the previous frame;
the variance includes a motion vector and a residual.
Further, the step S1 is preceded by a prediction step S0, which specifically includes: and predicting the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model according to a preset accuracy rate.
Further, the step S2 of using the minimum delay time as the optimization target specifically includes performing optimization through an objective function shown in the following formula:
Figure GDA0002827972500000021
s.t.f(SGoF,nP)≥Λ,
0<SGoF≤1,
0<nP≤11,
Figure GDA0002827972500000023
in the above formula, LOP is the name abbreviation of delay Optimization Problem (Latency Optimization Problem), f (S)GoF,nP) At a sampling rate SGoFAnd the number of changed frames nPThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, OI(t) division of base frame when the t-th frame group arrivesDetermination of formula Op(t) determining the assignment of the changed frames when the t-th frame group arrives, SGoFIs the sampling rate, nPFor varying the number of frames, Q, within a grouping of framesm(T) is the time required for local processing, Q, according to the allocationsAnd (T) is the time required by the service end to process according to the distribution condition, and T is the maximum duration limit of the video.
Further, the step S2 of using the minimum delay time as the optimization target specifically includes performing optimization through an objective function shown in the following formula:
Figure GDA0002827972500000022
s.t.OI(t),OP(t)∈{0,1}.
in the above equation, mod-LOP is a simplified name reduction of delay Optimization Problem (modified delay Optimization Problem), OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Qm(T) is the time required for local processing, Q, according to the allocationsAnd (T) is the time required for the server to process according to the distribution condition.
A video processing system comprises a frame grouping module, an allocation module and a result processing module;
the frame grouping module is used for grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame;
the allocation module is used for determining the processing subjects of the base frame and the change frame in the frame grouping by taking the minimum delay as an optimization target and allocating the processing subjects to the determined processing subjects; the processing main body comprises a local end and a server end;
the result processing module is used for acquiring the result of the processing main body for identifying the basic frame and the change frame to obtain an identification result.
Further, a first frame in the frame grouping is a basic frame, and the rest frames are changed frames; the data information recorded by the change frame comprises the change amount of the change frame and the previous frame; the variance includes a motion vector and a residual.
And the prediction module is used for predicting the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model according to a preset accuracy rate.
Further, the allocation module takes the minimum delay as an optimization target, and specifically includes optimizing by an objective function shown in the following formula:
Figure GDA0002827972500000031
s.t.f(SGoF,nP)≥Λ,
0<SGoF≤1,
0<nP≤11,
Figure GDA0002827972500000032
in the above formula, LOP is the name abbreviation of delay Optimization Problem (Latency Optimization Problem), f (S)GoF,nP) At a sampling rate SGoFAnd the number of changed frames nPThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determining the assignment of the changed frames when the t-th frame group arrives, SGoFIs the sampling rate, nPFor varying the number of frames, Q, within a grouping of framesm(T) is the time required for local processing, Q, according to the allocations(T) is the time required for the server to process according to the distribution condition, and T is the maximum duration limit of the video;
or:
the allocation module takes the minimum delay as an optimization target, and specifically comprises the following steps of optimizing through an objective function shown as the following formula:
Figure GDA0002827972500000041
s.t.OI(t),OP(t)∈{0,1}.
in the above formula, mod-LOP is a name abbreviation of a simplified delay Optimization Problem (modified delay Optimization Problem), OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Qm(T) is the time required for local processing, Q, according to the allocationsAnd (T) is the time required for the server to process according to the distribution condition.
Compared with the prior art, the invention has the advantages that:
1. the invention takes the minimum time delay as an optimization target to distribute a processing main body for the basic frame and the change frame by grouping the video frequency into the frame group consisting of the basic frame and the change frame, namely distributing the frame in the frame group to a local end or a service end for identification, thereby not only ensuring the identification accuracy, but also greatly reducing the time delay of image processing; the delay perceived by the user is greatly shortened.
2. The invention predicts the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model, and the intelligent algorithm model predicts the sampling rate S of the frame grouping based on different frame groupingGoFAnd the number of changed frames nPTraining an intelligent algorithm model according to the accuracy, and calculating a set of sampling rate S for the given accuracy by the intelligent algorithm model through the trained intelligent algorithm modelGoFAnd the number of changed frames nPThe parameters can meet the requirement of accuracy, and the calculation amount required by video processing is effectively reduced.
Drawings
Fig. 1 is a schematic diagram of a video motion recognition process according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a system architecture according to an embodiment of the present invention.
Fig. 3 is a diagram illustrating an example of a GoF assignment decision upon arrival according to an embodiment of the present invention.
FIG. 4 shows the accuracy and sampling rate S according to an embodiment of the present inventionGoFAnd the number of changed frames nPSchematic of the relationship between.
Fig. 5 shows the delay of the operation of ResNet-18 and ResNet-152 on the mobile device MI 8 according to an embodiment of the present invention.
Fig. 6 shows the delay of the operation of the ResNet-18 and ResNet-152 on the edge server according to an embodiment of the present invention.
Fig. 7 is a diagram illustrating a comparison of delay obtained after block search is implemented in OpenCL and RenderScript (RS, a component of the Android operating system for mobile devices, which provides an API using heterogeneous hardware acceleration) according to an embodiment of the present invention.
Fig. 8 is a diagram illustrating an occupancy rate of a GPU in the case of implementing block search in two ways, OpenCL and RS, according to an embodiment of the present invention.
Fig. 9 is a delay situation diagram obtained by adopting two implementation manners, OpenCL and RS + jni (java Native interface), for video compression and paralleling the video compression process and the inference process of I frames according to the embodiment of the present invention.
Fig. 10 is a diagram of a comparison of different channels and different accuracy requirements according to an embodiment of the present invention.
Fig. 11 is a diagram of a comparison of different channels and different accuracy requirements in an embodiment of the invention.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.
As shown in fig. 1 and fig. 2, the video processing method of the present embodiment includes: s1, grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame; s2, determining a processing main body of a basic frame and a processing main body of a change frame in a frame grouping by taking the minimum delay as an optimization target, and distributing the processing main body to the determined processing main body; the processing main body comprises a local end and a server end; and S3, identifying the basic frame and the change frame through the processing main body to obtain an identification result.
In this embodiment, a mobile device (e.g., a smart phone) is taken as a local side, and an edge server connected to the mobile device via a network is taken as a server side for convenience of description. The mobile equipment shoots video images through the camera and then identifies the actions of the shot video images. In this embodiment, specifically, millet MI 8 running Android 9 is used as a mobile device, a desktop computer running Ubuntu is used as an edge server, a CPU of the desktop computer is Intel Core i 78700K, a GPU is GeForce RTX 2080, and a UCF-101 data set is used to implement the technical scheme of the method.
In this embodiment, before the step S1, a prediction step S0 is further included, which specifically includes: and predicting the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model (an off-line predictor in the figure 2) according to a preset accuracy rate. Since there is a functional relationship between the sampling rate of the frames and the accuracy rate of the motion recognition, as shown in fig. 4, in the present embodiment, the relationship between the sampling rate and the accuracy rate of the motion recognition is learned through an intelligent algorithm model, i.e., through the sampling rate S based on different frame groupsGoFAnd the number of changed frames nPThe intelligent algorithm model is trained according to the accuracy, and the trained intelligent algorithm model can calculate a group of sampling rates S under the condition of given accuracyGoFAnd the number of changed frames nPAnd (4) parameters. The intelligent algorithm model preferably adopts an off-line model, is trained in advance and then is loaded to the mobile equipment, and the sampling rate of frame grouping and the number of changed frames are calculated.
In the present embodiment, f (S) is preferably usedGoF,nP) To characterize the functional relationship between the sampling rate of a frame and the accuracy of motion recognition, the sampling rate S is characterized by a binary cubic polynomialGoFAnd the number of changed frames nPBy fitting a polynomial to the measured values in fig. 4, the sampling rate S can be obtainedGoFAnd the number of changed frames nP. The binary cubic polynomial may be represented by the following formula:
f(SGoF,nP)=p00+p10×SGoF+p01×nP+p20×SGoF 2+p11×SGoF×nP+p02×nP 2+p30×SGoF 3+p21×SGoF 2×nP+p12×SGoF×nP 2+p03×nP 3
in the above formula, p00, p10, p20, p11, p02, p30, p21, p12 and p03 are all binary cubic polynomial coefficients, and the rest parameters are defined as above. Table 1 below is a case where a binary cubic polynomial is fitted through the UCF-101 dataset.
Table 1:
Figure GDA0002827972500000061
the Accuracy Setting field is input into an intelligent algorithm model, the Sampled GoF is converted into a fractional form because the GoF samples are discrete in the actual operation of the system, 1261 in Data (1261) represents the number of videos selected for testing in a test set of a UCF-101 Data set, tuple content under the field represents the Accuracy rate obtained by actual testing, and Gap represents the difference value between the actually measured Accuracy rate and the input Accuracy rate.
In this embodiment, the first frame within a frame grouping is the base frame (I-frame), and the remaining frames are the change frames (P-frames). The data information recorded by the change frame comprises the change amount of the change frame and the previous frame; the variance includes a motion vector and a residual. Preferably, the variance is encoded only between itself and the previous frame using the previous frame as a reference, the variance is composed of a Motion Vector (MV) and a residual, the Motion Vector represents the movement of a pixel Block between the two frames, and the residual represents the difference between the base frame and the frame restored from the Motion Vector by means of Block Search. In this embodiment, delay conditions in two block search modes, OpenCL and renderscript (rs), are shown in fig. 7. The GPU occupancy situation by the two block search methods of OpenCL and RS is shown in fig. 8.
In this embodiment, it is preferable to identify the base frame with the large CNN model ResNet-152 and the change frame with the small CNN model ResNet-18 (i.e., identify the motion vectors and residuals). By filtering out redundant information between dropped frames, the CNN model can significantly reduce the complexity of motion recognition while achieving better accuracy performance. In each frame grouping, the motion vectors and residuals of all the change frames are respectively added together to enhance the information contained in the change frames and reduce the inference times of the change frames in the CNN model (ResNet 18). Both the CNN model ResNet-152 and CNN model ResNet-18 are deep learning models.
In this embodiment, as shown in fig. 2, after the frame grouping is divided and the base frame and the change frame in the frame grouping are determined, the base frame and the change frame are allocated to the local side process and the server side process, the time spent on allocating the base frame and the change frame is different, and the time delay experienced by the user is different. Therefore, in the present embodiment, by taking the minimum delay as an optimization target, the processing subjects of the base frame and the change frame within the frame grouping are determined and assigned to the determined processing subjects. The step S2 of using the minimum delay time as the optimization target specifically includes performing optimization by an objective function expressed by the following two expressions (LOP) or (mod-LOP).
In this embodiment, formula (LOP) is:
Figure GDA0002827972500000071
s.t.f(SGoF,nP)≥Λ,
0<SGoF≤1,
0<nP≤11,
Figure GDA0002827972500000072
in the above formula, LOP is the name abbreviation of delay optimization problem (Latency Optimiz)ation Problem),f(SGoF,nP) At a sampling rate SGoFAnd the number of changed frames nPThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determining the assignment of the changed frames when the t-th frame group arrives, SGoFIs the sampling rate, nPFor varying the number of frames, Q, within a grouping of framesm(T) time required for local processing, Q, according to allocationsAnd (T) is the time required by the server side for processing according to the distribution condition, and T is the maximum duration limit of the video.
Since the embodiment adopts the intelligent recognition model to predict the sampling rate SGoFAnd the number of changed frames nPTherefore, the sampling rate S can also be removed from (LOP) in the equationGoFAnd the number of changed frames nPAnd reduces it to a scheduling problem to allocate decision OI(t) and Op(t) as a variable. While taking into account changes in the state of the system, Qs(T) and QmThe value of (T) changes during the arrival of the concatenated frame grouping, and therefore, the objective function can be optimized as shown in the following equation:
Figure GDA0002827972500000073
s.t.OI(t),OP(t)∈{0,1}.
in the above formula, mod-LOP is a name abbreviation of a simplified delay Optimization Problem (modified delay Optimization Problem), OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Qm(t) time required for local processing, Q, depending on the allocationsAnd (t) is the time required for the server to process according to the distribution situation. Qm(t) and Qs(t) is determined by the system status obtained by the system analyzer and updated after a frame grouping is completed.
In this embodiment, for the deep learning model, the mobile device uses a deep learning framework tensoflow Lite specially designed for the mobile device, and the edge server uses a Pytorch.
In this embodiment, when the basic frame and/or the change frame are allocated to the server for processing, the basic frame and/or the change frame are compressed and then sent to the server through the network for processing. In the compression process, for a base frame, the frame is compressed with Discrete Cosine Transform (DCT) and Entropy Coding (EC) using intra prediction according to the h.264 standard. Under the condition of motion vectors and residual errors, DCT and EC are directly operated to carry out compression, and in order to avoid the influence on precision, lossless compression is achieved by removing the quantization process. And packaging the compressed data, and sending the data to a server through a TCP/IP protocol. When the server receives the data, a Decoder (Decoder) is used for decoding the data, and then a CNN model is operated to identify the actions in the frame, so that the scores of the identified actions are obtained. When the base frame and/or the change frame are identified by the local terminal, the mobile device directly identifies the actions in the base frame and/or the change frame through the CNN model to obtain the scores of the identified actions. And after the scores of all the actions are weighted and summed, the action with the highest score is taken as a recognition result (namely a label). To improve the performance of compression-based deep learning model inference, OpenCL is used to implement compression-related operations on mobile GPUs, so that inference running on the mobile device CPU can be parallel to video compression. As shown in fig. 9, in this embodiment, two implementation manners, OpenCL and RS + jni (java Native interface), are adopted for video compression, and a video compression process and an inference process of an I frame are performed in parallel to obtain a delay condition. In fig. 9, an example "reference line" represents a delay situation when only I-frame inference is run alone, an "OpenCL implementation" represents a delay situation in parallel with video compression and I-frame inference implemented by OpenCL, and an "RS + JNI implementation" represents a delay situation in parallel with video compression and I-frame inference implemented by RS + JNI.
In this embodiment, the mobile device is at OI(t) and OpThere are four cases on (t). For Q under different choicess(t) and Qm(t) there will be different update rules inOn this basis we select the optimal allocation decision to minimize the value of the (mod-LOP) objective function. Since the calculation of the GoF at time t-1 may not be completed when the frame grouping GoF at time t arrives, g (t) is used to represent the time interval between the (t-1) th and the t-th frame grouping GoF. The remaining computation time is available rs(t)=max(Qs(t-1) -g (t),0) and rm(t)=max(Qm(t-1) -g (t),0) represents rs(t)、rm(t) is used to represent the remaining computation time of the server and the mobile device, respectively. O isI(t) '0' indicates that the base frame is allocated to local side processing, OI(t) ═ 1 indicates that the base frame is allocated to server processing; o isp(t) '0' indicates that the change frame is allocated to the local side process, Op(t)' 1 indicates that the change frame is allocated to the server process. The time for allocating decision processing when the frame grouping arrives is shown in fig. 3, when t is 1, the online scheduler decides to unload the I frame to the edge server and keeps the P frame to be calculated locally, i.e. OI(t) 1 and Op(t)=0,Qm(t) and Qs(t) is obtained from the system state provided by the system analyzer and updated after the completion of the GoF. The user perceived delay is the time interval between the arrival and completion of the last to processed GoF, e.g., the second to processed GoF in fig. 3.
The first situation is as follows: o isI(t)=0,OpIf (t) is 0, that is, the base frame (I frame) and the changed frame (P frame) of the frame group GoF arriving at time t are both calculated locally, then the delay is as follows:
Figure GDA0002827972500000091
in the above formula, dI,m(t) the predicted delay to run ResNet-152 on the mobile device at time t, dsch(t) delay of block search for obtaining information required for P frame at time t, dP,m(t) denotes the delay of running ResNet-18 on the mobile device at time t, with the remaining parameters defined as above. Since the GPU is used to obtain the MV and the residual, dI,m(t) and dsch(t) may be in parallel. In movingThe delay for running ResNet-18 and ResNet-152 on mobile device millet MI 8 is shown in FIG. 5.
Case two: o isI(t)=1,OpAnd (t) 1, namely, distributing the I frame and the P frame in the frame grouping GoF arriving at the time t to a server for processing, namely, the mobile equipment needs to compress the frames in the frame grouping and then sends the frames to an edge server, the edge server waits for data to arrive and receive, and then runs ResNet-152 and ResNet-18 on the I frame and the P frame respectively, and the time for the edge server to wait for the data is equal to the sum of video compression and data transmission. Then, the delay is shown as follows:
Figure GDA0002827972500000092
in the above equation, I and P frames have compression and acquisition delays dI,C(t)+dsch(t)+dP,C(t) wherein dI,C(t) represents the delay of the predicted compressed I frame at time t, dP,C(t) represents the delay of the predicted compressed P frame at time t, dI,W(t) latency of predicted I-frame at time t, dP,W(t) latency of P frames predicted at time t, dI,s(t) delay to run ResNet-152 on the edge server predicted for time t, dP,s(t) delay to run ResNet-18 on the edge server predicted for time t, with the remaining parameters defined as above. The delay scenario for running ResNet-18 and ResNet-152 on an edge server is shown in FIG. 6.
Case three: o isI(t)=0,OpAnd (t) is 1, i.e. an I frame in a frame group GoF arriving at the time t is distributed to a local end, P frames are distributed to a server for processing, ResNet-152 running on a CPU of the mobile device can be compressed in parallel with Block Search and P frames running on a GPU, and the edge server can perform residual calculation in the process of waiting for the arrival of a calculation task. Then, the delay is shown as follows:
Figure GDA0002827972500000093
in the above formula, the definition of each parameter is the same as above.
Case four: o isI(t)=1,OpWhen t is 0, I-frames in a frame grouping GoF arriving at the time t are distributed to a server, P-frames are distributed to local processing, ResNet-18 running on a CPU of the mobile device can be compressed in parallel with the I-frames running on a GPU, and an edge server can perform residual calculation in the process of waiting for a calculation task to arrive. Then, the delay is shown as follows:
Figure GDA0002827972500000101
in the above formula, the definition of each parameter is the same as above.
In this embodiment, under the combination of poor 4 wireless channel states and 3 different accuracy requirements, two cases of "deep action" and "local execution" are performed, for example, as shown in fig. 10 (in fig. 10, (a), (b), and (c) respectively represent three different case combinations), where the legend deep action represents a result obtained by a complete execution flow using the method, and the "local execution" represents that a method is not used, but all results obtained by sampling are locally calculated. It can be seen from the figure that even in the case of very bad channel state (bandwidth is 0.75Mbps), the DeepAction can still effectively reduce the computation delay.
In this embodiment, under the combination of better 4 wireless channel states and 3 different accuracy requirements, the three cases of "deep action", "local execution", and "remote execution" are performed, for example, as shown in fig. 11 (in fig. 11, (a), (b), (c) respectively represent three different case combinations), the legend "deep action" represents the result obtained by the complete execution flow using the method, "local execution" represents that the method is not used, but the results obtained by sampling are all put in the local calculation, and the legend "remote execution" represents that the method is not used, but all frames are all allocated to the edge server for calculation. It can be seen from the figure that even under the condition that the channel state is excellent (bandwidth is 93.84Mbps), the DeepAction can effectively reduce the calculation delay compared with the remote execution.
The video processing system of the embodiment comprises a frame grouping module, a distribution module and a result processing module; the frame grouping module is used for grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame; the distribution module is used for determining the processing main bodies of the basic frames and the change frames in the frame grouping by taking the minimum delay as an optimization target and distributing the processing main bodies to the determined processing main bodies; the processing main body comprises a local end and a server end; and the result processing module is used for acquiring the result of the processing main body for identifying the basic frame and the change frame to obtain the identification result. The first frame in the frame grouping is a basic frame, and the rest frames are change frames; the data information recorded by the change frame comprises the change amount of the change frame and the previous frame; the variance includes a motion vector and a residual.
In this embodiment, the method further includes a predicting module, configured to predict a sampling rate and a number of changed frames of the frame grouping through an intelligent algorithm model according to a preset accuracy. The allocation module takes the minimum delay as an optimization target, and specifically comprises the following steps of optimizing through an objective function shown as the following formula:
Figure GDA0002827972500000111
s.t.f(SGoF,nP)≥Λ,
0<SGoF≤1,
0<nP≤11,
Figure GDA0002827972500000112
in the above formula, LOP is the name abbreviation of delay Optimization Problem (Latency Optimization Problem), f (S)GoF,nP) At a sampling rate SGoFAnd the number of changed frames nPThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, OI(t) isDetermination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determining the assignment of the changed frames when the t-th frame group arrives, SGoFIs the sampling rate, nPFor varying the number of frames, Q, within a grouping of framesm(T) time required for local processing, Q, according to allocations(T) is the time required by the server side for processing according to the distribution condition, and T is the maximum duration limit of the video;
or:
the allocation module takes the minimum delay as an optimization target, and specifically comprises the following steps of optimizing through an objective function shown as the following formula:
Figure GDA0002827972500000113
s.t.OI(t),OP(t)∈{0,1}.
in the above formula, mod-LOP is a name abbreviation of a simplified delay Optimization Problem (modified delay Optimization Problem), OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Qm(T) time required for local processing, Q, according to allocationsAnd (T) is the time required for the server to process according to the distribution situation.
Through the system of the embodiment, the processing method can be realized, the identification delay is effectively reduced, and the identification accuracy is ensured.
The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims (8)

1. A video processing method, characterized by:
s1, grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame;
s2, with the minimum delay as an optimization target, determining the processing main bodies of the basic frames and the change frames in the frame grouping, and distributing the processing main bodies to the determined processing main bodies; the processing main body comprises a local end and a server end;
s3, identifying the basic frame and the change frame through the processing main body to obtain an identification result;
the step S1 is preceded by a prediction step S0, which specifically includes: and predicting the sampling rate and the number of the changed frames of the frame grouping through an intelligent algorithm model according to a preset accuracy rate.
2. The video processing method of claim 1, wherein: the first frame in the frame grouping is a base frame and the remaining frames are change frames.
3. The video processing method according to claim 2, wherein: the data information recorded by the change frame comprises the change amount of the change frame and the previous frame;
the variance includes a motion vector and a residual.
4. A video processing method according to any one of claims 1 to 3, characterized by: the step S2 of using the minimum delay as the optimization target specifically includes optimizing by an objective function shown in the following formula:
Figure FDA0002827972490000011
s.t.f(SGoF,nP)≥Λ,
0<SGoF≤1,
0<nP≤11,
Figure FDA0002827972490000012
in the above formula, LOP is the name abbreviation of delay optimization problem, f (S)GoF,nP) At a sampling rate SGoFAnd the number of changed frames nPThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determining the assignment of the changed frames when the t-th frame group arrives, SGoFIs the sampling rate, nPFor varying the number of frames, Q, within a grouping of framesm(T) is the time required for local processing, Q, according to the allocationsAnd (T) is the time required by the service end to process according to the distribution condition, and T is the maximum duration limit of the video.
5. A video processing method according to any one of claims 1 to 3, characterized by: the step S2 of using the minimum delay as the optimization target specifically includes optimizing by an objective function shown in the following formula:
Figure FDA0002827972490000021
s.t.OI(t),OP(t)∈{0,1}.
in the above formula, mod-LOP is a name abbreviation for the simplified delay optimization problem, OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Qm(T) is the time required for local processing, Q, according to the allocationsAnd (T) is the time required for the server to process according to the distribution condition.
6. A video processing system, characterized by: the system comprises a frame grouping module, a distribution module and a result processing module;
the frame grouping module is used for grouping frames in a video to be processed to obtain a frame grouping, and dividing the frames in the frame grouping into a basic frame and a change frame;
the allocation module is used for determining the processing subjects of the base frame and the change frame in the frame grouping by taking the minimum delay as an optimization target and allocating the processing subjects to the determined processing subjects; the processing main body comprises a local end and a server end;
the result processing module is used for acquiring the result of the processing main body for identifying the basic frame and the change frame to obtain an identification result;
the device also comprises a prediction module which is used for predicting the sampling rate of the frame grouping and the number of the changed frames through an intelligent algorithm model according to the preset accuracy rate.
7. The video processing system of claim 6, wherein: the first frame in the frame grouping is a basic frame, and the rest frames are change frames; the data information recorded by the change frame comprises the change amount of the change frame and the previous frame; the variance includes a motion vector and a residual.
8. The video processing system according to claim 6 or 7, wherein: the allocation module takes the minimum delay as an optimization target, and specifically comprises the following steps of optimizing through an objective function shown as the following formula:
Figure FDA0002827972490000022
s.t.f(SGoF,nP)≥Λ,
0<SGoF≤1,
0<nP≤11,
Figure FDA0002827972490000023
in the above formula, LOP is the name abbreviation of delay optimization problem, f (S)GoF,nP) At a sampling rate SGoFAnd the number of changed frames nPThe accuracy rate, Λ, that can be achieved is the predetermined accuracy rate, OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determining the assignment of the changed frames when the t-th frame group arrives, SGoFIs the sampling rate, nPFor varying the number of frames, Q, within a grouping of framesm(T) is the time required for local processing, Q, according to the allocations(T) is the time required for the server to process according to the distribution condition, and T is the maximum duration limit of the video;
or:
the allocation module takes the minimum delay as an optimization target, and specifically comprises the following steps of optimizing through an objective function shown as the following formula:
Figure FDA0002827972490000031
s.t.OI(t),OP(t)∈{0,1}.
in the above formula, mod-LOP is a name abbreviation for the simplified delay optimization problem, OI(t) determination of the allocation of the base frame when the t-th frame grouping arrives, Op(t) determination of the allocation of the changed frames when the t-th frame grouping arrives, Qm(T) is the time required for local processing, Q, according to the allocationsAnd (T) is the time required for the server to process according to the distribution condition.
CN201911410027.7A 2019-12-31 2019-12-31 Video processing method and system Active CN111131835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911410027.7A CN111131835B (en) 2019-12-31 2019-12-31 Video processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911410027.7A CN111131835B (en) 2019-12-31 2019-12-31 Video processing method and system

Publications (2)

Publication Number Publication Date
CN111131835A CN111131835A (en) 2020-05-08
CN111131835B true CN111131835B (en) 2021-02-26

Family

ID=70506265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911410027.7A Active CN111131835B (en) 2019-12-31 2019-12-31 Video processing method and system

Country Status (1)

Country Link
CN (1) CN111131835B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110018834A (en) * 2019-04-11 2019-07-16 北京理工大学 It is a kind of to mix the task unloading for moving cloud/edge calculations and data cache method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8477661B2 (en) * 2009-08-14 2013-07-02 Radisys Canada Ulc Distributed media mixing and conferencing in IP networks
US8646021B2 (en) * 2011-04-20 2014-02-04 Verizon Patent And Licensing Inc. Method and apparatus for providing an interactive application within a media stream
CN107333267B (en) * 2017-06-23 2019-11-01 电子科技大学 A kind of edge calculations method for 5G super-intensive networking scene
CN108809723B (en) * 2018-06-14 2021-03-23 重庆邮电大学 Edge server joint task unloading and convolutional neural network layer scheduling method
CN108540406B (en) * 2018-07-13 2021-06-08 大连理工大学 Network unloading method based on hybrid cloud computing
CN110096362B (en) * 2019-04-24 2023-04-14 重庆邮电大学 Multitask unloading method based on edge server cooperation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110018834A (en) * 2019-04-11 2019-07-16 北京理工大学 It is a kind of to mix the task unloading for moving cloud/edge calculations and data cache method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
资源受限移动边缘计算任务拆分卸载调度决策;张艮山,刘旭宁;《计算机应用与软件》;20191012;全文 *

Also Published As

Publication number Publication date
CN111131835A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN108780499B (en) System and method for video processing based on quantization parameters
US8724704B2 (en) Apparatus and method for motion estimation and image processing apparatus
US11657264B2 (en) Content-specific neural network distribution
US9635374B2 (en) Systems and methods for coding video data using switchable encoders and decoders
CN100586180C (en) Be used to carry out the method and system of de-blocking filter
CN101039434B (en) Video coding apparatus
KR20140110008A (en) Object detection informed encoding
CN112383777B (en) Video encoding method, video encoding device, electronic equipment and storage medium
JP5766877B2 (en) Frame coding selection based on similarity, visual quality, and interest
US20230062752A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
US10623744B2 (en) Scene based rate control for video compression and video streaming
CN110248189B (en) Video quality prediction method, device, medium and electronic equipment
CN110891177B (en) Denoising processing method, device and machine equipment in video denoising and video transcoding
CN110430436A (en) A kind of cloud mobile video compression method, system, device and storage medium
CN111787322B (en) Video coding method and device, electronic equipment and computer readable storage medium
US20120195364A1 (en) Dynamic mode search order control for a video encoder
WO2017180201A1 (en) Adaptive directional loop filter
CN114363649A (en) Video processing method, device, equipment and storage medium
Xiao et al. Dnn-driven compressive offloading for edge-assisted semantic video segmentation
US20180007366A1 (en) Adaptive tile data size coding for video and image compression
CN110418134B (en) Video coding method and device based on video quality and electronic equipment
CN111131835B (en) Video processing method and system
Li et al. Fleet: improving quality of experience for low-latency live video streaming
US20210103813A1 (en) High-Level Syntax for Priority Signaling in Neural Network Compression
CA3182110A1 (en) Reinforcement learning based rate control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant