CN112084949A

CN112084949A - Video real-time identification segmentation and detection method and device

Info

Publication number: CN112084949A
Application number: CN202010946166.8A
Authority: CN
Inventors: 景乃锋; 宋卓然; 吴飞洋
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-12-15
Anticipated expiration: 2040-09-10
Also published as: CN112084949B

Abstract

The invention discloses a method and a device for identifying, dividing and detecting a video in real time, wherein the dividing method comprises the steps of decoding a target video to obtain I-type frame image data, P-type frame image data and a motion vector table of the target video; acquiring a motion vector of the B-type frame based on the motion vector table; obtaining an I-type frame image segmentation result and a P-type frame image segmentation result based on a first preset neural network; acquiring a reconstruction result of the B-type frame according to the segmentation result of the I-type frame image, the segmentation result of the P-type frame image, the motion vector of the B-type frame and the reconstruction result of the acquired B-type frame image data; and inputting the reconstruction result of the B-type frame, the image segmentation result of the I-type frame and the image segmentation result of the P-type frame into a second preset neural network to obtain the image segmentation result of the B-type frame. The method provided by the invention can realize higher performance while maintaining accuracy, and solves the problem that the conventional video identification task processing method can not reduce the calculated amount and energy consumption on the basis of ensuring higher precision.

Description

Video real-time identification segmentation and detection method and device

Technical Field

The invention relates to the technical field of neural networks, in particular to a method and a device for identifying, segmenting and detecting videos in real time.

Background

Deep convolutional neural networks have found widespread application in image recognition, such as in classification, detection, and segmentation of images. With the development of the deep convolutional neural network, people gradually expand the application range of the deep convolutional neural network to the video field.

Wherein deep learning is better suited to handle image recognition tasks. For a target segmentation task, the complete convolution network has been applied to the maximum in the field; and for target detection, the R-CNN family occupies a dominant position. However, if the image recognition model is directly applied to each frame of video, there is an unacceptable amount of computation and power consumption. Therefore, based on the limitation of image recognition, researchers have proposed many neural networks for video recognition, for example, OSVOS proposed a dual-flow FCN model for foreground and contour, respectively; for better performance, FAVOS proposes local segmentation based on tracked target objects, followed by construction of ROI SegNet, a robust but still large network for segmenting targets. However, the above-mentioned techniques achieve high accuracy at the cost of high computational effort and energy consumption.

Furthermore, it is known that the change of image information between video frames is slow, and the amount of calculation can be reduced by using such data redundancy. Therefore, to achieve real-time video segmentation, DFF proposes a depth feature stream method, which is the first to directly combine optical flow and key features together. The optical flow is extracted by a neural network, key features are extracted by a large convolutional neural network for key frames, but the key frames are determined by spacing a fixed number of frames, the method affects the accuracy of recognition, and the overhead of extracting the optical flow is high.

Disclosure of Invention

The invention aims to solve the technical problem that the existing method for processing the video identification task consumes a large amount of calculation amount and energy consumption if the identification precision is higher, and influences the identification precision if the calculation amount and the energy consumption are reduced.

In order to solve the technical problem, the invention provides a video real-time identification and segmentation method, which comprises the following steps:

decoding a target video through a preset video decoder to obtain I-type frame image data, P-type frame image data and a motion vector table of the target video;

acquiring a motion vector of the B-type frame based on the motion vector table;

inputting the I-type frame image data and the P-type frame image data into a first preset neural network to obtain an I-type frame image segmentation result and a P-type frame image segmentation result;

acquiring a reconstruction result of the B-type frame according to the I-type frame image segmentation result, the P-type frame image segmentation result, the motion vector of the B-type frame and the acquired reconstruction result of the B-type frame image data;

inputting the reconstruction result of the B-type frame, the image segmentation result of the I-type frame and the image segmentation result of the P-type frame into a second preset neural network according to a preset input mode to obtain the image segmentation result of the B-type frame;

the video coding and decoding standard of the target video is classification of I-frame image data, B-frame image data and P-frame image data, a motion vector table is provided, and each frame of image data is divided into a plurality of small divided blocks in a preset mode.

Preferably, the step of obtaining a reconstruction result of the B-class frame according to the segmentation result of the I-class frame image, the segmentation result of the P-class frame image, the motion vector of the B-class frame, and the reconstruction result of the obtained B-frame image data includes:

sequentially acquiring the reconstruction results of B frame image data which is not acquired in the B frame image data according to a decoding sequence on the basis of the I frame image segmentation result, the P frame image segmentation result, the motion vector of the B frame and the reconstruction results of the acquired B frame image data; when the reconstruction result of B frame image data which is not acquired is acquired, all the reconstruction results of the acquired B frame image data are used as the reconstruction results of the acquired B frame image data;

acquiring a reconstruction result of single B frame image data in the B frame image data based on the class frame image segmentation result, the P frame image segmentation result, the B frame motion vector and the acquired reconstruction result of the B frame image data comprises the following steps:

sequentially acquiring the weight of each divided small block in single B frame image data in the B frame image data based on the I frame image segmentation result, the P frame image segmentation result, the B frame motion vector and the acquired B frame image data reconstruction result

Building a result;

and collecting the reconstruction results of all the segmentation small blocks in the B frame image data to obtain the reconstruction result of the B frame image data in the B-type frame image data.

Preferably, the obtaining a reconstruction result of a single segmentation small block in a single frame image data in a type B frame image data based on the type I frame image segmentation result, the type P frame image segmentation result, the type B frame motion vector, and the obtained reconstruction result of the type B frame image data comprises:

acquiring all motion vectors corresponding to a single segmentation small block based on the motion vectors of the B-type frame;

acquiring all reference divided small blocks of the divided small blocks based on all motion vectors of the divided small blocks;

and averaging all the reference coding data of the divided small blocks pixel by using an averaging filter to obtain a reconstruction result of the divided small blocks.

Preferably, the step of inputting the B-class frame image reconstruction result, the I-class frame image segmentation result, and the P-class frame image segmentation result into a second preset neural network according to a preset input mode, and obtaining the B-class frame image segmentation result includes:

respectively taking a reconstruction result of single-frame image data in a B-type frame reconstruction result, an image segmentation result of first I-frame image data or P-frame image data of a preamble corresponding to the frame reconstruction result in a video playing sequence and an image segmentation result of first subsequent I-frame image data or P-frame image data as an input group;

sequentially inputting all input groups into the second preset neural network to respectively obtain image segmentation results of B frame image data corresponding to each input group;

the image segmentation results of all the B frame image data constitute the image segmentation results of the B-class frame.

Preferably, the first preset neural network is a large deep neural network for video segmentation; the second preset convolutional neural network comprises three layers of a convolutional layer, a pooling layer and an activation layer, wherein an input image channel of the first layer is provided with three input channels.

In order to solve the above technical problem, the present invention further provides a video real-time detection method, including:

acquiring a motion vector of the B-type frame based on a motion vector table;

inputting the I-type frame image data and the P-type frame image data into a third preset neural network to obtain an I-type frame image detection result and a P-type frame image detection result;

respectively acquiring detected frames in all reference frame image detection results corresponding to each B frame image data in the B-type frame image data, and respectively sorting the detected frames in all reference frame image detection results corresponding to each B frame image data to obtain a frame sequence to be set of each B frame image data;

setting corresponding B frame image data based on the frame to be set of each B frame image data in an ordering mode, obtaining the reconstruction result of each B frame image data, and enabling the reconstruction results of all B frame image data to form the reconstruction result of a B-type frame;

inputting the reconstruction result of the B-type frame, the set I-type frame image detection result and the set P-type frame image detection result into a second preset neural network according to a preset input mode to obtain the image detection result of the B-type frame;

the video coding and decoding standard of the target video is classification of I-frame image data, B-frame image data and P-frame image data, a motion vector table is provided, and each frame of image data is divided into a plurality of small divided blocks according to a preset mode of an encoder;

the set I-frame image detection result is composed of the image detection results of all set I-frame image data, and the set P-frame image detection result is composed of the image detection results of all set P-frame image data;

the image detection result of the set I or P frame image data is that all detected frames and internal pixel points thereof in the image detection result of the set frame image data are set to be of a first type of color, and other pixel points outside all detected frames in the image detection result of the set frame image data are set to be of a second type of color; the first type of color comprises a plurality of colors, and the colors set by the detected frames and the internal pixel points of the detected frames in different types in all the reference frame image data corresponding to each B frame image data are different.

Preferably, the step of obtaining the detected frames in the detection results of all reference frame images corresponding to each B frame image data in the B-type frame image data, and sorting the detected frames in the detection results of all reference frame images corresponding to each B frame image data, respectively, to obtain the ordering of frames to be set of each B frame image data includes:

taking the first B frame image data as target B frame image data in decoding order;

acquiring a reference frame of the target B frame image data based on the motion vector of the B frame, classifying all detected frames in all reference frames corresponding to the target B frame image data according to a preset classification requirement, and sequencing the classified detected frames according to a preset classification sequence to obtain a to-be-set frame sequence of the target B frame image data;

and judging whether the current target B frame image data is the last B frame image data in the decoding sequence, if so, finishing the sequencing, otherwise, taking the next B frame image data as new target B frame image data according to the decoding sequence, and acquiring the sequencing of frames to be set of the new target B frame image data.

Preferably, the setting of the B-frame image data based on the to-be-set frame ordering of the single B-frame image data, and the obtaining of the reconstruction result of the B-frame image data includes:

taking the first type detected frame as a frame to be set according to the frame to be set;

setting the frame to be set and the internal pixel points thereof as a first color and setting other pixel points outside the frame to be set as a second color in the image detection results of all the reference frames with the frame to be set of the B frame image data, and acquiring the image detection result of the set reference frame of the B frame image data aiming at the frame to be set;

acquiring a sub-reconstruction result in the B frame image data according to an image detection result of a set reference frame corresponding to the B frame image data of the frame to be set, a motion vector of the B frame and a reconstruction result of the acquired B frame image data;

and judging whether the current frame to be set is the last detected frame in the frame sequence to be set, if so, merging all sub-reconstruction results of the B frame image data to obtain the reconstruction result of the B frame image data, otherwise, taking the next detected frame as a new frame to be set according to the frame sequence to be set, and obtaining a sub-reconstruction result of the B frame image data corresponding to the new frame to be set.

Preferably, the third preset neural network is a large deep neural network for video detection; the second preset convolutional neural network comprises three layers of a convolutional layer, a pooling layer and an activation layer, wherein an input image channel of the first layer is provided with three input channels.

In order to solve the technical problem, the invention also provides a video real-time identification, segmentation and detection device, which comprises a decoding module, a motion vector acquisition module, an I/P frame image segmentation or detection result acquisition module, a frame to be set ordering module, a reconstruction result acquisition module and a B frame image segmentation/detection result acquisition module;

the decoding module is used for decoding a target video through a preset video decoder to obtain I-type frame image data, P-type frame image data and a motion vector table of the target video;

the motion vector acquisition module is used for acquiring the motion vector of the B-type frame based on the motion vector table;

the I/P frame image segmentation or detection result acquisition module is used for inputting the I frame image data and the P frame image data into a first preset neural network to obtain an I frame image segmentation result and a P frame image segmentation result, and inputting the I frame image data and the P frame image data into a third preset neural network to obtain an I frame image detection result and a P frame image detection result;

the frame ordering module to be set is used for respectively acquiring detected frames in all reference frame image detection results corresponding to each B frame image data in the B-type frame image data, and respectively sorting the detected frames in all reference frame image detection results corresponding to each B frame image data to obtain the frame ordering to be set of each B frame image data;

the reconstruction result acquisition module is used for acquiring the reconstruction result of the B-type frame according to the I-type frame image segmentation result, the P-type frame image segmentation result, the motion vector of the B-type frame and the acquired reconstruction result of the B-type frame image data, and is used for setting the corresponding B-type frame image data respectively based on the frame sequence to be set of each B-type frame image data to acquire the reconstruction result of each B-type frame image data, and the reconstruction results of all B-type frame image data form the reconstruction result of the B-type frame;

the B-type frame image segmentation/detection result acquisition module is used for inputting a reconstruction result of the B-type frame, an I-type frame image segmentation result and a P-type frame image segmentation result into a second preset neural network according to a preset input mode to obtain an image segmentation result of the B-type frame, and is used for inputting the reconstruction result of the B-type frame, a set I-type frame image detection result and a set P-type frame image detection result into the second preset neural network according to the preset input mode to obtain an image detection result of the B-type frame;

the video coding and decoding standard of the target video is classification of I-frame image data, B-frame image data and P-frame image data, a motion vector table is provided, and each frame of image data is divided into a plurality of small divided blocks according to a preset mode of an encoder.

In order to solve the above technical problem, the present invention further provides a storage medium having a computer program stored thereon, which when executed by a processor, implements the video real-time recognition segmentation method.

In order to solve the above technical problem, the present invention further provides a storage medium having a computer program stored thereon, which when executed by a processor, implements the video real-time detection method.

In order to solve the above technical problem, the present invention further provides a terminal, including: the system comprises a processor, a first memory and a second memory, wherein the memories are respectively connected with the first processor and the second memory in a communication way;

the first memory is used for storing computer programs, and the processor is used for executing the computer programs stored in the first memory so as to enable the terminal to execute a real-time video identification and segmentation method;

the second memory is used for storing computer programs, and the processor is used for executing the computer programs stored in the second memory so as to enable the terminal to execute a real-time video identification and segmentation method.

Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:

the video real-time identification, segmentation and detection method provided by the embodiment of the invention is mainly used for classifying the video coding standard into videos with I-frame image data, B-frame image data and P-frame image data, has a motion vector table, and each frame of image data is segmented and detected into a plurality of segmented small blocks according to the preset mode of an encoder, and further divides a target video into I-type frame image data and P-type frame image data through a video decoder, and then obtains an I-type frame image segmentation/detection side result and a P-type frame image segmentation/detection result through the processing of a neural network, and then obtains an image segmentation result of a B-type frame by taking the I-type frame image segmentation/detection result and the P-type frame image segmentation/detection result as the basis, and further obtains the segmentation result of the target video. The method of the invention realizes higher performance while maintaining accuracy by closely connecting the video decoder and the neural network, and solves the problem that the existing processing method for the video identification task can not reduce the calculation amount and the energy consumption on the basis of ensuring higher precision.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a video real-time recognition and segmentation method according to an embodiment of the present invention;

FIG. 2 is a table showing partial class B frame motion vectors of a horse-riding video in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a horse riding video result reconstruction according to an embodiment of the invention;

FIG. 4 is a flow chart of a video real-time detection method according to a second embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a video real-time recognition segmentation and detection apparatus according to an embodiment of the present invention;

fig. 6 shows a schematic structural diagram of a sixth terminal according to an embodiment of the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

In the process of recognizing, segmenting and detecting images by deep learning at the present stage, the complete convolution network is fully utilized in the field of target segmentation tasks, and the R-CNN family has a higher dominance in the target detection range. However, if the image recognition model is directly applied to each frame of video, the calculation amount and energy consumption are not affordable, and the accuracy of recognition is not high. Therefore, based on the limitation of image recognition, researchers have proposed many neural networks for video recognition, such as the dual-flow FCN model proposed by OSVOS, and the local segmentation based on the tracked target object proposed by FAVOS, but the cost of achieving high accuracy is high computation and energy consumption. Further, in order to implement real-time video segmentation, the DFF proposes a depth feature stream method, which directly combines optical flow and key features together, but the key frames are determined by spacing a fixed number of frames, which affects the recognition accuracy and has a large overhead for extracting optical flow.

Example one

In order to solve the technical problems in the prior art, the embodiment of the invention provides a video real-time identification and segmentation method.

FIG. 1 is a flow chart of a video real-time recognition and segmentation method according to an embodiment of the present invention; referring to fig. 1, a video real-time identification and segmentation method according to an embodiment of the present invention includes the following steps.

Step S101, decoding the target video through a preset video decoder to obtain I-type frame image data, P-type frame image data and a motion vector table of the target video.

Specifically, the video coding and decoding standard of the target video in the embodiment of the present invention is to have classification of I-frame image data, B-frame image data, and P-frame image data, and each frame of image data is divided into blocks of a preset size and has a motion vector table. For example, the video encoding and decoding standard of the target video may be h.265 video, and at this time, the small blocks are divided into coding tree blocks; the video coding and decoding standard of the target video can also be H.264 video, and the small blocks are divided into macro blocks. Wherein, the motion vector is the amount of motion track of the divided small blocks represented by the decoder by recording the code stream of the dependency relationship. The decoder refers to the dependent frame and the divided small block as a reference frame and a reference divided small block, and the motion vector table includes a reference frame corresponding to each B frame image data and P frame image data in the target video and a reference divided small block corresponding to each divided small block in the B frame image data and P frame image data, where we need to use the motion vector of the B frame image data.

It should be noted that each frame of image data may be divided according to a basic unit of divided small blocks, and a typical size of the divided small blocks is 8 × 8 pixels. The decoding process of the I-type frame image data, the P-type frame image data and the B-type frame image data by the decoder comprises the following characteristics: for class I frame image data, each small partition block is subjected to intra-frame prediction through 14 prediction modes; the decoder runs different prediction algorithms under different modes and calculates the sum of absolute errors between the current segmented small block and the segmented small block which is already calculated; the decoder will select the mode and partition the tiles according to the goal of minimizing the sum of absolute errors. For P-type frame image data, a decoder searches in a larger range; the search range includes already-encoded divided small blocks of a current frame (i.e., intra prediction) and a preamble frame (i.e., inter prediction); the mode is then selected and the tiles are partitioned again according to the goal of minimizing the sum of absolute errors. For B-frame image data, the decoder searches in a larger range; the search range is not only the current frame or the preamble frame, but also can be a small segmentation block which is already coded in the subsequent frame; the mode is selected and the tiles are partitioned again according to the goal of minimizing the sum of absolute errors.

The decoder records the decoding order of the frames according to the inter-frame dependency because the image data of the B-frame may refer to the preceding frame and the following frame. For example, suppose (I)₀,B₁,B₂,B₃,P₄,I₅,B₆,P₇) Is the playing order of the video, and (I)₀,P₄,B₃,B₂,B₁,I₅,P₇,B₆) It will be the actual decoding order since B is now the case₃Dependence on I₀And P₄. Further, the decoder may convert the code stream back to a conventional sequence of frames according to a particular decoding order. Particularly with respect to class I frame image data,the partitioned blocks are restored using intra prediction according to the selected prediction mode, the partitioned blocks referenced in the motion vector, and the corresponding residual. For both P-frame and B-frame image data, the decoder uses both intra-frame and inter-frame prediction. It should be noted that all decoded I, P and B frames are written back to global storage or a buffer for display.

Step S102, the motion vector of the B-type frame is obtained based on the motion vector table.

Specifically, the motion vector table includes a reference frame on which each B frame image data and P frame image data in the target video respectively corresponds to, and a reference segmentation small block on which each segmentation small block in the B frame image data and P frame image data corresponds to, where we need to use a motion vector of the B frame image data, but in this example, only the motion vector of the B frame image data is used, so the motion vector of each segmentation small block in each B frame image data in the B-type frame image data needs to be acquired based on the motion vector table.

Step S103, inputting the I-type frame image data and the P-type frame image data into a first preset neural network to obtain an I-type frame image segmentation result and a P-type frame image segmentation result.

Specifically, in this embodiment, the first preset neural network is set as a large deep neural network for video segmentation, and then the acquired I-type frame image data and P-type frame image data are respectively input into the first preset neural network, so as to obtain an I-type frame image segmentation result and a P-type frame image segmentation result.

And step S104, acquiring the reconstruction result of the B-type frame according to the segmentation result of the I-type frame image, the segmentation result of the P-type frame image, the motion vector of the B-type frame and the reconstruction result of the acquired B-frame image data.

Specifically, in order to obtain the reconstruction result of the B-type frame, the reconstruction result of each frame of image data in the B-type frame image data needs to be obtained first, and the reconstruction result of each frame of image data forms the reconstruction result of the B-type frame; and acquiring the reconstruction result of each frame of image data needs to acquire the reconstruction result of each segmentation small block in the corresponding frame of image data, and the reconstruction result of each frame of image data corresponding to all the segmentation small blocks in each frame of image data forms the reconstruction result of each frame of image data. It should be noted that the reconstruction result of each divided small block and the reconstruction result of each frame of image data are obtained on the premise of the segmentation result of the I-type frame image, the segmentation result of the P-type frame image, the motion vector of the B-type frame, and the reconstruction result of the obtained B-type frame image data; wherein the special feature is that the reconstruction result of the first acquired B frame image data is acquired based on the I-type frame image segmentation result, the P-type frame image segmentation result, and the B-type frame motion vector.

Further, the process of obtaining the reconstruction result of a single segmented small block in a single B frame image data in the B frame image data based on the I frame image segmentation result, the P frame image segmentation result, the B frame motion vector and the obtained reconstruction result of the B frame image data includes: taking a single segmentation small block of which the reconstruction result is to be obtained as a segmentation small block to be reconstructed, and then firstly obtaining all motion vectors corresponding to the segmentation small block to be reconstructed based on the motion vectors of the B-type frame; then all reference segmentation small blocks of the segmentation small blocks to be reconstructed are obtained based on all motion vectors of the segmentation small blocks to be reconstructed; and finally, averaging all the reference coding data of the small blocks to be reconstructed pixel by using an averaging filter to obtain the reconstruction result of the small blocks to be reconstructed.

The process of sequentially acquiring the reconstruction result of single B frame image data in the B frame image data based on the I frame image segmentation result, the P frame image segmentation result, the B frame motion vector and the acquired reconstruction result of the B frame image data comprises the following steps: firstly, a reconstruction result of first B frame image data in a decoding sequence is obtained based on an I type frame image segmentation result, a P type frame image segmentation result and a B type frame motion vector, and the obtained reconstruction result of the B frame image data is used as the reconstruction result of the obtained B frame image data. Then, a single B frame image data to be obtained after the first B frame image data is set in turn according to the decoding sequence is used as the current frame image data, and for each different current frame image data: firstly, sequentially acquiring a reconstruction result of each segmented small block in current frame image data based on an I-type frame image segmentation result, a P-type frame image segmentation result, a B-type frame motion vector and an acquired reconstruction result of B-frame image data; and then, the reconstruction results of all the small segmented blocks in the current frame image data are collected to obtain the reconstruction result of the current frame image data. When the reconstruction result of the B frame image data which is not acquired is acquired, the reconstruction results of all the acquired B frame image data are used as the reconstruction results of the acquired B frame image data.

Therefore, based on the above-mentioned manner for obtaining the reconstruction result of the single frame image data and the manner for obtaining the reconstruction result of the single segmented small block, it can be known that the process for obtaining the reconstruction result of the B-class frame according to the segmentation result of the I-class frame image, the segmentation result of the P-class frame image, the motion vector of the B-class frame, and the reconstruction result of the obtained B-frame image data specifically includes: acquiring a reconstruction result of each coded data in single B frame image data in the B frame image data based on the I frame image segmentation result, the P frame image segmentation result, the B frame motion vector and the acquired reconstruction result of the B frame image data, and aggregating the reconstruction results of each segmented small block in the current B frame image data to obtain the reconstruction result of the current frame image data; then, sequentially acquiring the reconstruction result of each coded data in the next B frame image data in the B frame image data according to the decoding sequence, and collecting to obtain the reconstruction result of the next B frame image data; and in the same way, the reconstruction results of all B frame image data in the B frame image data are obtained, and the obtained reconstruction results of all B frame image data form the reconstruction results of the B frame.

In order to describe the reconstruction process of a single segmented small block in a single frame of image data in a B-type frame of image data in more detail, a horse-riding video is taken as an example, namely the horse-riding video is taken as a target video, and fig. 2 shows a table of motion vectors of a part of B-type frames of the horse-riding video according to an embodiment of the invention; wherein cur represents the frame number of the current frame, ref represents the frame number of the reference frame, S_curRepresenting the image segmentation result of the current frame, S_refDenotes the image segmentation result of the reference frame, srcx and srcy denote S_refThe x, y coordinates of the divided patches, dstx, and dsty represent S_curThe x, y coordinates of the segmented patches; f represents that the small blocks can be directly segmented in the reconstruction processAssigning, wherein T represents that mean value operation needs to be carried out on the segmented small blocks in the reconstruction process; FIG. 3 is a schematic diagram illustrating a horse riding video result reconstruction according to an embodiment of the invention.

The acquisition process of the B-type frame reconstruction result roughly comprises the following steps: we use the segmentation results of I-class frame images and P-class frame images as reference frames, further for example to obtain B_XAs a result of reconstruction of frame image data, we first take out a reference frame S based on (srcx, srcy) in the motion vector_refThen the extracted corresponding segmented small block is put into the current frame reconstruction result with coordinates (dstx, dsty).

Further, the example shown in FIG. 3 illustrates B in a B-frame image data₁The divided small block of frame image data (dstx, dsty) of (256,108) has two motion vectors (B)₁,I₀256,108,240,136, F) and (B)₁,P₄256,108,392,232, T), reconstructed using two reference segmented patches; we locate two reference partitions, in this example (240,136) and (392,232), from the reference frame I by two motion vectors₀And P₄. Then, the small block data segmented from different references are averaged pixel by pixel in a mean filter mode to obtain B₁The small blocks are divided.

And step S105, inputting the reconstruction result of the B-type frame, the image segmentation result of the I-type frame and the image segmentation result of the P-type frame into a second preset neural network according to a preset input mode to obtain the image segmentation result of the B-type frame.

After the reconstruction result of the B-type frame is obtained in the above steps, the image segmentation result of the B-type frame is also obtained based on the segmentation result of the I-type frame image and the segmentation result of the P-type frame image. Specifically, the image segmentation results of the B-class frames include the image segmentation results of all the B-frame image data, and the image segmentation results of each B-frame image data are acquired sequentially. Further, firstly, the reconstruction result of single frame image data in the B-type frame reconstruction result, the image segmentation result of the first I frame image data or P frame image data in the preamble corresponding to the frame reconstruction result in the playing sequence and the image segmentation result of the first subsequent I frame image data or P frame image data are respectively used as an input group; then, all input groups are sequentially input into a second preset neural network, and image segmentation results of B frame image data corresponding to each input group are respectively obtained; and finally, collecting the image segmentation results of all B frame image data to obtain the image segmentation results of the B-type frames.

It should be noted that the second predetermined convolutional neural network includes three layers, i.e., a convolutional layer, a pooling layer, and an activation layer, where the input image channel of the first layer has three input channels. Preferably, when each input group is in input, the reconstruction result of the B frame image data in each group is input to the second input channel of the first layer of the second preset convolutional neural network, and the image segmentation result of the first I frame image data or P frame image data and the image segmentation result of the first subsequent I frame image data or P frame image data corresponding to the reconstruction result of the frame image data in the playing sequence are input to the first input channel and the third input channel of the first layer of the second preset convolutional neural network.

Meanwhile, it should be noted that, in step S104 and step S105, two processes are performed synchronously, that is, in step S104, after a reconstruction result of B frame image data is obtained, step S105 may input the reconstruction result of the obtained B frame image data, an image segmentation result of the first I frame image data or P frame image data corresponding to the frame reconstruction result in the playing sequence, and an image segmentation result of the subsequent first I frame image data or P frame image data into the second preset neural network, so as to obtain an image segmentation result of the B frame image data. After the reconstructed structure of the last B frame image data is obtained in step S104 in the above manner, step B105 can complete the image segmentation result of the last B frame image data quickly, that is, the time used by the method of this embodiment is saved, and the acquisition speed of the segmentation result of the target video is increased.

The video real-time identification and segmentation method provided by the embodiment of the invention mainly classifies video coding standards into videos with I-frame image data, B-frame image data and P-frame image data, has a motion vector table, divides each frame of image data into a plurality of small segmented blocks according to a preset mode, further divides a target video into I-frame image data and P-frame image data through a video decoder, obtains an I-frame image division result and a P-frame image division result through processing of a neural network, then obtains an image division result of a B-frame by taking the I-frame image division result and the P-frame image division result as a basis, and further obtains a division result of the target video. The method of the invention realizes higher performance while maintaining accuracy by closely connecting the video decoder and the neural network, and solves the problem that the existing processing method for the video identification task can not reduce the calculation amount and the energy consumption on the basis of ensuring higher precision.

Example two

In order to solve the technical problems in the prior art, the embodiment of the invention provides a video real-time detection method.

FIG. 4 is a flow chart of a video real-time detection method according to a second embodiment of the present invention; referring to fig. 4, a video real-time detection method according to an embodiment of the present invention includes the following steps.

Step S201, decoding the target video by a preset video decoder to obtain I-type frame image data, P-type frame image data, and a motion vector table of the target video.

The step is the same as the step S101 in the first embodiment, and may be implemented by referring to the step S101 in the first embodiment, and specific content thereof is not described herein again.

In step S202, a motion vector of the B-class frame is acquired based on the motion vector table.

The step is the same as the step S102 in the first embodiment, and the step S102 in the first embodiment may be specifically implemented, and specific content thereof is not described herein again.

Step S203, inputting the I-type frame image data and the P-type frame image data into a third preset neural network to obtain an I-type frame image detection result and a P-type frame image detection result.

Except for the above, the step is the same as the step S103 in the first embodiment, and specific reference may be made to the step S103 in the first embodiment for implementation, and specific details thereof are not repeated here.

It should be noted that the third predetermined neural network is a large deep neural network for video detection.

Step S204, respectively obtaining detected frames in all reference frame image detection results corresponding to each B frame image data in the B-type frame image data, and respectively sorting the detected frames in all reference frame image detection results corresponding to each B frame image data to obtain a frame sequence to be set of each B frame image data.

Specifically, the first B frame image data among them is taken as the target B frame image data in decoding order.

And acquiring a reference frame of the target B frame image data based on the motion vector of the B frame, classifying all detected frames in all reference frames corresponding to the target B frame image data according to a preset classification requirement, and sequencing the classified detected frames according to a preset classification sequence to obtain a frame sequence to be set of the target B frame image data. It should be noted that the reference frame of the B frame image data may be I frame image data, P frame image data, and B frame image data. And there may be multiple detected boxes on each reference frame. Further, the preset classification requirements are as follows: the content in the detected frame is the same; classifying all detected frames in all reference frames corresponding to the target B frame image data according to a preset classification requirement, and performing a sequencing process, namely: and sorting all the detected frames in all the reference frames corresponding to the classified target B frame image data according to a preset category sequence set in advance to obtain the frame sorting to be set of the target B frame image data.

After the frames to be set of the current target B frame image data are obtained and sequenced based on the content, whether the current target B frame image data is the last frame of the B frame image data in the decoding sequence needs to be judged, if yes, the task of sequencing the frames to be set of the target B frame image data is ended, otherwise, the next B frame image data is used as new target B frame image data according to the decoding sequence, and the previous step is repeated to obtain the new frames to be set of the target B frame image data. And the like to obtain the frame sequence to be set of each B frame image data in the B frame image data.

Step S205, setting the corresponding B-frame image data based on the frame to be set of each B-frame image data, and obtaining the reconstruction result of each B-frame image data, where the reconstruction results of all B-frame image data form the reconstruction result of the B-frame.

Specifically, the corresponding B-frame image data is set based on the frame ordering to be set of each B-frame image data, respectively, to obtain the reconstruction result of each B-frame image data.

Further, the process of setting the B-frame image data based on the ordering of the frames to be set of the single B-frame image data and obtaining the reconstruction result of the B-frame image data includes: taking the first type detected frame as a frame to be set according to the frame to be set; then starting a sub-reconstruction result construction process corresponding to the frame to be set; setting a frame to be set and internal pixel points thereof as a first color and setting other pixel points outside the frame to be set as a second color in image detection results of all reference frames with frames to be set in the B frame image data, and acquiring an image detection result of the set reference frame of the B frame image data of the frame to be set, namely acquiring an image detection result of the reference frame after partial setting of the B frame image data; and then acquiring a sub-reconstruction result in the B frame image data according to the image detection result of the set reference frame corresponding to the B frame image data of the frame to be set, the motion vector of the B frame and the reconstruction result of the acquired B frame image data.

It should be noted that there may be one or more reference frames having frames to be set; if only one reference frame with the frame to be set exists for the current frame to be set, setting the frame to be set and internal pixel points of the frame to be set as first-class colors and setting other pixel points outside the frame to be set as second-class colors in an image detection result of the reference frame; if a plurality of reference frames with the frame to be set exist in the current frame to be set, setting the frame to be set and internal pixel points of the frame to be set as a first color and setting other pixel points outside the frame to be set as a second color in image detection results of the plurality of reference frames with the frame to be set. The colors in the first type of color are different from the colors in the second type of color, the first type of color comprises a plurality of colors, and the colors set by the detected frames and the internal pixel points of the detected frames in different types in all the reference frame image data corresponding to each B frame image data are different. That is, it is assumed that the B frame image data includes N types of detected frames corresponding to all the reference frame image detection results, when the setting is performed, the N types of detected frames and the internal pixel points thereof need to be set to N colors, and the other pixel points outside all the detected frames in all the reference frame image detection results are set to N +1 colors, where the first N colors are colors in the first type of color, and the N +1 color is a color in the second type of color.

After the sub-reconstruction result construction process corresponding to the current frame to be set is completed, whether the current frame to be set is the last detected frame in the frame ordering to be set is judged, if yes, all sub-reconstruction results of the B frame image data are merged to obtain the reconstruction result of the B frame image data, otherwise, the next detected frame is used as a new frame to be set according to the frame ordering to be set, and the sub-reconstruction result construction process corresponding to the frame to be set is repeated to obtain a sub-reconstruction result of the B frame image data corresponding to the new frame to be set. And analogizing in turn, and finally obtaining the reconstruction result of the B frame image data.

And repeating the reconstruction result acquisition process of the single B frame image data to acquire the reconstruction result of each B frame image data in the B type frame image data.

It should be further noted that, step S204 and step S205 may also be performed synchronously, that is, after a to-be-set frame sequence of B frame image data is acquired in step S204, the acquired to-be-set frame sequence of B frame image data directly enters the operation of step S205, that is, a reconstruction result of the B frame image data is acquired according to the to-be-set frame sequence of the acquired B frame image data.

And step S206, inputting the reconstruction result of the B-type frame, the set I-type frame image detection result and the set P-type frame image detection result into a second preset neural network according to a preset input mode to obtain the image detection result of the B-type frame.

The set detection result of the I-frame image is regarded as the segmentation result of the I-frame image in step S105 in the first embodiment, and the set detection result of the P-frame image is regarded as the segmentation result of the P-frame image in step S105 in the first embodiment, which is the same as the segmentation result of step S105 in the first embodiment except for the above contents, and the specific contents may be specifically implemented with reference to step S105 in the first embodiment, and are not described herein again.

The set I-frame image detection result is composed of the image detection results of all the set I-frame image data, and the set P-frame image detection result is composed of the image detection results of all the set P-frame image data; and the image detection result of the set I or P frame image data is that all detected frames and internal pixel points thereof in the image detection result of the set frame image data are set to be of a first type of color, and other pixel points outside all detected frames in the image detection result of the set frame image data are set to be of a second type of color. The above-mentioned set I-frame image detection result and set P-frame image detection result can be obtained from the set reference frame image detection result obtained in the process of obtaining the reconstruction result of each B-frame image data in step S205. The first color is different from the second color, the first color comprises a plurality of colors, each B frame image data corresponds to different types of detected frames in all reference frame image detection results and the colors set by internal pixel points of the detected frames are different, and the B frame image data corresponds to all reference frame image detection results and comprises an image detection result of B frame image data, an image detection result of I frame image data and an image detection result of P frame image data. That is, it is assumed that the B frame image data includes N types of detected frames corresponding to all the reference frame image detection results, when the setting is performed, the N types of detected frames and the internal pixel points thereof need to be set to N colors, and the other pixel points outside all the detected frames in all the reference frame image detection results are set to N +1 colors, where the first N colors are colors in the first type of color, and the N +1 color is a color in the second type of color. The above setting can make the colors of the different types of detected frames and their internal pixel points in each set I frame image detection result or set P frame image detection result, and the colors outside all detected frames in the frame image detection result different.

The video real-time identification and detection method provided by the embodiment of the invention mainly classifies video coding standards into videos with I-frame image data, B-frame image data and P-frame image data, has a motion vector table, detects each frame of image data into a plurality of small segmented blocks according to a preset mode of a decoder, further divides a target video into I-type frame image data and P-type frame image data through a video decoder, obtains an I-type frame image detection side result and a P-type frame image detection result through processing of a neural network, and then obtains an image detection result of a B-type frame by taking the I-type frame image detection result and the P-type frame image detection result as a basis so as to obtain a detection result of the target video. The method of the invention realizes higher performance while maintaining accuracy by closely connecting the video decoder and the neural network, and solves the problem that the existing processing method for the video identification task can not reduce the calculation amount and the energy consumption on the basis of ensuring higher precision.

EXAMPLE III

In order to solve the technical problems in the prior art, the embodiment of the invention provides a video real-time identification, segmentation and detection device, which comprises a decoding module, a motion vector acquisition module, an I/P frame image segmentation or detection result acquisition module, a frame to be set ordering module, a reconstruction result acquisition module and a B frame image segmentation/detection result acquisition module;

the decoding module is used for decoding the target video through a preset video decoder to obtain I-type frame image data, P-type frame image data and a motion vector table of the target video;

the motion vector acquisition module is used for acquiring a motion vector of the B-type frame based on the motion vector table;

the frame to be set ordering module is used for respectively acquiring detected frames in all reference frame image detection results corresponding to each B frame image data in the B-type frame image data, and respectively sorting the detected frames in all reference frame image detection results corresponding to each B frame image data to obtain frame to be set ordering of each B frame image data;

the reconstruction result acquisition module is used for acquiring the reconstruction result of the B-type frame according to the I-type frame image segmentation result, the P-type frame image segmentation result, the motion vector of the B-type frame and the acquired reconstruction result of the B-type frame image data, and is used for setting the corresponding B-type frame image data respectively based on the frame to be set of each B-type frame image data in an ordered manner to acquire the reconstruction result of each B-type frame image data, and the reconstruction results of all B-type frame image data form the reconstruction result of the B-type frame;

the B-class frame image segmentation/detection result acquisition module is used for inputting a reconstruction result of a B-class frame, an I-class frame image segmentation result and a P-class frame image segmentation result into a second preset neural network according to a preset input mode to obtain an image segmentation result of the B-class frame, and is used for inputting the reconstruction result of the B-class frame, a set I-class frame image detection result and a set P-class frame image detection result into the second preset neural network according to the preset input mode to obtain an image detection result of the B-class frame;

the video coding and decoding standard of the target video is a classification with I frame image data, B frame image data and P frame image data, a motion vector table is provided, and each frame of image data is divided into a plurality of small divided blocks according to a preset mode of a decoder.

The video real-time identification, segmentation and detection device provided by the embodiment of the invention mainly classifies video coding standards into videos with I-frame image data, B-frame image data and P-frame image data, has a motion vector table, segments and detects each frame of image data into a plurality of segmented small blocks according to a preset mode of a decoder, further divides a target video into I-type frame image data and P-type frame image data through a video decoder, obtains an I-type frame image segmentation/detection side result and a P-type frame image segmentation/detection result through processing of a neural network, and then obtains an image segmentation result of a B-type frame by taking the I-type frame image segmentation/detection result and the P-type frame image segmentation/detection result as a basis, thereby obtaining a segmentation result of the target video. The method of the invention realizes higher performance while maintaining accuracy by closely connecting the video decoder and the neural network, and solves the problem that the existing processing method for the video identification task can not reduce the calculation amount and the energy consumption on the basis of ensuring higher precision.

Example four

To solve the technical problems in the prior art, an embodiment of the present invention provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the video real-time identification and segmentation method in the first embodiment.

The specific steps of the video real-time identification and segmentation method and the beneficial effects obtained by applying the readable storage medium provided by the embodiment of the invention are the same as those of the first embodiment, and are not described herein again.

It should be noted that: the storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

EXAMPLE five

To solve the technical problems in the prior art, an embodiment of the present invention provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the video real-time detection method in the second embodiment.

The specific steps of the video real-time detection method and the beneficial effects obtained by applying the readable storage medium provided by the embodiment of the invention are the same as those of the embodiment, and are not described herein again.

EXAMPLE six

In order to solve the technical problems in the prior art, the embodiment of the invention also provides a terminal.

Fig. 6 is a schematic structural diagram of a sixth terminal according to an embodiment of the present invention, and referring to fig. 6, the terminal according to this embodiment includes a first memory and a second memory, which are connected to each other, and the memories are respectively in communication connection with the first processor and the second memory, and are used for storing computer programs, and the first processor is used for executing the computer programs stored in the memories, so that all steps in the video real-time identification and segmentation method according to the embodiment can be implemented when the terminal is executed; the second processor is used for executing the computer program stored in the memory, so that the terminal can realize all the steps in the video real-time detection method of the second embodiment when executing the computer program.

The specific steps of the video real-time identification and segmentation method and the beneficial effects obtained by the terminal applying the embodiment of the invention are the same as those of the embodiment one, and are not described herein again.

The specific steps of the video real-time detection method and the beneficial effects obtained by the terminal applying the embodiment of the present invention are the same as those of the embodiment, and are not described herein again.

It should be noted that the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Similarly, the Processor may also be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A real-time video identification and segmentation method comprises the following steps:

acquiring a motion vector of the B-type frame based on the motion vector table;

the video coding and decoding standard of the target video is classification of I-frame image data, B-frame image data and P-frame image data, a motion vector table is provided, and each frame of image data is divided into a plurality of small divided blocks according to a preset mode.

2. The method of claim 1, wherein the step of obtaining the reconstruction result of the B-class frame according to the segmentation result of the I-class frame image, the segmentation result of the P-class frame image, the motion vector of the B-class frame, and the reconstruction result of the obtained B-frame image data comprises:

sequentially acquiring a reconstruction result of each segmentation small block in single B frame image data in the B frame image data based on an I frame image segmentation result, a P frame image segmentation result, a B frame motion vector and an acquired reconstruction result of the B frame image data;

3. The method of claim 2, wherein obtaining a reconstruction result of a single segmented small block in a single frame of image data in a class B frame of image data based on the class I frame image segmentation result, the class P frame image segmentation result, the class B frame motion vector, and the obtained reconstruction result of the B frame of image data comprises:

4. The method according to claim 3, wherein the step of inputting the B-class frame image reconstruction result, the I-class frame image segmentation result and the P-class frame image segmentation result into a second preset neural network according to a preset input mode to obtain the B-class frame image segmentation result comprises:

5. The method of claim 4, wherein the first predetermined neural network is a large deep neural network for video segmentation; the second preset convolutional neural network comprises three layers of a convolutional layer, a pooling layer and an activation layer, wherein an input image channel of the first layer is provided with three input channels.

6. A video real-time detection method comprises the following steps:

acquiring a motion vector of the B-type frame based on a motion vector table;

the image detection result of the set I or P frame image data is that all detected frames and internal pixel points thereof in the image detection result of the set frame image data are set to be of a first type of color, and other pixel points outside all detected frames in the image detection result of the set frame image data are set to be of a second type of color; the first type of color comprises a plurality of colors, and the colors set by the different types of detected frames and the internal pixel points in the detection results of all the reference frame images corresponding to each B frame image data are different.

7. The method according to claim 6, wherein the step of obtaining the detected frames in all the detection results of the reference frame images corresponding to each B frame image data in the B-type frame image data respectively, and the step of sorting the detected frames in all the detection results of the reference frame images corresponding to each B frame image data respectively, and the step of obtaining the frames to be set for each B frame image data comprises:

8. The method of claim 7, wherein the setting of the B-frame image data based on the frame ordering to be set of the single B-frame image data, and the obtaining of the reconstruction result of the B-frame image data comprises:

9. The method of claim 6, wherein the third predetermined neural network is a large deep neural network for video detection; the second preset convolutional neural network comprises three layers of a convolutional layer, a pooling layer and an activation layer, wherein an input image channel of the first layer is provided with three input channels.

10. A video real-time identification, segmentation and detection device is characterized by comprising a decoding module, a motion vector acquisition module, an I/P frame image segmentation or detection result acquisition module, a frame to be set ordering module, a reconstruction result acquisition module and a B frame image segmentation/detection result acquisition module;