CN110991365B

CN110991365B - Video motion information acquisition method, system and electronic equipment

Info

Publication number: CN110991365B
Application number: CN201911249221.1A
Authority: CN
Inventors: 邬晶晶; 张涌; 文森特·周
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2024-02-20
Anticipated expiration: 2039-12-09
Also published as: CN110991365A

Abstract

The application relates to a video motion information acquisition method, a video motion information acquisition system and electronic equipment. Comprising the following steps: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame; inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image: the total loss function L is: l=l _T +α ₁ (L _T+1 +L _T‑1 )+α ₂ (L _T+2 +L _T‑2 )+α ₃ (L _T+3 +L _T‑3 )…+α _n (L _T+n +L _T‑n ). According to the method, the motion information of the video is obtained by utilizing the method for predicting the surrounding frames by the current frame, so that the understanding of the motion objects in the video is facilitated, the realization is simple, the original network structure is not required to be changed, the parameter number is not increased, the calculated amount of the network is not increased, the extra storage amount is not increased, the speed is higher, and the calculation cost is low.

Description

Video motion information acquisition method, system and electronic equipment

Technical Field

The application belongs to the technical field of video task processing, and particularly relates to a video motion information acquisition method, a video motion information acquisition system and electronic equipment.

Background

In video tasks, the network model requires more reliable motion features to reflect dynamic changes occurring in the video, helping the model to predict more accurately. The motion information in the video comprises the motion of a camera and the motion of an object in the video, and the influence of illumination change and the like exists, and the actual environment between continuous frames is complex and changeable, so that the acquisition of the video motion information is often a difficulty in processing video tasks. The existing treatment method comprises the following steps:

1. three-dimensional convolution

Convolutional neural networks have been widely used in computer vision in recent years, and for the problem of video analysis, two-dimensional convolution cannot capture information on timing, so three-dimensional convolution is adopted, increasing the time dimension. The three-dimensional convolution input data are [ time length, feature map width and channel number ], the size of the convolution kernel is [ time span, convolution kernel length, convolution kernel width and channel number ], and the convolution process is equivalent to sliding on three dimensions (time dimension, feature map length and feature map width), so that a new feature map with different time scales is obtained. Because the three-dimensional convolution has more time dimension than the two-dimensional convolution, the parameter quantity of time span times is increased, and the defects of more parameter quantity, large operation quantity, easiness in fitting and the like exist.

2. Optical flow method

The optical flow method is a method for calculating the motion information of an object between adjacent frames by using the change of pixels in an image sequence in a time domain and the correlation between the adjacent frames to find the corresponding relation between the previous frame and the current frame. Typically, optical flow is due to movement of the foreground objects themselves in the scene, movement of the camera, or a combination of both.

The motion speed and the motion direction of each pixel in each image are found out to be an optical flow field, and the purpose of researching the optical flow field is to approximately obtain a motion field which cannot be directly obtained from a picture sequence, wherein the motion field is the motion of an object in a three-dimensional real world. The position of the picture a point of the T frame is (x ₁ ，y ₁ ) The corresponding A-point position at the time of the T+1st frame is (x ₂ ，y ₂ ) The motion of point a is: (u) _x ，v _y )＝(x ₂ ，y ₂ )-(x ₁ ，y ₁ ) There are a wide variety of optical flow calculation methods to calculate the position of a in the t+1 frame, including gradient-based methods, matching-based methods, energy-based methods, phase-based methods. Although the end-to-end learning representation method has been successful, the hand-made optical flow characteristics are still widely used in video analysis tasks. Optical flow extraction is expensive both spatially and temporally. The extracted optical flow must be written to disk for training and testing. Is required toSignificant storage costs are incurred and optical flow calculations are not necessarily accurate.

Disclosure of Invention

The application provides a video motion information acquisition method, a video motion information acquisition system and electronic equipment, and aims to at least solve one of the technical problems in the prior art to a certain extent.

In order to solve the above problems, the present application provides the following technical solutions:

a video motion information acquisition method comprises the following steps:

step a: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;

step b: inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image:

in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y _i Representing the true pixel value;

the total loss function L is:

L＝L _T +α ₁ (L _T+1 +L _T-1 )+α ₂ (L _T+2 +L _T-2 )+α ₃ (L _T+3 +L _T-3 )…+α _n (L _T+n +L _T-n )

in the above formula, L _T For the loss function, alpha, of the current frame ₁ (L _T+1 +L _T-1 )、α ₂ (L _T+2 +L _T-2 )、α ₃ (L _T+3 +L _T-3 )…α _n (L _T+n +L _T-n ) Alpha is the loss function of surrounding frames _n (n=1, 2, 3 …) is up and downThe coefficients preceding the corresponding loss function.

The technical scheme adopted by the embodiment of the application further comprises: in the step a, the network structure of the network model comprises a target identification, image classification, super resolution and image segmentation network structure.

The technical scheme adopted by the embodiment of the application further comprises: in the step b, the inputting a single frame image to the network model, and the outputting, by the network model, a surrounding frame image corresponding to the single frame image according to the loss function specifically includes: and (3) inputting a gray scale image corresponding to the video at the time T into the network model, and outputting colored color images corresponding to the time T, the time T+n (n=1, 2 and 3 …) and the time T-n (n=1, 2 and 3 …) according to a loss function by the network model.

The technical scheme adopted by the embodiment of the application further comprises: in the step b, the inputting the single frame image to the network model, and the outputting, by the network model, the surrounding frame image corresponding to the single frame image according to the loss function further includes: and inputting an image corresponding to the video at the time T into the network model, and outputting an object classification result of the image at the time T, the image corresponding to the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) by the network model according to a loss function.

The technical scheme adopted by the embodiment of the application further comprises: the loss function of the image at the moment T is a multi-classification cross entropy loss function, and is defined as follows:

in the above formula, M represents the number of categories, y _c Is 0 or 1, y if the class is the same as the class of the sample _c Has a value of 1, p _c Representing the probability that the predicted sample belongs to category c.

The embodiment of the application adopts another technical scheme that: a video motion information acquisition system comprising:

model construction module: a network model is built for video motion information by a method for predicting surrounding frames through a current frame;

the image processing module is used for inputting a single frame image into the constructed network model, and the network model outputs surrounding frame images corresponding to the single frame image according to the loss function; wherein each individual loss function is a per pixel root mean square error for each image:

the total loss function L is:

L＝L _T +α ₁ (L _T+1 +L _T-1 )+α ₂ (L _T+2 +L _T-2 )+α _L (L _T+3 +L _T-3 )…+α _n (L _T+n +L _T-n )

in the above formula, L _T For the loss function, alpha, of the current frame ₁ (L _T+1 +L _T-1 )、α ₂ (L _T+2 +L _T-2 )、α ₃ (L _T+3 +L _T-3 )…α _n (L _T+n +L _T-n ) Alpha is the loss function of surrounding frames _n (n=1, 2, 3 …) is a coefficient preceding the loss function corresponding to the upper and lower.

The technical scheme adopted by the embodiment of the application further comprises: the network structure of the network model comprises a target identification, image classification, super resolution and image segmentation network structure.

The technical scheme adopted by the embodiment of the application further comprises: the image processing module includes:

an image conversion unit: and the gray level image corresponding to the video at the time T is input into the network model, and the network model outputs colored color images corresponding to the time T, the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) according to the loss function.

The technical scheme adopted by the embodiment of the application further comprises: in the step b, the image processing module further includes:

an image classification unit: and the network model is used for inputting an image corresponding to the video at the time T to the network model, and outputting an object classification result of the image at the time T, the image corresponding to the time T+n (n=1, 2, 3 …) and the image corresponding to the time T-n (n=1, 2, 3 …) according to a loss function.

The embodiment of the application adopts the following technical scheme: an electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the one processor to enable the at least one processor to perform the following operations of the video motion information acquisition method described above:

the total loss function L is:

Compared with the prior art, the beneficial effect that this application embodiment produced lies in: according to the video motion information acquisition method, system and electronic equipment, the motion information of a video is acquired by utilizing the method that the current frame predicts surrounding frames, a single-frame image is input into a network model, and the network model outputs multi-frame processed pictures or classification results, so that the understanding of moving objects in the video is facilitated. The method is simple to realize, does not need to change the original network structure, does not increase the parameter quantity, does not increase the calculated amount of the network, does not increase extra storage, and is faster in speed and low in calculation cost.

Drawings

Fig. 1 is a flowchart of a video motion information acquisition method according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a network model image processing procedure according to a first embodiment of the present application;

fig. 3 is a flowchart of a video motion information acquisition method according to a second embodiment of the present application;

FIG. 4 is a schematic diagram of a network model image processing procedure according to a second embodiment of the present application;

fig. 5 is a schematic structural diagram of a video motion information acquisition system according to an embodiment of the present application;

fig. 6 is a schematic diagram of a hardware device of a video motion information obtaining method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In order to solve the defects in the prior art, the method for predicting the surrounding frames by the current frame is used for constructing a network model for video motion information, a single frame image is input into the network model in actual application, the network model outputs the surrounding frame image corresponding to the single frame image, or an image object classification result is identified, and a moving object in a video is obtained. The method and the device can be applied to various video image processing tasks such as image conversion, video classification, video recognition and the like, and for more clearly describing the technical scheme of the method and the device, the image conversion and the video classification are taken as examples for specific description.

Referring to fig. 1, a flowchart of a video motion information acquisition method according to a first embodiment of the present application is shown. The embodiment needs to convert black-and-white video into color video, which specifically includes:

step 100: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;

in step 100, the network structure of the network model may be designed with reference to the network structure of image classification, super resolution, image segmentation, etc.

Step 110: inputting a gray scale image corresponding to the video at the time T to a network model, and outputting colored color images corresponding to the time T, the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) according to a loss function by the network model;

in step 110, the predicted number of surrounding frames may be set according to the task characteristics. The predicted weights of the corresponding loss functions of the surrounding frames can be set manually or can be learned by the network. In order to more accurately predict the color image of the surrounding frame, the network model continuously learns the motion process of the object under the update of the loss function, hides the motion information in the weight of the model, predicts the corresponding context frame at the moment T, and can learn the context information in space and time. The image processing of the network model is shown in fig. 2.

The loss function is defined as follows:

the calculation of each individual loss function is the root Mean Square Error (MSE) per pixel per image:

in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y _i Representing the actual pixel value.

Coefficient alpha in front of the upper and lower corresponding loss function _n (n=1, 2, 3 …) may be set manually or may be learned automatically by a network model. The total loss function includes the loss function of the current frame and the loss functions of surrounding frames:

L＝L _T +α ₁ (L _T+1 +l _T-1 )+α ₂ (L _T+2 +L _T-2 )+α ₃ (L _L+3 +L _T-3 )…+α _n (L _L+n +L _T-n ) (2)

referring to fig. 3, a flowchart of a video motion information acquisition method according to a second embodiment of the present application is shown. The embodiment needs to identify the object category of each frame of image of the video, which specifically includes:

step 200: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;

in step 200, the network structure of the network model may be designed with reference to the network structure such as the target identification.

Step 210: inputting an image corresponding to the video at the time T to a network model, and outputting an object classification result of the image at the time T, the image corresponding to the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) by the network model according to a loss function;

in step 210, the predicted number of surrounding frames may be set according to the task characteristics. The predicted weights of the corresponding loss functions of the surrounding frames can be set manually or can be learned by the network. In order to more accurately predict the object type of the image at the moment T and the images of surrounding frames, the network model continuously learns the motion process of the object under the update of the loss function, and the motion information is hidden in the weight of the model. And simultaneously, predicting the corresponding context frame at the moment T, and learning the context information in space and time. The image processing of the network model is shown in fig. 4.

The loss function at time T is a multi-class cross entropy loss function defined as follows:

in the above formula, M represents the number of categories, y _c The value of (1) is 0 or 1, and if the category is the same as the category of the sample, it is 1, p _c Representing the probability that the predicted sample belongs to category c.

The loss function at other moments and the definition of the total loss function are the same as those of the first embodiment, and will not be described here again.

Fig. 5 is a schematic structural diagram of a video motion information acquisition system according to an embodiment of the present application. The video motion information acquisition system comprises a model building module and an image processing module.

Model construction module: a network model is built for video motion information by a method for predicting surrounding frames through a current frame; the network structure of the network model can be designed by referring to the network structures of target recognition, image classification, super resolution, image segmentation and the like.

An image processing module: the method comprises the steps that a single frame image is input to a network model, the network model outputs surrounding frame images corresponding to the single frame image, or an image object classification result is identified, and a moving object in a video is obtained; specifically, the image processing module may process a plurality of tasks such as image conversion, video classification, video recognition, and the like, and at least includes:

an image conversion unit: the method comprises the steps that a gray scale image corresponding to a video at the time T is input to a network model, and the network model outputs colored color images corresponding to the time T, the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) according to a loss function; the predicted number of surrounding frames may be set according to the task characteristics. The predicted weights of the corresponding loss functions of the surrounding frames can be set manually or can be learned by the network.

In order to more accurately predict the color image of the surrounding frame, the network model continuously learns the motion process of the object under the update of the loss function, hides the motion information in the weight of the model, predicts the corresponding context frame at the moment T, and can learn the context information in space and time.

The loss function is defined as follows:

Coefficient alpha in front of the upper and lower corresponding loss function _n (n=1, 2, 3 …) may be set manually or may be learned automatically by a network model. Overall, theThe loss function includes a loss function of the current frame and a loss function of surrounding frames:

L＝L _T +α ₁ (L _T+1 +L _T-1 )+α ₂ (L _T+2 +L _T-2 )+α ₃ (L _T+3 +L _T-3 )…+α _n (L _T+n +L _T-n ) (2)

an image classification unit: the method comprises the steps that an image corresponding to a video at the time T is input to a network model, and the network model outputs an object classification result of the image at the time T, an image corresponding to the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) according to a loss function; in order to more accurately predict the object type of the image at the moment T and the images of surrounding frames, the network model continuously learns the motion process of the object under the update of the loss function, and the motion information is hidden in the weight of the model.

Fig. 6 is a schematic diagram of a hardware device of a video motion information obtaining method according to an embodiment of the present application. As shown in fig. 6, the device includes one or more processors and memory. Taking a processor as an example, the apparatus may further comprise: an input system and an output system.

The processor, memory, input system, and output system may be connected by a bus or other means, with a bus connection being illustrated in fig. 6.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor executes various functional applications of the electronic device and data processing, i.e., implements the processing methods of the method embodiments described above, by running non-transitory software programs, instructions, and modules stored in the memory.

The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data, etc. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, which may be connected to the processing system via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input system may receive input numeric or character information and generate a signal input. The output system may include a display device such as a display screen.

The one or more modules are stored in the memory and when executed by the one or more processors perform the following operations of any of the method embodiments described above:

the total loss function L is:

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.

Embodiments of the present application provide a non-transitory (non-volatile) computer storage medium storing computer-executable instructions that are operable to:

the total loss function L is:

L＝L _T +α ₁ (L _T+1 +L _T-1 )+α ₂ (L _T+2 +L _T-2 )+α ₃ (L _T+3 +L _T-3 )…

+α _n (L _T+n +L _T-n )

The present embodiments provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to:

the total loss function L is:

According to the video motion information acquisition method, system and electronic equipment, the motion information of a video is acquired by utilizing the method that the current frame predicts surrounding frames, a single-frame image is input into a network model, and the network model outputs multi-frame processed pictures or classification results, so that the understanding of moving objects in the video is facilitated. The method is simple to realize, does not need to change the original network structure, does not increase the parameter quantity, does not increase the calculated amount of the network, does not increase extra storage, and is faster in speed and low in calculation cost.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for acquiring video motion information, comprising the steps of:

the total loss function L is:

in the above formula, L _T For the loss function, alpha, of the current frame ₁ (L _T+1 +L _T-1 )、α ₂ (L _T+2 +L _T-2 )、α ₃ (L _T+3 +L _T-3 )…α _n (L _T+n +L _T-n ) Alpha is the loss function of surrounding frames _n N=1, 2, 3 …, the coefficients preceding the upper and lower corresponding loss functions.

2. The method according to claim 1, wherein in the step a, the network structure of the network model includes a target recognition, image classification, super resolution, image segmentation network structure.

3. The video motion information acquisition method according to claim 1 or 2, wherein in the step b, the inputting a single frame image to the network model, and the outputting, by the network model, a surrounding frame image corresponding to the single frame image according to the loss function, specifically includes: and inputting a gray scale image corresponding to the video at the moment T to the network model, and outputting colored color images corresponding to the moment T, the moment T+n and the moment T-n according to a loss function by the network model, wherein n=1, 2 and 3 ….

4. The video motion information acquiring method according to claim 3, wherein in the step b, the inputting a single frame image to the network model, the network model outputting a surrounding frame image corresponding to the single frame image according to a loss function further comprises: and inputting an image corresponding to the video at the time T to the network model, and outputting an object classification result of the image at the time T, the time T+n and the image corresponding to the time T-n by the network model according to a loss function, wherein n=1, 2 and 3 ….

5. The method of claim 4, wherein the loss function of the T moment image is a multi-class cross entropy loss function defined as follows:

6. A video motion information acquisition system, comprising:

the total loss function L is:

7. The video motion information acquisition system of claim 6, wherein the network structure of the network model includes a target recognition, image classification, super resolution, image segmentation network structure.

8. The video motion information acquisition system according to claim 6 or 7, wherein the image processing module includes:

an image conversion unit: and the gray level image corresponding to the video at the time T is input to the network model, and the network model outputs colored color images corresponding to the time T, the time T+n and the time T-n according to the loss function, wherein n=1, 2 and 3 ….

9. The video motion information acquisition system of claim 8, wherein the image processing module further comprises:

an image classification unit: and the network model is used for inputting an image corresponding to the video at the time T to the network model, and outputting an object classification result of the image at the time T, the time T+n and the image corresponding to the time T-n according to a loss function, wherein n=1, 2 and 3 ….

10. The video motion information acquisition system of claim 9, wherein the loss function of the T moment image is a multi-class cross entropy loss function defined as follows:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the one processor to enable the at least one processor to perform the video motion information acquisition method of any one of claims 1 to 5.