CN110991365B - Video motion information acquisition method, system and electronic equipment - Google Patents

Video motion information acquisition method, system and electronic equipment Download PDF

Info

Publication number
CN110991365B
CN110991365B CN201911249221.1A CN201911249221A CN110991365B CN 110991365 B CN110991365 B CN 110991365B CN 201911249221 A CN201911249221 A CN 201911249221A CN 110991365 B CN110991365 B CN 110991365B
Authority
CN
China
Prior art keywords
image
loss function
network model
motion information
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911249221.1A
Other languages
Chinese (zh)
Other versions
CN110991365A (en
Inventor
邬晶晶
张涌
文森特·周
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201911249221.1A priority Critical patent/CN110991365B/en
Publication of CN110991365A publication Critical patent/CN110991365A/en
Application granted granted Critical
Publication of CN110991365B publication Critical patent/CN110991365B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a video motion information acquisition method, a video motion information acquisition system and electronic equipment. Comprising the following steps: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame; inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image: the total loss function L is: l=l T1 (L T+1 +L T‑1 )+α 2 (L T+2 +L T‑2 )+α 3 (L T+3 +L T‑3 )…+α n (L T+n +L T‑n ). According to the method, the motion information of the video is obtained by utilizing the method for predicting the surrounding frames by the current frame, so that the understanding of the motion objects in the video is facilitated, the realization is simple, the original network structure is not required to be changed, the parameter number is not increased, the calculated amount of the network is not increased, the extra storage amount is not increased, the speed is higher, and the calculation cost is low.

Description

Video motion information acquisition method, system and electronic equipment
Technical Field
The application belongs to the technical field of video task processing, and particularly relates to a video motion information acquisition method, a video motion information acquisition system and electronic equipment.
Background
In video tasks, the network model requires more reliable motion features to reflect dynamic changes occurring in the video, helping the model to predict more accurately. The motion information in the video comprises the motion of a camera and the motion of an object in the video, and the influence of illumination change and the like exists, and the actual environment between continuous frames is complex and changeable, so that the acquisition of the video motion information is often a difficulty in processing video tasks. The existing treatment method comprises the following steps:
1. three-dimensional convolution
Convolutional neural networks have been widely used in computer vision in recent years, and for the problem of video analysis, two-dimensional convolution cannot capture information on timing, so three-dimensional convolution is adopted, increasing the time dimension. The three-dimensional convolution input data are [ time length, feature map width and channel number ], the size of the convolution kernel is [ time span, convolution kernel length, convolution kernel width and channel number ], and the convolution process is equivalent to sliding on three dimensions (time dimension, feature map length and feature map width), so that a new feature map with different time scales is obtained. Because the three-dimensional convolution has more time dimension than the two-dimensional convolution, the parameter quantity of time span times is increased, and the defects of more parameter quantity, large operation quantity, easiness in fitting and the like exist.
2. Optical flow method
The optical flow method is a method for calculating the motion information of an object between adjacent frames by using the change of pixels in an image sequence in a time domain and the correlation between the adjacent frames to find the corresponding relation between the previous frame and the current frame. Typically, optical flow is due to movement of the foreground objects themselves in the scene, movement of the camera, or a combination of both.
The motion speed and the motion direction of each pixel in each image are found out to be an optical flow field, and the purpose of researching the optical flow field is to approximately obtain a motion field which cannot be directly obtained from a picture sequence, wherein the motion field is the motion of an object in a three-dimensional real world. The position of the picture a point of the T frame is (x 1 ,y 1 ) The corresponding A-point position at the time of the T+1st frame is (x 2 ,y 2 ) The motion of point a is: (u) x ,v y )=(x 2 ,y 2 )-(x 1 ,y 1 ) There are a wide variety of optical flow calculation methods to calculate the position of a in the t+1 frame, including gradient-based methods, matching-based methods, energy-based methods, phase-based methods. Although the end-to-end learning representation method has been successful, the hand-made optical flow characteristics are still widely used in video analysis tasks. Optical flow extraction is expensive both spatially and temporally. The extracted optical flow must be written to disk for training and testing. Is required toSignificant storage costs are incurred and optical flow calculations are not necessarily accurate.
Disclosure of Invention
The application provides a video motion information acquisition method, a video motion information acquisition system and electronic equipment, and aims to at least solve one of the technical problems in the prior art to a certain extent.
In order to solve the above problems, the present application provides the following technical solutions:
a video motion information acquisition method comprises the following steps:
step a: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;
step b: inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the true pixel value;
the total loss function L is:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L T+3 +L T-3 )…+α n (L T+n +L T-n )
in the above formula, L T For the loss function, alpha, of the current frame 1 (L T+1 +L T-1 )、α 2 (L T+2 +L T-2 )、α 3 (L T+3 +L T-3 )…α n (L T+n +L T-n ) Alpha is the loss function of surrounding frames n (n=1, 2, 3 …) is up and downThe coefficients preceding the corresponding loss function.
The technical scheme adopted by the embodiment of the application further comprises: in the step a, the network structure of the network model comprises a target identification, image classification, super resolution and image segmentation network structure.
The technical scheme adopted by the embodiment of the application further comprises: in the step b, the inputting a single frame image to the network model, and the outputting, by the network model, a surrounding frame image corresponding to the single frame image according to the loss function specifically includes: and (3) inputting a gray scale image corresponding to the video at the time T into the network model, and outputting colored color images corresponding to the time T, the time T+n (n=1, 2 and 3 …) and the time T-n (n=1, 2 and 3 …) according to a loss function by the network model.
The technical scheme adopted by the embodiment of the application further comprises: in the step b, the inputting the single frame image to the network model, and the outputting, by the network model, the surrounding frame image corresponding to the single frame image according to the loss function further includes: and inputting an image corresponding to the video at the time T into the network model, and outputting an object classification result of the image at the time T, the image corresponding to the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) by the network model according to a loss function.
The technical scheme adopted by the embodiment of the application further comprises: the loss function of the image at the moment T is a multi-classification cross entropy loss function, and is defined as follows:
in the above formula, M represents the number of categories, y c Is 0 or 1, y if the class is the same as the class of the sample c Has a value of 1, p c Representing the probability that the predicted sample belongs to category c.
The embodiment of the application adopts another technical scheme that: a video motion information acquisition system comprising:
model construction module: a network model is built for video motion information by a method for predicting surrounding frames through a current frame;
the image processing module is used for inputting a single frame image into the constructed network model, and the network model outputs surrounding frame images corresponding to the single frame image according to the loss function; wherein each individual loss function is a per pixel root mean square error for each image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the true pixel value;
the total loss function L is:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α L (L T+3 +L T-3 )…+α n (L T+n +L T-n )
in the above formula, L T For the loss function, alpha, of the current frame 1 (L T+1 +L T-1 )、α 2 (L T+2 +L T-2 )、α 3 (L T+3 +L T-3 )…α n (L T+n +L T-n ) Alpha is the loss function of surrounding frames n (n=1, 2, 3 …) is a coefficient preceding the loss function corresponding to the upper and lower.
The technical scheme adopted by the embodiment of the application further comprises: the network structure of the network model comprises a target identification, image classification, super resolution and image segmentation network structure.
The technical scheme adopted by the embodiment of the application further comprises: the image processing module includes:
an image conversion unit: and the gray level image corresponding to the video at the time T is input into the network model, and the network model outputs colored color images corresponding to the time T, the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) according to the loss function.
The technical scheme adopted by the embodiment of the application further comprises: in the step b, the image processing module further includes:
an image classification unit: and the network model is used for inputting an image corresponding to the video at the time T to the network model, and outputting an object classification result of the image at the time T, the image corresponding to the time T+n (n=1, 2, 3 …) and the image corresponding to the time T-n (n=1, 2, 3 …) according to a loss function.
The technical scheme adopted by the embodiment of the application further comprises: the loss function of the image at the moment T is a multi-classification cross entropy loss function, and is defined as follows:
in the above formula, M represents the number of categories, y c Is 0 or 1, y if the class is the same as the class of the sample c Has a value of 1, p c Representing the probability that the predicted sample belongs to category c.
The embodiment of the application adopts the following technical scheme: an electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the one processor to enable the at least one processor to perform the following operations of the video motion information acquisition method described above:
step a: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;
step b: inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the true pixel value;
the total loss function L is:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L T+3 +L T-3 )…+α n (L T+n +L T-n )
in the above formula, L T For the loss function, alpha, of the current frame 1 (L T+1 +L T-1 )、α 2 (L T+2 +L T-2 )、α 3 (L T+3 +L T-3 )…α n (L T+n +L T-n ) Alpha is the loss function of surrounding frames n (n=1, 2, 3 …) is a coefficient preceding the loss function corresponding to the upper and lower.
Compared with the prior art, the beneficial effect that this application embodiment produced lies in: according to the video motion information acquisition method, system and electronic equipment, the motion information of a video is acquired by utilizing the method that the current frame predicts surrounding frames, a single-frame image is input into a network model, and the network model outputs multi-frame processed pictures or classification results, so that the understanding of moving objects in the video is facilitated. The method is simple to realize, does not need to change the original network structure, does not increase the parameter quantity, does not increase the calculated amount of the network, does not increase extra storage, and is faster in speed and low in calculation cost.
Drawings
Fig. 1 is a flowchart of a video motion information acquisition method according to a first embodiment of the present application;
FIG. 2 is a schematic diagram of a network model image processing procedure according to a first embodiment of the present application;
fig. 3 is a flowchart of a video motion information acquisition method according to a second embodiment of the present application;
FIG. 4 is a schematic diagram of a network model image processing procedure according to a second embodiment of the present application;
fig. 5 is a schematic structural diagram of a video motion information acquisition system according to an embodiment of the present application;
fig. 6 is a schematic diagram of a hardware device of a video motion information obtaining method according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In order to solve the defects in the prior art, the method for predicting the surrounding frames by the current frame is used for constructing a network model for video motion information, a single frame image is input into the network model in actual application, the network model outputs the surrounding frame image corresponding to the single frame image, or an image object classification result is identified, and a moving object in a video is obtained. The method and the device can be applied to various video image processing tasks such as image conversion, video classification, video recognition and the like, and for more clearly describing the technical scheme of the method and the device, the image conversion and the video classification are taken as examples for specific description.
Referring to fig. 1, a flowchart of a video motion information acquisition method according to a first embodiment of the present application is shown. The embodiment needs to convert black-and-white video into color video, which specifically includes:
step 100: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;
in step 100, the network structure of the network model may be designed with reference to the network structure of image classification, super resolution, image segmentation, etc.
Step 110: inputting a gray scale image corresponding to the video at the time T to a network model, and outputting colored color images corresponding to the time T, the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) according to a loss function by the network model;
in step 110, the predicted number of surrounding frames may be set according to the task characteristics. The predicted weights of the corresponding loss functions of the surrounding frames can be set manually or can be learned by the network. In order to more accurately predict the color image of the surrounding frame, the network model continuously learns the motion process of the object under the update of the loss function, hides the motion information in the weight of the model, predicts the corresponding context frame at the moment T, and can learn the context information in space and time. The image processing of the network model is shown in fig. 2.
The loss function is defined as follows:
the calculation of each individual loss function is the root Mean Square Error (MSE) per pixel per image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the actual pixel value.
Coefficient alpha in front of the upper and lower corresponding loss function n (n=1, 2, 3 …) may be set manually or may be learned automatically by a network model. The total loss function includes the loss function of the current frame and the loss functions of surrounding frames:
L=L T1 (L T+1 +l T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L L+3 +L T-3 )…+α n (L L+n +L T-n ) (2)
referring to fig. 3, a flowchart of a video motion information acquisition method according to a second embodiment of the present application is shown. The embodiment needs to identify the object category of each frame of image of the video, which specifically includes:
step 200: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;
in step 200, the network structure of the network model may be designed with reference to the network structure such as the target identification.
Step 210: inputting an image corresponding to the video at the time T to a network model, and outputting an object classification result of the image at the time T, the image corresponding to the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) by the network model according to a loss function;
in step 210, the predicted number of surrounding frames may be set according to the task characteristics. The predicted weights of the corresponding loss functions of the surrounding frames can be set manually or can be learned by the network. In order to more accurately predict the object type of the image at the moment T and the images of surrounding frames, the network model continuously learns the motion process of the object under the update of the loss function, and the motion information is hidden in the weight of the model. And simultaneously, predicting the corresponding context frame at the moment T, and learning the context information in space and time. The image processing of the network model is shown in fig. 4.
The loss function at time T is a multi-class cross entropy loss function defined as follows:
in the above formula, M represents the number of categories, y c The value of (1) is 0 or 1, and if the category is the same as the category of the sample, it is 1, p c Representing the probability that the predicted sample belongs to category c.
The loss function at other moments and the definition of the total loss function are the same as those of the first embodiment, and will not be described here again.
Fig. 5 is a schematic structural diagram of a video motion information acquisition system according to an embodiment of the present application. The video motion information acquisition system comprises a model building module and an image processing module.
Model construction module: a network model is built for video motion information by a method for predicting surrounding frames through a current frame; the network structure of the network model can be designed by referring to the network structures of target recognition, image classification, super resolution, image segmentation and the like.
An image processing module: the method comprises the steps that a single frame image is input to a network model, the network model outputs surrounding frame images corresponding to the single frame image, or an image object classification result is identified, and a moving object in a video is obtained; specifically, the image processing module may process a plurality of tasks such as image conversion, video classification, video recognition, and the like, and at least includes:
an image conversion unit: the method comprises the steps that a gray scale image corresponding to a video at the time T is input to a network model, and the network model outputs colored color images corresponding to the time T, the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) according to a loss function; the predicted number of surrounding frames may be set according to the task characteristics. The predicted weights of the corresponding loss functions of the surrounding frames can be set manually or can be learned by the network.
In order to more accurately predict the color image of the surrounding frame, the network model continuously learns the motion process of the object under the update of the loss function, hides the motion information in the weight of the model, predicts the corresponding context frame at the moment T, and can learn the context information in space and time.
The loss function is defined as follows:
the calculation of each individual loss function is the root Mean Square Error (MSE) per pixel per image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the actual pixel value.
Coefficient alpha in front of the upper and lower corresponding loss function n (n=1, 2, 3 …) may be set manually or may be learned automatically by a network model. Overall, theThe loss function includes a loss function of the current frame and a loss function of surrounding frames:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L T+3 +L T-3 )…+α n (L T+n +L T-n ) (2)
an image classification unit: the method comprises the steps that an image corresponding to a video at the time T is input to a network model, and the network model outputs an object classification result of the image at the time T, an image corresponding to the time T+n (n=1, 2, 3 …) and the time T-n (n=1, 2, 3 …) according to a loss function; in order to more accurately predict the object type of the image at the moment T and the images of surrounding frames, the network model continuously learns the motion process of the object under the update of the loss function, and the motion information is hidden in the weight of the model.
The loss function at time T is a multi-class cross entropy loss function defined as follows:
in the above formula, M represents the number of categories, y c The value of (1) is 0 or 1, and if the category is the same as the category of the sample, it is 1, p c Representing the probability that the predicted sample belongs to category c.
Fig. 6 is a schematic diagram of a hardware device of a video motion information obtaining method according to an embodiment of the present application. As shown in fig. 6, the device includes one or more processors and memory. Taking a processor as an example, the apparatus may further comprise: an input system and an output system.
The processor, memory, input system, and output system may be connected by a bus or other means, with a bus connection being illustrated in fig. 6.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor executes various functional applications of the electronic device and data processing, i.e., implements the processing methods of the method embodiments described above, by running non-transitory software programs, instructions, and modules stored in the memory.
The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data, etc. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, which may be connected to the processing system via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input system may receive input numeric or character information and generate a signal input. The output system may include a display device such as a display screen.
The one or more modules are stored in the memory and when executed by the one or more processors perform the following operations of any of the method embodiments described above:
step a: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;
step b: inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the true pixel value;
the total loss function L is:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L T+3 +L T-3 )…+α n (L T+n +L T-n )
in the above formula, L T For the loss function, alpha, of the current frame 1 (L T+1 +L T-1 )、α 2 (L T+2 +L T-2 )、α 3 (L T+3 +L T-3 )…α n (L T+n +L T-n ) Alpha is the loss function of surrounding frames n (n=1, 2, 3 …) is a coefficient preceding the loss function corresponding to the upper and lower.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
Embodiments of the present application provide a non-transitory (non-volatile) computer storage medium storing computer-executable instructions that are operable to:
step a: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;
step b: inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the true pixel value;
the total loss function L is:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L T+3 +L T-3 )…
n (L T+n +L T-n )
in the above formula, L T For the loss function, alpha, of the current frame 1 (L T+1 +L T-1 )、α 2 (L T+2 +L T-2 )、α 3 (L T+3 +L T-3 )…α n (L T+n +L T-n ) Alpha is the loss function of surrounding frames n (n=1, 2, 3 …) is a coefficient preceding the loss function corresponding to the upper and lower.
The present embodiments provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to:
step a: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;
step b: inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the true pixel value;
the total loss function L is:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L T+3 +L T-3 )…+α n (L T+n +L T-n )
in the above formula, L T For the loss function, alpha, of the current frame 1 (L T+1 +L T-1 )、α 2 (L T+2 +L T-2 )、α 3 (L T+3 +L T-3 )…α n (L T+n +L T-n ) Alpha is the loss function of surrounding frames n (n=1, 2, 3 …) is a coefficient preceding the loss function corresponding to the upper and lower.
According to the video motion information acquisition method, system and electronic equipment, the motion information of a video is acquired by utilizing the method that the current frame predicts surrounding frames, a single-frame image is input into a network model, and the network model outputs multi-frame processed pictures or classification results, so that the understanding of moving objects in the video is facilitated. The method is simple to realize, does not need to change the original network structure, does not increase the parameter quantity, does not increase the calculated amount of the network, does not increase extra storage, and is faster in speed and low in calculation cost.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A method for acquiring video motion information, comprising the steps of:
step a: constructing a network model for video motion information by a method of predicting surrounding frames by a current frame;
step b: inputting a single frame image into the constructed network model, and outputting surrounding frame images corresponding to the single frame image by the network model according to a loss function; wherein each individual loss function is a per pixel root mean square error for each image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the true pixel value;
the total loss function L is:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L T+3 +L T-3 )…+α n (L T+n +L T-n )
in the above formula, L T For the loss function, alpha, of the current frame 1 (L T+1 +L T-1 )、α 2 (L T+2 +L T-2 )、α 3 (L T+3 +L T-3 )…α n (L T+n +L T-n ) Alpha is the loss function of surrounding frames n N=1, 2, 3 …, the coefficients preceding the upper and lower corresponding loss functions.
2. The method according to claim 1, wherein in the step a, the network structure of the network model includes a target recognition, image classification, super resolution, image segmentation network structure.
3. The video motion information acquisition method according to claim 1 or 2, wherein in the step b, the inputting a single frame image to the network model, and the outputting, by the network model, a surrounding frame image corresponding to the single frame image according to the loss function, specifically includes: and inputting a gray scale image corresponding to the video at the moment T to the network model, and outputting colored color images corresponding to the moment T, the moment T+n and the moment T-n according to a loss function by the network model, wherein n=1, 2 and 3 ….
4. The video motion information acquiring method according to claim 3, wherein in the step b, the inputting a single frame image to the network model, the network model outputting a surrounding frame image corresponding to the single frame image according to a loss function further comprises: and inputting an image corresponding to the video at the time T to the network model, and outputting an object classification result of the image at the time T, the time T+n and the image corresponding to the time T-n by the network model according to a loss function, wherein n=1, 2 and 3 ….
5. The method of claim 4, wherein the loss function of the T moment image is a multi-class cross entropy loss function defined as follows:
in the above formula, M represents the number of categories, y c Is 0 or 1, y if the class is the same as the class of the sample c Has a value of 1, p c Representing the probability that the predicted sample belongs to category c.
6. A video motion information acquisition system, comprising:
model construction module: a network model is built for video motion information by a method for predicting surrounding frames through a current frame;
the image processing module is used for inputting a single frame image into the constructed network model, and the network model outputs surrounding frame images corresponding to the single frame image according to the loss function; wherein each individual loss function is a per pixel root mean square error for each image:
in the above formula, n represents the number of pixels of the entire image,representing the pixel value predicted by the network, y i Representing the true pixel value;
the total loss function L is:
L=L T1 (L T+1 +L T-1 )+α 2 (L T+2 +L T-2 )+α 3 (L T+3 +L T-3 )…+α n (L T+n +L T-n )
in the above formula, L T For the loss function, alpha, of the current frame 1 (L T+1 +L T-1 )、α 2 (L T+2 +L T-2 )、α 3 (L T+3 +L T-3 )…α n (L T+n +L T-n ) Alpha is the loss function of surrounding frames n N=1, 2, 3 …, the coefficients preceding the upper and lower corresponding loss functions.
7. The video motion information acquisition system of claim 6, wherein the network structure of the network model includes a target recognition, image classification, super resolution, image segmentation network structure.
8. The video motion information acquisition system according to claim 6 or 7, wherein the image processing module includes:
an image conversion unit: and the gray level image corresponding to the video at the time T is input to the network model, and the network model outputs colored color images corresponding to the time T, the time T+n and the time T-n according to the loss function, wherein n=1, 2 and 3 ….
9. The video motion information acquisition system of claim 8, wherein the image processing module further comprises:
an image classification unit: and the network model is used for inputting an image corresponding to the video at the time T to the network model, and outputting an object classification result of the image at the time T, the time T+n and the image corresponding to the time T-n according to a loss function, wherein n=1, 2 and 3 ….
10. The video motion information acquisition system of claim 9, wherein the loss function of the T moment image is a multi-class cross entropy loss function defined as follows:
in the above formula, M represents the number of categories, y c Is 0 or 1, y if the class is the same as the class of the sample c Has a value of 1, p c Representing the probability that the predicted sample belongs to category c.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the one processor to enable the at least one processor to perform the video motion information acquisition method of any one of claims 1 to 5.
CN201911249221.1A 2019-12-09 2019-12-09 Video motion information acquisition method, system and electronic equipment Active CN110991365B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911249221.1A CN110991365B (en) 2019-12-09 2019-12-09 Video motion information acquisition method, system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911249221.1A CN110991365B (en) 2019-12-09 2019-12-09 Video motion information acquisition method, system and electronic equipment

Publications (2)

Publication Number Publication Date
CN110991365A CN110991365A (en) 2020-04-10
CN110991365B true CN110991365B (en) 2024-02-20

Family

ID=70091453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911249221.1A Active CN110991365B (en) 2019-12-09 2019-12-09 Video motion information acquisition method, system and electronic equipment

Country Status (1)

Country Link
CN (1) CN110991365B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109064507A (en) * 2018-08-21 2018-12-21 北京大学深圳研究生院 A kind of flow depth degree convolutional network model method of doing more physical exercises for video estimation
CN109919087A (en) * 2019-03-06 2019-06-21 腾讯科技(深圳)有限公司 A kind of method of visual classification, the method and device of model training
CN110059605A (en) * 2019-04-10 2019-07-26 厦门美图之家科技有限公司 A kind of neural network training method calculates equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109064507A (en) * 2018-08-21 2018-12-21 北京大学深圳研究生院 A kind of flow depth degree convolutional network model method of doing more physical exercises for video estimation
CN109919087A (en) * 2019-03-06 2019-06-21 腾讯科技(深圳)有限公司 A kind of method of visual classification, the method and device of model training
CN110059605A (en) * 2019-04-10 2019-07-26 厦门美图之家科技有限公司 A kind of neural network training method calculates equipment and storage medium

Also Published As

Publication number Publication date
CN110991365A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
US11468697B2 (en) Pedestrian re-identification method based on spatio-temporal joint model of residual attention mechanism and device thereof
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
WO2020238560A1 (en) Video target tracking method and apparatus, computer device and storage medium
CN108665496B (en) End-to-end semantic instant positioning and mapping method based on deep learning
CN110378288B (en) Deep learning-based multi-stage space-time moving target detection method
CN111402130B (en) Data processing method and data processing device
CN111814621A (en) Multi-scale vehicle and pedestrian detection method and device based on attention mechanism
CN113837938B (en) Super-resolution method for reconstructing potential image based on dynamic vision sensor
CN110363770B (en) Training method and device for edge-guided infrared semantic segmentation model
CN110555420B (en) Fusion model network and method based on pedestrian regional feature extraction and re-identification
CN110826428A (en) Ship detection method in high-speed SAR image
CN115393687A (en) RGB image semi-supervised target detection method based on double pseudo-label optimization learning
CN110334703B (en) Ship detection and identification method in day and night image
CN112967341A (en) Indoor visual positioning method, system, equipment and storage medium based on live-action image
CN109657538B (en) Scene segmentation method and system based on context information guidance
CN116977674A (en) Image matching method, related device, storage medium and program product
CN114049483A (en) Target detection network self-supervision training method and device based on event camera
CN110991365B (en) Video motion information acquisition method, system and electronic equipment
Zhang et al. Video object detection base on rgb and optical flow analysis
CN110929632A (en) Complex scene-oriented vehicle target detection method and device
CN111597967A (en) Infrared image multi-target pedestrian identification method
Su et al. Small target detection method based on feature fusion for deep learning in state grid environment evaluation
CN116994065B (en) Cloud cluster classification and cloud evolution trend prediction method
CN116758029B (en) Window cleaning machine movement control method and system based on machine vision
CN118397038B (en) Moving object segmentation method, system, equipment and medium based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant