CN114973410A - Method and device for extracting motion characteristics of video frame - Google Patents

Method and device for extracting motion characteristics of video frame Download PDF

Info

Publication number
CN114973410A
CN114973410A CN202210550792.4A CN202210550792A CN114973410A CN 114973410 A CN114973410 A CN 114973410A CN 202210550792 A CN202210550792 A CN 202210550792A CN 114973410 A CN114973410 A CN 114973410A
Authority
CN
China
Prior art keywords
feature
frame
map
neighborhood
next frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210550792.4A
Other languages
Chinese (zh)
Inventor
龙拂尘
邱钊凡
潘滢炜
姚霆
梅涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202210550792.4A priority Critical patent/CN114973410A/en
Publication of CN114973410A publication Critical patent/CN114973410A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Abstract

The invention provides a method and a device for extracting motion characteristics of a video frame, wherein the method comprises the following steps: by obtaining the feature map of each frame in the video data, a first feature corresponding to a set space coordinate showing a moving object in a target frame and a second feature corresponding to a neighborhood showing the moving object in a next frame with a set size are determined according to the feature map of the target frame and the feature map of the next frame aiming at any target frame, so that the aggregation feature of the neighborhood in the next frame is determined according to the first feature and the second feature, and the action feature in the target frame is determined based on the aggregation feature. Because the aggregation characteristic is determined based on the relevance between the moving object shown by the set space coordinate corresponding to the first characteristic and the moving object shown in the neighborhood corresponding to the second characteristic, compared with the prior art in which the relevance between corresponding coordinate points in adjacent frames is only captured, the method and the device expand the collection area and improve the accuracy of the motion characteristic of the video frame.

Description

Method and device for extracting motion characteristics of video frame
Technical Field
The invention relates to the technical field of deep learning, in particular to a method and a device for extracting motion characteristics of video frames.
Background
With the increasing demand of people for shooting videos, a large amount of video data is accumulated in various scenes, and therefore video understanding, namely automatic identification and analysis of contents in the videos, is required. The motion recognition is a core field of video understanding, and is used for recognizing motions appearing in a video, generally motions of people in the video, and it can be understood that motions of objects other than human bodies can also be used.
In the related art, it is usually defaulted that the motion between consecutive frames of a video is well aligned in space, and feature aggregation is performed based on the motion corresponding to the same position between adjacent frames, so as to perform motion feature collection according to the aggregated features. The action between the default video continuous frames can be well aligned in space, the application range is small, and inaccurate video action characteristic acquisition can be caused under the condition of large object movement speed and action deformation.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present invention is to provide a method for extracting motion features of a video frame, so as to determine an aggregation feature of a neighborhood region in a next frame according to a first feature corresponding to a set spatial coordinate showing a motion object in a target frame and a second feature corresponding to a neighborhood region showing a set size of the motion object in the next frame, determine motion features in the target frame based on the aggregation feature, and improve accuracy of the motion features of the video frame.
The second purpose of the present invention is to provide an apparatus for extracting motion characteristics of video frames.
A third objective of the present invention is to provide an electronic device.
A fourth object of the invention is to propose a non-transitory computer-readable storage medium.
A fifth object of the invention is to propose a computer program product.
To achieve the above object, an embodiment of a first aspect of the present invention provides a method for extracting motion features of a video frame, including:
acquiring a characteristic map of each frame in video data;
aiming at any target frame, according to the feature map of the target frame and the feature map of the next frame, determining a first feature corresponding to a set space coordinate for displaying a moving object in the target frame and a second feature corresponding to a neighborhood for displaying a set size of the moving object in the next frame; the first feature is used for indicating a moving object shown by a corresponding set space coordinate in the target frame; the second feature is used for indicating the moving objects shown in the corresponding same set spatial coordinate neighborhood in the next frame;
determining the aggregation characteristic of the neighborhood in the next frame according to the first characteristic and the second characteristic; the aggregation feature is used for indicating the relevance between the moving object shown by the set space coordinate corresponding to the first feature and the moving object shown in the neighborhood corresponding to the second feature;
and fusing the first feature and the aggregation feature to obtain the action feature in the target frame.
Optionally, as a first possible implementation manner of the first aspect, the determining, for the arbitrary target frame, second features corresponding to a neighborhood showing a set size of the moving object in a next frame according to the feature map of the target frame and the feature map of the next frame includes:
aiming at any one target frame, determining a motion significance map according to the feature map of the target frame and the feature map of the next frame;
for the motion significance map, predicting the deviation of the characteristics of each space coordinate in the neighborhood with the set size in the next frame by adopting a coordinate deviation estimator so as to determine a plurality of sampling coordinates in the neighborhood according to the deviation;
determining sampling characteristics corresponding to the sampling coordinates by a bilinear interpolation method;
and determining a second feature corresponding to a neighborhood of a set size of the set space coordinate in the next frame according to each sampling feature.
Optionally, as a second possible implementation manner of the first aspect, the determining, for the any one target frame, a motion saliency map according to the feature map of the target frame and the feature map of the next frame includes:
aiming at any one target frame, determining an inter-frame feature difference according to the feature map of the target frame and the feature map of the next frame;
normalizing the inter-frame feature difference by adopting an activation function to obtain a corresponding attention map; wherein the attention map is used to indicate a spatial position of a moving object having motion between two frames;
and multiplying the attention map and the feature map of the next frame to obtain a motion significance map.
Optionally, as a third possible implementation manner of the first aspect, the determining, according to the first feature and the second feature, an aggregated feature of the neighborhood in a next frame includes:
performing matrix multiplication operation on the first characteristic and the second characteristic to obtain a similarity matrix; the similarity matrix is used for representing the similarity between the space coordinate features in the neighborhood corresponding to the first feature and the second feature;
and taking the similarity matrix as a weight, and performing matrix multiplication operation with the transposed matrix of the second feature to obtain the aggregation feature of the neighborhood in the next frame.
Optionally, as a fourth possible implementation manner of the first aspect, the method further includes:
and aiming at any one target frame, determining a feature map of the target frame and a feature map of a next frame by adopting a two-dimensional convolutional neural network or a visual feature extraction transform network.
Optionally, as a fifth possible implementation manner of the first aspect, the determining, by using a two-dimensional convolutional neural network and/or a visual feature extraction fransformer network, a feature map of the target frame and a feature map of a next frame includes:
determining a feature map of the target frame and the next frame by adopting a convolution kernel in a basic residual error module of the two-dimensional convolution neural network; alternatively, the first and second electrodes may be,
determining a feature map for the target frame and a next frame using an MSA module in the Transformer network with a regular window configuration.
According to the method for extracting the action features of the video frame, provided by the embodiment of the invention, the action features in the target frame are determined according to the feature map of the target frame and the feature map of the next frame aiming at any target frame, the first features corresponding to the set space coordinates for displaying the motion object in the target frame and the second features corresponding to the neighborhood for displaying the set size of the motion object in the next frame are determined, so that the aggregation features of the neighborhood in the next frame are determined according to the first features and the second features, and the action features in the target frame are determined based on the aggregation features. Because the aggregation characteristic is determined based on the relevance between the moving object shown by the set space coordinate corresponding to the first characteristic and the moving object shown in the neighborhood corresponding to the second characteristic, compared with the prior art in which the relevance between corresponding coordinate points in adjacent frames is only captured, the method and the device expand the collection area and improve the accuracy of the motion characteristic of the video frame.
To achieve the above object, a second aspect of the present invention provides an apparatus for extracting motion characteristics of a video frame, including:
the acquisition module is used for acquiring a feature map of each frame in the video data;
a first determining module, configured to determine, for any one target frame, a first feature corresponding to a set spatial coordinate showing a moving object in the target frame and a second feature corresponding to a neighborhood showing a set size of the moving object in a next frame according to a feature map of the target frame and a feature map of the next frame; the first feature is used for indicating a moving object shown by a corresponding set space coordinate in the target frame; the second feature is used for indicating the moving object shown in the corresponding same set spatial coordinate neighborhood in the next frame;
a second determining module, configured to determine an aggregation feature of the neighborhood in a next frame according to the first feature and the second feature; the aggregation feature is used for indicating the relevance between the moving object shown by the set space coordinate corresponding to the first feature and the moving object shown in the neighborhood corresponding to the second feature;
and the fusion module is used for fusing the first characteristic and the aggregation characteristic to obtain the action characteristic in the target frame.
Optionally, as a first possible implementation manner of the second aspect, the first determining module includes:
a first determining unit, configured to determine, for the any one target frame, a motion saliency map according to the feature map of the target frame and the feature map of the next frame;
a prediction unit, configured to, for the motion saliency map, predict, using a coordinate offset estimator, an offset of a feature of each spatial coordinate in the neighborhood of the set size in the next frame, so as to determine a plurality of sampling coordinates in the neighborhood according to the offset;
the second determining unit is used for determining sampling characteristics corresponding to the sampling coordinates by adopting a bilinear interpolation method;
and the third determining unit is used for determining second characteristics corresponding to the neighborhood of the set size of the set space coordinate in the next frame according to the sampling characteristics.
Optionally, as a second possible implementation manner of the second aspect, the first determining unit is further configured to:
aiming at any one target frame, determining an inter-frame feature difference according to the feature map of the target frame and the feature map of the next frame;
normalizing the inter-frame feature difference by adopting an activation function to obtain a corresponding attention map; wherein the attention map is used to indicate a spatial position of a moving object having motion between two frames;
and multiplying the attention map and the feature map of the next frame to obtain a motion significance map.
Optionally, as a third possible implementation manner of the second aspect, the second determining module includes:
the first processing unit is used for carrying out matrix multiplication operation on the first characteristic and the second characteristic to obtain a similarity matrix; the similarity matrix is used for representing the similarity between the space coordinate features in the neighborhood corresponding to the first feature and the second feature;
and the second processing unit is used for performing matrix multiplication operation on the similarity matrix serving as a weight and the transposed matrix of the second characteristic to obtain the aggregation characteristic of the neighborhood in the next frame.
Optionally, as a fourth possible implementation manner of the second aspect, the apparatus further includes:
and the third determining module is used for determining the feature map of the target frame and the feature map of the next frame by adopting a two-dimensional convolutional neural network or a visual feature extraction Transformer network aiming at any one target frame.
Optionally, as a fifth possible implementation manner of the second aspect, the third determining module is further configured to:
determining a feature map of the target frame and the next frame by adopting a convolution kernel in a basic residual error module of the two-dimensional convolution neural network; alternatively, the first and second electrodes may be,
determining a feature map for the target frame and a next frame using an MSA module in the Transformer network with a regular window configuration.
The action feature extraction device for video frames provided in the embodiment of the present invention, by obtaining the feature map of each frame in the video data, determines, for any one target frame, a first feature corresponding to a set spatial coordinate showing a moving object in the target frame and a second feature corresponding to a neighborhood showing a set size of the moving object in the next frame according to the feature map of the target frame and the feature map of the next frame, thereby determining an aggregation feature of the neighborhood in the next frame according to the first feature and the second feature, and determining an action feature in the target frame based on the aggregation feature. Because the aggregation characteristic is determined based on the relevance between the moving object shown by the set space coordinate corresponding to the first characteristic and the moving object shown by the neighborhood corresponding to the second characteristic, compared with the prior art in which the relevance between corresponding coordinate points in adjacent frames is only captured, the method and the device enlarge the acquisition region and improve the accuracy of the motion characteristic of the video frame.
To achieve the above object, a third aspect of the present invention provides an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
In order to achieve the above object, a fourth aspect of the present invention provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect.
In order to achieve the above object, an embodiment of a fifth aspect of the present invention provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method of the first aspect.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a method for extracting motion characteristics of a video frame according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a time-domain convolution scheme, a time-domain self-attention scheme and the present invention according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of another method for extracting motion characteristics of a video frame according to an embodiment of the present invention;
fig. 4 is a schematic flow chart of determining a motion saliency map according to an embodiment of the present invention;
fig. 5 is a schematic flowchart of another method for extracting motion characteristics of a video frame according to an embodiment of the present invention;
fig. 6 is a schematic diagram illustrating a method for extracting motion characteristics of a video frame according to an embodiment of the present invention;
FIG. 7 is a schematic diagram illustrating another method for extracting motion characteristics of video frames according to an embodiment of the present invention;
fig. 8 is a flowchart illustrating another method for extracting motion characteristics of video frames according to an embodiment of the present invention;
fig. 9 is a schematic diagram illustrating an application of a method for extracting motion features of a video frame according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an apparatus for extracting motion characteristics of a video frame according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of another motion feature extraction apparatus for video frames according to an embodiment of the present invention; and
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
A motion feature extraction method, apparatus, electronic device, storage medium, and computer program product of a video frame according to embodiments of the present invention are described below with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a method for extracting motion characteristics of a video frame according to an embodiment of the present invention.
In the related art, with the continuous development of deep learning, the following two schemes are mainly adopted to extract the action features of the video frame: one is a time domain convolution scheme, which uses space-time three-dimensional convolution or decomposed space-time three-dimensional convolution (space-time convolution + time domain convolution) to capture motion, and the other is a time domain self-attention scheme, which uses a self-attention mechanism to extract features in the time domain. However, both schemes are based on a common assumption that the motion in the space can be well aligned between the continuous frames, so that feature aggregation can be performed based on the motion corresponding to the same position between the adjacent frames, and further, the motion feature collection can be performed according to the aggregated features. The action between the default video continuous frames can be well aligned in space, the application range is small, and inaccurate video action characteristic acquisition can be caused under the condition of large object movement speed and action deformation.
To solve this problem, an embodiment of the present invention provides a method for extracting motion features of a video frame, so as to determine an aggregation feature of a neighborhood region in a next frame according to a first feature corresponding to a set spatial coordinate showing a motion object in a target frame and a second feature corresponding to a neighborhood region showing a set size of the motion object in the next frame, determine motion features in the target frame based on the aggregation feature, and improve accuracy of the motion features of the video frame, as shown in fig. 1, the method for extracting motion features of a video frame includes the following steps:
step 101, obtaining a feature map of each frame in video data.
It should be noted that the method for extracting motion characteristics of a video frame according to the embodiment of the present invention may be executed by a device for extracting motion characteristics of a video frame. The motion feature extraction device of the video frame may be an electronic device, or may be configured in the electronic device. The electronic device may be any stationary or mobile computing device capable of performing data processing, for example, a mobile computing device such as a notebook computer, a smart phone, and a wearable device, or a stationary computing device such as a desktop computer, or a server, or other types of computing devices, and the like, which is not limited in the embodiment of the present invention.
It will be appreciated that a video may be viewed as being made up of a series of sequences of video frames. In this embodiment, the motion feature extraction device for video frames may obtain a feature map of each frame in the video data, and perform the subsequent steps. As a possible implementation manner, the motion feature extraction device of the video frame may employ a neural network to process the video data to obtain a feature map of each frame in the video data. It should be noted that the motion feature extraction device for video frames in this embodiment may obtain the feature map of each frame in the video data in various public, legal, and compliant manners.
In a possible implementation manner of this embodiment, the motion feature extraction device of the video frame may acquire the feature map of each frame in the video data in an online acquisition manner or an offline acquisition manner. For example, the motion feature extraction device of the video frame may acquire the feature map of each frame in the video data in real time at a historical time after being authorized, or may acquire the feature map of each frame in the video data in a manual manner through offline after being authorized, and the like, which is not limited in this embodiment.
In another possible implementation manner of this embodiment, the motion feature extraction device of the video frame may obtain the feature map of each frame in the video data by means of network transmission or physical copy. For example, the motion feature extraction device of the video frame may obtain the feature map of each frame in the video data after authorization, and the feature map is transmitted or physically copied from other devices through a network.
Step 102, for any one target frame, according to the feature map of the target frame and the feature map of the next frame, determining a first feature corresponding to a set space coordinate showing a moving object in the target frame and a second feature corresponding to a neighborhood of a set size showing the moving object in the next frame.
In this embodiment, the first feature may be used to indicate a moving object shown by a corresponding set spatial coordinate in the target frame, and the second feature is used to indicate a moving object shown in a neighboring area of the same set spatial coordinate corresponding to the next frame, so that, for any one target frame, the first feature corresponding to the set spatial coordinate showing the moving object in the target frame and the second feature corresponding to the neighboring area of the set size showing the moving object in the next frame may be determined according to the obtained feature map of the target frame and the feature map of the next frame. The set space coordinates may be represented by (x, y), and the neighborhood of the set size may be represented by a k × k grid. It should be noted that, the specific values of x and y in the set spatial coordinates (x and y) and k in the set size k × k are not limited in this embodiment, and optionally, the specific values may be set according to manual experience, or may be dynamically adjusted according to actual application requirements, which is not limited in this embodiment.
In a possible implementation manner of this embodiment, the first feature may be a query featureThe feature, that is, the feature of the item to be queried, may be understood as a feature corresponding to the set spatial coordinate showing the moving object in the target frame, and the second feature may be a key feature and a value feature, that is, the feature of each item in the set to be queried may be understood as a feature corresponding to each spatial coordinate in the neighborhood showing the set size of the moving object in the next frame. Wherein the key feature and the value feature are only different in representation notation. Optionally, in response to that the obtained feature map of the video frame is a three-dimensional feature map, the three-dimensional feature map may be converted into a two-dimensional sequence, and then, for any one target frame, a first feature, namely a query feature, corresponding to a set spatial coordinate showing the moving object in the target frame and a second feature, namely a key feature and a value feature, corresponding to a neighborhood showing the set size of the moving object in a next frame are determined. For example, assume that the feature map of each frame in the video data is a three-dimensional feature map F, and the dimension of the three-dimensional feature map F is C × L × H × W, where C, H × W, and L represent the channel size, two-dimensional feature size, and time length of the feature, respectively. So that the three-dimensional characteristic spectrum F can be firstly converted into a two-dimensional sequence form
Figure BDA0003654975870000081
Then, for any target frame, supposing to be a tth frame, determining a feature of setting space coordinates (x, y) in the tth frame, namely a query feature, and marking as Q t ∈R C And the features of the neighborhood surrounding the set size of the set spatial coordinates (x, y) in the t +1 th frame, i.e., the key feature and the value feature, are denoted as K t+1 ∈R C×{k×k} And V t+1 ∈R C×{k×k}
And 103, determining the aggregation characteristics of the neighborhood in the next frame according to the first characteristics and the second characteristics.
In this embodiment, the aggregation feature is used to indicate the association between the moving object shown by the set spatial coordinates corresponding to the first feature and the moving object shown in the neighborhood corresponding to the second feature. Since the first feature may be used to indicate a moving object displayed in a corresponding set spatial coordinate in the target frame, and the second feature is used to indicate a moving object displayed in a corresponding neighboring area of the same set spatial coordinate in the next frame, the aggregation feature of the neighboring area in the next frame may be determined based on the first feature and the second feature. As a possible implementation manner, similarity between a moving object shown by a set space coordinate corresponding to a first feature and a moving object shown in a neighborhood corresponding to a second feature may be measured according to the first feature and the second feature to obtain a corresponding similarity matrix, and feature fusion is performed on the similarity matrix and the second feature in a channel-by-channel manner to obtain an aggregation feature of the neighborhood in a next frame.
And step 104, fusing the first characteristic and the aggregation characteristic to obtain the action characteristic in the target frame.
In this embodiment, since the aggregation feature is used to indicate the association between the moving object shown by the set spatial coordinate corresponding to the first feature and the moving object shown in the neighborhood corresponding to the second feature, the aggregation feature is used to enhance the first feature corresponding to the set spatial coordinate showing the moving object in the target frame, so as to obtain the motion feature in the target frame. Therefore, the method can capture the characteristics of the same space coordinate in the time domain, and also can consider the relevance between adjacent frames, so that the finally obtained action characteristics in the target frame are more accurate.
In one possible implementation manner of this embodiment, the first feature and the fusion feature may be summed to obtain the action feature in the target frame. For example, assume that the first characteristic is denoted as Q t The polymerization feature is denoted A t+1 Thus the motion feature Y in the target frame t The following calculation formula can be adopted for calculation:
Y t =Q t +A t+1
it can be understood that, in any set spatial coordinate position of any target frame, the technical solution will mine the correlation between the feature of the position in the target frame and the feature of each spatial coordinate position in the neighborhood of the set size around the position in the next frame, so that the feature map of the target frame will be enhanced by the feature map of the next frame in this way. Thus, the present solution operates between each pair of adjacent frames. It should be noted that, as for the last frame in the video data, since no correlation processing is performed on the next frame, it is possible to enhance itself by calculating the correlation between the feature of any set spatial coordinate position of the frame and the feature of each spatial coordinate position in the neighborhood of the set size around the position, thereby ensuring that the technical solution can maintain a fixed time domain length in the time domain.
According to the method for extracting the action features of the video frames, provided by the embodiment of the invention, the action features in the target frames are determined according to the feature map of the target frame and the feature map of the next frame aiming at any one target frame, the first features corresponding to the set space coordinates for displaying the moving object in the target frame and the second features corresponding to the neighborhood for displaying the set size of the moving object in the next frame are determined, so that the aggregation features of the neighborhood in the next frame are determined according to the first features and the second features, and the action features in the target frames are determined based on the aggregation features. Because the aggregation characteristic is determined based on the relevance between the moving object shown by the set space coordinate corresponding to the first characteristic and the moving object shown in the neighborhood corresponding to the second characteristic, compared with the prior art in which the relevance between corresponding coordinate points in adjacent frames is only captured, the method and the device expand the collection area and improve the accuracy of the motion characteristic of the video frame.
As can be seen from the above analysis, there are three different schemes for extracting features of a video frame, namely, a time domain convolution scheme, a time domain self-attention scheme, and the present technical scheme, and in order to clearly illustrate differences between the three schemes, the present invention further provides schematic diagrams of the three schemes, and fig. 2 is a schematic diagram of a time domain convolution scheme, a time domain self-attention scheme, and a schematic diagram of the technical scheme provided in an embodiment of the present invention. Fig. 2(a) is a time domain convolution scheme, fig. 2(b) is a time domain self-attention scheme, and fig. 2(c) is the present technical solution.
As shown in fig. 2(a), in the time domain convolution scheme, multiple features of multiple consecutive video frames at the same spatial coordinate position are convolved according to the time sequence of each video frame, and when the amplitude of motion deformation in a video frame is too large, different moving objects are represented at the same spatial coordinate position between consecutive frames, for example, a motion field is represented at the spatial coordinate position in the t-1 th frame and the t +1 th frame, and a strut jump player is represented at the spatial coordinate position in the t-th frame, so that when time domain feature aggregation is performed by using time domain convolution, that is, feature aggregation is performed at the same spatial position in the time domain, the video motion feature is lost.
As shown in fig. 2(b), in the temporal self-attention scheme, a feature of any set spatial coordinate position of a target frame, i.e., a t-th frame, is used as a query feature, a feature of the same spatial coordinate position of a previous frame, i.e., a t-1-th frame, is used as a key feature, and a feature of the same spatial coordinate position of a next frame, i.e., a t + 1-th frame, is used as a value feature to perform temporal feature aggregation, and similarly, when the amplitude of motion deformation in a video frame is too large, different moving objects are represented at the same spatial coordinate position between consecutive frames, which causes a problem that video motion features are lost.
As shown in fig. 2(c), in the present embodiment, a feature of any set spatial coordinate position of a target frame, i.e., a t-th frame, is used as a query feature, and a next frame, namely, the characteristics of each space coordinate position in the set-size domain of the same space coordinate position of the t +1 th frame are used as key characteristics and value characteristics to carry out time domain characteristic aggregation, the aggregation of the characteristics is expanded from the aggregation of the same space coordinate position to the aggregation of each space coordinate position in the domain, on one hand, the receptive field of the time domain aggregation is expanded, on the other hand, each space coordinate position in the domain is explicitly utilized to align the whole space motion, therefore, when the action is greatly deformed, the time domain feature aggregation performed between continuous frames can be ensured to occur at the corresponding action area position, the time domain feature aggregation effect is enhanced, and the problem of video action feature loss caused by large-scale movement or deformation and the like is solved.
As can be seen from the above analysis, in the embodiment of the present invention, for any one target frame, according to the feature map of the target frame and the feature map of the next frame, the second feature corresponding to the neighborhood displaying the set size of the moving object in the next frame is determined.
Fig. 3 is a flowchart illustrating another method for extracting motion characteristics of a video frame according to an embodiment of the present invention.
As shown in fig. 3, the method for extracting motion features of a video frame may include the following steps:
step 301, obtaining a feature map of each frame in video data.
It should be noted that the execution process of this step may refer to the execution process of step 101 in the above embodiment, and the principle is the same, and is not described herein again.
Step 302, for any target frame, according to the feature map of the target frame and the feature map of the next frame, determining a first feature and a motion saliency map corresponding to the set spatial coordinates of the motion object shown in the target frame.
In this embodiment, the first feature may be used to indicate a moving object shown by a corresponding set spatial coordinate in the target frame, and for any one target frame, the first feature corresponding to the set spatial coordinate showing the moving object in the target frame may be determined according to the obtained feature map of the target frame.
In this embodiment, since the feature Map of any one target frame and the feature Map of the next frame may be obtained, a Motion Significance Map (MSM) may be determined according to the feature Map of the target frame and the feature Map of the next frame, and may be denoted as f m
Step 303, predicting the characteristic offset of each spatial coordinate in the neighborhood of a set size in the next frame by using a coordinate offset estimator according to the motion saliency map, so as to determine a plurality of sampling coordinates in the neighborhood according to the offset.
In the present embodiment, the motion saliency map f is aimed at m A coordinate offset estimator may be employed to predict the offset of the feature of each spatial coordinate within a neighborhood of a set size in the next frame, thereby determining a plurality of sample coordinates within the neighborhood from the offset value. Wherein, the coordinate offset estimator can be realized by a two-dimensional convolution, and the number of output channels of the convolution can be 2k 2 . Alternatively, the shift of the feature of each spatial coordinate in the neighborhood of the set size in the next frame may be expressed as (Δ a, Δ b), that is, (Δ a, Δ b) is a coordinate displacement corresponding to (a, b) a spatial coordinate point p in a neighborhood of the set size centered on the set spatial coordinate (x, y) of the target frame, that is, a k × k grid, so that the corresponding sampling coordinate in the neighborhood may be expressed as p' ═ a (a + Δ a, b + Δ b), whereby a plurality of sampling coordinates in the neighborhood may be determined.
And step 304, determining sampling characteristics corresponding to each sampling coordinate by using a bilinear interpolation method.
In this embodiment, after determining a plurality of sampling coordinates in the neighborhood, a bilinear interpolation method may be used to determine the sampling feature corresponding to each sampling coordinate. Optionally, for any sampling coordinate p ', the following calculation formula may be adopted to determine the sampling feature K ' corresponding to the sampling coordinate p ' t+1 (p′):
Figure BDA0003654975870000111
Where p' is used to represent the sample coordinates, i.e. the differentiable spatial position (spatial position containing offset), p is used to represent all integer spatial positions within the neighborhood (i.e. the original regular position), K t+1 (p) is used to represent the feature corresponding to position p in the regular k × k neighborhood grid, and G is the kernel of bilinear interpolation.
Step 305, according to each sampling feature, determining a second feature corresponding to a neighborhood of a set size of a set spatial coordinate in a next frame.
In this embodiment, after obtaining the sampling feature corresponding to each sampling coordinate, the second frame corresponding to the neighborhood of the set size of the set spatial coordinate in the next frame may be determined according to each sampling featureAnd (5) characterizing. Alternatively, k may be sampled 2 The sampling feature corresponding to each sampling coordinate in the neighborhood of the next frame is used as the second feature corresponding to the neighborhood of the set size for setting the spatial coordinates in the next frame, namely the key feature and the value feature, which can be recorded as K t+1 ∈R C×{k×k} And V t+1 ∈R C×{k×k}
And step 306, determining the aggregation characteristics of the neighborhood in the next frame according to the first characteristics and the second characteristics.
And 307, fusing the first feature and the aggregation feature to obtain the action feature in the target frame.
It should be noted that the execution process of steps 306-307 may refer to the execution process of steps 103-104 in the above embodiment, and the principle is the same, and is not described herein again.
According to the method for extracting the action features of the video frame, provided by the embodiment of the invention, a first feature and a motion significance map corresponding to a set space coordinate showing a motion object in a target frame are determined according to a feature map of the target frame and a feature map of a next frame aiming at any target frame, so that a coordinate offset estimator is adopted to predict the offset of the feature of each space coordinate in a neighborhood with a set size in the next frame aiming at the motion significance map, a plurality of sampling coordinates in the neighborhood are determined according to the offset, and after the sampling feature corresponding to each sampling coordinate is determined by adopting a bilinear interpolation method, a second feature corresponding to the neighborhood with the set size of the space coordinate in the next frame is determined according to each sampling feature. Therefore, by predicting the coordinate offset of the neighborhood on the motion significance characteristic map, the problem that geometric deformation caused by motion of a moving object is neglected because calculation is directly carried out in the neighborhood with a set size is avoided.
In order to clearly illustrate the process of determining the motion saliency map according to the feature map of the target frame and the feature map of the next frame in step 302 in the embodiment shown in fig. 3, this embodiment provides a flowchart of determining the motion saliency map shown in fig. 4, and as shown in fig. 4, determining the motion saliency map may include the following steps:
step 401, for any target frame, determining an inter-frame feature difference according to the feature map of the target frame and the feature map of the next frame.
Here, the inter-frame feature difference may be determined for any one target frame from the feature map of the target frame and the feature map of the next frame. Alternatively, the feature map of the target frame may be represented as f of the t-th frame t The feature map of the next frame can be expressed as f of the t +1 th frame t+1 Thus, the inter-frame feature difference Δ f between the feature map of the target frame and the feature map of the next frame can be calculated by the following formula:
Δf=f t+1 -f t
and 402, normalizing the inter-feature difference by using an activation function to obtain a corresponding attention map.
Here, after determining the inter-frame feature difference, the inter-frame feature difference may be normalized by using an activation function to obtain a corresponding attention map. Wherein the attention map is used to indicate the spatial position of a moving object with motion between two frames. Alternatively, the inter-frame feature difference may be expressed as Δ f, and the activation function may be a sigmoid function, so that the attention map may be expressed as sigmoid (Δ f).
And step 403, multiplying the attention map and the feature map of the next frame to obtain a motion significance map.
Here, since the attention map is used to indicate the spatial position of a moving object in which there is motion between two frames, that is, the attention map spectrum dynamically indicates the spatial position in which the motion is relatively large in the next frame, the motion saliency map is obtained by multiplying the attention map by the feature map of the next frame. Alternatively, the attention map may be represented as sigmoid (Δ f), and the feature map of the next frame may be represented as f of the t +1 th frame t+1 The motion saliency map may represent f m So that f can be calculated by the following formula m
f m =sigmoid(Δf)×f t+1
In summary, for any target frame, the inter-frame feature difference is determined according to the feature map of the target frame and the feature map of the next frame, and normalization processing is performed on the inter-frame feature difference by using an activation function to obtain a corresponding attention map, so that the attention map and the feature map of the next frame are multiplied to obtain a motion saliency map. Therefore, the motion significance map can be determined according to the feature map of the target frame and the feature map of the next frame aiming at any target frame.
Through the above analysis, in the embodiment of the present invention, the aggregation feature of the neighborhood in the next frame may be determined according to the first feature and the second feature, and in order to clearly illustrate how to determine the aggregation feature of the neighborhood in the next frame according to the first feature and the second feature, the present invention further provides a method for extracting an action feature of a video frame.
Fig. 5 is a flowchart illustrating another method for extracting motion characteristics of a video frame according to an embodiment of the present invention.
As shown in fig. 5, the method for extracting motion features of a video frame may include the following steps:
step 501, obtaining a feature map of each frame in video data.
Step 502, for any one target frame, according to the feature map of the target frame and the feature map of the next frame, determining a first feature corresponding to a set space coordinate showing a moving object in the target frame and a second feature corresponding to a neighborhood of a set size showing the moving object in the next frame.
It should be noted that, the execution process of steps 501-502 may refer to the execution process of steps 101-102 in the foregoing embodiment, and the principle is the same, which is not described herein again.
Step 503, performing matrix multiplication operation on the first feature and the second feature to obtain a similarity matrix.
In this embodiment, the first feature and the second feature may be subjected to a matrix multiplication operation to obtain a similarity matrix. And the similarity matrix is used for representing the similarity between the spatial coordinate characteristics in the neighborhood corresponding to the first characteristic and the second characteristic. Optionally, the first feature may be a query feature, denoted as Q t ∈R C The second feature can be a key feature and a value feature, and is denoted as K t+1 ∈R C×{k×k} And V t+1 ∈R C×{k×k} The similarity matrix can be denoted as W cor . Wherein the key feature and the value feature are only different in representation sign. Therefore, matrix multiplication operation can be carried out on the query features and the key features to obtain a similarity matrix W cor The specific calculation formula is as follows:
Figure BDA0003654975870000131
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003654975870000132
representing a matrix multiplication operation, i.e. a dot product operation. The similarity matrix W cor And measuring the similarity between the spatial coordinate features in the neighborhood of the query feature and the key feature.
Step 504, the similarity matrix is used as a weight, and is subjected to matrix multiplication operation with the transpose matrix of the second feature to obtain the aggregation feature of the neighborhood in the next frame.
In this embodiment, the similarity matrix may be used as a weight, and a matrix multiplication operation may be performed on the similarity matrix and the transpose matrix of the second feature to obtain the aggregation feature of the neighborhood in the next frame. Optionally, the second feature may be a key feature and a value feature, denoted as K t+1 ∈R C×{k×k} And V t+1 ∈R C×{k×k} The similarity matrix can be denoted as W cor The polymerization characteristic can be denoted as A t+1 . Wherein the key feature and the value feature are only different in representation notation. Thereby W can be adjusted cor As the weight, the weight is subjected to matrix multiplication operation with the transposed matrix of the value feature to obtain the aggregation feature A of the neighborhood in the next frame t+1 The specific calculation formula is as follows:
Figure BDA0003654975870000133
wherein the content of the first and second substances,
Figure BDA0003654975870000134
representing a matrix multiplication operation, i.e. a dot product operation, [ ·] T Representing a matrix transposition operation. Here, the similarity matrix is used as a weight, and can be used to aggregate spatial coordinate features in the neighborhood of adjacent frames, so as to enhance the query feature.
And 505, fusing the first characteristic and the aggregation characteristic to obtain an action characteristic in the target frame.
It should be noted that the execution process of this step may refer to the execution process of step 104 in the foregoing embodiment, and the principle is the same, and is not described herein again.
According to the action feature extraction method of the video frame provided by the embodiment of the invention, the first feature and the second feature are subjected to matrix multiplication operation to obtain a similarity matrix, wherein the similarity matrix is used for expressing the similarity between the space coordinate features in the neighborhood corresponding to the first feature and the second feature, and the similarity matrix is used as the weight and is subjected to matrix multiplication operation with the transposed matrix of the second feature to obtain the aggregation feature of the neighborhood in the next frame. Therefore, the aggregation of the obtained aggregation matrix from the same space coordinate position can be expanded to the aggregation of each space coordinate position in the field, the alignment of the whole space motion is performed by explicitly utilizing each space coordinate position in the field, and the time domain characteristic aggregation effect is enhanced.
In order to more clearly illustrate the above embodiments, the description will now be made by way of example.
Fig. 6 is a schematic diagram illustrating a method for extracting motion characteristics of a video frame according to an embodiment of the present invention. Fig. 6(a) is a standard-alone Inter-Frame Attention (SIFA) operator proposed by the method for extracting motion features of a video Frame provided by the present invention, fig. 6(b) is a joint spatio-temporal self-Attention operator, and fig. 6(c) is a decoupled spatio-temporal self-Attention operator.
As shown in the time aggregation between the adjacent frames in the upper left corner of fig. 6, in the motion feature extraction method for video frames, motion feature extraction may be performed on consecutive video frames from frame 1 to frame t +1, and the feature map of the target frame is enhanced by the feature map of the next frame each time.
As shown in the upper right corner of fig. 6 for determining the action feature of the t-th frame, a matrix multiplication operation may be performed on the query feature corresponding to the set spatial coordinate of the t-th frame and the key feature corresponding to the neighborhood of the set size of the t + 1-th frame to obtain a similarity matrix W cor Further, the similarity matrix W cor Performing matrix multiplication operation on value characteristics corresponding to the neighborhood with the set size of the t +1 th frame to obtain aggregation characteristics A t+1 And will polymerize feature A t+1 And adding the result to the query feature corresponding to the set space coordinate of the t-th frame to realize the enhancement of the query feature.
As shown in fig. 6(a), in the SIFA operator proposed by the method for extracting motion features of a video frame provided by the present invention, a second row and a 4 th matrix in a feature map of a t-th frame may be used as set spatial coordinates, and a square region surrounded by a first row to a third row and a third column to a fifth column in a feature map of a t + 1-th frame may be used as a neighborhood of a set size in the t + 1-th frame. The amount of calculation is low because the correlation calculation is performed only in the neighborhood of the next frame. Moreover, because the time domain change of the same spatial position is not simply mined, but the fusion characteristics are determined by a larger neighborhood, the information interaction between frames is enriched.
As shown in fig. 6(b), in the joint spatio-temporal self-attention operator, the second row and the 4 th matrix in the feature map of the t-th frame are also used as the set spatial coordinates, i.e. the target matrix. When the characteristics of the matrix are calculated, the characteristics of the characteristic map of the t-1 th frame, namely the characteristics of all the matrices in the characteristic map of the t-1 th frame, and the characteristics of the characteristic map of the t +1 th frame, namely the characteristics of all the matrices in the characteristic map of the t +1 th frame are calculated, so that the characteristics of the target matrix are determined according to the characteristics of all the matrices of the t-1 th frame and the characteristics of all the matrices of the t +1 th frame, and the calculation amount is large.
As shown in fig. 6(c), in the decoupled spatio-temporal self-attention operator, the second row and the 4 th matrix in the feature map of the t-th frame are also used as the set spatial coordinates, i.e. the target matrix. When calculating the characteristics of the matrix, not only the characteristics of the characteristic map of the t-1 th frame and the characteristics of the characteristic map of the t +1 th frame, but also the characteristics of the 4 th matrix in the second row in the characteristic map of the t-1 th frame and the characteristics of the 4 th matrix in the second row in the characteristic map of the t +1 th frame are calculated to determine the characteristics of the target matrix, and the calculation amount is larger.
Fig. 7 is a schematic diagram illustrating a method for extracting motion characteristics of a video frame according to another embodiment of the present invention. As shown in FIG. 7, first, a feature map f is generated from the t-th frame t Determining query features corresponding to set space coordinates in the feature map of the t-th frame, and determining the feature map f of the t-th frame t And the feature map f of the t +1 th frame t+1 Determining the inter-frame feature difference delta f, then adopting a sigmoid function to normalize the inter-frame feature difference delta f, and comparing the normalized inter-frame feature difference delta f with the feature map f of the t +1 th frame t+1 Multiplying to obtain a motion significance map f m And then according to the motion significance map f m Predicting the characteristic offset of each space coordinate in a neighborhood with a set size in the next frame by using a coordinate offset estimator to obtain a plurality of sampling coordinates in the neighborhood, determining the sampling characteristic corresponding to each sampling coordinate by using a bilinear interpolation method to determine the key characteristic and the value characteristic corresponding to the neighborhood with the set size in the t +1 th frame, and performing matrix multiplication operation on the query characteristic corresponding to the set space coordinate of the t th frame and the key characteristic corresponding to the neighborhood with the set size of the t +1 th frame to obtain a similarity matrix W cor Further, the similarity matrix W cor Performing matrix multiplication operation on value characteristics corresponding to the neighborhood with the set size of the t +1 th frame to obtain aggregation characteristics A t+1 And the feature A is polymerized t+1 And adding the result to the query feature corresponding to the set space coordinate of the t-th frame to realize the enhancement of the query feature.
It can be understood that the method for extracting motion features of a video frame provided by the present invention can be applied to a two-dimensional convolutional neural network and a visual Transformer network, and in order to clearly illustrate how the method for extracting motion features of a video frame provided by the present invention is applied to a two-dimensional convolutional neural network and a visual Transformer network, the present invention also provides a method for extracting motion features of a video frame.
Fig. 8 is a flowchart illustrating another method for extracting motion characteristics of a video frame according to an embodiment of the present invention.
As shown in fig. 8, the method for extracting motion features of a video frame may include the following steps:
step 801, obtaining a feature map of each frame in video data.
It should be noted that the execution process of this step may refer to the execution process of step 101 in the foregoing embodiment, and the principle is the same, and is not described herein again.
Step 802, aiming at any one target frame, a two-dimensional convolutional neural network or a visual feature extraction Transformer network is adopted to determine a feature map of the target frame and a feature map of a next frame.
In this embodiment, for any one target frame, a two-dimensional convolutional neural network or a visual feature extraction Transformer network may be adopted to determine a feature map of the target frame and a feature map of a next frame. Specifically, the feature maps of the target frame and the next frame may be determined by using a convolution kernel in a basic residual module of a two-dimensional convolution neural network, or by using an MSA module with a regular window configuration in a transform network.
Step 803, for any one target frame, according to the feature map of the target frame and the feature map of the next frame, determining a first feature corresponding to the set spatial coordinates showing the moving object in the target frame and a second feature corresponding to the neighborhood of the set size showing the moving object in the next frame.
And step 804, determining the aggregation characteristics of the neighborhood in the next frame according to the first characteristics and the second characteristics.
Step 805, the first feature and the aggregation feature are fused to obtain the action feature in the target frame.
It should be noted that, the process of the steps 803-805 can refer to the process of the steps 102-104 in the above embodiments, and the principle is the same, which is not described herein again.
According to the method for extracting the action features of the video frame, provided by the embodiment of the invention, the feature map of the target frame and the feature map of the next frame are determined by adopting a two-dimensional convolutional neural network or a visual feature extraction transform network aiming at any target frame, so that the video feature learning is improved. .
In order to more clearly illustrate the above embodiments, the description will now be made by way of example.
Fig. 9 is a schematic application diagram of a method for extracting motion features of a video frame according to an embodiment of the present invention. Fig. 9(a) illustrates the application of the motion feature extraction method for video frames provided by the present invention to a two-dimensional convolutional neural network, and fig. 9(b) illustrates the application of the motion feature extraction method for video frames provided by the present invention to a visual Transformer network.
As shown in fig. 9(a), most video network structures form a two-dimensional space-domain convolution and a time-domain one-dimensional convolution by decoupling space-time three-dimensional convolution, and the time-domain one-dimensional convolution is usually embedded after the 2D space-domain convolution for time-domain modeling, so that the SIFA-Block module provided by the method for extracting motion characteristics of video frames provided by the present invention can be embedded after the 3x3 convolution in the ResNet basic residual module. Wherein the SIFA-Block module is used for executing the process shown in FIG. 7. It will be appreciated that since the SIFA-Block module is embedded only in the last three stages of ResNet, only a small amount of computation is added. And, a global pooling operation is applied on the output features to obtain frame-level features for final video motion classification optimization.
The SIFA-Block module provided by the method for extracting the motion characteristics of the video frame provided by the invention can be embedded into a Swin-Transformer basic network, and an SIFA-Transformer is constructed for modeling the video characteristics. In particular, for two consecutive Swin-Transformer base modules, as shown in FIG. 9(b), SIFA-Block modules may be placed just after the MSA module with regular window configuration (W-MSA). Here, the model may reshape the Patch sequence dimensions output by the W-MSA module to C L H W as input to the SIFA-Block module. And, for the final reshaped output, the model may use global pooling to obtain frame-level features for feature learning.
In order to implement the above embodiments, the present invention further provides a device for extracting motion characteristics of a video frame.
Fig. 10 is a schematic structural diagram of an apparatus for extracting motion characteristics of a video frame according to an embodiment of the present invention.
As shown in fig. 10, the motion feature extraction device for video frames includes: an acquisition module 11, a first determination module 12, a second determination module 13 and a fusion module 14.
The acquiring module 11 is configured to acquire a feature map of each frame in the video data;
a first determining module 12, configured to determine, for any target frame, a first feature corresponding to a set spatial coordinate showing a moving object in the target frame and a second feature corresponding to a neighborhood showing a set size of the moving object in a next frame according to a feature map of the target frame and a feature map of the next frame; the first feature is used for indicating a moving object shown by a corresponding set space coordinate in the target frame; the second feature is used for indicating the moving object shown in the corresponding same set spatial coordinate neighborhood in the next frame;
a second determining module 13, configured to determine, according to the first feature and the second feature, an aggregation feature of the neighborhood in a next frame; the aggregation feature is used for indicating the relevance between the moving object shown by the set space coordinate corresponding to the first feature and the moving object shown in the neighborhood corresponding to the second feature;
and a fusion module 14, configured to fuse the first feature and the aggregated feature to obtain an action feature in the target frame.
Further, in a possible implementation manner of the embodiment of the present invention, the first determining module 12 includes:
a first determining unit 1201, configured to determine, for the any one target frame, a motion saliency map according to the feature map of the target frame and the feature map of the next frame;
a prediction unit 1202, configured to predict, for the motion saliency map, a deviation of a feature of each spatial coordinate in the neighborhood of the set size in the next frame by using a coordinate deviation estimator, so as to determine a plurality of sampling coordinates in the neighborhood according to the deviation;
a second determining unit 1203, configured to determine, by using a bilinear interpolation method, a sampling feature corresponding to each sampling coordinate;
a third determining unit 1204, configured to determine, according to each of the sampling features, a second feature corresponding to a neighborhood of a set size of a set spatial coordinate in a next frame.
Further, in a possible implementation manner of the embodiment of the present invention, the first determining unit 1201 is further configured to:
aiming at any one target frame, determining an inter-frame feature difference according to the feature map of the target frame and the feature map of the next frame;
normalizing the inter-frame feature difference by adopting an activation function to obtain a corresponding attention map; wherein the attention map is used to indicate a spatial position of a moving object having motion between two frames;
and multiplying the attention map and the feature map of the next frame to obtain a motion significance map.
Further, in a possible implementation manner of the embodiment of the present invention, the second determining module 13 includes:
the first processing unit is used for carrying out matrix multiplication operation on the first characteristic and the second characteristic to obtain a similarity matrix; the similarity matrix is used for representing the similarity between the space coordinate features in the neighborhood corresponding to the first feature and the second feature;
and the second processing unit is used for performing matrix multiplication operation on the similarity matrix serving as a weight and the transposed matrix of the second feature to obtain the aggregation feature of the neighborhood in the next frame.
It should be noted that the foregoing explanation on the embodiment of the motion feature extraction method for video frames is also applicable to the motion feature extraction apparatus for video frames in this embodiment, and is not repeated here.
Based on the foregoing embodiment, an embodiment of the present invention further provides a possible implementation manner of a motion feature extraction apparatus for a video frame, fig. 11 is a schematic structural diagram of another motion feature extraction apparatus for a video frame according to an embodiment of the present invention, and on the basis of the foregoing embodiment, the motion feature extraction apparatus for a video frame further includes: a third determination module 15.
And a third determining module 15, configured to determine, for any one of the target frames, a feature map of the target frame and a feature map of a next frame by using a two-dimensional convolutional neural network or a visual feature extraction Transformer network.
Further, in a possible implementation manner of the embodiment of the present invention, the third determining module 15 is further configured to:
determining a feature map of the target frame and the next frame by adopting a convolution kernel in a basic residual error module of the two-dimensional convolution neural network; alternatively, the first and second electrodes may be,
determining a feature map for the target frame and a next frame using an MSA module in the Transformer network with a regular window configuration.
The action feature extraction device for video frames provided in the embodiment of the present invention, by obtaining the feature map of each frame in the video data, determines, for any one target frame, a first feature corresponding to a set spatial coordinate showing a moving object in the target frame and a second feature corresponding to a neighborhood showing a set size of the moving object in the next frame according to the feature map of the target frame and the feature map of the next frame, thereby determining an aggregation feature of the neighborhood in the next frame according to the first feature and the second feature, and determining an action feature in the target frame based on the aggregation feature. Because the aggregation characteristic is determined based on the relevance between the moving object shown by the set space coordinate corresponding to the first characteristic and the moving object shown in the neighborhood corresponding to the second characteristic, compared with the prior art in which the relevance between corresponding coordinate points in adjacent frames is only captured, the method and the device expand the collection area and improve the accuracy of the motion characteristic of the video frame.
In order to implement the above embodiments, the present invention further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method for extracting motion characteristics of video frames according to any of the embodiments of the present invention.
Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which may implement the processes in the embodiments shown in fig. 1 to 11 of the present invention, and as shown in fig. 12, the electronic device may include: the device comprises a shell 1, a processor 2, a memory 3, a circuit board 4 and a power circuit 5, wherein the circuit board 4 is arranged in a space enclosed by the shell 1, and the processor 2 and the memory 3 are arranged on the circuit board 4; a power supply circuit 5 for supplying power to each circuit or device of the electronic apparatus; the memory 3 is used for storing executable program codes; the processor 2 reads the executable program code stored in the memory 3 to run a program corresponding to the executable program code, so as to execute the motion feature extraction method for video frames according to any of the foregoing embodiments.
The specific execution process of the above steps by the processor 2 and the steps further executed by the processor 2 by running the executable program code may refer to the description of the embodiments shown in fig. 1 to 11 of the present invention, and are not described herein again.
In order to achieve the above embodiments, the present invention further provides a computer readable storage medium storing computer instructions, wherein the computer instructions are configured to cause the computer to execute the method for extracting motion characteristics of a video frame according to any one of the foregoing embodiments of the present invention.
In order to implement the foregoing embodiments, the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the motion feature extraction method for video frames according to any of the foregoing embodiments of the present invention.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and brought together by those skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques known in the art may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
Persons of ordinary skill in the art will appreciate that all or a portion of the steps carried in implementing the methods of the embodiments described above may be implemented by associated hardware that is instructed to execute a program, which may be stored in a computer-readable storage medium, that when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. While embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that variations, modifications, substitutions and alterations thereof may be made by those of ordinary skill in the art within the scope of the present invention.

Claims (15)

1. A motion feature extraction method of a video frame is characterized by comprising the following steps:
acquiring a feature map of each frame in video data;
aiming at any target frame, determining a first feature corresponding to a set space coordinate for displaying a moving object in the target frame and a second feature corresponding to a neighborhood for displaying a set size of the moving object in a next frame according to a feature map of the target frame and a feature map of the next frame; the first feature is used for indicating a moving object shown by the corresponding set space coordinate in the target frame; the second feature is used for indicating the moving object shown in the corresponding same set spatial coordinate neighborhood in the next frame;
determining an aggregation characteristic of the neighborhood in a next frame according to the first characteristic and the second characteristic; the aggregation feature is used for indicating the relevance between the moving object shown by the set space coordinate corresponding to the first feature and the moving object shown in the neighborhood corresponding to the second feature;
and fusing the first feature and the aggregation feature to obtain the action feature in the target frame.
2. The method according to claim 1, wherein the determining, for the any one target frame, a second feature corresponding to a neighborhood showing a set size of the moving object in a next frame according to the feature map of the target frame and a feature map of the next frame comprises:
aiming at any one target frame, determining a motion significance map according to the feature map of the target frame and the feature map of the next frame;
for the motion significance map, predicting the deviation of the characteristics of each space coordinate in a neighborhood with a set size in the next frame by adopting a coordinate deviation estimator so as to determine a plurality of sampling coordinates in the neighborhood according to the deviation;
determining sampling characteristics corresponding to the sampling coordinates by a bilinear interpolation method;
and determining a second feature corresponding to a neighborhood with a set size of a set space coordinate in the next frame according to each sampling feature.
3. The method according to claim 2, wherein the determining, for the any one target frame, a motion saliency map according to the feature map of the target frame and the feature map of the next frame comprises:
aiming at any one target frame, determining an inter-frame feature difference according to the feature map of the target frame and the feature map of the next frame;
normalizing the inter-frame feature difference by adopting an activation function to obtain a corresponding attention map; wherein the attention map is used to indicate a spatial position of a moving object having motion between two frames;
and multiplying the attention map and the feature map of the next frame to obtain a motion significance map.
4. The method according to any of claims 1-3, wherein said determining an aggregated feature of said neighborhood in a next frame based on said first feature and said second feature comprises:
performing matrix multiplication operation on the first characteristic and the second characteristic to obtain a similarity matrix; the similarity matrix is used for representing the similarity between the space coordinate features in the neighborhood corresponding to the first feature and the second feature;
and taking the similarity matrix as a weight, and performing matrix multiplication operation on the similarity matrix and the transposed matrix of the second characteristic to obtain the aggregation characteristic of the neighborhood in the next frame.
5. The method according to any one of claims 1-3, further comprising:
and aiming at any one target frame, determining a feature map of the target frame and a feature map of a next frame by adopting a two-dimensional convolutional neural network or a visual feature extraction transform network.
6. The method according to claim 5, wherein the determining the feature map of the target frame and the feature map of the next frame by using a two-dimensional convolutional neural network and/or a visual feature extraction (Vis) Transformer network comprises:
determining a characteristic map by checking the target frame and the next frame by convolution in a basic residual error module of the two-dimensional convolution neural network; alternatively, the first and second electrodes may be,
determining a feature map for the target frame and a next frame using an MSA module in the Transformer network with a regular window configuration.
7. An apparatus for extracting motion characteristics of a video frame, comprising:
the acquisition module is used for acquiring a feature map of each frame in the video data;
a first determining module, configured to determine, for any target frame, a first feature corresponding to a set spatial coordinate showing a moving object in the target frame and a second feature corresponding to a neighborhood showing a set size of the moving object in a next frame according to a feature map of the target frame and a feature map of the next frame; the first feature is used for indicating a moving object shown by a corresponding set space coordinate in the target frame; the second feature is used for indicating the moving object shown in the corresponding same set spatial coordinate neighborhood in the next frame;
a second determining module, configured to determine an aggregation feature of the neighborhood in a next frame according to the first feature and the second feature; the aggregation feature is used for indicating the relevance between the moving object shown by the set space coordinate corresponding to the first feature and the moving object shown in the neighborhood corresponding to the second feature;
and the fusion module is used for fusing the first characteristic and the aggregation characteristic to obtain the action characteristic in the target frame.
8. The apparatus of claim 7, wherein the first determining module comprises:
a first determining unit, configured to determine, for the any one target frame, a motion saliency map according to the feature map of the target frame and the feature map of the next frame;
a prediction unit, configured to, for the motion saliency map, predict, using a coordinate offset estimator, an offset of a feature of each spatial coordinate in the neighborhood of the set size in the next frame, so as to determine a plurality of sampling coordinates in the neighborhood according to the offset;
the second determining unit is used for determining sampling characteristics corresponding to the sampling coordinates by adopting a bilinear interpolation method;
and the third determining unit is used for determining a second feature corresponding to a neighborhood of a set size of the set space coordinate in the next frame according to each sampling feature.
9. The apparatus of claim 8, wherein the first determining unit is further configured to:
aiming at any one target frame, determining an inter-frame feature difference according to the feature map of the target frame and the feature map of the next frame;
normalizing the inter-frame feature difference by adopting an activation function to obtain a corresponding attention map; wherein the attention map is used to indicate a spatial position of a moving object having motion between two frames;
and multiplying the attention map and the feature map of the next frame to obtain a motion significance map.
10. The apparatus of any of claims 7-9, wherein the second determining module comprises:
the first processing unit is used for carrying out matrix multiplication operation on the first characteristic and the second characteristic to obtain a similarity matrix; the similarity matrix is used for representing the similarity between the space coordinate features in the neighborhood corresponding to the first feature and the second feature;
and the second processing unit is used for performing matrix multiplication operation on the similarity matrix serving as a weight and the transposed matrix of the second feature to obtain the aggregation feature of the neighborhood in the next frame.
11. The apparatus according to any one of claims 7-9, further comprising:
and the third determining module is used for determining the feature map of the target frame and the feature map of the next frame by adopting a two-dimensional convolutional neural network or a visual feature extraction Transformer network aiming at any one target frame.
12. The apparatus of claim 11, wherein the third determining module is further configured to:
determining a characteristic map by checking the target frame and the next frame by convolution in a basic residual error module of the two-dimensional convolution neural network; alternatively, the first and second electrodes may be,
determining a feature map for the target frame and a next frame using an MSA module in the Transformer network with a regular window configuration.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-6.
CN202210550792.4A 2022-05-20 2022-05-20 Method and device for extracting motion characteristics of video frame Pending CN114973410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210550792.4A CN114973410A (en) 2022-05-20 2022-05-20 Method and device for extracting motion characteristics of video frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210550792.4A CN114973410A (en) 2022-05-20 2022-05-20 Method and device for extracting motion characteristics of video frame

Publications (1)

Publication Number Publication Date
CN114973410A true CN114973410A (en) 2022-08-30

Family

ID=82985082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210550792.4A Pending CN114973410A (en) 2022-05-20 2022-05-20 Method and device for extracting motion characteristics of video frame

Country Status (1)

Country Link
CN (1) CN114973410A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863003A (en) * 2023-05-29 2023-10-10 阿里巴巴(中国)有限公司 Video generation method, method and device for training video generation model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863003A (en) * 2023-05-29 2023-10-10 阿里巴巴(中国)有限公司 Video generation method, method and device for training video generation model

Similar Documents

Publication Publication Date Title
US11151725B2 (en) Image salient object segmentation method and apparatus based on reciprocal attention between foreground and background
CN109522942B (en) Image classification method and device, terminal equipment and storage medium
CN111192292B (en) Target tracking method and related equipment based on attention mechanism and twin network
US11274922B2 (en) Method and apparatus for binocular ranging
CN111754541B (en) Target tracking method, device, equipment and readable storage medium
CN109960742B (en) Local information searching method and device
CN108875931B (en) Neural network training and image processing method, device and system
CN112668608B (en) Image recognition method and device, electronic equipment and storage medium
CN112036381B (en) Visual tracking method, video monitoring method and terminal equipment
CN111914878A (en) Feature point tracking training and tracking method and device, electronic equipment and storage medium
CN111680678A (en) Target area identification method, device, equipment and readable storage medium
CN111179270A (en) Image co-segmentation method and device based on attention mechanism
CN115205150A (en) Image deblurring method, device, equipment, medium and computer program product
CN114612902A (en) Image semantic segmentation method, device, equipment, storage medium and program product
CN114973410A (en) Method and device for extracting motion characteristics of video frame
CN111914809A (en) Target object positioning method, image processing method, device and computer equipment
CN110956131A (en) Single-target tracking method, device and system
CN111091099A (en) Scene recognition model construction method, scene recognition method and device
CN114820755B (en) Depth map estimation method and system
Bak et al. Camera motion detection for story and multimedia information convergence
CN113763313A (en) Text image quality detection method, device, medium and electronic equipment
CN113033397A (en) Target tracking method, device, equipment, medium and program product
CN112016571A (en) Feature extraction method and device based on attention mechanism and electronic equipment
CN113807354A (en) Image semantic segmentation method, device, equipment and storage medium
CN111753729A (en) False face detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination