CN110852256B

CN110852256B - Method, device and equipment for generating time sequence action nomination and storage medium

Info

Publication number: CN110852256B
Application number: CN201911087939.5A
Authority: CN
Inventors: 李剑; 林楚铭; 王亚彪; 汪铖杰; 李季檩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2023-04-18
Anticipated expiration: 2039-11-08
Also published as: CN110852256A

Abstract

The application discloses a method, a device, equipment and a storage medium for generating a time sequence action nomination, wherein the method comprises the following steps: acquiring a plurality of video frames in a video; calling a time sequence action nomination generating model to carry out prediction processing on a plurality of video frames to obtain a time sequence boundary confidence map and an action integrity probability map corresponding to a video, and fusing the time sequence boundary confidence map and the action integrity probability map to obtain a fusion characteristic map; and outputting the time sequence action nomination of the video according to the fused feature map. Because the two time sequence boundary confidence graphs and the action integrity probability graph based on the dense boundary predict the boundary of the time sequence action nomination based on the global nomination level information (L-L dimension), and not predict the boundary of the time sequence action nomination based on local information only like the BMN, a more accurate boundary can be predicted in the task of generating the time sequence action nomination.

Description

Method, device and equipment for generating time sequence action nomination and storage medium

Technical Field

The present application relates to the field of machine learning, and in particular, to a method, an apparatus, a device, and a storage medium for generating a time-series action nomination.

Background

The task of generating the time sequence action nomination is as follows: generating a certain number of time sequence action nominations for the uncut long video, wherein one time sequence action nomination is a time sequence interval (from a starting boundary to an ending boundary) which possibly comprises action segments. The high-quality time sequence action nomination should have several characteristics: (1) flexible timing length; (2) precise timing boundaries; (3) a reliable confidence score. The time sequence action nomination is a key step of various tasks such as action detection, video analysis and the like.

Boundary-based methods (boundary-based methods) are used in the related art to accomplish this generation task. Typical boundary-based methods include: boundary-Sensitive networks (BSNs) and Boundary-Matching networks (BMNs) for time-series action nomination generation. The BSN includes two processing stages: (1) Positioning a time sequence boundary, and generating an action nomination through a combined boundary; (2) And constructing the characteristic of the time sequence action nomination, and predicting the confidence coefficient corresponding to the time sequence action nomination according to the characteristic. BMNs improve BSNs to end-to-end (end-to-end) methods. The second step of BSN is improved mainly by the boundary matching layer, and all actions are nominated and the confidence degree is predicted.

However, the BMN still has difficulty predicting the boundary with higher accuracy. In particular, in the case where a video has a complicated motion, a cluttered background, a fuzzy boundary, and a motion with a large time span, the boundary accuracy of the BMN prediction is poor.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for generating a time sequence action nomination, which can solve the problem of poor boundary accuracy of BMN prediction in some scenes in the related art. The technical scheme is as follows:

according to an aspect of the present application, there is provided a method for generating a time series action nomination, the method including:

acquiring a plurality of video frames in a video;

calling a time sequence action nomination generation model to carry out prediction processing on the plurality of video frames to obtain a time sequence boundary confidence map and an action integrity probability map corresponding to the video, wherein the time sequence boundary confidence map is used for predicting a starting boundary and an ending boundary of the time sequence action nomination, and the action integrity probability map is used for representing the action integrity probabilities of the starting boundary and the ending boundary of the same time sequence action nomination;

fusing the time sequence boundary confidence graph and the action integrity probability graph to obtain a fusion characteristic graph;

and outputting the time sequence action nomination of the video according to the fusion feature map.

According to an aspect of the present application, there is provided an apparatus for generating a time series action nomination, the apparatus including:

the acquisition module is used for acquiring a plurality of video frames in a video;

the calling module is used for calling a time sequence action nomination generating model to carry out prediction processing on the plurality of video frames to obtain a time sequence boundary confidence map and an action integrity probability map corresponding to the video; the time sequence boundary confidence map is used for predicting a starting boundary and an ending boundary of time sequence action nomination, and the action integrity probability map is used for representing action integrity probabilities of the starting boundary and the ending boundary of the same time sequence action nomination;

the fusion module is used for fusing the time sequence boundary confidence map and the action integrity probability map to obtain a fusion characteristic map;

and the output module is used for outputting the time sequence action nomination of the video according to the fusion feature map.

According to an aspect of the present application, there is provided a computer device including: a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of generating a time series action nomination as described above.

According to an aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, which is loaded and executed by the processor to implement the method for generating a time-series action nomination as described above.

The embodiment of the application has at least the following beneficial effects:

the method comprises the steps of obtaining two time sequence boundary confidence graphs and action integrity probability graphs corresponding to a video by carrying out prediction processing on a plurality of video frames in the video, fusing the time sequence boundary confidence graphs and the action integrity probability graphs to obtain a fusion feature graph, and outputting time sequence action nominations of the video according to the fusion feature graph. Because the fused feature map based on the dense boundaries predicts the boundaries of the time-series action nominations in the global dimension (L x L) instead of predicting the boundaries of the time-series action nominations only based on local information similarly to the BMN, more accurate boundaries can be predicted in the task of generating the time-series action nominations.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a block diagram of a video analytics system provided in one illustrative embodiment of the present application;

FIG. 2 is a flow chart of a method for generating a time series action nomination according to another exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a generative model of time series action nomination provided by another illustrative embodiment of the present application;

FIG. 4 is a block diagram illustrating a model for generating a time series action nomination according to another exemplary embodiment of the present application;

FIG. 5 is a network architecture diagram of a generative model of a temporal action nomination provided in another illustrative embodiment of the present application;

FIG. 6 is a flow chart of a method for generating a time series action nomination provided by another illustrative embodiment of the present application;

FIG. 7 is a flowchart of a method for generating a time series action nomination according to another exemplary embodiment of the present application;

FIG. 8 is a flowchart of a method for generating a time series action nomination according to another illustrative embodiment of the present application;

FIG. 9 is a flowchart of a method for generating a time series action nomination according to another illustrative embodiment of the present application;

FIG. 10 is a block diagram of a model for generation of a time series action nomination according to another exemplary embodiment of the present application;

FIG. 11 is a block diagram of an apparatus for generating a time series action nomination according to another exemplary embodiment of the present application;

FIG. 12 is a block diagram of a computer device provided in another illustrative embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.

First, a number of terms referred to in this application will be introduced:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language people use daily, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The automatic driving technology generally comprises technologies such as high-precision maps, environmental perception, behavior decision, path planning, motion control and the like, and has wide application prospects.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence. The application designs a Dense Boundary-based time sequence action nomination generation model (DBG). The model is end-to-end and operates very fast. The model improves the boundary generation mode of the BMN, and a dense boundary confidence coefficient can be predicted by naming all possible time sequence actions.

Fig. 1 shows a block diagram of a video analysis system provided by an exemplary embodiment of the present application. The system comprises: a head end device 120 and a computer device 140. The head-end device 120 and the computer device 140 are connected via a communication network.

The front-end device 120 may be a surveillance camera, a smart home device, a chat robot, a desktop computer, a smart phone, or the like. The front-end 120 may be capable of capturing video, generating video, storing video, or downloading video. The front-end device 120 provides the pending video to the computer device 140.

The computer device 140 is internally provided with a dense boundary-based time sequence action nomination generating model 142, the time sequence action nomination generating model 142 is used for predicting two time sequence boundary confidence maps and action integrity probability maps corresponding to the video, and the time sequence action nomination of the video is output according to the two time sequence boundary confidence maps and the action integrity probability maps. The time sequence action nomination can be used for subsequent action detection, video analysis, security alarm and the like.

The video analysis system can be realized as follows: and analyzing tasks such as the analysis of the teacher's action in class in the education scene, the analysis of the student's classroom performance in the education scene, the extraction of essence videos in the long videos, the detection of key actions in the short videos and the like.

Fig. 2 is a flowchart illustrating a method for generating a time-series action nomination according to an exemplary embodiment of the present application. The present embodiment is illustrated with the method applied to the computer device shown in fig. 1. The method comprises the following steps:

step 201, acquiring a plurality of video frames in a video;

the video to be processed comprises a plurality of video frames which are arranged in sequence. Each video frame corresponds to temporal information. Optionally, the time information is a frame number or a timestamp.

Step 202, calling a time sequence action nomination generating model to carry out prediction processing on a plurality of video frames to obtain a time sequence boundary confidence map and an action integrity probability map corresponding to a video;

optionally, the time-series action nomination generation model is a dense boundary based time-series action nomination generation model.

The timing boundary confidence map is used for predicting a starting boundary and an ending boundary of the timing action nomination. Optionally, the timing boundary confidence map comprises: a start boundary confidence map and an end boundary confidence map. The start boundary confidence map is a confidence map for describing a start boundary, and the end boundary confidence map is a confidence map for describing an end boundary. Optionally, each timing boundary confidence map is L x L. Each timing boundary confidence map is a two-dimensional map that is described based on a start dimension (startingdim) and an end dimension (endingdim).

The action integrity probability graph is a two-dimensional graph used for representing the action integrity probability of the starting boundary and the ending boundary of the same time sequence action nomination. Optionally, the motion integrity probability map is also L x L. In one example, L =100.

And step 203, fusing the time sequence boundary confidence map and the action integrity probability map to obtain a fused characteristic map.

And the fusion characteristic graph is used for predicting L-L time sequence action nominations, and redundant time sequence action nominations exist in the L-L time sequence action nominations. Therefore, the fused feature map can predict the time-series action nominations with dense boundaries.

And step 204, outputting the time sequence action nomination of the video according to the fusion feature map.

And after removing the redundant time sequence action nomination in the fusion characteristic diagram, the computer equipment outputs the time sequence action nomination of the video. The video may be named one or more time-series actions. Each time sequence action nomination comprises the following steps: a start boundary, an end boundary, and a confidence level.

In summary, in the method provided in this embodiment, two timing boundary confidence maps and two action integrity probability maps corresponding to a video are obtained by performing prediction processing on a plurality of video frames in the video, the timing boundary confidence maps and the action integrity probability maps are fused to obtain a fusion feature map, and a timing action nomination of the video is output according to the fusion feature map. Because the fused feature map based on the dense boundary predicts the boundary of the time-series action nomination in the global dimension (L x L) instead of predicting the boundary of the time-series action nomination based on local information only similarly to the BMN, a more accurate boundary can be predicted in the task of generating the time-series action nomination.

Referring to fig. 3 in combination, a Dense Boundary based time series action nominator (DBG) model includes: dual Stream base network (DSB); an Action-integrity Regression (ACR) module; a Timing Boundary Classification (TBC) module. The above step 202 may alternatively be implemented as the following steps 202a to 202c, as shown in fig. 4:

step 202a, calling a double-flow base network to process video characteristics of a plurality of video frames to obtain action probability characteristics and double-flow characteristics;

dual stream-based networks are used to explore the local rich behavior in video sequences. The dual-flow-based network outputs two features: a Dual Stream Feature (DSF) at a lower level and an Action probability Feature (ASF) at a higher level.

In one example, the video characteristics of the video frame include: RGB (red green blue ) features and optical flow features. The dual-stream features are generated by fusing the RGB features and the optical flow features. The motion probability feature is generated by extracting motion features in the RGB feature and the optical flow feature.

Optionally, the motion probability feature is learned under an additional motion classification loss function.

Step 202b, calling an action integrity regression module to perform first prediction processing on the action probability characteristics to obtain an action integrity probability graph corresponding to the video;

and the action integrity regression module ACR is used for carrying out global prediction processing on action dimensions on the action probability characteristics to obtain an action integrity probability chart for nominating all candidate time sequence actions. The action integrity probability map is used for representing the action integrity of the starting boundary and the ending boundary of each candidate time sequence action nomination.

Step 202c, calling a time sequence boundary classification module to perform second prediction processing on the double-flow characteristics to obtain a time sequence boundary confidence map corresponding to the video.

And the TBC is used for carrying out boundary prediction processing of space-time dimension on the double-flow characteristics to obtain a timing boundary confidence map corresponding to the video. The timing boundary confidence map includes a start boundary confidence map and an end boundary confidence map.

The starting boundary confidence map, the ending boundary confidence map and the action integrity probability map are fused into a fusion feature map, and the fusion feature map is used for integrally giving a prediction of time sequence action nomination.

In summary, the method provided in this embodiment uses the dual-flow-based network as the backbone network structure of the generation model based on the time sequence action nomination of the dense boundary, and can sufficiently capture enough features for identifying the boundary and the action, thereby exploring local rich behaviors in the video sequence.

FIG. 5 illustrates an architecture diagram of a dense boundary based time series action nomination based generative model provided by an exemplary embodiment of the present application. The generative model includes: a video encoding section 520, a dense boundary timing action generator 540, and a post-process 560.

The video encoding unit 520 includes: spatial networks and temporal networks. The spatial network is used for encoding the video frame to obtain the RGB characteristics of the video frame. The time network is used for coding the video frame to obtain the optical flow characteristics of the video frame.

The dense boundary timing action generator 540 includes: a dual-stream base network 542, an action integrity regression module 544, and a timing boundary classification module 546.

The dual stream base network 542 includes: a first convolutional layer network 51, a second convolutional layer network 52, an element-wise layer, three predictive convolutional layers 53 to 55, and an Averaging layer.

Illustratively, the first convolutional layer network 51 includes 2 one-dimensional convolutional layers stacked, and the second convolutional layer network 52 includes 2 one-dimensional convolutional layers stacked. The first convolution layer network 51 is used for performing convolution feature extraction on the RGB features in the video frame to obtain a spatial feature tf; the second convolution layer network 52 is configured to perform convolution feature extraction on the optical flow features in the video frame to obtain the time feature sf. And the element sum layer is used for carrying out element sum on the spatial feature tf and the temporal feature sf to construct a double-flow feature dsf.

Illustratively, the prediction convolutional layer 53 is used for predicting the spatial feature tf to obtain a first action probability; the prediction convolutional layer 54 is used for predicting the time characteristic sf to obtain a second action probability; the predicted convolutional layer 55 is used for predicting the dual-flow characteristic dsf to obtain a third action probability. The averaging layer is used for averaging the first action probability, the second action probability and the third action probability to obtain the action probability characteristic asf of the high layer.

The motion integrity regression module 544 includes: the first action naming feature generation layer (PEG layer) and the action convolution network. The action convolution network includes: n first two-dimensional convolution kernels, n being a positive integer. In one example, n is 3.

The first PEG layer is used for converting the action probability characteristics asf of the high layer into the action probability characteristics asf in a matrix form. And the action convolution network is used for extracting the convolution characteristics of the action probability characteristics asf in the matrix form to obtain an action integrity probability graph.

The timing boundary classification module 546 includes: a second PEG layer and a time-sequential convolutional network. The time series convolution network comprises: a three-dimensional convolution kernel and m second two-dimensional convolution kernels, m being a positive integer. In one example, m is 2.

The second PEG layer is used to convert the dual stream characteristics dsf of the lower layer into a matrix form of the dual stream characteristics dsf. The time sequence convolution network is used for performing convolution feature extraction on the double-flow feature dsf in the matrix form to obtain a starting boundary confidence map and an ending boundary confidence map.

The post-treatment comprises the following steps: a fusion layer and a Soft-non-maximum suppression operation (Soft-non-maximum suppression operation, soft-NMS) layer. The fusion layer is used for fusing the two time sequence boundary confidence graphs and the action integrity probability graph to obtain a fusion characteristic graph. And the Soft-NMS layer is used for performing Soft-NMS processing on the fusion characteristic graph to remove redundant time sequence action nominations and outputting sparse time sequence action nominations of the video.

Table one schematically shows the network architecture design of the dense boundary timing action generator 540 described above.

Watch 1

Wherein 1D represents one dimension, 2D represents two dimensions, and 3D represents three dimensions.

Fig. 6 is a flowchart illustrating a method for generating a dense boundary-based time-series action nomination according to another exemplary embodiment of the present application. This embodiment is illustrated by applying the method to the generative model shown in fig. 5. The method comprises the following steps:

step 601, calling a video coding part to code a plurality of video frames to obtain RGB (red, green and blue) characteristics and optical flow characteristics of each video frame;

the video encoding unit includes: spatial networks and temporal networks.

For each video frame in the plurality of video frames, the computer device invokes a spatial network to encode the video frame to obtain RGB features of the video frame. And the computer equipment calls a time network to encode the video frame to obtain the optical flow characteristics of the video frame.

Step 602, calling a double-flow base network to process video characteristics of a plurality of video frames to obtain action probability characteristics and double-flow characteristics;

the dual-flow-based network includes: a first convolutional layer network, a second convolutional layer network, an element sum layer, three predicted convolutional layers, and an Averaging layer. This step optionally includes the following substeps, as shown in FIG. 7:

s6021, RGB characteristics and optical flow characteristics of each video frame in a plurality of video frames are obtained. S6022, calling the first convolution layer network to perform convolution processing on the RGB characteristics of the video frame to obtain spatial characteristics sf; and calling the second convolution layer to perform convolution processing on the optical flow characteristics of the video frame to obtain the time characteristics tf. And S6023, calling the elements and the layers to perform element and operation on the spatial feature sf and the temporal feature tf to obtain a double-flow feature dsf. S6024, calling the three predicted convolutional layers to predict the spatial feature sf, the temporal feature tf, and the dual-flow feature dsf, respectively, to obtain a first action probability corresponding to the spatial feature sf, a second action probability corresponding to the temporal feature tf, and a third action probability corresponding to the dual-flow feature dsf. And S6025, calling the average layer to average the first action probability, the second action probability and the third action probability to obtain action probability characteristics.

Step 603, calling an action integrity regression module to perform first prediction processing on the action probability characteristics to obtain an action integrity probability map corresponding to the video;

the action integrity regression module comprises: a first PEG layer and an action convolutional network. This step optionally includes the following substeps, as shown in FIG. 8:

s6031, the first PEG layer is called to convert the action probability characteristic into a first characteristic diagram in a matrix form. And S6032, calling the motion convolution network to carry out convolution processing on the first characteristic diagram in the matrix form to obtain a motion integrity probability diagram corresponding to the video.

Illustratively, the action convolutional network comprises: the system comprises a three-dimensional convolution kernel and n first two-dimensional convolution kernels, wherein n is a positive integer. For example, n is 2.

Step 604, calling a time sequence boundary classification module to perform second prediction processing on the double-flow characteristics to obtain a time sequence boundary confidence map corresponding to the video;

optionally, the timing boundary confidence map comprises: a start boundary confidence map and an end boundary confidence map.

The timing boundary classification module 546 includes: a second PEG layer and a time-sequential convolutional network. This step optionally includes the following substeps, as shown in FIG. 9:

and S6041, calling a second PEG layer to convert the double-current characteristic into a second characteristic diagram in a matrix form. And S6042, calling a time sequence convolution network to carry out convolution processing on the second characteristic graph in the matrix form to obtain a time sequence boundary confidence graph corresponding to the video.

Illustratively, the time-sequential convolutional network comprises: a three-dimensional convolution kernel and m second two-dimensional convolution kernels, m being a positive integer. For example, m is 2.

Step 605, fusing the time sequence boundary confidence map and the action integrity probability map to obtain a fusion characteristic map;

and the computer equipment multiplies the start boundary confidence map, the end boundary confidence map and the action integrity probability map, and fuses to obtain a fusion characteristic map.

And multiplying and fusing the generated three confidence maps to obtain the final P.

Wherein i is an integer not greater than L, and j is an integer not greater than L. i represents the position (coordinate value) in the ending dimension and j represents the position (coordinate value) in the starting dimension. The fused feature map is a map of L x L, P ^c Is a probability map of motion integrity, P ^s To start the boundary confidence map, P ^e To end boundary confidence map.

Optionally, the present application performs a smoothing process on the two timing boundary confidence maps as follows before the fusion.

Wherein k is an integer not greater than L.

Step 606, obtaining L x L candidate time sequence action nominations in the fusion characteristic diagram;

since there are LXL candidate timing action nominations generated by the above generation model, the present application needs to perform NMS, that is, a timing action nomination in which redundancy is removed by a non-maximum suppression operation. And obtaining the final sparse time sequence action nomination.

And step 607, removing the redundant time sequence action nomination in the L-by-L candidate time sequence action nominations and outputting the time sequence action nomination of the video.

Wherein each time sequence action nomination of the output video has a boundary and a confidence.

Alternatively, a post-processing procedure is required at the time of boundary prediction, due to the loss of L in action classification _DSB Is an additional loss function, and the action probability predicted by the double-flow base network DSB does not participate in the calculation of the final action nomination.

In summary, the method provided by this embodiment explores rich global semantic information by using the PFG layer and several convolutional layers. ACR finally outputs the action integrity confidence map p of LXL ^c The boundary confidence map p of the LXXX 2 finally output by the DSB ^e . Illustratively, the L =100 pfg layer is a layer that samples global action nomination features, and can extract boundaries and integrity of a time-series action more accurately from global information.

The method provided by the embodiment can obtain a more accurate boundary classification result by smoothing the starting boundary confidence map and the ending boundary confidence map before fusion.

Fig. 10 shows a comparison of BMN in the related art and DBG of the present application at the time of boundary prediction.

The BMN in the related art predicts the boundary probability at each time point (the initial boundary confidence sequence in the graph) by using local information, and this local method lacks global information of motion, which may be difficult to handle motion with fuzzy boundaries and large time span.

The DBG in the present application performs boundary classification using global nomination level information. The global nomination level information is extracted by the PEG layer and the time sequence boundary classification module, so that the time sequence action nomination of a dense boundary can be extracted and obtained based on the global nomination level information, and a more accurate boundary is obtained.

In the training process of the generative model, three loss functions are used, which are respectively called as action classification loss L _DSB Boundary classification penalty L _TBC And integrity regression loss L _ACR 。

Wherein the boundary classification loses L _TBC Is from the start boundary

And an end boundary->

Two loss functions. Classification loss function L _DSB The present application uses a binary regression loss. Integrity regression loss function L _ACR A smooth L1loss (smooth L1loss 0) was used.

Wherein, g ^a To classify the loss label for the action, g ^s To begin the boundary classification loss labels of the boundary, g ^e To classify the loss label for the boundary of the end boundary, g ^c Loss label is returned for integrity. The final training loss function is a weighting of the loss function. p represents probability, a represents action, c represents integrity, s represents start boundary, and e represents end boundary. The final training loss function is a weighting of the above loss functions:

where λ is the corresponding weight, set to 2.

In the following, reference is made to the embodiments of the apparatus of the present application, and for details not described in detail in the embodiments of the apparatus, reference is made to the embodiments of the method described above.

Fig. 11 is a block diagram of a training apparatus for a generative model of time-series action nomination provided in an exemplary embodiment of the present application, the apparatus including:

an obtaining module 1120, configured to obtain a plurality of video frames in a video;

a calling module 1140, configured to call a time sequence action nomination generation model to perform prediction processing on the multiple video frames, so as to obtain a time sequence boundary confidence map and an action integrity probability map corresponding to the video; the time sequence boundary confidence map is used for predicting a starting boundary and an ending boundary of time sequence action nomination, and the action integrity probability map is used for representing action integrity probabilities of the starting boundary and the ending boundary of the same time sequence action nomination;

the fusion module 1160 is used for fusing the time sequence boundary confidence map and the action integrity probability map to obtain a fusion characteristic map;

and the output module 1180 is configured to output the time sequence action nomination of the video according to the fusion feature map.

In an optional embodiment, the time-series action nomination generating model comprises: a dense boundary timing action generator, the dense boundary timing action generator comprising: the system comprises a double-flow base network, an action integrity regression module and a time sequence boundary classification module;

the invoking module 1140 is configured to invoke the dual-stream-based network to process the video features of the multiple video frames, so as to obtain an action probability feature and a dual-stream feature; calling the action integrity regression module to perform first prediction processing on the action probability characteristics to obtain the action integrity probability graph corresponding to the video; and calling the time sequence boundary classification module to perform second prediction processing on the double-current characteristics to obtain the time sequence boundary confidence map corresponding to the video.

In an alternative embodiment, the dual-flow-based network includes: a first convolutional layer network, a second convolutional layer network, an additive layer, three prediction convolutional layers and an average layer;

the invoking module 1140, configured to obtain RGB features and optical flow features of each of the plurality of video frames; calling the first convolution layer network to carry out convolution processing on the RGB characteristics of the video frame to obtain spatial characteristics sf; calling the second convolution layer to carry out convolution processing on the optical flow characteristics of the video frame to obtain time characteristics tf; calling the addition layer to perform element and operation on the spatial feature sf and the temporal feature tf to obtain a double-current feature dsf; calling the three predicted convolutional layers to predict the spatial feature sf, the temporal feature tf and the double-flow feature dsf respectively to obtain a first action probability corresponding to the spatial feature sf, a second action probability corresponding to the temporal feature tf and a third action probability corresponding to the double-flow feature dsf; and calling the averaging layer to average the first action probability, the second action probability and the third action probability to obtain the action probability characteristic.

In an optional embodiment, the action integrity regression module comprises: generating a layer and an action convolution network by a first action nomination feature;

the invoking module 1140 is configured to invoke the first action nomination feature generation layer to convert the action probability feature into a first feature map in a matrix form; and calling the action convolution network to carry out convolution processing on the first characteristic graph in the matrix form to obtain an action integrity probability graph corresponding to the video.

In an alternative embodiment, the action convolution network includes: n first two-dimensional convolution kernels stacked in sequence, wherein n is a positive integer.

In an optional embodiment, the timing boundary classification module comprises: a second action nomination feature generation layer and a time series convolution network;

the calling module 1140 is configured to call the second action nomination feature generation layer to convert the dual-stream feature into a second feature map in a matrix form; and calling the time sequence convolution network to carry out convolution processing on the second characteristic diagram in the matrix form to obtain the time sequence boundary confidence diagram corresponding to the video.

In an alternative embodiment, the time-sequential convolutional network comprises:

and sequentially stacking a three-dimensional convolution kernel and m second two-dimensional convolution kernels, wherein m is a positive integer.

In an alternative embodiment, the output module 1160 is configured to obtain L × L candidate time series action nominations in the fused feature map; and removing redundant time sequence action nominations in the L-by-L candidate time sequence action nominations, and outputting the time sequence action nominations of the video, wherein the time sequence action nominations have boundaries and confidence degrees.

In an optional embodiment, the time-series action nomination generating model further comprises: a video encoding unit; the invoking module 1140 is further configured to invoke the video encoding part to encode the plurality of video frames to obtain RGB features and optical flow features of each video frame.

The application further provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the training method for the generation model of the time sequence action nomination or the generation method of the time sequence action nomination provided by the above method embodiments. It should be noted that the computer device may be a computer device as provided in fig. 12 below.

Referring to fig. 12, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the method comprises the following steps: the computer apparatus 1200 includes a Central Processing Unit (CPU) 1201, a system memory 1204 including a Random Access Memory (RAM) 1202 and a Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system memory 1204 and the central processing unit 1201. The computer device 1200 also includes a basic input/output system (I/O system) 1206 for facilitating information transfer between various devices within the computer, and a mass storage device 1207 for storing an operating system 1213, application programs 1214, and other program modules 1210.

The basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209, such as a mouse, keyboard, etc., for user input of information. Wherein a display 1208 and an input device 1209 are connected to the central processing unit 1201 through an input-output controller 1210 connected to the system bus 1205. The basic input/output system 1206 may also include an input/output controller 1210 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1210 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1207 and its associated computer-readable media provide non-volatile storage for the computer device 1200. That is, the mass storage device 1207 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROI drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1204 and mass storage device 1207 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1201, the one or more programs including instructions for implementing a training method for a generation model of a time series action nomination or a generation method for a time series action nomination as described above, and the central processing unit 1201 executes the one or more programs to implement the training method for the generation model of a time series action nomination or the generation method for the time series action nomination as provided by the various method embodiments described above.

According to various embodiments of the application, the computer device 1200 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the computer device 1200 may connect to the network 1212 through a network interface unit 1211 connected to the system bus 1205, or may connect to other types of networks and remote computer systems (not shown) using the network interface unit 1211.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the one or more programs include a generation method for performing the time series action nomination provided by the embodiment of the application.

The embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded by the processor and implements the method for generating the time sequence action nomination.

The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or an instruction set is stored in the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the training method of the generation model for time-series action nomination or the generation method for time-series action nomination described above.

The present application further provides a computer program product, which when running on a computer, causes the computer to execute the training method of the generation model of the time series action nomination or the generation method of the time series action nomination provided by the above-mentioned method embodiments.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for generating a time series action nomination performed by a computer device running a time series action nomination generation model, wherein the time series action nomination generation model comprises a dense boundary time series action generator, the dense boundary time series action generator comprises a dual-flow-base network, an action integrity regression module and a time series boundary classification module, and the method comprises the following steps:

acquiring a plurality of video frames in a video;

calling the double-flow base network to process the video characteristics of the video frames to obtain action probability characteristics and double-flow characteristics; calling the action integrity regression module to perform first prediction processing on the action probability characteristics to obtain an action integrity probability graph corresponding to the video, wherein the action integrity probability graph is used for representing action integrity probabilities of a starting boundary and an ending boundary of the same time sequence action nomination; calling the time sequence boundary classification module to perform second prediction processing on the double-current features to obtain a time sequence boundary confidence map corresponding to the video, wherein the time sequence boundary confidence map is used for predicting a start boundary and an end boundary of time sequence action nomination;

and outputting the time sequence action nomination of the video according to the fusion feature graph.

2. The method of claim 1, wherein the dual-flow-based network comprises: a first convolutional layer network, a second convolutional layer network, an additive layer, three prediction convolutional layers and an average layer;

the calling the double-flow-based network to process the video characteristics of the plurality of video frames to obtain action probability characteristics and double-flow characteristics, and the method comprises the following steps:

acquiring red, green and blue (RGB) characteristics and optical flow characteristics of each video frame in the plurality of video frames;

calling the first convolution layer network to carry out convolution processing on the RGB characteristics of the video frame to obtain spatial characteristics; calling the second convolution layer to carry out convolution processing on the optical flow characteristics of the video frame to obtain time characteristics;

calling the addition layer to perform element and operation on the spatial feature and the time feature to obtain a double-flow feature;

calling the three prediction convolution layers to respectively predict the spatial feature, the time feature and the dual-flow feature to obtain a first action probability corresponding to the spatial feature, a second action probability corresponding to the time feature and a third action probability corresponding to the dual-flow feature;

and calling the averaging layer to average the first action probability, the second action probability and the third action probability to obtain the action probability characteristic.

3. The method of claim 1, wherein the action integrity regression module comprises: a first action nomination feature generation layer and an action convolution network;

the calling the action integrity regression module to perform first prediction processing on the action probability features to obtain the action integrity probability map corresponding to the video, and the method comprises the following steps:

calling the first action nomination feature generation layer to convert the action probability features into a first feature map in a matrix form;

and calling the action convolution network to carry out convolution processing on the first characteristic diagram in the matrix form to obtain an action integrity probability diagram corresponding to the video.

4. The method of claim 3, wherein the action convolution network comprises: n first two-dimensional convolution kernels stacked in sequence, wherein n is a positive integer.

5. The method of claim 1, wherein the timing boundary classification module comprises: generating a layer and a time sequence convolution network by the second action nomination characteristic;

the calling the time sequence boundary classification module to perform second prediction processing on the double-current features to obtain the time sequence boundary confidence map corresponding to the video, including:

calling the second action nomination feature generation layer to convert the double-flow features into a second feature map in a matrix form;

and calling the time sequence convolution network to carry out convolution processing on the second characteristic diagram in the matrix form to obtain the time sequence boundary confidence diagram corresponding to the video.

6. The method of claim 5, wherein the time-sequential convolutional network comprises:

and the three-dimensional convolution kernel and the m second two-dimensional convolution kernels are sequentially stacked, wherein m is a positive integer.

7. The method according to any one of claims 1 to 6, wherein the outputting the time-series action nomination of the video according to the fused feature map comprises:

obtaining L x L candidate time sequence action nominations in the fusion feature map;

and removing redundant time sequence action nominations in the L-by-L candidate time sequence action nominations, and outputting the time sequence action nominations of the video, wherein the time sequence action nominations have boundaries and confidence degrees.

8. The method of any of claims 1 to 6, wherein the time series action nomination generation model further comprises: a video encoding unit; the method further comprises the following steps:

and calling the video coding part to code the plurality of video frames to obtain RGB (red, green and blue) characteristics and optical flow characteristics of each video frame.

9. A generation device of time sequence action nomination is characterized in that the device runs a time sequence action nomination generation model, the time sequence action nomination generation model comprises a dense boundary time sequence action generator, the dense boundary time sequence action generator comprises a double-flow base network, an action integrity regression module and a time sequence boundary classification module, and the device comprises:

the calling module is used for calling the double-flow base network to process the video characteristics of the video frames to obtain action probability characteristics and double-flow characteristics; calling the action integrity regression module to perform first prediction processing on the action probability characteristics to obtain an action integrity probability graph corresponding to the video, wherein the action integrity probability graph is used for representing action integrity probabilities of a starting boundary and an ending boundary of the same time sequence action nomination; calling the time sequence boundary classification module to perform second prediction processing on the double-current features to obtain a time sequence boundary confidence map corresponding to the video, wherein the time sequence boundary confidence map is used for predicting a start boundary and an end boundary of time sequence action nomination;

10. A computer device, characterized in that the computer device comprises: a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of generating a time-series action nomination according to any one of claims 1 to 8.

11. A computer readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to implement a method of generating a time series action nomination as claimed in any one of claims 1 to 8.