CN112307885A

CN112307885A - Model construction and training method and device, and time sequence action positioning method and device

Info

Publication number: CN112307885A
Application number: CN202010851281.7A
Authority: CN
Inventors: 姚霆; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2021-02-02

Abstract

The disclosure relates to a model building and training method and device and a video time sequence action positioning method and device. The model training method comprises the following steps: the method comprises the steps of training an action prediction network model by adopting first video data and second video data, so that the trained action prediction network model is used for realizing time sequence action positioning of video data to be tested, wherein the first video data are long video data containing time domain marks, the second video data are action short-segment video data, the video data to be tested are long video data not containing the time domain marks, and the number of the first video data is smaller than that of the second video data. The time sequence action positioning model can be successfully expanded to a wider action category set through the time sequence action positioning model suitable for large-scale action categories.

Description

Model construction and training method and device, and time sequence action positioning method and device

Technical Field

The disclosure relates to the field of video data processing, in particular to a model construction and training method and device and a video time sequence action positioning method and device.

Background

With the increasingly developed internet cloud technology, the video data of the user is exponentially and explosively increased, and the video-based intelligent service related to the video data is closer to the daily life of the user. This trend has prompted the rapid development of intelligent video understanding techniques, one of which is the time-sequential action localization, locating the time period during which an action occurs in a long video, and classifying actions according to the content contained in the segment. At present, most time sequence action positioning algorithms need to carry out intensive time domain marking on training video data, and then carry out action positioning model training in a full supervision mode on the basis.

Disclosure of Invention

The inventor finds out through research that: the related-art time sequence action positioning mode needs strong supervision information, so once a new action type appears, the related video must be labeled in total, which limits the number of the applicable action types (for example, the ActivityNet data set only has 200 types), and cannot position the large-scale action type.

In view of at least one of the above technical problems, the present disclosure provides a model building and training method and apparatus, and a video time sequence action positioning method and apparatus, which are suitable for time sequence action positioning of large-scale action categories.

According to an aspect of the present disclosure, there is provided an action-aware network model building method, including:

constructing a first pre-known submodel, wherein the first pre-known submodel is used for performing action recognition training and time sequence action positioning training by adopting first video data, and the first video data are long video data containing time domain marks;

constructing a second pre-known submodel, wherein the second pre-known submodel is used for performing action recognition training and time sequence action positioning training by adopting second video data, the second video data are action short-segment video data, and the quantity of the first video data is less than that of the second video data;

wherein the action-prediction network model comprises a first prediction submodel and a second prediction submodel; the trained action prediction network model is used for realizing the time sequence action positioning of the video data to be tested, wherein the video data to be tested is long video data without time domain marks.

In some embodiments of the present disclosure, the constructing the first prediction submodel includes:

constructing a fragment extraction submodel, wherein the fragment extraction submodel is used for extracting action fragments from the first video data as foreground short fragments;

constructing a first action recognition submodel, wherein the first action recognition submodel is used for carrying out action recognition training by adopting the short foreground fragment;

constructing a second time sequence action positioning sub-model, wherein the second time sequence action positioning sub-model is used for performing time sequence action positioning training by adopting the first video data;

and constructing a first weight transfer submodel, wherein the first weight transfer submodel is used for bridging the action recognition of the foreground short segment and the time sequence action positioning of the first video data through a first weight transfer function.

In some embodiments of the present disclosure, the method for constructing an action-aware network model further includes:

and constructing a confrontation training model, wherein the confrontation training model is used for distinguishing anchor point features corresponding to the background in the first video data and anchor point features corresponding to the background generated by the second video data by using a discriminator on each time domain scale, so that the background generator can utilize the background before and after action in the first video data as a guide to perform magic generation on the background before and after action of the second video data.

and constructing a weight sharing submodel, wherein the weight sharing submodel is used for sharing the weights of a first weight migration function and a second weight migration function, the first weight migration function is the first weight migration function of the first weight migration submodel in the first pre-known submodel, and the second weight migration function is the first weight migration function of the second weight migration submodel in the second pre-known submodel.

In some embodiments of the present disclosure, the method for constructing an action-aware network model further includes: and constructing a joint optimization submodel, wherein the joint optimization submodel is used for realizing the convergence of the overall positioning loss function of the action predictive network model on the target category through continuously and alternately optimizing a time sequence action positioning loss function of a complete characteristic sequence synthesized by the first video data and the second video data, an action recognition loss function of a foreground fragment of the first video data and the second video data and an antagonistic training optimization loss function, and the target category belongs to the action category of the first video data or the action category of the second video data.

According to another aspect of the present disclosure, there is provided a motion prediction network model training method, including:

the method comprises the steps of training an action prediction network model by adopting first video data and second video data, so that the trained action prediction network model is used for realizing time sequence action positioning of video data to be tested, wherein the first video data are long video data containing time domain marks, the second video data are action short-segment video data, the video data to be tested are long video data not containing the time domain marks, the number of the first video data is smaller than that of the second video data, and the action category of the video data to be tested belongs to the action category of the first video data or the action category of the second video data.

In some embodiments of the present disclosure, the action-aware network model is constructed by the action-aware network model construction method according to any of the above embodiments.

In some embodiments of the present disclosure, the time series action positioning comprises:

an action category and an action start time of the video data are determined.

In some embodiments of the present disclosure, the action-anticipation network model comprises a first anticipation sub-model and a second anticipation sub-model.

In some embodiments of the disclosure, the training of the motion prediction network model using the first video data and the second video data includes:

performing action recognition training and time sequence action positioning training on the first pre-known sub-model by adopting first video data;

and performing action recognition training and time sequence action positioning training on the second pre-known sub-model by adopting second video data.

In some embodiments of the present disclosure, the first pre-known submodel includes a first motion recognition submodel and a first time series motion localization submodel.

In some embodiments of the present disclosure, the performing motion recognition training and timing motion positioning training on the first prediction submodel by using the first video data includes:

extracting action fragments from the first video data as foreground short fragments, inputting the foreground short fragments into a first action recognition sub-model, and obtaining action recognition results of the foreground short fragments;

inputting the first video data into a first time sequence action positioning sub-model to obtain a time sequence action positioning result of the first video data;

motion recognition of the foreground short segment and time sequence motion positioning of the first video data are bridged by a first weight transfer function.

In some embodiments of the present disclosure, the inputting the foreground short segment into the first action recognition submodel, and the obtaining the action recognition result of the foreground short segment includes: extracting segment level characteristics of the foreground short segments, combining the segment level characteristics of the foreground short segments to form a characteristic sequence, and generating characteristic maps under different time domain scales; performing global pooling operation on the feature map of the foreground short segment to obtain a feature vector under each time domain scale; mapping the feature vectors of the foreground short segments by adopting a first video level classification weight matrix to be used as action identification features of the foreground short segments; and obtaining the action recognition result of the foreground short segment according to the action recognition characteristics of the foreground short segment.

In some embodiments of the disclosure, the inputting the first video data into the first time sequence action positioning sub-model, and the obtaining the time sequence action positioning result of the first video data includes: extracting segment level features of the first video data, combining the segment level features of the first video data to form a feature sequence, and generating feature maps under different time domain scales; and adopting the first anchor point level classification weight matrix to act on each feature anchor point on the first video data feature map to obtain a time sequence action positioning result of the first video data.

In some embodiments of the present disclosure, the bridging the motion recognition of the foreground short segment and the time-series motion localization of the first video data by the weight transfer function includes:

and applying a weight transfer function to the first video level classification weight matrix to generate a first anchor point level classification weight matrix.

In some embodiments of the present disclosure, the second prediction submodel includes a second motion recognition submodel and a second sequential motion localization submodel.

In some embodiments of the present disclosure, the performing motion recognition training and timing motion positioning training on the second prediction sub-model by using the second video data includes:

extracting the segment level characteristics of the second video data, and combining the segment level characteristics of the second video data to form a characteristic sequence of the second video data;

performing magic generation on the background before and after the action of the second video data by using the background before and after the action in the first video data as a guide, and combining the generated background feature and the original second video data feature to form a complete synthesized feature sequence;

inputting the characteristic sequence of the second video data into a second action recognition submodel to obtain an action recognition result of the second video data;

inputting the complete synthesized characteristic sequence into a second time sequence action positioning sub-model to obtain a time sequence action positioning result of second video data;

and bridging the action identification of the second video data and the time sequence action positioning of the front and back content extended versions of the second video data through a second weight transfer function.

In some embodiments of the present disclosure, the inputting the feature sequence of the second video data into the second motion recognition submodel, and the obtaining the motion recognition result of the second video data includes: generating feature maps under different time domain scales according to the feature sequence of the second video data; performing global pooling operation on the second video data to obtain a feature vector under each time domain scale; mapping the feature vector of the second video data by adopting a second video level classification weight matrix to be used as the action identification feature of the second video data; and acquiring a motion recognition result of the second video data according to the motion recognition characteristics of the second video data.

In some embodiments of the present disclosure, the inputting the complete synthesized feature sequence into the second temporal motion localization sub-model, and obtaining the temporal motion localization result of the second video data includes: generating feature maps under different time domain scales according to the complete synthesized feature sequence; and adopting the second anchor point level classification weight matrix to act on each feature anchor point on the feature spectrum of the front and rear content extended versions of the target short segment to obtain the time sequence action positioning result of the front and rear content extended versions of the target short segment.

In some embodiments of the present disclosure, the bridging the motion recognition of the second video data and the time-series motion localization of the front and back content extension versions of the second video data by the second weight transfer function includes:

and applying a second weight migration function to the second video level classification weight matrix to generate a second anchor point level classification weight matrix.

In some embodiments of the present disclosure, the action-predictive network model training method further includes:

and sharing the weights of the first weight migration function and the second weight migration function.

In some embodiments of the present disclosure, the action-predictive network model comprises a confrontational training model;

the generating the magic of the background before and after the action of the second video data by using the background before and after the action in the first video data as a guide comprises:

on each time domain scale, distinguishing anchor point characteristics corresponding to the background in the first video data and anchor point characteristics corresponding to the background generated by the second video data by using a discriminator;

and performing magic generation on the background before and after the action of the second video data by using the background before and after the action in the first video data as a guide through the confrontation training model.

the convergence of the overall positioning loss function of the action prediction network model on the target category is realized by continuously and alternately optimizing a time sequence action positioning loss function of a complete feature sequence synthesized by the first video data and the second video data, an action recognition loss function of a foreground segment of the first video data and the second video data, and an antagonistic training optimization loss function, wherein the target category is the action category belonging to the first video data or the action category belonging to the second video data.

According to another aspect of the present disclosure, there is provided a video time sequence action positioning method, including:

inputting video data to be tested into an action prediction network model, wherein the action prediction network model is obtained by training according to the action prediction network model training method in any one of the embodiments;

and performing time sequence action positioning on the video data to be detected by adopting an action prediction network model.

In some embodiments of the present disclosure, the performing a time-series action positioning on video data to be tested by using an action prediction network model includes:

extracting segment level characteristics from video data to be detected, combining the segment level characteristics to form a characteristic sequence, and generating characteristic maps under different time domain scales;

generating anchor point characteristics on the anchor point layer under each time domain scale, and reversely deducing the corresponding time segment position in the video data to be detected through each anchor point characteristic;

performing action classification and time sequence offset regression on the anchor point characteristics to obtain a candidate action positioning result, wherein a second anchor point level classification weight matrix for performing action classification on the anchor point characteristics is obtained through migration of a second video level classification weight matrix;

and determining a final action positioning result from the candidate action positioning results according to the positioning sorting score.

and performing time sequence action positioning on the video data to be detected by adopting a second time sequence action positioning sub-model in the action prediction network model obtained by training the action prediction network model training method in any embodiment.

According to another aspect of the present disclosure, there is provided an action-aware network model building apparatus including:

the system comprises a first construction module, a second construction module and a third construction module, wherein the first construction module is used for constructing a first prediction submodel, the first prediction submodel is used for performing action recognition training and time sequence action positioning training by adopting first video data, and the first video data are long video data containing time domain labels;

the second construction module is used for constructing a second pre-known submodel, wherein the second pre-known submodel is used for performing action recognition training and time sequence action positioning training by adopting second video data, the second video data are action short segment video data, and the quantity of the first video data is less than that of the second video data;

According to another aspect of the present disclosure, a motion-predicted network model training device is provided, where the motion-predicted network model training device is configured to train a motion-predicted network model by using first video data and second video data, so that the trained motion-predicted network model is used to realize time sequence motion positioning of video data to be tested, where the first video data is long video data including a time domain label, the second video data is motion short segment video data, the video data to be tested is un-clipped long video data, the number of the first video data is smaller than that of the second video data, and a motion category of the video data to be tested belongs to a motion category of the first video data or a motion category of the second video data.

According to another aspect of the present disclosure, there is provided a video time-series action positioning device, including:

a data input module, configured to input video data to be tested into an action prediction network model, where the action prediction network model is obtained by training according to the action prediction network model training method according to any one of claims 1 to 9;

and the positioning module is used for carrying out time sequence action positioning on the video data to be detected by adopting an action prediction network model.

According to another aspect of the present disclosure, there is provided a computer apparatus comprising:

a memory to store instructions;

a processor, configured to execute the instructions, so that the computer device performs operations of implementing the action-prediction network model building method according to any one of the above embodiments, the action-prediction network model training method according to any one of the above embodiments, or the video time-series action positioning method according to any one of the above embodiments.

According to another aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, wherein the computer-readable storage medium stores computer instructions, which when executed by a processor, implement the action-aware network model building method according to any of the above embodiments, the action-aware network model training method according to any of the above embodiments, or the video time-series action positioning method according to any of the above embodiments.

The time sequence action positioning model can be successfully expanded to a wider action category set through the time sequence action positioning model suitable for large-scale action categories.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of some embodiments of a method of action-predictive network model construction and training according to the present disclosure.

Fig. 2 is a schematic diagram of some embodiments of a method for constructing an action-predictive network model according to the present disclosure.

FIG. 3 is a schematic diagram of some embodiments of an action-predictive network model according to the present disclosure.

FIG. 4 is a schematic diagram of some embodiments of a behavior prediction network model training method of the present disclosure.

FIG. 5 is a diagram illustrating generation of contextual information for an action by counterlearning in some embodiments of the present disclosure.

Fig. 6 is a schematic diagram of some embodiments of a video timing action positioning method according to the present disclosure.

Fig. 7 is a schematic diagram of some embodiments of a predictive network model building apparatus of the present disclosure.

FIG. 8 is a schematic diagram of some embodiments of a video timing motion positioning apparatus according to the present disclosure.

FIG. 9 is a schematic diagram of some embodiments of a computer apparatus of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The inventor finds out through research that: in the related art, a Temporal action location model (Temporal action location model) requires that training data have complete time domain labeling information, and due to huge manual labeling cost, the number of action types suitable for the action location model is limited, and a large-scale action location model algorithm cannot be obtained.

Meanwhile, various large-scale motion short-segment data sets (e.g., Kinetics) in the related art are used for training a motion recognition model, and such short-segment data sets are generally short in duration (about 10 seconds) and only contain a complete motion and a corresponding motion label.

Thus, the inventor proposes a method and an apparatus for constructing a motion prediction network model, a method and an apparatus for training a motion prediction network model, a method and an apparatus for locating a video time series motion, a computer apparatus, and a storage medium, from which a time series motion locating model can be learned. The following description will be made by way of specific examples.

FIG. 1 is a schematic diagram of some embodiments of a method of action-predictive network model construction and training according to the present disclosure. Preferably, the present embodiment may be executed by the action-predicted network model building apparatus and the action-predicted network model training apparatus of the present disclosure. As shown in fig. 1, the action-prediction network model training method of the present disclosure may include steps 11 to 12, where:

and 11, constructing an action prediction network model.

In some embodiments of the present disclosure, step 11 may comprise: and improving and combining the Action recognition Model and the time sequence Action positioning Model, and adding a new Model and function to construct an Action prediction network Model (Temporal Action Location Model).

Fig. 1 also presents a schematic view of some embodiments of the disclosed action-predictive network model. As shown in fig. 1, the motion prediction network model is constructed by improving and combining a motion recognition model and a time-series motion localization model.

In some embodiments of the present disclosure, the motion recognition model is used to predict a motion category of the video data under test. The motion recognition model adopts various large-scale motion short-segment data sets (such as Kinetics) for training, and the short-segment data set is generally short in duration (about 10 seconds) and only contains a complete motion and a corresponding motion label.

In some embodiments of the present disclosure, a time series action localization model is used to determine an action category and an action start time of video data. Generally, a time sequence motion positioning model requires training data to have complete time domain Annotation (Temporal Annotation) information, and the number of motion types applicable to the motion positioning model is limited due to huge manual Annotation cost, so that a large-scale motion positioning model algorithm cannot be obtained.

And step 12, training the action prediction network model.

In some embodiments of the present disclosure, step 12 may comprise: and training the action prediction network model by adopting the first video data and the second video data, so that the trained action prediction network model is used for realizing the time sequence action positioning of the video data to be detected.

In some embodiments of the present disclosure, the time-series action positioning includes determining start and end position points of an action segment in video data to be tested, and performing action category prediction of the corresponding segment.

In some embodiments of the present disclosure, the first Video data may be un-clipped long Video (Untrimmed Video) data containing a temporal annotation.

In some embodiments of the present disclosure, as shown in fig. 1, the first video data may specifically be long video data including a long jump segment and a background segment before and after the long jump motion.

In some embodiments of the present disclosure, the second video data may be motion short segment video data.

In some embodiments of the present disclosure, the second video data may be implemented as kinetic, which is a current behavior data set with a large variety and large amount of data.

In some embodiments of the present disclosure, the video data under test may be long video data that does not contain temporal markers,

in some embodiments of the present disclosure, the amount of the first video data is less than the amount of the second video data.

In some embodiments of the present disclosure, the first video data is small-scale data and the second video data is large-scale data.

In some embodiments of the present disclosure, the data ratio of the first video data and the second video data is 1: 9.

in some embodiments of the present disclosure, the motion category of the video data under test belongs to a motion category of the first video data or a motion category of the second video data.

In some embodiments of the present disclosure, step 12 may include steps 121-122, wherein:

and step 121, performing motion recognition training and timing motion positioning training on the first video data.

In some embodiments of the present disclosure, step 121 may comprise: extracting action fragments from the first video data as foreground short fragments, and performing action recognition training on the foreground short fragments; performing time sequence action positioning training on the first video data; motion recognition of the foreground short segment and time sequence motion positioning of the first video data are bridged by a first weight transfer function.

And step 122, performing action recognition training and time sequence action positioning results on the second video data.

In some embodiments of the present disclosure, step 122 may comprise: performing motion recognition training on the second video data; modeling background characteristics of the content before and after the short segment by using a confrontation generation model, performing magic generation on the background before and after the action of the second video data by using the background before and after the action in the first video data as a guide, and performing time sequence action positioning training on an extended version of the content before and after the second video data; and bridging the action identification of the second video data and the time sequence action positioning of the front and back content extended versions of the second video data through a second weight transfer function.

In some embodiments of the present disclosure, step 122 may comprise: and performing time sequence action positioning learning on the short segment data, predicting that the type of the short segment data is a riding bicycle, determining front and rear background segments of the short segment data, and determining starting and ending position points of the action segment.

The present disclosure relates to a training method for time series action positioning model with short segments. The embodiment creatively provides a method for constructing a bridge between the action recognition and the time sequence action positioning by using the weight migration, and a method for modeling the background characteristics of the content before and after the short segment by using the confrontation generation model, and completes the model optimization of the action positioning on the target category on the basis, thereby successfully expanding the time sequence action positioning model to a wider action category set.

The action prediction network model construction and training method is suitable for a time sequence action positioning model of large-scale action categories, and training data comprise action short segments corresponding to the large-scale action categories and a small part of long videos with time sequence labels of other categories.

The following describes a method for constructing a behavior prediction network model and a method for training a behavior prediction network model according to the present disclosure, respectively, with specific embodiments.

Fig. 2 is a schematic diagram of some embodiments of a method for constructing an action-predictive network model according to the present disclosure. Preferably, the present embodiment may be performed by the action-aware network model building apparatus of the present disclosure.

FIG. 3 is a schematic diagram of some embodiments of an action-predictive network model according to the present disclosure. As shown in fig. 3, the action-prediction network model of the present disclosure may include a first prediction submodel 31 and a second prediction submodel 32, wherein: first video data for performing motion recognition training and timing motion positioning training on the first prediction submodel 31; and the second video data is used for performing motion recognition training and time sequence motion positioning training on the second pre-known sub-model 32.

As shown in fig. 2, the action-aware network model building method of the present disclosure may include steps 21 to 22, where:

and step 21, constructing a first prediction submodel, wherein the first prediction submodel is used for performing motion recognition training and time sequence motion positioning training by adopting first video data, and the first video data are long video data containing time domain labels.

In some embodiments of the present disclosure, step 21 may include steps 211-214, wherein:

and step 211, constructing a segment extraction submodel, wherein the segment extraction submodel is used for extracting the action segment from the first video data as a foreground short segment.

And 212, constructing a first action recognition submodel, wherein the first action recognition submodel is used for carrying out action recognition training by adopting the foreground short segment.

And 213, constructing a second time sequence action positioning sub-model, wherein the second time sequence action positioning sub-model is used for performing time sequence action positioning training by adopting the first video data.

Step 214, a first weight transfer submodel is constructed, wherein the first weight transfer submodel is used for bridging the action recognition of the foreground short segment and the time sequence action positioning of the first video data through a first weight transfer function.

In some embodiments of the present disclosure, as shown in fig. 3, the first prediction submodel 31 may include a first action recognition submodel 311, a first time-series action location submodel 312, a fragment extraction submodel 313, and a first weight shift submodel 314.

And step 22, constructing a second prediction submodel, wherein the second prediction submodel is used for performing motion recognition training and timing sequence motion positioning training by adopting second video data, the second video data are motion short segment video data, and the quantity of the first video data is less than that of the second video data.

In some embodiments of the present disclosure, the trained motion prediction network model is used to implement timing motion positioning of video data to be tested, where the video data to be tested is long video data that does not include time domain annotation.

In some embodiments of the present disclosure, step 22 may include steps 221-224, wherein:

step 221, constructing a feature sequence generation submodel, wherein the background generation submodel is used for extracting the segment level features of the second video data and combining the segment level features of the second video data to form a feature sequence of the second video data;

step 222, constructing a background generator, wherein the background generator performs magic generation on the background before and after the action of the second video data by using the background before and after the action in the first video data as a guide, and combines the generated background feature and the original second video data feature to form a complete synthesized feature sequence;

step 223, constructing a second action recognition submodel, wherein the second action recognition submodel is used for performing action recognition training by adopting a feature sequence of second video data;

step 224, constructing a second time sequence action positioning sub-model, wherein the second time sequence action positioning sub-model is used for performing time sequence action positioning training by adopting a complete synthetic characteristic sequence;

and step 225, constructing a second weight transfer submodel, wherein the second weight transfer submodel is used for bridging the action identification of the second video data and the time sequence action positioning of the front and back content extended versions of the second video data.

In some embodiments of the present disclosure, as shown in fig. 3, the second predictive submodel 32 may include a second motion recognition submodel 321, a second sequential motion localization submodel 322, a feature sequence generation submodel 323, a context generator 324, and a second weight shift submodel 325.

In some embodiments of the present disclosure, the action-aware network model building method may further include: and constructing a confrontation training model, wherein the confrontation training model is used for distinguishing anchor point features corresponding to the background in the first video data and anchor point features corresponding to the background generated by the second video data by using a discriminator on each time domain scale, so that the background generator can utilize the background before and after action in the first video data as a guide to perform magic generation on the background before and after action of the second video data.

In some embodiments of the present disclosure, as shown in fig. 3, the disclosed action prediction network model may also include an antagonistic training model 33.

In some embodiments of the present disclosure, the action-aware network model building method may further include: and constructing a weight sharing submodel, wherein the weight sharing submodel is used for sharing the weights of a first weight migration function and a second weight migration function, the first weight migration function is the first weight migration function of the first weight migration submodel in the first pre-known submodel, and the second weight migration function is the first weight migration function of the second weight migration submodel in the second pre-known submodel.

In some embodiments of the present disclosure, as shown in fig. 3, the disclosed action-predictive network model may also include a joint optimization model 34.

FIG. 4 is a schematic diagram of some embodiments of a behavior prediction network model training method of the present disclosure. Preferably, the present embodiment may be performed by the action-prediction network model training apparatus of the present disclosure. As shown in fig. 4, the action-aware network model building method of the present disclosure may include step 41, where:

and step 41, training the motion prediction network model by using first video data and second video data, so that the trained motion prediction network model is used for realizing time sequence motion positioning of the video data to be tested, wherein the first video data is long video data containing time domain marks, the second video data is motion short-segment video data, the video data to be tested is long video data not containing time domain marks, and the number of the first video data is smaller than that of the second video data.

Fig. 3 also shows schematic diagrams of other embodiments of the action-prediction network model training method according to the disclosure. The following describes the motion prediction network model training method of the present disclosure with reference to the motion prediction network model in the embodiment of fig. 3.

In some embodiments of the present disclosure, as shown in fig. 3, the action-anticipation network model training method of the present disclosure may include steps S100 to S400, where:

and S100, performing motion recognition training and time sequence motion positioning training on the first pre-known sub-model by adopting the first video data.

In some embodiments of the present disclosure, as shown in fig. 3, step S100 may include steps S110-S130, wherein:

step S110, extracting an action segment from the first Video data as a Foreground (forkrounded) short segment, inputting the Foreground short segment into the first action recognition sub-model, and obtaining an action recognition result of the Foreground short segment, wherein the first Video data is an un-clipped long Video (Untrimmed Video).

In some embodiments of the present disclosure, as shown in fig. 3, step S110 may include first to fifth steps in which:

firstly, 3D CNN (Three-dimensional Convolutional network) is adopted as a basic network, and the segment level characteristics of the short foreground segment are extracted.

And secondly, combining the segment-level features of the foreground short segments to form a feature sequence.

And thirdly, generating Feature maps (Feature maps) under different scales by using a 1D Conv Net (One-dimensional Convolutional network) aiming at the Feature sequences of the foreground short segments.

And fourthly, carrying out Global pooling (Global pooling) operation on the feature map of the foreground short segment to obtain a feature vector under each time domain scale.

Fifthly, adopting a first video level classification weight matrix

Mapping the feature vector of the foreground short segment to be used as the action recognition feature of the foreground short segment; and obtaining the action recognition result of the foreground short segment according to the action recognition characteristics of the foreground short segment.

Step S120, inputting the first video data into the first time sequence action positioning sub-model, and obtaining the time sequence action positioning result of the first video data.

In some embodiments of the present disclosure, as shown in fig. 3, step S120 may include first to fifth steps in which:

firstly, extracting the segment level characteristics of first video data by adopting 3D CNN as a basic network.

Second, the segment-level features of the first video data are combined to form a feature sequence.

Thirdly, generating feature maps under different scales by using one 1D Conv Net aiming at the feature sequence of the first video data.

Fourthly, for the feature map of the long video, the anchor point feature (Background Cell) corresponding to the Background short segment in the long video and the anchor point feature (Foreground Cell) corresponding to the Foreground short segment in the long video are distinguished.

Fifthly, adopting a first anchor point level classification weight matrix

And acting on each characteristic anchor point on the characteristic map of the first video data to obtain a time sequence action positioning result of the first video data.

In some embodiments of the present disclosure, the fifth step may include: the anchor point characteristics corresponding to the background short segments and the anchor point characteristics corresponding to the foreground short segments are nominated according to the first anchor point level weight matrix

Performing time sequence offset regression; the first anchor point level classification weight matrix generated in step S130 is adopted for the anchor point features corresponding to the foreground short segments

And (5) performing action classification.

Step S130, the action recognition of the foreground short segment and the time sequence action positioning of the first video data are bridged by the first weight transfer function.

In some embodiments of the present disclosure, as shown in fig. 3, step S130 may include: applying a weight migration function T to a first video level classification weight matrix

Generating a first anchor point level classification weight matrix

And step S200, performing Action recognition training and timing sequence Action positioning training on a second predicted sub-model by adopting second video data, wherein the second video data is Action Moment video data.

In some embodiments of the present disclosure, the second video data may be a target short segment.

In some embodiments of the present disclosure, step S200 may include: operations similar to steps 101-115 are performed on the target short segment (Action Moment) and its contextual extension.

In some embodiments of the present disclosure, step S200 may include steps S210-S250, wherein:

step S210, extracting the segment-level features of the second video data, and combining the segment-level features of the second video data to form a feature sequence of the second video data.

In some embodiments of the present disclosure, as shown in fig. 3, step S210 may include: and extracting the segment level characteristics of the target short segment by adopting the 3D CNN as a basic network, and forming a characteristic sequence of the second video data by combination.

Step S220, using the pre-and post-motion background in the first video data as a guide to perform magic generation on the pre-and post-motion background of the second video data, and combining the generated background feature and the original second video data feature to form a complete synthesized feature sequence.

In some embodiments of the present disclosure, as shown in fig. 3, step S220 may include: two 1D convolution networks are used as generators (Context generators) to respectively generate a front part and a back part (G)₁And G₂) A background content characteristic.

Step S230, inputting the feature sequence of the second video data into the second motion recognition submodel, and obtaining a motion recognition result of the second video data.

In some embodiments of the present disclosure, step S230 may include: generating feature maps under different time domain scales through one 1D Conv Net according to the feature sequence of the second video data; a global pooling operation is performed on the second video data,obtaining a feature vector under each time domain scale; classifying the weight matrix using the second video level

Mapping the feature vector of the second video data to be used as the motion identification feature of the second video data; and acquiring a motion recognition result of the second video data according to the motion recognition characteristics of the second video data.

Step S240, inputting the complete synthesized feature sequence into the second time sequence motion positioning sub-model, and obtaining a time sequence motion positioning result of the second video data.

In some embodiments of the present disclosure, step S240 may include: passing the complete synthesized characteristic sequence through a 1D Conv Net to generate characteristic maps under different time domain scales; classifying the weight matrix using a second anchor level

And acting on each feature anchor point on the feature map of the front and rear content extended versions of the target short segment to obtain the time sequence action positioning result of the front and rear content extended versions of the target short segment.

In some embodiments of the present disclosure, step S240 may include: for the feature map of the front and back content extended versions of the second video data, differentiating the magic generated anchor point features (Fake Background Cell) corresponding to the Background short segment and the anchor point features (Foreground Cell) corresponding to the second video data (Foreground short segment) of the front and back content extended versions of the second video data; the anchor point characteristics corresponding to the background short segments and the anchor point characteristics corresponding to the foreground short segments are nominated according to the anchor point level

Performing time sequence offset regression, and adopting the anchor point level classification weight matrix generated in the step S250 for the anchor point characteristics corresponding to the foreground short segment

And (5) performing action classification.

Step S250, bridging the action recognition of the second video data and the time sequence action positioning of the front and back content extension versions of the second video data through the second weight transfer function.

In some embodiments of the present disclosure, step S250 may include: applying a second weight migration function to a second video level classification weight matrix

Generating a second anchor level classification weight matrix

In some embodiments of the present disclosure, the action-predictive network model training method may further include sharing weights of the first weight migration function and the second weight migration function.

The weight migration in step S130 and step S250 is further described below by way of specific examples.

In some embodiments of the present disclosure, features at the video segment level are extracted through a 3D network, combined into a feature sequence and then sent to a cascaded 1D convolutional network (anchor layer), to obtain different feature maps at 8 temporal scales.

In some embodiments of the present disclosure, for non-clipped long videos, temporal boundary regression and action classification are generally optimized at each anchor point on the feature map. For foreground short segments or target class short segments, a video-level feature vector is usually obtained through a global pooling operation to serve as a classification optimization.

In some embodiments of the present disclosure, the temporal action localization is taken into account to include temporal action nomination and action classification, where classification weights for a particular class can be derived from the video level recognition model parameters. In order to establish the connection between the two, a weight migration function is provided.

In some embodiments of the present disclosure, on the j-layer feature map input as foreground segment or target short segment, the wholeA feature vector may be obtained after localized pooling. Then a weight matrix

Is used to perform video level classification of the action category c.

For the time-series motion localization of long video, the above-mentioned embodiment of the present disclosure uses a 1D convolutional layer with step size of 1 to act on each anchor point of the feature map for motion classification at the anchor point level (time-series motion nomination), and the above-mentioned embodiment of the present disclosure takes the weight for prediction category c in the 1D convolutional layer as the weight

For the same class c, a weight migration function T can be used to classify weights according to video level

To predict anchor level classification weights

As shown in the following equation (1):

theta in the formula (1)^jIs a parameter of the migration function and is independent of class c. The migration function may be implemented by several fully connected layers and different activation functions. By sharing theta^jThe migration function can be widely applied to action categories corresponding to the target short segments and used for predicting action classification weight of an anchor point level in the time sequence action positioning model. Therefore, the weight migration function T can be regarded as a bridge for performing the time series motion localization learning by using the video level classification weight.

And step S300, generating a magic background of the second video data by using the pre-motion and post-motion backgrounds in the first video data as a guide through the confrontation training model.

In some embodiments of the present disclosure, step S300 may include: on each time domain scale, distinguishing anchor point characteristics corresponding to the background in the first video data and anchor point characteristics corresponding to the background generated by the second video data by using a discriminator; and performing magic generation on the background before and after the action of the second video data by using the background before and after the action in the first video data as a guide through the confrontation training model.

After the classification weight of the anchor point level is obtained through the migration function, the training method of the motion positioning model cannot be directly used because the target short segment does not contain the motion background. Therefore, the above-mentioned embodiment of the present disclosure introduces a scheme of Adversarial Training (adaptive Training), and uses the context before and after actions in the un-clipped long video as a guide to generate the content information before and after the target segment in a magic manner. FIG. 5 is a diagram illustrating generation of contextual information for an action by counterlearning in some embodiments of the present disclosure. Fig. 5 is a detailed illustration of the embodiment in fig. 3, and fig. 5 includes descriptions of step S120, step S210 to step S220, step S240, and step S300 in the embodiment in fig. 3.

In some embodiments of the present disclosure, as shown in fig. 3 and 5, the action-prediction network model may further include an antagonistic training model 33.

As shown in fig. 5, the above-described embodiment of the present disclosure represents the feature maps of the second video data (target short segment) and the first video data (uncut long video segment) output by the 3D network as f_mAnd f_u. The above embodiments of the present disclosure will be described in_mAs priori knowledge, two 1D convolution networks are used as generators (Context generators) to respectively generate front and back (G)₁And G₂) A background content characteristic. The complete synthesized sequence features are obtained by splicing the generated front and back background content features and the original short segment features, as shown in formula (2):

in the formula (2), a represents a feature splicing operation. Synthesizing complete characteristic sequence of target short segment

And a long uncut video feature sequence f_uAnd inputting the data into a 1D convolutional network to generate different feature maps under different time domain scales. On each temporal scale, a Discriminator (Background Discriminator) is used to distinguish between anchor features (BG Cell of unknown Video) corresponding to the Background in long Video and anchor features (BG Cell of Synthetic motion) corresponding to the Background generated by target short Video. And finally, on the feature map under each scale, the time sequence boundary optimization and the anchor point level action classification optimization act on each anchor point feature to carry out action positioning training of synthesizing target short segments and long videos.

In step S400, the whole network obtains a final model through Joint optimization (Joint Learning) timing nomination Loss function (proxy Loss), Classification Loss function (Classification Loss) and countermeasure Loss function (adaptive Loss).

In some embodiments of the present disclosure, step S400 may include: the convergence of the overall positioning loss function of the action prediction network model on the target category is realized by continuously and alternately optimizing a time sequence action positioning loss function of a complete feature sequence synthesized by the first video data and the second video data, an action recognition loss function of a foreground segment of the first video data and the second video data, and an antagonistic training optimization loss function, wherein the target category is the action category belonging to the first video data or the second video data.

The action prediction network model training method provided by the embodiment of the disclosure is suitable for a time sequence action positioning model of large-scale action categories, and the training data are action short segments corresponding to the categories and a small part of long videos with time sequence labels of other categories. Accordingly, the above embodiments of the present disclosure provide an Action Herald Networks (Action Herald Networks). In the network, a weight transfer function is included to bridge the identification of short segments of motion and the positioning of the temporal motion of long videos. Meanwhile, the background content information in the existing long video is used as a basis, and the front background content information and the rear background content information of each short segment are generated through a countermeasure network. And finally, combining the generated background features and the original short segment features to form a complete artificial feature sequence, and performing motion positioning model training on the basis to obtain a time sequence motion positioning model of the corresponding category of the short segment data set.

Fig. 6 is a schematic diagram of some embodiments of a video timing action positioning method according to the present disclosure. Preferably, the present embodiment can be executed by the video time sequence action positioning device of the present disclosure. As shown in fig. 6, the video time sequence action positioning method of the present disclosure may include

steps

61 and 62, where:

step 61, inputting the video data to be tested into a motion prediction network model, wherein the motion prediction network model is obtained by training according to the motion prediction network model training method described in any of the above embodiments (for example, any of the embodiments in fig. 1 to 4).

In some embodiments of the present disclosure, the video data under test is a new original video belonging to the second video data action category.

In some embodiments of the present disclosure, the video data to be tested is video that is not cropped; the video data to be tested comprises an action segment and a non-action background segment.

And step 62, performing time sequence action positioning on the video data to be detected.

In some embodiments of the present disclosure, step 62 may comprise: and performing time sequence action positioning on the video data to be detected by adopting a second time sequence action positioning sub-model in the action prediction network model obtained by training the action prediction network model training method in any embodiment.

In some embodiments of the present disclosure, step 62 may comprise: and finding the time position of the action through the trained model, and classifying the action of the segment.

In some embodiments of the present disclosure, step 62 may include step 210 and step S240 of the fig. 3 or fig. 5 embodiments.

In some embodiments of the present disclosure, step 62 may include steps 621-624, wherein:

step 621, extracting segment level features from the video data to be detected, combining the segment level features to form a feature sequence, and generating feature maps under different time domain scales.

In some embodiments of the present disclosure, step 621 may comprise: firstly, extracting a 1-dimensional characteristic sequence of the video data to be detected through a 3D CNN network; generating feature maps at different time-domain scales by using a 1D Conv Net.

Step 622, generating anchor point characteristics on the anchor point layer under each time domain scale, and reversely deducing the corresponding time segment position in the original video data to be detected through each anchor point characteristic.

Step 623, performing action classification and time sequence offset regression on the anchor point features to obtain a candidate action positioning result, wherein a second anchor point level classification weight matrix for performing action classification on the anchor point features is obtained by migration of a second video level classification weight matrix, and the action positioning result comprises action classification scores, time segment positions and overlapping scores.

In some embodiments of the present disclosure, the classification score is a confidence prediction of the corresponding action category; the overlap fraction specifically means: the anchor point corresponds to the overlapping degree (intersection ratio) of the time segment and the real action segment (time tag), which is a regression value, and the larger the value, the more likely the segment is to be action-oriented (the more overlapping with the time tag).

And step 624, determining a final action positioning result from the candidate action positioning results according to the positioning ranking score.

In some embodiments of the present disclosure, step 624 may comprise: and multiplying the action classification fraction and the overlap fraction to obtain a final positioning sorting fraction, and removing a redundant result with large overlap by an NMS (Non-Maximum Suppression) algorithm to obtain a final time sequence action positioning result on the target short segment category.

In some embodiments of the present disclosure, the ranking score may specifically be a product of the classification score and the overlap score, which is a basis for ranking all candidate results. And after NMS post-processing, taking the result before sequencing as the final positioning result.

In some embodiments of the present disclosure, step 624 may comprise: and after the result sorting scores are sorted in a descending order, taking a plurality of positioning results as final results.

According to the video time sequence action positioning method disclosed by the embodiment of the disclosure, action positioning of large-scale action categories can be realized, and the number of action categories which can be applied is greatly increased.

Fig. 7 is a schematic diagram of some embodiments of an action-predictive network model building apparatus according to the present disclosure. As shown in fig. 7, the action-aware network model building apparatus of the present disclosure may include a first building module 71 and a second building module 72, wherein:

the first constructing module 71 is configured to construct a first prediction submodel, where the first prediction submodel is configured to perform motion recognition training and timing motion positioning training by using first video data, where the first video data is long video data including a time domain label;

and a second constructing module 72, configured to construct a second predictive submodel, where the second predictive submodel is configured to perform motion recognition training and timing motion positioning training by using second video data, the second video data is motion short segment video data, and the number of the first video data is smaller than the number of the second video data.

In some embodiments of the present disclosure, the action-anticipation network model comprises a first anticipation sub-model and a second anticipation sub-model; the trained action prediction network model is used for realizing the time sequence action positioning of the video data to be tested, wherein the video data to be tested is long video data without time domain marks.

In some embodiments of the present disclosure, the action-aware network model building apparatus of the present disclosure may be configured to perform operations of the action-aware network model building method according to any of the above embodiments of the present disclosure.

According to another aspect of the present disclosure, a motion-predicted network model training device is provided, where the motion-predicted network model training device may be configured to train a motion-predicted network model by using first video data and second video data, so that the trained motion-predicted network model is used to implement time sequence motion positioning of video data to be tested, where the first video data is long video data including a time domain label, the second video data is motion short segment video data, the video data to be tested is un-clipped long video data, the number of the first video data is smaller than the number of the second video data, and a motion category of the video data to be tested belongs to a motion category of the first video data or a motion category of the second video data.

In some embodiments of the present disclosure, the time series action positioning may include: an action category and an action start time of the video data are determined.

In some embodiments of the present disclosure, the action-anticipation network model may include a first anticipation sub-model and a second anticipation sub-model.

In some embodiments of the present disclosure, the action-prediction network model training apparatus may be configured to perform action recognition training and timing action positioning training on the first prediction submodel using the first video data; and performing action recognition training and time sequence action positioning training on the second pre-known sub-model by adopting second video data.

In some embodiments of the present disclosure, the first pre-known submodel may include a first motion recognition submodel and a first time series motion localization submodel.

In some embodiments of the present disclosure, the action prediction network model training device may be configured to extract an action fragment from the first video data as a foreground short fragment, input the foreground short fragment into the first action recognition submodel, and obtain an action recognition result of the foreground short fragment, under the condition that the first video data is used to perform action recognition training and timing sequence action positioning training on the first prediction submodel; inputting the first video data into a first time sequence action positioning sub-model to obtain a time sequence action positioning result of the first video data; motion recognition of the foreground short segment and time sequence motion positioning of the first video data are bridged by a first weight transfer function.

In some embodiments of the present disclosure, the action-prediction network model training device may be configured to extract segment-level features of the foreground short segment, combine the segment-level features of the foreground short segment to form a feature sequence, and generate feature maps in different time-domain scales, when the foreground short segment is input into the first action recognition submodel and the action recognition result of the foreground short segment is obtained; performing global pooling operation on the feature map of the foreground short segment to obtain a feature vector under each time domain scale; mapping the feature vectors of the foreground short segments by adopting a first video level classification weight matrix to be used as action identification features of the foreground short segments; and obtaining the action recognition result of the foreground short segment according to the action recognition characteristics of the foreground short segment.

In some embodiments of the present disclosure, the action-prediction network model training apparatus may be configured to, when the first video data is input into the first time-series action positioning sub-model and a time-series action positioning result of the first video data is obtained, extract segment-level features of the first video data, combine the segment-level features of the first video data to form a feature sequence, and generate feature maps at different time-domain scales; and adopting the first anchor point level classification weight matrix to act on each feature anchor point on the first video data feature map to obtain a time sequence action positioning result of the first video data.

In some embodiments of the present disclosure, the action-prediction network model training apparatus may be configured to apply a weight transfer function to the first video-level classification weight matrix to generate the first anchor-level classification weight matrix in a case where action recognition of the foreground short segment and timing action positioning of the first video data are bridged by the weight transfer function.

In some embodiments of the present disclosure, the second prediction submodel may include a second motion recognition submodel and a second sequential motion localization submodel.

In some embodiments of the present disclosure, the motion prediction network model training apparatus may be configured to, in a case where motion recognition training and timing motion positioning training are performed on the second prediction sub-model by using the second video data, extract segment-level features of the second video data, and combine the segment-level features of the second video data to form a feature sequence of the second video data; performing magic generation on the background before and after the action of the second video data by using the background before and after the action in the first video data as a guide, and combining the generated background feature and the original second video data feature to form a complete synthesized feature sequence; inputting the characteristic sequence of the second video data into a second action recognition submodel to obtain an action recognition result of the second video data; inputting the complete synthesized characteristic sequence into a second time sequence action positioning sub-model to obtain a time sequence action positioning result of second video data; and bridging the action identification of the second video data and the time sequence action positioning of the front and back content extended versions of the second video data through a second weight transfer function.

In some embodiments of the present disclosure, the action-prediction network model training apparatus may be configured to generate feature maps in different time domain scales according to the feature sequence of the second video data when the feature sequence of the second video data is input into the second action recognition submodel and the action recognition result of the second video data is obtained; performing global pooling operation on the second video data to obtain a feature vector under each time domain scale; mapping the feature vector of the second video data by adopting a second video level classification weight matrix to be used as the action identification feature of the second video data; and acquiring a motion recognition result of the second video data according to the motion recognition characteristics of the second video data.

In some embodiments of the present disclosure, the motion prediction network model training apparatus may be configured to input a complete synthesized feature sequence into the second time sequence motion positioning sub-model, and generate feature maps at different time domain scales according to the complete synthesized feature sequence under the condition of obtaining a time sequence motion positioning result of the second video data; and adopting the second anchor point level classification weight matrix to act on each feature anchor point on the feature spectrum of the front and rear content extended versions of the target short segment to obtain the time sequence action positioning result of the front and rear content extended versions of the target short segment.

In some embodiments of the present disclosure, the action-prediction network model training apparatus may be configured to apply the second weight transfer function to the second video-level classification weight matrix to generate a second anchor-level classification weight matrix in a case where the action recognition of the second video data and the time-series action positioning of the front-back content extension version of the second video data are bridged by the second weight transfer function.

In some embodiments of the present disclosure, the action-predictive network model training device may be further configured to share the weights of the first weight migration function and the second weight migration function.

In some embodiments of the present disclosure, as shown in fig. 3 and 5, the action-prediction network model further includes an antagonistic training model 33.

In some embodiments of the present disclosure, the action-aware network model training apparatus may be configured to distinguish, on each time domain scale, anchor point features corresponding to a background in the first video data and anchor point features corresponding to a background generated by the second video data, using a discriminator, in a case where a pre-and-post-action background in the first video data is used as a guide to perform magic generation on a pre-and-post-action background of the second video data; and performing magic generation on the background before and after the action of the second video data by using the background before and after the action in the first video data as a guide through the confrontation training model.

In some embodiments of the present disclosure, the motion-predicted network model training apparatus may further be configured to implement convergence of the overall positioning loss function of the motion-predicted network model on a target class by continuously and alternately optimizing a time-series motion positioning loss function of a complete feature sequence synthesized by the first video data and the second video data, a motion recognition loss function of a foreground segment of the first video data and the second video data, and a confrontation training optimization loss function, where the target class is a motion class belonging to the first video data or a motion class of the second video data.

In some embodiments of the present disclosure, the action-predictive network model training apparatus of the present disclosure may be used to perform operations of the action-predictive network model training method according to any of the above embodiments of the present disclosure.

The above-described embodiments of the present disclosure are originally directed to training of time series motion localization models with short segments. The embodiment creatively provides a method for constructing a bridge between the action recognition and the time sequence action positioning by using the weight migration, and a method for modeling the background characteristics of the content before and after the short segment by using the confrontation generation model, and completes the model optimization of the action positioning on the target category on the basis, thereby successfully expanding the time sequence action positioning model to a wider action category set.

FIG. 8 is a schematic diagram of some embodiments of a video timing motion positioning apparatus according to the present disclosure. As shown in fig. 8, the video time sequence action positioning apparatus of the present disclosure may include a data input module 81 and a positioning module 82, wherein:

the data input module 81 is configured to input video data to be tested into an action prediction network model, where the action prediction network model is obtained by training according to the action prediction network model training method according to any one of the embodiments.

And the positioning module 82 is used for performing time sequence action positioning on the video data to be detected.

In some embodiments of the present disclosure, the positioning module 82 may be configured to extract segment-level features from the video data to be detected, combine the segment-level features to form a feature sequence, and generate feature maps at different time domain scales; generating anchor point characteristics on the anchor point layer under each time domain scale, and reversely deducing the corresponding time segment position in the original video data to be detected through each anchor point characteristic; performing action classification and time sequence offset regression on the anchor point characteristics to obtain a candidate action positioning result, wherein a second anchor point level classification weight matrix for performing action classification on the anchor point characteristics is obtained through migration of a second video level classification weight matrix; and determining a final action positioning result from the candidate action positioning results according to the positioning sorting score.

In some embodiments of the present disclosure, the video time sequence action positioning apparatus of the present disclosure may be used to perform the operations of the video time sequence action positioning method according to any of the above embodiments.

The video time sequence action positioning device disclosed by the embodiment of the disclosure can realize action positioning of large-scale action categories, and greatly improves the number of action categories which can be applied.

FIG. 9 is a schematic diagram of some embodiments of a computer apparatus of the present disclosure. As shown in fig. 9, the computer apparatus of the present disclosure may include a memory 91 and a processor 92, wherein:

a memory 91 for storing instructions.

A processor 92, configured to execute the instructions, so that the computer device performs operations of implementing the action-prediction network model building method according to any one of the above embodiments, the action-prediction network model training method according to any one of the above embodiments, or the video time sequence action positioning method according to any one of the above embodiments.

Based on the computer device provided by the above embodiment of the present disclosure, through a concept of migration learning, a time sequence action positioning model suitable for large-scale action categories is learned, and training data are action short segments corresponding to the categories and a small part of long videos with time sequence labels in other categories. Accordingly, the above embodiments of the present disclosure provide an Action Herald Networks (Action Herald Networks). In the network, a weight transfer function is included to bridge the identification of short segments of motion and the positioning of the temporal motion of long videos. Meanwhile, the background content information in the existing long video is used as a basis, and the front background content information and the rear background content information of each short segment are generated through a countermeasure network. And finally, combining the generated background features and the original short segment features to form a complete artificial feature sequence, and performing motion positioning model training on the basis to obtain a time sequence motion positioning model of the corresponding category of the short segment data set.

The embodiment of the disclosure can realize action positioning of large-scale action categories, and greatly improve the number of the action categories which can be applied.

Based on the non-transitory computer readable storage medium provided by the above embodiments of the present disclosure, the training of a time series action positioning model with short segments is originally involved. The embodiment creatively provides a method for constructing a bridge between the action recognition and the time sequence action positioning by using the weight migration, and a method for modeling the background characteristics of the content before and after the short segment by using the confrontation generation model, and completes the model optimization of the action positioning on the target category on the basis, thereby successfully expanding the time sequence action positioning model to a wider action category set.

The predictive network model building apparatus, video timing action positioning apparatus and computer apparatus described above may be implemented as a general purpose processor, a Programmable Logic Controller (PLC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components or any suitable combination thereof designed to perform the functions described herein.

Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware to implement the above embodiments, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.

The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for constructing an action prediction network model is characterized by comprising the following steps:

2. The method according to claim 1, wherein the constructing the first prediction submodel comprises:

3. The method according to claim 1, wherein the constructing a second predictive sub-model comprises:

constructing a feature sequence generation submodel, wherein the background generation submodel is used for extracting the segment level features of the second video data and combining the segment level features of the second video data to form a feature sequence of the second video data;

constructing a background generator, wherein the background generator utilizes the background before and after the action in the first video data as a guide to perform magic generation on the background before and after the action of the second video data, and combines the generated background feature and the original second video data feature to form a complete synthesized feature sequence;

constructing a second action recognition submodel, wherein the second action recognition submodel is used for carrying out action recognition training by adopting a characteristic sequence of second video data;

constructing a second time sequence action positioning sub-model, wherein the second time sequence action positioning sub-model is used for performing time sequence action positioning training by adopting a complete synthetic characteristic sequence;

and constructing a second weight transfer submodel, wherein the second weight transfer submodel is used for bridging the action recognition of the second video data and the time sequence action positioning of the front and back content extension versions of the second video data.

4. The method of constructing a behavior prediction network model according to claim 3, further comprising:

and constructing a confrontation training model, and distinguishing the anchor point characteristics corresponding to the background in the first video data and the anchor point characteristics corresponding to the background generated by the second video data by using a discriminator on each time domain scale, so that the background generator can utilize the background before and after the action in the first video data as a guide to perform magic generation on the background before and after the action of the second video data.

5. The method for constructing a behavior prediction network model according to any one of claims 1 to 4, further comprising:

and constructing a weight sharing submodel for sharing weights of the first weight migration function and the second weight migration function, wherein the first weight migration function is a first weight migration function of a first weight transfer submodel in the first pre-known submodel, and the second weight migration function is a first weight migration function of a second weight transfer submodel in the second pre-known submodel.

6. The method for constructing a behavior prediction network model according to any one of claims 1 to 4, further comprising:

and constructing a joint optimization submodel, wherein the joint optimization submodel is used for realizing the convergence of the overall positioning loss function of the action predictive network model on the target category through continuously and alternately optimizing a time sequence action positioning loss function of a complete characteristic sequence synthesized by the first video data and the second video data, an action recognition loss function of a foreground fragment of the first video data and the second video data and an antagonistic training optimization loss function, and the target category belongs to the action category of the first video data or the action category of the second video data.

7. A motion prediction network model training method is characterized by comprising the following steps:

the method comprises the steps of training an action prediction network model by adopting first video data and second video data, so that the trained action prediction network model is used for realizing time sequence action positioning of video data to be tested, wherein the first video data are long video data containing time domain marks, the second video data are action short-segment video data, the video data to be tested are long video data not containing the time domain marks, and the number of the first video data is smaller than that of the second video data.

8. The action-aware network model training method according to claim 7, wherein the action-aware network model is constructed by the action-aware network model construction method according to any one of claims 1 to 6.

9. The method according to claim 7 or 8, wherein the motion category of the video data to be tested belongs to a motion category of first video data or a motion category of second video data.

10. The action-anticipation network model training method of claim 7 or 8, wherein the action-anticipation network model comprises a first anticipation sub-model and a second anticipation sub-model;

the training of the motion prediction network model by using the first video data and the second video data comprises:

performing action recognition training and timing sequence action positioning training on the second pre-known sub-model by adopting second video data;

the time sequence action positioning comprises: an action category and an action start time of the video data are determined.

11. The action-prediction network model training method of claim 10, wherein the first prediction submodel comprises a first action recognition submodel and a first time-series action positioning submodel;

the action recognition training and the time sequence action positioning training of the first pre-known submodel by adopting the first video data comprise the following steps:

12. The method for training the action prediction network model according to claim 11, wherein the step of inputting the foreground short segment into the first action recognition submodel and the step of obtaining the action recognition result of the foreground short segment comprises: extracting segment level characteristics of the foreground short segments, combining the segment level characteristics of the foreground short segments to form a characteristic sequence, and generating characteristic maps under different time domain scales; performing global pooling operation on the feature map of the foreground short segment to obtain a feature vector under each time domain scale; mapping the feature vectors of the foreground short segments by adopting a first video level classification weight matrix to be used as action identification features of the foreground short segments; acquiring a motion recognition result of the foreground short segment according to the motion recognition characteristics of the foreground short segment;

the inputting the first video data into the first time sequence action positioning sub-model and the obtaining the time sequence action positioning result of the first video data comprise: extracting segment level features of the first video data, combining the segment level features of the first video data to form a feature sequence, and generating feature maps under different time domain scales; the method comprises the steps that a first anchor point level classification weight matrix is adopted to act on each feature anchor point on a first video data feature map, and a time sequence action positioning result of first video data is obtained;

the step of bridging the action recognition of the foreground short segment and the time sequence action positioning of the first video data through the weight transfer function comprises the following steps:

13. The action-prediction network model training method of claim 11, wherein the second prediction submodel comprises a second action recognition submodel and a second time-series action positioning submodel;

the action recognition training and the time sequence action positioning training of the second pre-known sub-model by adopting the second video data comprise the following steps:

14. The method of claim 13, wherein the inputting the feature sequence of the second video data into the second motion recognition submodel and the obtaining the motion recognition result of the second video data comprises: generating feature maps under different time domain scales according to the feature sequence of the second video data; performing global pooling operation on the second video data to obtain a feature vector under each time domain scale; mapping the feature vector of the second video data by adopting a second video level classification weight matrix to be used as the action identification feature of the second video data; acquiring a motion recognition result of the second video data according to the motion recognition feature of the second video data;

the step of inputting the complete synthesized characteristic sequence into the second time sequence action positioning sub-model and acquiring the time sequence action positioning result of the second video data comprises the following steps: generating feature maps under different time domain scales according to the complete synthesized feature sequence; adopting a second anchor point level classification weight matrix to act on each feature anchor point on the feature map of the content extension versions before and after the target short segment, and obtaining the time sequence action positioning result of the content extension versions before and after the target short segment;

the step of bridging the action recognition of the second video data and the time sequence action positioning of the front and back content extension versions of the second video data through the second weight transfer function comprises:

15. The method of training a motion-predictive network model according to claim 13, further comprising:

16. The action-predictive network model training method of claim 13, wherein the action-predictive network model comprises a confrontation training model;

17. The method of training a motion-predictive network model according to claim 13, further comprising:

18. A video time sequence action positioning method is characterized by comprising the following steps:

inputting video data to be tested into a motion prediction network model, wherein the motion prediction network model is obtained by training according to the motion prediction network model training method of any one of claims 7-17;

19. The video time series motion positioning method according to claim 18, wherein the time series motion positioning of the video data to be measured by using the motion prediction network model comprises:

20. The video time series motion positioning method according to claim 18 or 19, wherein the time series motion positioning of the video data to be measured by using the motion prediction network model comprises:

and performing time sequence action positioning on the video data to be detected by adopting a second time sequence action positioning sub-model in the action prediction network model obtained by training the action prediction network model training method according to any one of claims 13 to 17.

21. An action-prediction network model building device, comprising:

22. The action prediction network model training device is characterized in that the action prediction network model training device is used for training an action prediction network model by adopting first video data and second video data, so that the trained action prediction network model is used for realizing time sequence action positioning of video data to be tested, wherein the first video data is long video data containing time domain marks, the second video data is action short-segment video data, the video data to be tested is uncut long video data, the number of the first video data is smaller than that of the second video data, and the action category of the video data to be tested belongs to the action category of the first video data or the action category of the second video data.

23. A video time series motion positioning apparatus, comprising:

a data input module, configured to input video data to be tested into a motion prediction network model, where the motion prediction network model is obtained by training according to the motion prediction network model training method according to any one of claims 7 to 17;

24. A computer device, comprising:

a memory to store instructions;

a processor for executing the instructions to cause the computer apparatus to perform operations to implement the action-aware network model building method of any one of claims 1-6, the action-aware network model training method of any one of claims 7-17, or the video temporal action localization method of claim 18 or 19.

25. A non-transitory computer-readable storage medium storing computer instructions which, when executed by a processor, implement the action-aware network model construction method of any one of claims 1-6, the action-aware network model training method of any one of claims 7-17, or the video temporal action localization method of claim 18 or 19.