CN115861902B

CN115861902B - Unsupervised action migration and discovery method, system, device and medium

Info

Publication number: CN115861902B
Application number: CN202310063448.7A
Authority: CN
Inventors: 张恺成; 陈泽林; 郑伟诗
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2023-06-09
Anticipated expiration: 2043-02-06
Also published as: CN115861902A

Abstract

The invention discloses an unsupervised action migration and discovery method, a system, equipment and a medium, wherein the method comprises the following steps: acquiring a target data set without labels; constructing a convolution network model of a decomposition action flow, performing slicing processing on all videos, calculating the clustering centers of all slices by using a clustering algorithm to serve as pseudo tags of slicing actions, and learning the decomposition actions expressed by the video slices by using the pseudo tags; constructing a convolution network model of a complete action flow, calculating the clustering centers of all complete videos by using a clustering algorithm to serve as pseudo tags of video actions, and learning the complete actions expressed by the complete videos by using the pseudo tags; the convolutional network model of the decomposed action flow and the convolutional network model of the complete action flow learn each other, so that the model can discover new action types and learn more accurate decomposed action information. The invention can complete the task of motion recognition under the condition of no supervision, and improves the accuracy of motion recognition and the efficiency of the whole algorithm by utilizing a migration learning method.

Description

Unsupervised action migration and discovery method, system, device and medium

Technical Field

The invention belongs to the technical field of action recognition, and particularly relates to an unsupervised action migration and discovery method, an unsupervised action migration and discovery system, unsupervised action migration and discovery equipment and unsupervised action migration and discovery media.

Background

Unsupervised action migration aims at applying a pre-trained network to an unsupervised target data set to complete the task of action recognition, and the prior art comprises two aspects:

(1) Unsupervised action recognition. Fully supervised motion recognition has evolved over many years, with the most representative sense of operation now being a dual-stream network, which contains a frame convolutional network and an optical flow convolutional network, giving time-sequential motion information to motion recognition. In the prior art, an efficient 3D convolution network is also explored and researched, and modeling of the spatial position and action information relationship is realized. The method mainly provides some self-supervision labeling methods for the non-supervision action recognition, pre-trains the model through carefully designed non-supervision agent tasks, and then performs fine training on the model through existing labels of the target data set.

(2) Unsupervised transfer learning. In transfer learning, training data comes from two different domains, namely a source domain and a target domain. The main task of the transfer learning is to utilize the training of the source data set to improve the model performance of the target data set. A more popular approach to transfer learning is unsupervised domain adaptation UDA (unsupervised domain adaptation). UDA applies to annotated source data sets and unlabeled target data sets, and the source task is consistent with the target task (e.g., the action type is consistent). Most UDA work is focused on minimizing the field differences.

The network model obtained by pre-training under a large data set is migrated to a small data set, and only the target data set is subjected to fine complete supervision training, so that the action recognition performance on the target data set can be remarkably improved (compared with random initialization training). However, in real life applications, manual labels to refine supervised training are difficult to easily obtain. The main work at present of the non-supervision action recognition is a self-supervision training method, and full-supervision fine adjustment is still needed by using marked data, so that a pre-training model cannot be directly migrated to a non-marked target data set for use. In the transfer learning section, the conventional UDA method is not fully applicable to unsupervised transfer learning, because the target task often appears inconsistent with the source task.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, and provides an unsupervised motion migration and discovery method, an unsupervised motion migration and discovery system, unsupervised motion recognition equipment and medium, wherein a motion recognition task is completed under an unsupervised condition, and a migration learning method is utilized to improve the motion recognition accuracy and the overall algorithm efficiency.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

In a first aspect, the present invention provides an unsupervised action migration and discovery method, comprising the steps of:

acquiring a target data set without labels, wherein the target data set is acquired video;

constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;

the convolution network model of the decomposed action flow and the convolution network model of the complete action flow are mutually learned to obtain a trained MUSIC model; in the mutual learning process, integrity constraint is added between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the learned decomposition action, and similar complete action distinguishing strategies are adopted to distinguish the similar complete actions, wherein the similar complete action distinguishing strategies are that if the decomposition actions are different, the complete actions to which the similar complete action distinguishing strategies belong are divided into different categories, and finally, a decomposition action alignment strategy is introduced, so that a convolution network model of the decomposition action flow and a convolution network model of the complete action flow both learn shared decomposition actions;

And completing the action recognition task under the unsupervised condition by using the learned MUSIC model.

As a preferable technical scheme, the action learning of the convolutional network model for decomposing the action flow comprises a clustering step for decomposing the action flow and a learning step for decomposing the action flow;

in the step of clustering the decomposed action stream, the features of all video slices are extracted, and the features of all video slices are clustered into a plurality of decomposed actions by using a clustering algorithm to obtain a decomposed action feature set A, wherein the extraction method of the decomposed action feature set A is as follows:

，

，

where N represents the total number of videos,

is a union operation, ++>

Indicate->

No. H of personal video>

Decomposing action features extracted from individual slices, < >>

Is the firstiB th to b th frames of the videob+l-1 frame of video slice, +.>

Convolutional network model representing a resolved action flow, +.>

Parameters of the convolutional network representing the split action flow, +.>

The slice length is indicated as such,

is the set of video slice start frames, +.>

Every>

Frame slice sampling of video，/>

Indicate->

Total frame number of video>

Representing the total number of slices of a video;

clustering the decomposition action feature set A by using a clustering algorithm to obtain a pseudo tag set of all the slice decomposition actions

And clustering center set->

, wherein />

Indicate->

Pseudo tag of the b-th slice of the video,>

indicate->

Video of->

N represents the total number of videos, ">

Represents slice b, +.>

Representing the total number of slices representing a video, T _i Indicate->

Total frame number of the video, delta represents delta frame,/delta frame>

Indicate->

Cluster center feature of individual clusters,/>

Subscript number indicating the cluster of resolved actions, < ->

Representing the total number of clusters of resolved actions.

As an preferable technical solution, in the learning step of decomposing the motion stream, random slice feature sampling is performed on all videos, and the classification probability of each slice feature is calculated, where the calculation formula is as follows:

，

wherein ,

is->

Video->

Motion prediction probability vector for each slice, +.>

Representation->

Is>

Column, i.e. pair +.in the predictive probability vector>

Predictive probability of individual clusters, +.>

A parameter representing softmax obtained by training a deep learning network,>

resetting each iteration; />

Is the real number field [ ]>

]Is a matrix of (a); />

Indicate the decomposition action->

Video->

Feature vectors of the individual slices;

order the

Set of prediction vectors representing all slices, pseudo tag +.>

Give->

The individual slices provide self-supervision information, training- >

The loss function is obtained as follows:

，

wherein ,

is an indicator function.

As a preferable technical scheme, the action learning of the convolution network model of the complete action flow comprises a clustering step of the complete action flow and a learning step of the complete action flow;

in the clustering step of the complete action flow, the characteristics of the complete action are extracted as follows:

，

，

wherein ,

represent the firstiComplete characteristics of the individual videos,/->

Is an aggregation function of any kind, +.>

Representing the partial features extracted from the mth segment of the ith video,/th video>

Start frame representing mth video clip, < >>

An overview of video clips is presented,

is the (th) of the (th) video>

Frame to->

A video clip of frames, l denotes the video clip length,

convolutional network showing complete action flow, +.>

Is a parameter of the complete action flow convolution network, and let V represent the complete action feature set of all videos;

clustering the complete action feature set V by using a clustering algorithm to obtain a pseudo tag set of all the complete actions of the video

, wherein />

Indicate->

Pseudo tag of individual video->

Indicate->

Video of->

N represents the total number of videos.

As a preferred solution, the integrity constraint

The realization of which is as follows:

，

，

wherein ,

predictive probability vector representing complete features for each cluster,/- >

The softmax parameter obtained after training is represented and reset per iteration.

As a preferred technical solution, the similar complete action distinguishing strategy specifically includes:

differentiating the complete actions by the most representative decomposition actions, the representative decomposition actions

The method is obtained by taking the average value maximum value of the decomposition action prediction probability of each video segment, and specifically comprises the following steps:

，

wherein ,

is a function of the subscript of the maximum, +.>

Representation->

Decomposition action->

Prediction probability of +.>

Representing the current video +.>

Is a total number of fragments;

classifying complete actions according to the representative decomposition actions, i.e. complete actions comprising different representative decomposition actions, should be identified as different action types and clustered into different clusters, in particular, a complete action cluster set as follows:

，

wherein ,

representing a set of consistent complete action clustersA subset of the formulaic conditions, +.>

，

，/>

Number of clusters representing complete actions, +.>

；

Then a new complete action cluster label is obtained

，/>

Indicate->

The complete motion of the individual video is +.>

In (3), finally, utilize +.>

For->

Training was performed to obtain the loss function as follows:

，

wherein ,

representing the total number of cluster labels after applying a similar complete action discrimination strategy,/for >

Representing video

Is predicted as action +.>

Is a probability of (2).

As a preferred technical solution, the decomposition action alignment policy specifically includes:

forced decomposition action flow and complete action flow learn shared decomposition actions by minimizing loss functions

To align the split actions in the complete action stream +.>

And decomposing +.>

The specific loss function is as follows:

，

wherein ,

is any function representing the distance between the two distributions,/->

Representing a disaggregation act->

Distribution of complete action flow, +.>

Representing a disaggregation act->

Distribution in decomposition action stream, taking account of effectiveness and simplicity of calculation, use of loss function simplifying 2-Wasserstein distance calculation distribution ≡>

：

，

wherein ,

indicating desire(s)>

Representing the variance;

finally, the learning step loss function of the MUSIC model is expressed as:

，

wherein ,

、/>

is a class loss function directed by decomposing action flow and complete action flow pseudo tags,/for>

Is an integrity constraint loss function,/->

Is a decomposition action alignment loss function, +.>

and />

Is the weight that balances the individual losses.

In a second aspect, the invention provides an unsupervised action migration and discovery system, which is applied to the unsupervised action migration and discovery method, and comprises a data acquisition module, a model construction module, a mutual learning module and an action recognition module;

The data acquisition module is used for acquiring a target data set without labels, and the target data set is acquired video;

the model construction module is used for constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;

the mutual learning module is used for mutually learning a convolutional network model of a decomposition action flow and a convolutional network model of a complete action flow, in the mutual learning process, integrity constraint is added between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the learned decomposition action, and similar complete action distinguishing strategies are adopted to distinguish the similar complete action, wherein the similar complete action distinguishing strategies are that if the decomposition actions are different, the complete actions to which the similar complete action distinguishing strategies belong are divided into different categories, and finally a decomposition action alignment strategy is introduced, so that the convolutional network model of the decomposition action flow and the convolutional network model of the complete action flow learn the shared decomposition action;

And the action recognition module is used for completing the action recognition task under the unsupervised condition by utilizing the learned MUSIC model.

In a third aspect, the present invention provides an electronic device, including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the unsupervised method of action migration and discovery.

In a fourth aspect, the present invention provides a computer readable storage medium storing a program which, when executed by a processor, implements the unsupervised method of action migration and discovery.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention carries out slicing processing on all videos by constructing a convolution network model of a decomposition action flow, calculates the clustering centers of all slices by using a clustering algorithm as pseudo tags of slicing actions, and learns the decomposition actions expressed by the video slices by using the pseudo tags; constructing a convolution network model of a complete action flow, calculating the clustering centers of all complete videos by using a clustering algorithm to serve as pseudo tags of video actions, and learning the complete actions expressed by the complete videos by using the pseudo tags; the convolutional network model of the decomposed action flow and the convolutional network model of the complete action flow learn each other, so that the model can discover new action types and learn more accurate decomposed action information. Therefore, the invention can identify brand new action types in the target data set, and simultaneously train two streams by utilizing bidirectional mutual learning so as to achieve the purpose of modeling the combination relation of the analysis action and the complete action.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an exploded motion when the complete motion is a long jump according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating an exploded action when the complete action is a turn-up according to an embodiment of the present invention;

FIG. 3 is a flow chart of an unsupervised method of action migration and discovery according to an embodiment of the present invention;

FIG. 4 is a block diagram of an unsupervised action migration and discovery system according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.

The unsupervised action migration and discovery method of the present embodiment is implemented based on the proposed MUSIC model (mutuallylearn the subactions and the complete actions). It will be appreciated that the completion of a complete motion is accomplished by many small discrete motions, including more discrete motions as the more complex the motion is, referring to fig. 1, when the complete motion is a long jump, the discrete motions can be divided into running and long jump; referring to fig. 2, when the complete exercise is a jump, the split exercise can be divided into running and jumping upward. In order to learn the brand new action types and understand the action semantics of the higher layers, the main idea of the MUSIC algorithm framework is to provide self-supervision by utilizing the relationship between the decomposed action and the complete action. In summary, the MUSIC algorithm is composed of two parts of motion learning streams, namely, a split motion stream and a complete motion stream, and the two parts are allowed to learn each other bidirectionally. In the decomposition action flow, all videos are subjected to slice processing, the clustering centers of all slices are calculated by using a clustering algorithm to serve as pseudo tags of the slice actions, and the decomposition actions expressed by the video slices are learned by using the pseudo tags. In the complete action flow, using a clustering algorithm to calculate the clustering centers of all videos as pseudo labels of the video actions, and learning the complete actions expressed by the complete videos by using the pseudo labels. To realize the bidirectional learning of the decomposed action flow and the complete action flow, the MUSIC model also completes the following work:

(1) Integrity constraints are introduced to model the combined relationship between the decomposed and complete action streams.

(2) A similar complete action discrimination strategy is adopted, i.e. if the resolved actions are different, the complete actions to which they belong are divided into different categories.

(3) A split action alignment policy is introduced that requires both the split action flow and the full action flow to learn the shared split action.

Referring to fig. 3, an unsupervised action migration and discovery method of the present embodiment specifically includes the following steps:

s1, acquiring a target data set without a label;

in this embodiment, the target data set is an acquired video, such as a running motion video or a high jump motion video.

S2, constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing an action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow is to calculate the clustering centers of all complete videos by using a clustering algorithm as pseudo labels of video actions, and learn the complete actions expressed by the complete videos by using the pseudo labels.

S21, decomposing the action flow, wherein action learning of the decomposed action flow is realized by iterative execution of a clustering step and a learning step;

firstly, in the clustering step, the embodiment extracts the features of all video slices, and clusters the features into a plurality of subdivision action types by using a clustering algorithm, and the extraction method of the decomposition action feature set A is as follows:

，

，

where N represents the total number of videos,

indicate->

No. H of personal video>

Decomposing action features extracted from individual slices, < >>

Convolutional network model representing a resolved action flow, +.>

Parameters representing the deconstructed action convolutional network, +.>

Represents slice length, +.>

Every>

Frame slice sampling of video, ++>

Indicate->

Total number of frames of each video

Representing the total number of slices of a video.

Subsequently, clustering the set a by a clustering algorithm (e.g., k-means) to obtain a pseudo tag set p=for all slice decomposition actions

（/>

) And clustering center set h= =>

（/>

Representing the total number of clusters of resolved actions); let P denote the pseudo tag sets of all slices and H denote the type dictionary of the decomposition action.

Secondly, in the learning step, the embodiment samples the random slice feature of all videos and obtains the classification probability of each feature according to the following formula:

，

wherein ,

is->

Motion prediction probability vector for individual video, +.>

Each iteration is reset.

Order the

Representing prediction vectors for all slicesIs set of pseudo tag->

Give->

The individual slices provide self-supervision information, training->

The loss function can be obtained as follows:

，

wherein ,

is an indicator function.

S22, complete action flow;

in the complete action flow, the steps of action learning are equally divided into a clustering step and a learning step, unlike the decomposed action flow, complete actions are represented by complete videos instead of video slices. Specifically, a complete action may be aggregated from M evenly divided video segments in a video.

In the clustering step, the feature extraction of the complete action is as follows:

，

，

wherein ,

can be any aggregation function (such as mean function, maximum function, LSTM), -or-a->

Start frame representing mth video clip, < >>

A convolutional network representing the complete motion stream and let V represent the complete motion feature set for all videos.

S3, bidirectional mutual learning;

in this embodiment, the two streams, namely the complete action stream and the decomposed action stream, are co-trained to provide semantic-level pseudo-supervision by using the relationship between actions, and to model the combined relationship between the decomposed action and the complete action. Through bidirectional mutual learning, the MUSIC algorithm model is expected to find new action types and learn more accurate decomposition action information so that action migration can be better adapted to a target domain.

S31, integrity constraint;

considering that the expression of the complete action is the expression containing the decomposition action, the present embodiment adds an integrity constraint between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the decomposition action which has been learned

The realization of which is as follows:

，

，

wherein ,

and reset each iteration.

S32, similar complete action distinguishing strategies;

since complete action flows tend to merge similar complete actions into the same class, while the operation of the present embodiment requires the discovery of new action types, it is necessary to distinguish between these similar but inconsistent actions. Thus, the decomposition actions identified by the decomposition action flow are utilized to distinguish these similar complete actions and learnMore discriminative feature expression is learned. In particular, complete actions containing different decomposition actions should belong to different categories. However, in the split action flow, the network is likely to give false labels or classification predictions in error, so this embodiment distinguishes complete actions by only the most representative split actions. Representative decomposition action

，

wherein ,

representation->

Decomposition action->

Prediction probability of +.>

Representing the current video +.>

Is a fraction of the total number of fragments.

The next step is to classify the complete actions according to this representative decomposition action, i.e. the complete actions comprising different representative decomposition actions should be identified as different action types and clustered into different clusters. Specifically, the complete action cluster set is as follows:

，

wherein ,

，（/>

number of clusters representing complete actions),>

。

then a new complete action cluster label is obtained

，/>

Indicate->

The complete motion of the individual video is +.>

Is included in the cluster tag. Finally, use->

For->

Training was performed with the following loss functions:

，

wherein ,

representing the total number of cluster labels after applying a similar complete action discrimination strategy,/for>

Representing every video +.>

Is predicted as action +.>

Is a probability of (2).

S33, decomposing an action alignment strategy;

considering that integrity constraints and similar complete action discrimination policies are employed, complete actions are reconstructed and discriminated from split actions, so it is necessary for two flows to learn some shared split actions, this embodiment is referred to as split action alignment. In particular, as shown in the integrity constraint loss function formula,

The learned split actions in the complete action stream are shown. That forces both streams to learn those shared decomposition actions, by minimizing the loss function +.>

To align the split actions in the complete action stream +.>

And decomposing +.>

The specific loss function is as follows:

，

wherein ,

can be any function representing the distance between two distributions (e.g. KL divergence or Wasserstein distance), ->

Representing a disaggregation act->

Distribution of complete action flow, +.>

Representing a disaggregation act->

In the dividingSolving for the distribution in the action stream. In view of the availability and simplicity of the calculation, the present embodiment decides to use a simplified 2-Wasserstein distance calculation distribution of the loss function +.>

：

，

wherein ,

indicating desire(s)>

Representing the variance.

Finally, the learning step loss function of the MUSIC algorithm framework is expressed as:

，

wherein ,

and />

Two flow pseudo tag directed class loss functions (cross entropy) respectively,/and->

Is an integrity constraint loss function,/->

Is a decomposition action alignment loss function, +.>

and />

Is the weight that balances the individual losses.

And S4, completing an action recognition task under an unsupervised condition by using the learned MUSIC model.

The following briefly describes the performance of the MUSIC model in performing an action recognition task under unsupervised conditions:

The present embodiment employs two of the most common large datasets in terms of motion recognition as the source dataset for pre-training: kinetics and Ig65m. Meanwhile, two reference data sets are adopted as target data sets for testing the performance of the MUSIC algorithm: UCF-101 and HMDB-51. In UCF-101 and HMDB-51, more than 50% of the action types are new action types that are not useful in the source dataset.

In the test, the present embodiment adopts the cosine distance to measure the similarity of the two motion features. First, the video of each action type is randomly sampled as a control group. Then, the video (non-control group) of each action type is randomly selected for testing, and rank-1 and rank-5 are obtained. The above process is repeated for several times, and different control groups are selected again each time, and finally the obtained accuracy is averaged for several times.

Pre-training section, this example pre-trains the 3D-ResNeXt-101 model with Kinetics, and R (2+1) D-34[10 ] with Ig65m]And (5) a model. The input to the model is a video segment of 16 consecutive frames with a resolution of 224 x 224. Aggregation functions in a complete action stream

Is an average pooling function, unless otherwise specified, the sampling interval of all video slices +. >

. The cluster number is set to->

. The number of segments m=3 of the full motion video. Loss function parameter->

。

The invention realizes some most advanced unsupervised action recognition algorithms again, and uses the same pre-training model to compare, and the detailed performance is compared with that shown in the table.

"fully supervised approach" refers to the migration of a pre-trained model onto a target dataset for supervised refinement training, with the algorithm chosen as Temporal Segment Network (TSN). The direct migration method refers to directly migrating the pre-trained model to a target data set for testing without fine training.

Under the condition of no supervision, the invention obtains the optimal performance, and compared with other non-supervision algorithms, the invention is improved to a great extent. Compared with the same type of work, the method has the main reason that the MUSIC algorithm models the relation between the decomposed action and the complete action, so that the network can learn the semantic information of the action deeper, and further can identify some new action types which are not in the pre-training data set. While other types of work do not explicitly solve the new action type problem in the action migration problem.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present invention.

Based on the same ideas of the unsupervised action migration and discovery method in the above embodiments, the present invention also provides an unsupervised action migration and discovery system that can be used to perform the unsupervised action migration and discovery method described above. For ease of illustration, only those portions of an unsupervised motion migration and discovery system embodiment are shown in a schematic configuration diagram relating to embodiments of the present invention, and those skilled in the art will appreciate that the illustrated configuration is not limiting of the apparatus and may include more or fewer components than illustrated, or may combine certain components, or a different arrangement of components.

Referring to FIG. 4, in another embodiment of the present application, an unsupervised action migration and discovery system 100 is provided, which includes a data acquisition module 101, a model building module 102, a mutual learning module 103, and an action recognition module 104;

The data acquisition module 101 is configured to acquire a target data set without a tag, where the target data set is an acquired video;

the model building module 102 is configured to build a MUSIC model, where the MUSIC model includes a convolutional network model for decomposing an action flow and a convolutional network model for a complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;

the mutual learning module 103 is configured to mutually learn a convolutional network model of a decomposed action stream and a convolutional network model of a complete action stream, and in the mutual learning process, add an integrity constraint between the decomposed action stream and the complete action stream, so that the expression of the complete action is constructed by the decomposed action that has been learned, and differentiate the similar complete action by adopting a similar complete action differentiating policy, where if the decomposed actions are different, the complete actions to which the similar complete action differentiating policy belongs are classified into different categories, and finally introduce a decomposed action alignment policy, so that the convolutional network model of the decomposed action stream and the convolutional network model of the complete action stream learn a shared decomposed action;

The motion recognition module 104 is configured to complete a motion recognition task under an unsupervised condition by using the learned MUSIC model.

It should be noted that, the unsupervised action migration and discovery system of the present invention corresponds to the unsupervised action migration and discovery method of the present invention one by one, and technical features and beneficial effects described in the embodiments of the unsupervised action migration and discovery method are applicable to the unsupervised action migration and discovery embodiments, and specific content may be referred to the description in the embodiments of the method of the present invention, which is not repeated herein, and thus is stated herein.

Furthermore, in the implementation of the unsupervised action migration and discovery system of the foregoing embodiment, the logic division of each program module is merely illustrative, and in practical application, the above-mentioned function allocation may be performed by different program modules according to needs, for example, in view of configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the unsupervised action migration and discovery system is divided into different program modules to perform all or part of the functions described above.

Referring to fig. 5, in one embodiment, an electronic device implementing an unsupervised action migration and discovery method is provided, the electronic device 200 may include a first processor 201, a first memory 202, and a bus, and may further include a computer program, such as an unsupervised action migration and discovery program 203, stored in the first memory 202 and executable on the first processor 201.

The first memory 202 includes at least one type of readable storage medium, which includes flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a mobile hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a secure digital (SecureDigital, SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the electronic device 200. Further, the first memory 202 may also include both an internal memory unit and an external memory device of the electronic device 200. The first memory 202 may be used to store not only application software installed in the electronic device 200 and various data, such as codes of the unsupervised action migration and discovery program 203, but also temporarily store data that has been output or is to be output.

The first processor 201 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), a microprocessor, a digital processing chip, a graphics processor, a combination of various control chips, and so on. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 200 and processes data by running or executing programs or modules stored in the first memory 202 and calling data stored in the first memory 202.

Fig. 5 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 5 is not limiting of the electronic device 200 and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

The unsupervised action migration and discovery program 203 stored in the first memory 202 in the electronic device 200 is a combination of instructions that, when executed in the first processor 201, may implement:

Further, the modules/units integrated with the electronic device 200 may be stored in a non-volatile computer readable storage medium if implemented in the form of software functional units and sold or used as a stand-alone product. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. An unsupervised method of action migration and discovery, comprising the steps of:

the action learning of the convolutional network model for decomposing the action flow comprises a clustering step for decomposing the action flow and a learning step for decomposing the action flow; in the step of clustering the decomposed action stream, the features of all video slices are extracted, and the features of all video slices are clustered into a plurality of decomposed actions by using a clustering algorithm to obtain a decomposed action feature set A, wherein the extraction method of the decomposed action feature set A is as follows:

，

，

Where N represents the total number of videos,

is a union operation, ++>

Indicate->

No. H of personal video>

Decomposing action features extracted from individual slices, < >>

Is the firstiB th to b th frames of the videob+l-1 frame of video slice, +.>

Convolutional network model representing a resolved action flow, +.>

Parameters of the convolutional network representing the split action flow, +.>

Represents slice length, +.>

Is the set of video slice start frames, +.>

Every>

Frame slice sampling of video, ++>

Indicate->

Total frame number of video>

Representing the total number of slices of a video;

And clustering center set->

, wherein />

Indicate->

Pseudo tag of the b-th slice of the video,>

indicate->

Video of->

N represents the total number of videos, ">

Represents slice b, +.>

Representing the total number of slices, T, of a video _i Indicate->

Total frame number of the video, delta represents delta frame,/delta frame>

Indicate->

Cluster center feature of individual clusters,/>

Subscript number indicating the cluster of resolved actions, < ->

Representing the total number of the decomposed action clusters;

the action learning of the convolution network model of the complete action flow comprises a clustering step of the complete action flow and a learning step of the complete action flow;

，

，

wherein ,

represent the firstiComplete characteristics of the individual videos,/->

Is an aggregation function of any kind, +.>

Start frame representing mth video clip, < >>

An overview of video clips is presented,

is the (th) of the (th) video>

Frame to->

A video clip of frames, l denotes the video clip length,

convolutional network showing complete action flow, +.>

, wherein />

Indicate->

Pseudo tag of individual video->

Indicate->

Video of->

N represents the total number of videos;

the decomposition action alignment strategy specifically comprises the following steps: forced decomposition action flow and complete action flow learn shared decomposition actions by minimizing loss functions

To align the split actions in the complete action stream +.>

And decomposing +.>

The specific loss function is as follows:

，

wherein ,

is any function representing the distance between the two distributions,/->

Representing a disaggregation act- >

Distribution of complete action flow, +.>

Representing a disaggregation act->

：

，

wherein ,

indicating desire(s)>

Representing the variance;

finally, the learning step loss function of the MUSIC model is expressed as:

，

wherein ,

、/>

Is an integrity constraint loss function,/->

Is a decomposition action alignment loss function, +.>

and />

Is the weight to balance each loss;

the integrity constraints

The realization of which is as follows:

，

，

wherein ,

predictive probability vector representing complete features for each cluster,/->

Representing the softmax parameter obtained after training and resetting each iteration;

2. The unsupervised motion migration and discovery method according to claim 1, wherein in the learning step of decomposing the motion stream, random slice feature sampling is performed on all videos and classification probability of each slice feature is calculated as follows:

，/>

wherein ,

is->

Video->

Motion prediction probability vector for each slice, +. >

Representation of

Is>

Column, i.e. pair +.in the predictive probability vector>

Predictive probability of individual clusters, +.>

A parameter representing softmax obtained by training a deep learning network,>

resetting each iteration; />

Is the real number field [ ]>

]Is a matrix of (a); />

Indicate the decomposition action->

Video->

Feature vectors of the individual slices;

order the

Set of prediction vectors representing all slices, pseudo tag +.>

Give->

The individual slices provide self-supervision information, training->

The loss function is obtained as follows:

，

wherein ,

is an indicator function.

3. The unsupervised action migration and discovery method according to claim 1, wherein the similar complete action discrimination strategy is specifically:

，

wherein ,

is a function of the subscript of the maximum, +.>

Representation->

Decomposition action->

Prediction probability of +.>

Representing the current video +.>

Is a total number of fragments;

，

wherein ,

representing a subset of the complete action cluster set in accordance with the formula +.>

，

，/>

Number of clusters representing complete actions, +.>

；

Then a new complete action cluster label is obtained

，/>

Indicate->

The complete motion of the individual video is +.>

In (3), finally, utilize +.>

For->

Training was performed to obtain the loss function as follows:

，

wherein ,

Representing video +.>

Is predicted as action +.>

Is a probability of (2).

4. An unsupervised action migration and discovery system, characterized in that it is applied to the unsupervised action migration and discovery method according to any one of claims 1 to 3, and comprises a data acquisition module, a model construction module, a mutual learning module and an action recognition module;

5. An electronic device, the electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the unsupervised action migration and discovery method according to any one of claims 1-3.

6. A computer readable storage medium storing a program, wherein the program, when executed by a processor, implements the unsupervised action migration and discovery method according to any of claims 1-3.