CN115861902B - Unsupervised action migration and discovery method, system, device and medium - Google Patents

Unsupervised action migration and discovery method, system, device and medium Download PDF

Info

Publication number
CN115861902B
CN115861902B CN202310063448.7A CN202310063448A CN115861902B CN 115861902 B CN115861902 B CN 115861902B CN 202310063448 A CN202310063448 A CN 202310063448A CN 115861902 B CN115861902 B CN 115861902B
Authority
CN
China
Prior art keywords
action
complete
video
actions
decomposition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310063448.7A
Other languages
Chinese (zh)
Other versions
CN115861902A (en
Inventor
张恺成
陈泽林
郑伟诗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202310063448.7A priority Critical patent/CN115861902B/en
Publication of CN115861902A publication Critical patent/CN115861902A/en
Application granted granted Critical
Publication of CN115861902B publication Critical patent/CN115861902B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an unsupervised action migration and discovery method, a system, equipment and a medium, wherein the method comprises the following steps: acquiring a target data set without labels; constructing a convolution network model of a decomposition action flow, performing slicing processing on all videos, calculating the clustering centers of all slices by using a clustering algorithm to serve as pseudo tags of slicing actions, and learning the decomposition actions expressed by the video slices by using the pseudo tags; constructing a convolution network model of a complete action flow, calculating the clustering centers of all complete videos by using a clustering algorithm to serve as pseudo tags of video actions, and learning the complete actions expressed by the complete videos by using the pseudo tags; the convolutional network model of the decomposed action flow and the convolutional network model of the complete action flow learn each other, so that the model can discover new action types and learn more accurate decomposed action information. The invention can complete the task of motion recognition under the condition of no supervision, and improves the accuracy of motion recognition and the efficiency of the whole algorithm by utilizing a migration learning method.

Description

Unsupervised action migration and discovery method, system, device and medium
Technical Field
The invention belongs to the technical field of action recognition, and particularly relates to an unsupervised action migration and discovery method, an unsupervised action migration and discovery system, unsupervised action migration and discovery equipment and unsupervised action migration and discovery media.
Background
Unsupervised action migration aims at applying a pre-trained network to an unsupervised target data set to complete the task of action recognition, and the prior art comprises two aspects:
(1) Unsupervised action recognition. Fully supervised motion recognition has evolved over many years, with the most representative sense of operation now being a dual-stream network, which contains a frame convolutional network and an optical flow convolutional network, giving time-sequential motion information to motion recognition. In the prior art, an efficient 3D convolution network is also explored and researched, and modeling of the spatial position and action information relationship is realized. The method mainly provides some self-supervision labeling methods for the non-supervision action recognition, pre-trains the model through carefully designed non-supervision agent tasks, and then performs fine training on the model through existing labels of the target data set.
(2) Unsupervised transfer learning. In transfer learning, training data comes from two different domains, namely a source domain and a target domain. The main task of the transfer learning is to utilize the training of the source data set to improve the model performance of the target data set. A more popular approach to transfer learning is unsupervised domain adaptation UDA (unsupervised domain adaptation). UDA applies to annotated source data sets and unlabeled target data sets, and the source task is consistent with the target task (e.g., the action type is consistent). Most UDA work is focused on minimizing the field differences.
The network model obtained by pre-training under a large data set is migrated to a small data set, and only the target data set is subjected to fine complete supervision training, so that the action recognition performance on the target data set can be remarkably improved (compared with random initialization training). However, in real life applications, manual labels to refine supervised training are difficult to easily obtain. The main work at present of the non-supervision action recognition is a self-supervision training method, and full-supervision fine adjustment is still needed by using marked data, so that a pre-training model cannot be directly migrated to a non-marked target data set for use. In the transfer learning section, the conventional UDA method is not fully applicable to unsupervised transfer learning, because the target task often appears inconsistent with the source task.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art, and provides an unsupervised motion migration and discovery method, an unsupervised motion migration and discovery system, unsupervised motion recognition equipment and medium, wherein a motion recognition task is completed under an unsupervised condition, and a migration learning method is utilized to improve the motion recognition accuracy and the overall algorithm efficiency.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
In a first aspect, the present invention provides an unsupervised action migration and discovery method, comprising the steps of:
acquiring a target data set without labels, wherein the target data set is acquired video;
constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;
the convolution network model of the decomposed action flow and the convolution network model of the complete action flow are mutually learned to obtain a trained MUSIC model; in the mutual learning process, integrity constraint is added between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the learned decomposition action, and similar complete action distinguishing strategies are adopted to distinguish the similar complete actions, wherein the similar complete action distinguishing strategies are that if the decomposition actions are different, the complete actions to which the similar complete action distinguishing strategies belong are divided into different categories, and finally, a decomposition action alignment strategy is introduced, so that a convolution network model of the decomposition action flow and a convolution network model of the complete action flow both learn shared decomposition actions;
And completing the action recognition task under the unsupervised condition by using the learned MUSIC model.
As a preferable technical scheme, the action learning of the convolutional network model for decomposing the action flow comprises a clustering step for decomposing the action flow and a learning step for decomposing the action flow;
in the step of clustering the decomposed action stream, the features of all video slices are extracted, and the features of all video slices are clustered into a plurality of decomposed actions by using a clustering algorithm to obtain a decomposed action feature set A, wherein the extraction method of the decomposed action feature set A is as follows:
Figure SMS_1
Figure SMS_2
where N represents the total number of videos,
Figure SMS_4
is a union operation, ++>
Figure SMS_6
Indicate->
Figure SMS_8
No. H of personal video>
Figure SMS_10
Decomposing action features extracted from individual slices, < >>
Figure SMS_12
Is the firstiB th to b th frames of the videob+l-1 frame of video slice, +.>
Figure SMS_14
Convolutional network model representing a resolved action flow, +.>
Figure SMS_16
Parameters of the convolutional network representing the split action flow, +.>
Figure SMS_3
The slice length is indicated as such,
Figure SMS_5
is the set of video slice start frames, +.>
Figure SMS_7
Every>
Figure SMS_9
Frame slice sampling of video,/>
Figure SMS_11
Indicate->
Figure SMS_13
Total frame number of video>
Figure SMS_15
Representing the total number of slices of a video;
clustering the decomposition action feature set A by using a clustering algorithm to obtain a pseudo tag set of all the slice decomposition actions
Figure SMS_17
And clustering center set->
Figure SMS_19
, wherein />
Figure SMS_21
Indicate->
Figure SMS_23
Pseudo tag of the b-th slice of the video,>
Figure SMS_25
indicate->
Figure SMS_27
Video of->
Figure SMS_30
N represents the total number of videos, ">
Figure SMS_18
Represents slice b, +.>
Figure SMS_20
Representing the total number of slices representing a video, T i Indicate->
Figure SMS_22
Total frame number of the video, delta represents delta frame,/delta frame>
Figure SMS_24
Indicate->
Figure SMS_26
Cluster center feature of individual clusters,/>
Figure SMS_28
Subscript number indicating the cluster of resolved actions, < ->
Figure SMS_29
Representing the total number of clusters of resolved actions.
As an preferable technical solution, in the learning step of decomposing the motion stream, random slice feature sampling is performed on all videos, and the classification probability of each slice feature is calculated, where the calculation formula is as follows:
Figure SMS_31
wherein ,
Figure SMS_33
is->
Figure SMS_35
Video->
Figure SMS_38
Motion prediction probability vector for each slice, +.>
Figure SMS_40
Representation->
Figure SMS_41
Is>
Figure SMS_43
Column, i.e. pair +.in the predictive probability vector>
Figure SMS_45
Predictive probability of individual clusters, +.>
Figure SMS_32
A parameter representing softmax obtained by training a deep learning network,>
Figure SMS_34
resetting each iteration; />
Figure SMS_36
Is the real number field [ ]>
Figure SMS_37
]Is a matrix of (a); />
Figure SMS_39
Indicate the decomposition action->
Figure SMS_42
Video->
Figure SMS_44
Feature vectors of the individual slices;
order the
Figure SMS_46
Set of prediction vectors representing all slices, pseudo tag +.>
Figure SMS_47
Give->
Figure SMS_48
The individual slices provide self-supervision information, training- >
Figure SMS_49
The loss function is obtained as follows:
Figure SMS_50
wherein ,
Figure SMS_51
is an indicator function.
As a preferable technical scheme, the action learning of the convolution network model of the complete action flow comprises a clustering step of the complete action flow and a learning step of the complete action flow;
in the clustering step of the complete action flow, the characteristics of the complete action are extracted as follows:
Figure SMS_52
Figure SMS_53
wherein ,
Figure SMS_54
represent the firstiComplete characteristics of the individual videos,/->
Figure SMS_56
Is an aggregation function of any kind, +.>
Figure SMS_57
Representing the partial features extracted from the mth segment of the ith video,/th video>
Figure SMS_59
Start frame representing mth video clip, < >>
Figure SMS_61
An overview of video clips is presented,
Figure SMS_62
is the (th) of the (th) video>
Figure SMS_63
Frame to->
Figure SMS_55
A video clip of frames, l denotes the video clip length,
Figure SMS_58
convolutional network showing complete action flow, +.>
Figure SMS_60
Is a parameter of the complete action flow convolution network, and let V represent the complete action feature set of all videos;
clustering the complete action feature set V by using a clustering algorithm to obtain a pseudo tag set of all the complete actions of the video
Figure SMS_64
, wherein />
Figure SMS_65
Indicate->
Figure SMS_66
Pseudo tag of individual video->
Figure SMS_67
Indicate->
Figure SMS_68
Video of->
Figure SMS_69
N represents the total number of videos.
As a preferred solution, the integrity constraint
Figure SMS_70
The realization of which is as follows:
Figure SMS_71
Figure SMS_72
wherein ,
Figure SMS_73
predictive probability vector representing complete features for each cluster,/- >
Figure SMS_74
The softmax parameter obtained after training is represented and reset per iteration.
As a preferred technical solution, the similar complete action distinguishing strategy specifically includes:
differentiating the complete actions by the most representative decomposition actions, the representative decomposition actions
Figure SMS_75
The method is obtained by taking the average value maximum value of the decomposition action prediction probability of each video segment, and specifically comprises the following steps:
Figure SMS_76
wherein ,
Figure SMS_77
is a function of the subscript of the maximum, +.>
Figure SMS_78
Representation->
Figure SMS_79
Decomposition action->
Figure SMS_80
Prediction probability of +.>
Figure SMS_81
Representing the current video +.>
Figure SMS_82
Is a total number of fragments;
classifying complete actions according to the representative decomposition actions, i.e. complete actions comprising different representative decomposition actions, should be identified as different action types and clustered into different clusters, in particular, a complete action cluster set as follows:
Figure SMS_83
wherein ,
Figure SMS_84
representing a set of consistent complete action clustersA subset of the formulaic conditions, +.>
Figure SMS_85
Figure SMS_86
,/>
Figure SMS_87
Number of clusters representing complete actions, +.>
Figure SMS_88
Then a new complete action cluster label is obtained
Figure SMS_89
,/>
Figure SMS_90
Indicate->
Figure SMS_91
The complete motion of the individual video is +.>
Figure SMS_92
In (3), finally, utilize +.>
Figure SMS_93
For->
Figure SMS_94
Training was performed to obtain the loss function as follows:
Figure SMS_95
wherein ,
Figure SMS_96
representing the total number of cluster labels after applying a similar complete action discrimination strategy,/for >
Figure SMS_97
Representing video
Figure SMS_98
Is predicted as action +.>
Figure SMS_99
Is a probability of (2).
As a preferred technical solution, the decomposition action alignment policy specifically includes:
forced decomposition action flow and complete action flow learn shared decomposition actions by minimizing loss functions
Figure SMS_100
To align the split actions in the complete action stream +.>
Figure SMS_101
And decomposing +.>
Figure SMS_102
The specific loss function is as follows:
Figure SMS_103
wherein ,
Figure SMS_104
is any function representing the distance between the two distributions,/->
Figure SMS_105
Representing a disaggregation act->
Figure SMS_106
Distribution of complete action flow, +.>
Figure SMS_107
Representing a disaggregation act->
Figure SMS_108
Distribution in decomposition action stream, taking account of effectiveness and simplicity of calculation, use of loss function simplifying 2-Wasserstein distance calculation distribution ≡>
Figure SMS_109
Figure SMS_110
wherein ,
Figure SMS_111
indicating desire(s)>
Figure SMS_112
Representing the variance;
finally, the learning step loss function of the MUSIC model is expressed as:
Figure SMS_113
wherein ,
Figure SMS_114
、/>
Figure SMS_115
is a class loss function directed by decomposing action flow and complete action flow pseudo tags,/for>
Figure SMS_116
Is an integrity constraint loss function,/->
Figure SMS_117
Is a decomposition action alignment loss function, +.>
Figure SMS_118
and />
Figure SMS_119
Is the weight that balances the individual losses.
In a second aspect, the invention provides an unsupervised action migration and discovery system, which is applied to the unsupervised action migration and discovery method, and comprises a data acquisition module, a model construction module, a mutual learning module and an action recognition module;
The data acquisition module is used for acquiring a target data set without labels, and the target data set is acquired video;
the model construction module is used for constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;
the mutual learning module is used for mutually learning a convolutional network model of a decomposition action flow and a convolutional network model of a complete action flow, in the mutual learning process, integrity constraint is added between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the learned decomposition action, and similar complete action distinguishing strategies are adopted to distinguish the similar complete action, wherein the similar complete action distinguishing strategies are that if the decomposition actions are different, the complete actions to which the similar complete action distinguishing strategies belong are divided into different categories, and finally a decomposition action alignment strategy is introduced, so that the convolutional network model of the decomposition action flow and the convolutional network model of the complete action flow learn the shared decomposition action;
And the action recognition module is used for completing the action recognition task under the unsupervised condition by utilizing the learned MUSIC model.
In a third aspect, the present invention provides an electronic device, including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the unsupervised method of action migration and discovery.
In a fourth aspect, the present invention provides a computer readable storage medium storing a program which, when executed by a processor, implements the unsupervised method of action migration and discovery.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention carries out slicing processing on all videos by constructing a convolution network model of a decomposition action flow, calculates the clustering centers of all slices by using a clustering algorithm as pseudo tags of slicing actions, and learns the decomposition actions expressed by the video slices by using the pseudo tags; constructing a convolution network model of a complete action flow, calculating the clustering centers of all complete videos by using a clustering algorithm to serve as pseudo tags of video actions, and learning the complete actions expressed by the complete videos by using the pseudo tags; the convolutional network model of the decomposed action flow and the convolutional network model of the complete action flow learn each other, so that the model can discover new action types and learn more accurate decomposed action information. Therefore, the invention can identify brand new action types in the target data set, and simultaneously train two streams by utilizing bidirectional mutual learning so as to achieve the purpose of modeling the combination relation of the analysis action and the complete action.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an exploded motion when the complete motion is a long jump according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating an exploded action when the complete action is a turn-up according to an embodiment of the present invention;
FIG. 3 is a flow chart of an unsupervised method of action migration and discovery according to an embodiment of the present invention;
FIG. 4 is a block diagram of an unsupervised action migration and discovery system according to an embodiment of the present invention;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.
The unsupervised action migration and discovery method of the present embodiment is implemented based on the proposed MUSIC model (mutuallylearn the subactions and the complete actions). It will be appreciated that the completion of a complete motion is accomplished by many small discrete motions, including more discrete motions as the more complex the motion is, referring to fig. 1, when the complete motion is a long jump, the discrete motions can be divided into running and long jump; referring to fig. 2, when the complete exercise is a jump, the split exercise can be divided into running and jumping upward. In order to learn the brand new action types and understand the action semantics of the higher layers, the main idea of the MUSIC algorithm framework is to provide self-supervision by utilizing the relationship between the decomposed action and the complete action. In summary, the MUSIC algorithm is composed of two parts of motion learning streams, namely, a split motion stream and a complete motion stream, and the two parts are allowed to learn each other bidirectionally. In the decomposition action flow, all videos are subjected to slice processing, the clustering centers of all slices are calculated by using a clustering algorithm to serve as pseudo tags of the slice actions, and the decomposition actions expressed by the video slices are learned by using the pseudo tags. In the complete action flow, using a clustering algorithm to calculate the clustering centers of all videos as pseudo labels of the video actions, and learning the complete actions expressed by the complete videos by using the pseudo labels. To realize the bidirectional learning of the decomposed action flow and the complete action flow, the MUSIC model also completes the following work:
(1) Integrity constraints are introduced to model the combined relationship between the decomposed and complete action streams.
(2) A similar complete action discrimination strategy is adopted, i.e. if the resolved actions are different, the complete actions to which they belong are divided into different categories.
(3) A split action alignment policy is introduced that requires both the split action flow and the full action flow to learn the shared split action.
Referring to fig. 3, an unsupervised action migration and discovery method of the present embodiment specifically includes the following steps:
s1, acquiring a target data set without a label;
in this embodiment, the target data set is an acquired video, such as a running motion video or a high jump motion video.
S2, constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing an action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow is to calculate the clustering centers of all complete videos by using a clustering algorithm as pseudo labels of video actions, and learn the complete actions expressed by the complete videos by using the pseudo labels.
S21, decomposing the action flow, wherein action learning of the decomposed action flow is realized by iterative execution of a clustering step and a learning step;
firstly, in the clustering step, the embodiment extracts the features of all video slices, and clusters the features into a plurality of subdivision action types by using a clustering algorithm, and the extraction method of the decomposition action feature set A is as follows:
Figure SMS_120
Figure SMS_121
where N represents the total number of videos,
Figure SMS_124
indicate->
Figure SMS_126
No. H of personal video>
Figure SMS_127
Decomposing action features extracted from individual slices, < >>
Figure SMS_129
Convolutional network model representing a resolved action flow, +.>
Figure SMS_130
Parameters representing the deconstructed action convolutional network, +.>
Figure SMS_131
Represents slice length, +.>
Figure SMS_132
Every>
Figure SMS_122
Frame slice sampling of video, ++>
Figure SMS_123
Indicate->
Figure SMS_125
Total number of frames of each video
Figure SMS_128
Representing the total number of slices of a video.
Subsequently, clustering the set a by a clustering algorithm (e.g., k-means) to obtain a pseudo tag set p=for all slice decomposition actions
Figure SMS_133
(/>
Figure SMS_134
) And clustering center set h= =>
Figure SMS_135
(/>
Figure SMS_136
Representing the total number of clusters of resolved actions); let P denote the pseudo tag sets of all slices and H denote the type dictionary of the decomposition action.
Secondly, in the learning step, the embodiment samples the random slice feature of all videos and obtains the classification probability of each feature according to the following formula:
Figure SMS_137
wherein ,
Figure SMS_138
is->
Figure SMS_139
Motion prediction probability vector for individual video, +.>
Figure SMS_140
Each iteration is reset.
Order the
Figure SMS_141
Representing prediction vectors for all slicesIs set of pseudo tag->
Figure SMS_142
Give->
Figure SMS_143
The individual slices provide self-supervision information, training->
Figure SMS_144
The loss function can be obtained as follows:
Figure SMS_145
wherein ,
Figure SMS_146
is an indicator function.
S22, complete action flow;
in the complete action flow, the steps of action learning are equally divided into a clustering step and a learning step, unlike the decomposed action flow, complete actions are represented by complete videos instead of video slices. Specifically, a complete action may be aggregated from M evenly divided video segments in a video.
In the clustering step, the feature extraction of the complete action is as follows:
Figure SMS_147
Figure SMS_148
wherein ,
Figure SMS_149
can be any aggregation function (such as mean function, maximum function, LSTM), -or-a->
Figure SMS_150
Start frame representing mth video clip, < >>
Figure SMS_151
A convolutional network representing the complete motion stream and let V represent the complete motion feature set for all videos.
S3, bidirectional mutual learning;
in this embodiment, the two streams, namely the complete action stream and the decomposed action stream, are co-trained to provide semantic-level pseudo-supervision by using the relationship between actions, and to model the combined relationship between the decomposed action and the complete action. Through bidirectional mutual learning, the MUSIC algorithm model is expected to find new action types and learn more accurate decomposition action information so that action migration can be better adapted to a target domain.
S31, integrity constraint;
considering that the expression of the complete action is the expression containing the decomposition action, the present embodiment adds an integrity constraint between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the decomposition action which has been learned
Figure SMS_152
The realization of which is as follows:
Figure SMS_153
Figure SMS_154
wherein ,
Figure SMS_155
and reset each iteration.
S32, similar complete action distinguishing strategies;
since complete action flows tend to merge similar complete actions into the same class, while the operation of the present embodiment requires the discovery of new action types, it is necessary to distinguish between these similar but inconsistent actions. Thus, the decomposition actions identified by the decomposition action flow are utilized to distinguish these similar complete actions and learnMore discriminative feature expression is learned. In particular, complete actions containing different decomposition actions should belong to different categories. However, in the split action flow, the network is likely to give false labels or classification predictions in error, so this embodiment distinguishes complete actions by only the most representative split actions. Representative decomposition action
Figure SMS_156
The method is obtained by taking the average value maximum value of the decomposition action prediction probability of each video segment, and specifically comprises the following steps:
Figure SMS_157
wherein ,
Figure SMS_158
representation->
Figure SMS_159
Decomposition action->
Figure SMS_160
Prediction probability of +.>
Figure SMS_161
Representing the current video +.>
Figure SMS_162
Is a fraction of the total number of fragments.
The next step is to classify the complete actions according to this representative decomposition action, i.e. the complete actions comprising different representative decomposition actions should be identified as different action types and clustered into different clusters. Specifically, the complete action cluster set is as follows:
Figure SMS_163
wherein ,
Figure SMS_164
,(/>
Figure SMS_165
number of clusters representing complete actions),>
Figure SMS_166
then a new complete action cluster label is obtained
Figure SMS_167
,/>
Figure SMS_168
Indicate->
Figure SMS_169
The complete motion of the individual video is +.>
Figure SMS_170
Is included in the cluster tag. Finally, use->
Figure SMS_171
For->
Figure SMS_172
Training was performed with the following loss functions:
Figure SMS_173
wherein ,
Figure SMS_174
representing the total number of cluster labels after applying a similar complete action discrimination strategy,/for>
Figure SMS_175
Representing every video +.>
Figure SMS_176
Is predicted as action +.>
Figure SMS_177
Is a probability of (2).
S33, decomposing an action alignment strategy;
considering that integrity constraints and similar complete action discrimination policies are employed, complete actions are reconstructed and discriminated from split actions, so it is necessary for two flows to learn some shared split actions, this embodiment is referred to as split action alignment. In particular, as shown in the integrity constraint loss function formula,
Figure SMS_178
The learned split actions in the complete action stream are shown. That forces both streams to learn those shared decomposition actions, by minimizing the loss function +.>
Figure SMS_179
To align the split actions in the complete action stream +.>
Figure SMS_180
And decomposing +.>
Figure SMS_181
The specific loss function is as follows:
Figure SMS_182
wherein ,
Figure SMS_183
can be any function representing the distance between two distributions (e.g. KL divergence or Wasserstein distance), ->
Figure SMS_184
Representing a disaggregation act->
Figure SMS_185
Distribution of complete action flow, +.>
Figure SMS_186
Representing a disaggregation act->
Figure SMS_187
In the dividingSolving for the distribution in the action stream. In view of the availability and simplicity of the calculation, the present embodiment decides to use a simplified 2-Wasserstein distance calculation distribution of the loss function +.>
Figure SMS_188
Figure SMS_189
wherein ,
Figure SMS_190
indicating desire(s)>
Figure SMS_191
Representing the variance.
Finally, the learning step loss function of the MUSIC algorithm framework is expressed as:
Figure SMS_192
wherein ,
Figure SMS_193
and />
Figure SMS_194
Two flow pseudo tag directed class loss functions (cross entropy) respectively,/and->
Figure SMS_195
Is an integrity constraint loss function,/->
Figure SMS_196
Is a decomposition action alignment loss function, +.>
Figure SMS_197
and />
Figure SMS_198
Is the weight that balances the individual losses.
And S4, completing an action recognition task under an unsupervised condition by using the learned MUSIC model.
The following briefly describes the performance of the MUSIC model in performing an action recognition task under unsupervised conditions:
The present embodiment employs two of the most common large datasets in terms of motion recognition as the source dataset for pre-training: kinetics and Ig65m. Meanwhile, two reference data sets are adopted as target data sets for testing the performance of the MUSIC algorithm: UCF-101 and HMDB-51. In UCF-101 and HMDB-51, more than 50% of the action types are new action types that are not useful in the source dataset.
In the test, the present embodiment adopts the cosine distance to measure the similarity of the two motion features. First, the video of each action type is randomly sampled as a control group. Then, the video (non-control group) of each action type is randomly selected for testing, and rank-1 and rank-5 are obtained. The above process is repeated for several times, and different control groups are selected again each time, and finally the obtained accuracy is averaged for several times.
Pre-training section, this example pre-trains the 3D-ResNeXt-101 model with Kinetics, and R (2+1) D-34[10 ] with Ig65m]And (5) a model. The input to the model is a video segment of 16 consecutive frames with a resolution of 224 x 224. Aggregation functions in a complete action stream
Figure SMS_199
Is an average pooling function, unless otherwise specified, the sampling interval of all video slices +. >
Figure SMS_200
. The cluster number is set to->
Figure SMS_201
. The number of segments m=3 of the full motion video. Loss function parameter->
Figure SMS_202
The invention realizes some most advanced unsupervised action recognition algorithms again, and uses the same pre-training model to compare, and the detailed performance is compared with that shown in the table.
Figure SMS_203
"fully supervised approach" refers to the migration of a pre-trained model onto a target dataset for supervised refinement training, with the algorithm chosen as Temporal Segment Network (TSN). The direct migration method refers to directly migrating the pre-trained model to a target data set for testing without fine training.
Under the condition of no supervision, the invention obtains the optimal performance, and compared with other non-supervision algorithms, the invention is improved to a great extent. Compared with the same type of work, the method has the main reason that the MUSIC algorithm models the relation between the decomposed action and the complete action, so that the network can learn the semantic information of the action deeper, and further can identify some new action types which are not in the pre-training data set. While other types of work do not explicitly solve the new action type problem in the action migration problem.
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present invention.
Based on the same ideas of the unsupervised action migration and discovery method in the above embodiments, the present invention also provides an unsupervised action migration and discovery system that can be used to perform the unsupervised action migration and discovery method described above. For ease of illustration, only those portions of an unsupervised motion migration and discovery system embodiment are shown in a schematic configuration diagram relating to embodiments of the present invention, and those skilled in the art will appreciate that the illustrated configuration is not limiting of the apparatus and may include more or fewer components than illustrated, or may combine certain components, or a different arrangement of components.
Referring to FIG. 4, in another embodiment of the present application, an unsupervised action migration and discovery system 100 is provided, which includes a data acquisition module 101, a model building module 102, a mutual learning module 103, and an action recognition module 104;
The data acquisition module 101 is configured to acquire a target data set without a tag, where the target data set is an acquired video;
the model building module 102 is configured to build a MUSIC model, where the MUSIC model includes a convolutional network model for decomposing an action flow and a convolutional network model for a complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;
the mutual learning module 103 is configured to mutually learn a convolutional network model of a decomposed action stream and a convolutional network model of a complete action stream, and in the mutual learning process, add an integrity constraint between the decomposed action stream and the complete action stream, so that the expression of the complete action is constructed by the decomposed action that has been learned, and differentiate the similar complete action by adopting a similar complete action differentiating policy, where if the decomposed actions are different, the complete actions to which the similar complete action differentiating policy belongs are classified into different categories, and finally introduce a decomposed action alignment policy, so that the convolutional network model of the decomposed action stream and the convolutional network model of the complete action stream learn a shared decomposed action;
The motion recognition module 104 is configured to complete a motion recognition task under an unsupervised condition by using the learned MUSIC model.
It should be noted that, the unsupervised action migration and discovery system of the present invention corresponds to the unsupervised action migration and discovery method of the present invention one by one, and technical features and beneficial effects described in the embodiments of the unsupervised action migration and discovery method are applicable to the unsupervised action migration and discovery embodiments, and specific content may be referred to the description in the embodiments of the method of the present invention, which is not repeated herein, and thus is stated herein.
Furthermore, in the implementation of the unsupervised action migration and discovery system of the foregoing embodiment, the logic division of each program module is merely illustrative, and in practical application, the above-mentioned function allocation may be performed by different program modules according to needs, for example, in view of configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the unsupervised action migration and discovery system is divided into different program modules to perform all or part of the functions described above.
Referring to fig. 5, in one embodiment, an electronic device implementing an unsupervised action migration and discovery method is provided, the electronic device 200 may include a first processor 201, a first memory 202, and a bus, and may further include a computer program, such as an unsupervised action migration and discovery program 203, stored in the first memory 202 and executable on the first processor 201.
The first memory 202 includes at least one type of readable storage medium, which includes flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a mobile hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a secure digital (SecureDigital, SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the electronic device 200. Further, the first memory 202 may also include both an internal memory unit and an external memory device of the electronic device 200. The first memory 202 may be used to store not only application software installed in the electronic device 200 and various data, such as codes of the unsupervised action migration and discovery program 203, but also temporarily store data that has been output or is to be output.
The first processor 201 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), a microprocessor, a digital processing chip, a graphics processor, a combination of various control chips, and so on. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 200 and processes data by running or executing programs or modules stored in the first memory 202 and calling data stored in the first memory 202.
Fig. 5 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 5 is not limiting of the electronic device 200 and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
The unsupervised action migration and discovery program 203 stored in the first memory 202 in the electronic device 200 is a combination of instructions that, when executed in the first processor 201, may implement:
acquiring a target data set without labels, wherein the target data set is acquired video;
constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;
The convolution network model of the decomposed action flow and the convolution network model of the complete action flow are mutually learned to obtain a trained MUSIC model; in the mutual learning process, integrity constraint is added between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the learned decomposition action, and similar complete action distinguishing strategies are adopted to distinguish the similar complete actions, wherein the similar complete action distinguishing strategies are that if the decomposition actions are different, the complete actions to which the similar complete action distinguishing strategies belong are divided into different categories, and finally, a decomposition action alignment strategy is introduced, so that a convolution network model of the decomposition action flow and a convolution network model of the complete action flow both learn shared decomposition actions;
and completing the action recognition task under the unsupervised condition by using the learned MUSIC model.
Further, the modules/units integrated with the electronic device 200 may be stored in a non-volatile computer readable storage medium if implemented in the form of software functional units and sold or used as a stand-alone product. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (6)

1. An unsupervised method of action migration and discovery, comprising the steps of:
acquiring a target data set without labels, wherein the target data set is acquired video;
constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;
The convolution network model of the decomposed action flow and the convolution network model of the complete action flow are mutually learned to obtain a trained MUSIC model; in the mutual learning process, integrity constraint is added between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the learned decomposition action, and similar complete action distinguishing strategies are adopted to distinguish the similar complete actions, wherein the similar complete action distinguishing strategies are that if the decomposition actions are different, the complete actions to which the similar complete action distinguishing strategies belong are divided into different categories, and finally, a decomposition action alignment strategy is introduced, so that a convolution network model of the decomposition action flow and a convolution network model of the complete action flow both learn shared decomposition actions;
the action learning of the convolutional network model for decomposing the action flow comprises a clustering step for decomposing the action flow and a learning step for decomposing the action flow; in the step of clustering the decomposed action stream, the features of all video slices are extracted, and the features of all video slices are clustered into a plurality of decomposed actions by using a clustering algorithm to obtain a decomposed action feature set A, wherein the extraction method of the decomposed action feature set A is as follows:
Figure QLYQS_1
Figure QLYQS_2
Where N represents the total number of videos,
Figure QLYQS_4
is a union operation, ++>
Figure QLYQS_6
Indicate->
Figure QLYQS_8
No. H of personal video>
Figure QLYQS_10
Decomposing action features extracted from individual slices, < >>
Figure QLYQS_12
Is the firstiB th to b th frames of the videob+l-1 frame of video slice, +.>
Figure QLYQS_14
Convolutional network model representing a resolved action flow, +.>
Figure QLYQS_16
Parameters of the convolutional network representing the split action flow, +.>
Figure QLYQS_3
Represents slice length, +.>
Figure QLYQS_5
Is the set of video slice start frames, +.>
Figure QLYQS_7
Every>
Figure QLYQS_9
Frame slice sampling of video, ++>
Figure QLYQS_11
Indicate->
Figure QLYQS_13
Total frame number of video>
Figure QLYQS_15
Representing the total number of slices of a video;
clustering the decomposition action feature set A by using a clustering algorithm to obtain a pseudo tag set of all the slice decomposition actions
Figure QLYQS_18
And clustering center set->
Figure QLYQS_20
, wherein />
Figure QLYQS_22
Indicate->
Figure QLYQS_24
Pseudo tag of the b-th slice of the video,>
Figure QLYQS_26
indicate->
Figure QLYQS_28
Video of->
Figure QLYQS_30
N represents the total number of videos, ">
Figure QLYQS_17
Represents slice b, +.>
Figure QLYQS_19
Representing the total number of slices, T, of a video i Indicate->
Figure QLYQS_21
Total frame number of the video, delta represents delta frame,/delta frame>
Figure QLYQS_23
Indicate->
Figure QLYQS_25
Cluster center feature of individual clusters,/>
Figure QLYQS_27
Subscript number indicating the cluster of resolved actions, < ->
Figure QLYQS_29
Representing the total number of the decomposed action clusters;
the action learning of the convolution network model of the complete action flow comprises a clustering step of the complete action flow and a learning step of the complete action flow;
In the clustering step of the complete action flow, the characteristics of the complete action are extracted as follows:
Figure QLYQS_31
Figure QLYQS_32
wherein ,
Figure QLYQS_34
represent the firstiComplete characteristics of the individual videos,/->
Figure QLYQS_36
Is an aggregation function of any kind, +.>
Figure QLYQS_38
Representing the partial features extracted from the mth segment of the ith video,/th video>
Figure QLYQS_39
Start frame representing mth video clip, < >>
Figure QLYQS_40
An overview of video clips is presented,
Figure QLYQS_41
is the (th) of the (th) video>
Figure QLYQS_42
Frame to->
Figure QLYQS_33
A video clip of frames, l denotes the video clip length,
Figure QLYQS_35
convolutional network showing complete action flow, +.>
Figure QLYQS_37
Is a parameter of the complete action flow convolution network, and let V represent the complete action feature set of all videos;
clustering the complete action feature set V by using a clustering algorithm to obtain a pseudo tag set of all the complete actions of the video
Figure QLYQS_43
, wherein />
Figure QLYQS_44
Indicate->
Figure QLYQS_45
Pseudo tag of individual video->
Figure QLYQS_46
Indicate->
Figure QLYQS_47
Video of->
Figure QLYQS_48
N represents the total number of videos;
the decomposition action alignment strategy specifically comprises the following steps: forced decomposition action flow and complete action flow learn shared decomposition actions by minimizing loss functions
Figure QLYQS_49
To align the split actions in the complete action stream +.>
Figure QLYQS_50
And decomposing +.>
Figure QLYQS_51
The specific loss function is as follows:
Figure QLYQS_52
wherein ,
Figure QLYQS_53
is any function representing the distance between the two distributions,/->
Figure QLYQS_54
Representing a disaggregation act- >
Figure QLYQS_55
Distribution of complete action flow, +.>
Figure QLYQS_56
Representing a disaggregation act->
Figure QLYQS_57
Distribution in decomposition action stream, taking account of effectiveness and simplicity of calculation, use of loss function simplifying 2-Wasserstein distance calculation distribution ≡>
Figure QLYQS_58
Figure QLYQS_59
wherein ,
Figure QLYQS_60
indicating desire(s)>
Figure QLYQS_61
Representing the variance;
finally, the learning step loss function of the MUSIC model is expressed as:
Figure QLYQS_62
wherein ,
Figure QLYQS_63
、/>
Figure QLYQS_64
is a class loss function directed by decomposing action flow and complete action flow pseudo tags,/for>
Figure QLYQS_65
Is an integrity constraint loss function,/->
Figure QLYQS_66
Is a decomposition action alignment loss function, +.>
Figure QLYQS_67
and />
Figure QLYQS_68
Is the weight to balance each loss;
the integrity constraints
Figure QLYQS_69
The realization of which is as follows:
Figure QLYQS_70
Figure QLYQS_71
wherein ,
Figure QLYQS_72
predictive probability vector representing complete features for each cluster,/->
Figure QLYQS_73
Representing the softmax parameter obtained after training and resetting each iteration;
and completing the action recognition task under the unsupervised condition by using the learned MUSIC model.
2. The unsupervised motion migration and discovery method according to claim 1, wherein in the learning step of decomposing the motion stream, random slice feature sampling is performed on all videos and classification probability of each slice feature is calculated as follows:
Figure QLYQS_74
,/>
wherein ,
Figure QLYQS_76
is->
Figure QLYQS_78
Video->
Figure QLYQS_79
Motion prediction probability vector for each slice, +. >
Figure QLYQS_81
Representation of
Figure QLYQS_83
Is>
Figure QLYQS_84
Column, i.e. pair +.in the predictive probability vector>
Figure QLYQS_86
Predictive probability of individual clusters, +.>
Figure QLYQS_75
A parameter representing softmax obtained by training a deep learning network,>
Figure QLYQS_77
resetting each iteration; />
Figure QLYQS_80
Is the real number field [ ]>
Figure QLYQS_82
]Is a matrix of (a); />
Figure QLYQS_85
Indicate the decomposition action->
Figure QLYQS_87
Video->
Figure QLYQS_88
Feature vectors of the individual slices;
order the
Figure QLYQS_89
Set of prediction vectors representing all slices, pseudo tag +.>
Figure QLYQS_90
Give->
Figure QLYQS_91
The individual slices provide self-supervision information, training->
Figure QLYQS_92
The loss function is obtained as follows:
Figure QLYQS_93
wherein ,
Figure QLYQS_94
is an indicator function.
3. The unsupervised action migration and discovery method according to claim 1, wherein the similar complete action discrimination strategy is specifically:
differentiating the complete actions by the most representative decomposition actions, the representative decomposition actions
Figure QLYQS_95
The method is obtained by taking the average value maximum value of the decomposition action prediction probability of each video segment, and specifically comprises the following steps:
Figure QLYQS_96
wherein ,
Figure QLYQS_97
is a function of the subscript of the maximum, +.>
Figure QLYQS_98
Representation->
Figure QLYQS_99
Decomposition action->
Figure QLYQS_100
Prediction probability of +.>
Figure QLYQS_101
Representing the current video +.>
Figure QLYQS_102
Is a total number of fragments;
classifying complete actions according to the representative decomposition actions, i.e. complete actions comprising different representative decomposition actions, should be identified as different action types and clustered into different clusters, in particular, a complete action cluster set as follows:
Figure QLYQS_103
wherein ,
Figure QLYQS_104
representing a subset of the complete action cluster set in accordance with the formula +.>
Figure QLYQS_105
Figure QLYQS_106
,/>
Figure QLYQS_107
Number of clusters representing complete actions, +.>
Figure QLYQS_108
Then a new complete action cluster label is obtained
Figure QLYQS_109
,/>
Figure QLYQS_110
Indicate->
Figure QLYQS_111
The complete motion of the individual video is +.>
Figure QLYQS_112
In (3), finally, utilize +.>
Figure QLYQS_113
For->
Figure QLYQS_114
Training was performed to obtain the loss function as follows:
Figure QLYQS_115
wherein ,
Figure QLYQS_116
representing the total number of cluster labels after applying a similar complete action discrimination strategy,/for>
Figure QLYQS_117
Representing video +.>
Figure QLYQS_118
Is predicted as action +.>
Figure QLYQS_119
Is a probability of (2).
4. An unsupervised action migration and discovery system, characterized in that it is applied to the unsupervised action migration and discovery method according to any one of claims 1 to 3, and comprises a data acquisition module, a model construction module, a mutual learning module and an action recognition module;
the data acquisition module is used for acquiring a target data set without labels, and the target data set is acquired video;
the model construction module is used for constructing a decomposition action and complete action bidirectional learning MUSIC model, wherein the MUSIC model comprises a convolution network model for decomposing action flow and a convolution network model for complete action flow; the convolution network model of the decomposition action flow is that all videos are subjected to slicing treatment, the clustering centers of all slices are calculated by using a clustering algorithm to be used as pseudo labels of slicing actions, and the decomposition actions expressed by the video slices are learned by using the pseudo labels; the convolution network model of the complete action flow uses a clustering algorithm to calculate the clustering centers of all complete videos as pseudo tags of video actions, and learns the complete actions expressed by the complete videos by using the pseudo tags;
The mutual learning module is used for mutually learning a convolutional network model of a decomposition action flow and a convolutional network model of a complete action flow, in the mutual learning process, integrity constraint is added between the decomposition action flow and the complete action flow, so that the expression of the complete action is constructed by the learned decomposition action, and similar complete action distinguishing strategies are adopted to distinguish the similar complete action, wherein the similar complete action distinguishing strategies are that if the decomposition actions are different, the complete actions to which the similar complete action distinguishing strategies belong are divided into different categories, and finally a decomposition action alignment strategy is introduced, so that the convolutional network model of the decomposition action flow and the convolutional network model of the complete action flow learn the shared decomposition action;
and the action recognition module is used for completing the action recognition task under the unsupervised condition by utilizing the learned MUSIC model.
5. An electronic device, the electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the unsupervised action migration and discovery method according to any one of claims 1-3.
6. A computer readable storage medium storing a program, wherein the program, when executed by a processor, implements the unsupervised action migration and discovery method according to any of claims 1-3.
CN202310063448.7A 2023-02-06 2023-02-06 Unsupervised action migration and discovery method, system, device and medium Active CN115861902B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310063448.7A CN115861902B (en) 2023-02-06 2023-02-06 Unsupervised action migration and discovery method, system, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310063448.7A CN115861902B (en) 2023-02-06 2023-02-06 Unsupervised action migration and discovery method, system, device and medium

Publications (2)

Publication Number Publication Date
CN115861902A CN115861902A (en) 2023-03-28
CN115861902B true CN115861902B (en) 2023-06-09

Family

ID=85657626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310063448.7A Active CN115861902B (en) 2023-02-06 2023-02-06 Unsupervised action migration and discovery method, system, device and medium

Country Status (1)

Country Link
CN (1) CN115861902B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821737B (en) * 2023-06-08 2024-04-30 哈尔滨工业大学 Crack acoustic emission signal identification method based on improved weak supervision multi-feature fusion

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180165602A1 (en) * 2016-12-14 2018-06-14 Microsoft Technology Licensing, Llc Scalability of reinforcement learning by separation of concerns
US20190228313A1 (en) * 2018-01-23 2019-07-25 Insurance Services Office, Inc. Computer Vision Systems and Methods for Unsupervised Representation Learning by Sorting Sequences
CA3102439A1 (en) * 2018-06-08 2019-12-12 Zestfinance, Inc. Systems and methods for decomposition of non-differentiable and differentiable models
CN113870315B (en) * 2021-10-18 2023-08-25 南京硅基智能科技有限公司 Multi-algorithm integration-based action migration model training method and action migration method
CN113947525A (en) * 2021-11-25 2022-01-18 中山大学 Unsupervised action style migration method based on reversible flow network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度神经网络和投影树的高效率动作识别算法;郭洪涛;龙娟娟;;计算机应用与软件(第04期);第273-289页 *

Also Published As

Publication number Publication date
CN115861902A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
Kieu et al. Outlier detection for multidimensional time series using deep neural networks
Aakur et al. A perceptual prediction framework for self supervised event segmentation
CN111738532B (en) Method and system for acquiring influence degree of event on object
CN110334186B (en) Data query method and device, computer equipment and computer readable storage medium
US20210182602A1 (en) Flexible imputation of missing data
CN111052128A (en) Descriptor learning method for detecting and locating objects in video
CN115861902B (en) Unsupervised action migration and discovery method, system, device and medium
CN115695950B (en) Video abstract generation method based on content perception
CN113763385A (en) Video object segmentation method, device, equipment and medium
Nie et al. Classification-enhancement deep hashing for large-scale video retrieval
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
CN115062709A (en) Model optimization method, device, equipment, storage medium and program product
Lin et al. The design of error-correcting output codes based deep forest for the micro-expression recognition
CN117349494A (en) Graph classification method, system, medium and equipment for space graph convolution neural network
Yue et al. Vambc: A variational approach for mobility behavior clustering
CN116956171A (en) Classification method, device, equipment and storage medium based on AI model
CN115035455A (en) Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation
Luo et al. Memory enhanced spatial-temporal graph convolutional autoencoder for human-related video anomaly detection
CN113420821A (en) Multi-label learning method based on local correlation of labels and features
JP5623344B2 (en) Reduced feature generation apparatus, method, program, model construction apparatus and method
Yu et al. Construction of garden landscape design system based on multimodal intelligent computing and deep neural network
CN111581469B (en) Multi-subspace representation-based partial multi-mark learning method
Wang et al. An Assessment of Fitness of Undergraduates in BITZH by Using SMOTE and Machine Learning Algorithms
Yang et al. Multi-scale Siamese prediction network for video anomaly detection
Li et al. Joint Domain Alignment and Adversarial Learning for Domain Generalization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant