CN116469171A

CN116469171A - Self-supervision skeleton action recognition method based on cross-view consistency mining

Info

Publication number: CN116469171A
Application number: CN202310490627.9A
Authority: CN
Inventors: 徐增敏; 王露露; 蒙儒省
Original assignee: Guilin Anview Technology Co ltd; Guilin University of Electronic Technology
Current assignee: Guilin Anview Technology Co ltd; Guilin University of Electronic Technology
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-07-21

Abstract

The invention relates to the technical field of video processing, in particular to a self-supervision skeleton action recognition method based on cross-view consistency mining, which comprises the steps of acquiring non-tag skeleton data and preprocessing, generating multiple views of a 3D skeleton from a skeleton sequence, obtaining an amplification sequence of the multiple-view data by combining multiple data enhancement methods, obtaining coding characteristics of different views through an encoder, and establishing a single-view contrast learning frame; and further performing parallel self-supervision training on multiple view branches through comparative learning of instance discrimination, generating multiple single view embedded features, constructing comparative learning of single view semantic level through a nearest neighbor positive sample mining method, learning multi-view collaborative representation through combining a cross-view consistency mining module, and finally evaluating model effects on an action recognition task by using test data to obtain recognition performance of a model. The invention improves the performance of the human skeleton action recognition model and solves the problems of the existing human skeleton action recognition method.

Description

Self-supervision skeleton action recognition method based on cross-view consistency mining

Technical Field

The invention relates to the technical field of video processing, in particular to a self-supervision skeleton action recognition method based on cross-view consistency mining.

Background

With the rapid development of 5G and Internet technologies, multimedia information such as images and videos also has explosive growth, and huge manpower and financial resources are consumed for processing massive and redundant visual data only by means of a traditional manual mode, so that great challenges are brought to data analysis and understanding. The computer vision technology can rapidly capture effective information in data in real time, and realize tasks such as detection, tracking, behavior recognition and the like of targets in videos, and is one of the most widely-landed technologies in the artificial intelligence field.

Human motion recognition is a research hotspot in the field of computer vision, and is mainly to determine human behaviors by analyzing the correlation and visual appearance characteristics of image frames in a video sequence, so that the human motion recognition method has wide application prospects in multiple fields of intelligent monitoring, intelligent security, video retrieval and the like. Deep neural networks based on supervised learning have been successfully applied to various computer vision tasks, and such methods require training on labeled datasets, so that model performance depends to some extent on the quantity and quality of labeled data, which require a lot of resources to manually label, which is costly, and this has led to the attention of more and more researchers who utilize existing unlabeled data. As one of the non-supervision learning, the self-supervision learning can effectively utilize a large amount of unlabeled data in a big data era, and the self-supervision is carried out on the relationship of the data to obtain generalized characteristics, so that the self-supervision learning is transferred to a downstream task, and the performance of the human behavior recognition model can be effectively improved.

Multiple modes of video data can be used for representing human body motion characteristics, a plurality of motion recognition technologies based on RGB modes have achieved remarkable results, but RGB mode data is easily affected by factors such as background, appearance and illumination changes, so that the motion recognition model has poor robustness to environmental factors, and the RGB video data is large in quantity and high in calculation cost. With the development of depth sensors, skeleton data represents human body structures through 3D position coordinates of skeleton joints, the lightweight representation can be easily extracted, the skeleton data has stronger robustness to visual angle transformation, character appearance and environment change, and motion recognition by using the skeleton data is a direction of great concern.

The existing patents related to skeleton-motion recognition have more or less different drawbacks, and are explained in two aspects below:

supervised skeleton action recognition domain: the invention patent 'a human body action recognition method based on a cyclic neural network' is published in the university of Beijing-like science and technology in 2022, the cyclic neural network is used for carrying out skeleton action recognition, 15 joints important for motion are selected from skeleton joint points to be combined pairwise, and a new direction vector is formed to enhance the spatial characteristics of a skeleton sequence; the invention patent 'a human skeleton action recognition method integrating internal and external dependencies' is published by the institute of Fujian engineering in 2022, and the accuracy of skeleton action recognition is effectively improved by integrating internal and external dependency graph convolutions of skeleton graphs and combining time convolutions to form a space-time graph convolution model; the invention patent published by university of Shanghai in 2023, namely an action recognition method based on skeleton and image data fusion, fuses the prediction probabilities of each class obtained by skeleton data and local image data, and enriches modeling of action detail information by a model; the invention patent 'deep integration method based on skeleton motion recognition' published by the university of Anhui engineering in 2023 integrates a convolutional neural network and a long-term and short-term memory network to capture various space-time dynamics of motion recognition tasks, and models time dynamic characteristics of skeleton motion sequences under the pushing of integrated learning.

Self-supervision skeleton action recognition domain: the invention discloses an invention patent of Tianjin university 2022, namely a bone action recognition method based on segment driving contrast learning, which takes segments in a skeleton sequence as examples, constructs positive and negative samples of segment level and positive and negative samples of sequence level, and models sequence time sequence information by segment sequence prediction to mine better skeleton sequence characteristics; the invention discloses a method and a system for identifying contrasting self-supervision human behaviors based on temporal and spatial information aggregation, which are disclosed in a patent of Shenzhen research institute of university of Beijing in 2022, and effectively aggregate video temporal and spatial information to obtain more reliable characterization by carrying out intra-data fusion and inter-data voting on skeleton action sequences, motion information and skeleton information; the invention discloses a human body action recognition method and system based on a graphic neural network in 2022 university of combined fertilizer industry, wherein the invention utilizes downsampling with short connection and a corresponding upsampling layer to realize 2D feature extraction and joint point recognition of data, and the obtained 2D joint information is input into the graphic neural network to improve the effect of 3D action recognition; the invention discloses a patent of skeleton self-supervision method and model based on partial space-time data by 2023 Shanghai artificial intelligence innovation center, which utilizes a space mask strategy and a time sequence mask strategy to model the correlation between space nodes, thereby enhancing the robustness and adaptability of self-supervision learning characterization.

Disclosure of Invention

The invention aims to provide a self-supervision skeleton action recognition method based on cross-view consistency mining, and aims to solve the technical problem that the existing human action recognition method has less human action information obtained on label-free skeleton data.

In order to achieve the above purpose, the invention provides a self-supervision skeleton action recognition method based on cross-view consistency mining, which comprises the following steps:

acquiring label-free skeleton data and preprocessing the data;

generating multiple views of the 3D skeleton from the skeleton sequence;

combining a plurality of data enhancement methods to obtain an amplified sequence of the multi-view data;

coding features of different views are obtained through an encoder and projected to a low-dimensional embedded space;

based on a single-view framework contrast learning framework, performing parallel self-supervision training on multiple view branches to generate multiple single-view embedded features;

constructing single-view semantic level contrast learning by a nearest neighbor positive sample mining method;

learning a collaborative representation of multiple views in combination with a cross-view consistency mining module;

and evaluating the model effect by using the test data on the action recognition task to obtain the recognition performance of the model.

Preferably, in the process of acquiring the label-free skeleton data and preprocessing the data, acquiring label-free human skeleton data from different visual angles by using a depth sensor, representing a human body by a skeleton sequence by using 3D position coordinates of skeleton joints, firstly removing invalid frames of each skeleton sequence for data preprocessing, and then adjusting the size of the skeleton sequence to 50 frames through linear interpolation.

Preferably, the multiple views of the 3D skeleton are specifically three views using skeleton data: joint, motion and skeleton views, giving a 3D human skeleton sequence x, x e R containing T frames ^C×T×J J is the number of skeletal joints, and the position vector dimension c=3 of each joint; the motion view of the skeleton sequence is expressed as a temporal displacement between adjacent frames: x is x _:,t+1,: -x _:,t,: The method comprises the steps of carrying out a first treatment on the surface of the The bone view is expressed as the distance between two adjacent joints in the same frame:

preferably, in the process of obtaining the amplification sequence of the multi-view data by combining multiple data enhancement methods, the batch number of the multi-view data of the non-tag skeleton is used as input, data enhancement is performed by three strategies of miscut transformation, time sequence clipping and joint point masking, different examples of the non-tag skeleton sequence are obtained, and the different examples are used as positive sample sets.

Preferably, the encoder is a graph convolution neural network ST-GCN, the single-view comparison learning framework includes a data enhancement module, a feature encoder module and a projection layer module, sample data after data enhancement is input to the encoder to obtain a visual representation of the features, and the features are projected to a low-dimensional embedding space through an MLP layer.

Preferably, in the process of generating a plurality of single-view embedded features, constraining the similarity of different examples through example comparison learning for the output features of a network of a single-view framework comparison learning framework, and performing parallel self-supervision training on a plurality of view branches to obtain a plurality of single-view preliminary models.

Preferably, the nearest neighbor positive sample mining method is to compare the similarity between the sample characteristics and the characteristics in the storage library, and select the example with the closest distance in the embedded space as the positive sample, thereby expanding the positive sample set in the contrast learning process and promoting the clustering effect of the high-confidence samples in the contrast learning process.

Preferably, a cross-view consistency mining method is adopted to learn the process of collaborative representation of multiple views, specifically, priori knowledge is accumulated after single-view training, then cross-view positive sample mining is carried out, and the embedded features of samples in one view are utilized to guide the representation learning of the other view through the cross-view consistency mining method, so that the collaborative representation of multiple views is obtained.

Preferably, in the process of evaluating the model effect by using test data on the action recognition task to obtain the recognition performance of the model, verifying the model performance by using the linear evaluation of the action recognition task, fixing the parameters of the self-supervision pre-training encoder, adding a linear classifier, namely a full connection layer and a softmax layer, performing supervision training on the classifier, and evaluating the trained model by using the test data to obtain the recognition performance of the model.

The invention provides a self-supervision skeleton action recognition method based on cross-view consistency mining, which comprises the steps of acquiring label-free skeleton data and preprocessing, generating multiple views of a 3D skeleton from a skeleton sequence, further obtaining an amplification sequence of the multiple-view data by combining multiple data enhancement methods, obtaining coding characteristics of different views through an encoder, and establishing a single-view contrast learning frame; and further performing parallel self-supervision training on multiple view branches through comparative learning of instance discrimination, generating multiple single view embedded features, constructing comparative learning of single view semantic level through a nearest neighbor positive sample mining method, learning multi-view collaborative representation through combining a cross-view consistency mining module, and finally evaluating model effects on an action recognition task by using test data to obtain recognition performance of a model. According to the invention, the self-supervision model is constructed by utilizing contrast learning to mine semantic consistency among various views of the skeleton sequence, so that good characterization which is beneficial to motion recognition is obtained, and the problem that the existing motion recognition method has less human motion information obtained on the label-free skeleton data is solved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a self-supervising skeleton action recognition method based on cross-view consistency mining.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Referring to fig. 1, the present invention provides a self-supervision skeleton action recognition method based on cross-view consistency mining, which is further described below with reference to specific steps:

s1: acquiring label-free skeleton data and preprocessing the data;

s2: generating multiple views of the 3D skeleton from the skeleton sequence;

s3: combining a plurality of data enhancement methods to obtain an amplified sequence of the multi-view data;

s4: coding features of different views are obtained through an encoder and projected to a low-dimensional embedded space;

s5: based on a single-view framework contrast learning framework, performing parallel self-supervision training on multiple view branches to generate multiple single-view embedded features;

s6: constructing single-view semantic level contrast learning by a nearest neighbor positive sample mining method;

s7: learning a collaborative representation of multiple views in combination with a cross-view consistency mining module;

s8: and evaluating the model effect by using the test data on the action recognition task to obtain the recognition performance of the model.

Specifically, the following is further described in connection with specific implementation steps:

s1: acquiring label-free skeleton data and preprocessing the data;

specifically, the depth sensor is used for collecting unlabeled human skeleton data from different visual angles, the skeleton sequences represent the human body by using 3D position coordinates of skeleton articulation points, for data preprocessing, invalid frames of each skeleton sequence are removed first, and then the size of each skeleton sequence is adjusted to 50 frames through linear interpolation.

S2: generating multiple views of the 3D skeleton from the skeleton sequence;

specifically, three views of skeletal data are used: joint, motion and skeleton views, giving a 3D human skeleton sequence x, x e R containing T frames ^C×T×J J is the number of skeletal joints, and the position vector dimension c=3 of each joint; the motion view of the skeleton sequence is expressed as a temporal displacement between adjacent frames: x is x _:,t+1,: -x _:,t,: The method comprises the steps of carrying out a first treatment on the surface of the The bone view is expressed as the distance between two adjacent joints in the same frame:

specifically, the number of batches of the label-free skeleton multi-view data is used as input, data enhancement is performed through three strategies of error cutting transformation, time sequence cutting and joint point masking, a single-view skeleton sequence x is given, and different data enhancement methods are used for obtaining positive sample pairsWherein T is the number of frames, J is the number of skeletal joints, C is the number of channels, and the rest of skeleton sequences are negative samples.

S4: obtaining coding features of different views through an encoder, and establishing a single-view contrast learning frame;

specifically, the feature encoder is a graph convolution neural network ST-GCN, the single-view contrast learning framework comprises a data enhancement module, a feature encoder module and a projection layer module, and sample data after data enhancement is input to the encoder f _q And f _k Obtaining the corresponding characteristic representation h=f _q (x；θ _q ) Andwherein θ is _q And theta _k The method is characterized in that the method comprises the following steps of respectively obtaining learnable parameters of two encoders, wherein the parameters of the encoders adopt a momentum update mode: θ _k ←mθ _k +(1-m)θ _q M is a momentum coefficient, and features are projected into a low-dimensional embedding space through an MLP layer g (·) to perform similarity comparison, and the similarity is recorded as z=g _q (h) And->

S5: performing parallel self-supervision training on multiple view branches through comparative learning of instance discrimination to generate multiple single view embedded features;

specifically, for the output characteristics of the single-view framework contrast learning network, the similarity of sample examples is constrained through contrast learning of example discrimination, multiple view branches are trained in parallel, multiple single-view preliminary characterization models are obtained, and the following loss functions are required to be minimized in self-supervision training:

wherein,,is the dot product of two vectors, τ is the temperature coefficient, m _i Embedded features that are negative examples, N is the repositoryThe number of negative samples in (a). The memory bank is a first-in first-out queue, and after each update iteration, the memory bank is +.>The enqueue is embedded as a new negative sample, while the earliest embedded feature in M will dequeue.

specifically, similarity comparison is performed on sample features and features in a storage library through a nearest neighbor mining method in a single view, an example with the closest distance in an embedded space is selected as a positive sample, so that the clustering effect of samples with high confidence in a contrast learning process is promoted, and the following losses are required to be minimized in a model training process:

where P is a positive sample set and topK (-) refers to selecting k most similar sample features from N samples in the repository (where k is 1, since the data has no tag, a larger k would damage the pairSpecific performance) and outputs a sample index value, i.e., a positive sample set is data enhancement for sample xIn addition, the nearest neighbor in the storage library is added, and more positive samples with high confidence can be pulled up by using the nearest neighbor mining method in single-view contrast learning, so that the contrast learning process is more reasonable.

specifically, through a cross-view consistency mining method, embedded features of samples in one view are utilized to guide the representation learning of the other view, semantic exchange between views is promoted, and two view samples x are used for guiding the representation learning of the other view ^u And x ^v Through nearest neighbor single-view contrast learning, high-confidence positive samples of each view in an embedded space are mined, the number of hidden positive sample pairs in each view is enriched, and then nearest neighbor contrast loss of the view u is expressed as:

wherein the method comprises the steps ofIs sample x ^u K-nearest neighbor in the negative sample store of view u, nearest neighbor contrast loss for view v may be similarly obtained.

Upon cross-view semantic interaction, x will be in view u ^u As an input sample of the sample,as a positive sample of this sample, the store M of view v ^v As a negative sample, guiding the contrast learning of the view u by using the embedded features of the view v, and similarly, guiding the characterization learning of the view v by using the view u, wherein the cross-view contrast loss is as follows:

L _CVKM ＝L _u→v +L _v→u

wherein L is _v→u Representing the embedded features of view v as a positive sample of view u, the following loss functions need to be minimized in the multi-view contrast learning process:

Specifically, the performance of the model is verified through linear evaluation of an action recognition task, the parameters of a self-supervision pre-training encoder are fixed, a linear classifier (namely a full-connection layer and a softmax layer) is added, then supervision training is carried out on the classifier, and the trained model is evaluated by using test data, so that the recognition performance of the model is obtained.

The deep neural network based on supervised learning is successfully applied to various computer vision tasks, and the method needs to train on a labeled data set, so that the performance of a model depends on the quantity and quality of labeling data to a certain extent, the data need to consume a large amount of resources to be labeled manually, the cost is high, and compared with a supervised deep learning method, the self-supervised learning method has the characteristic of obtaining generalization by performing self supervision by using a large amount of unlabeled data, and the method is transferred to a downstream task, so that the performance of a human behavior recognition model can be effectively improved, and the problem of using a large amount of labeled data is avoided.

With the development of depth sensors, skeleton data represents human body structures through 3D position coordinates of skeleton joints, the lightweight representation can be easily extracted, the skeleton data has strong robustness to visual angle transformation, character appearance and environmental change, and a self-supervision learning method based on contrast learning has recently attracted research interest, and is independent of labeled data, and an output result is obtained mainly by comparing whether two inputs are similar.

In summary, the invention utilizes contrast learning to construct the self-supervision model to mine semantic consistency among various views of the skeleton sequence, thereby obtaining good characterization which is helpful for motion recognition and solving the problem that the existing motion recognition method has less human motion information obtained on the label-free skeleton data.

The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims

1. A self-supervision skeleton action recognition method based on cross-view consistency mining is characterized by comprising the following steps:

acquiring label-free skeleton data and preprocessing the data;

generating multiple views of the 3D skeleton from the skeleton sequence;

2. The self-supervising skeleton action recognition method based on cross-view consistency mining as claimed in claim 1, wherein,

in the process of acquiring non-tag skeleton data and preprocessing the data, a depth sensor is used for acquiring non-tag human skeleton data from different visual angles, a skeleton sequence represents a human body by utilizing 3D position coordinates of skeleton joints, for data preprocessing, invalid frames of each skeleton sequence are removed firstly, and then the size of the skeleton sequence is adjusted to 50 frames through linear interpolation.

3. The self-supervising skeleton action recognition method based on cross-view consistency mining as claimed in claim 2, wherein,

the multiple views of the 3D skeleton are specifically three views using skeleton data: joint, motion and skeleton views, giving a 3D human skeleton sequence x, x e R containing T frames ^C×T×J J is the number of skeletal joints, and the position vector dimension c=3 of each joint; the motion view of the skeleton sequence is expressed as a temporal displacement between adjacent frames: x is x _:,t+1,: -x _:,t,: The method comprises the steps of carrying out a first treatment on the surface of the The bone view is expressed as the distance between two adjacent joints in the same frame:

4. the self-supervising skeleton action recognition method based on cross-view consistency mining according to claim 3, wherein,

combining multiple data enhancement methods to obtain an amplification sequence of the multi-view data, utilizing the batch number of the multi-view data of the non-tag skeleton as input, carrying out data enhancement through three strategies of miscut transformation, time sequence clipping and joint point masking to obtain different examples of the non-tag skeleton sequence, and taking the different examples as positive sample sets.

5. The method for identifying self-supervising skeleton actions based on cross-view consistency mining of claim 4, wherein the method comprises the steps of,

the encoder is a graph convolution neural network ST-GCN, the single-view comparison learning framework comprises a data enhancement module, a feature encoder module and a projection layer module, sample data after data enhancement is input to the encoder to obtain visual representation of features, and the features are projected to a low-dimensional embedded space through an MLP layer.

6. The method for identifying self-supervising skeleton actions based on cross-view consistency mining of claim 5,

in the process of generating a plurality of single-view embedded features, constraining the similarity of different examples through example comparison learning for the output features of a network of a single-view framework comparison learning framework, and performing parallel self-supervision training on a plurality of view branches to obtain a plurality of single-view preliminary models.

7. The self-supervising skeleton action recognition method based on cross-view consistency mining of claim 6, wherein the method comprises the steps of,

the nearest neighbor positive sample mining method is to compare the similarity between the sample characteristics and the characteristics in the storage library, and select the example with the closest distance in the embedded space as the positive sample, thereby expanding the positive sample set in the contrast learning process and promoting the clustering effect of the high-confidence samples in the contrast learning process.

8. The method for identifying self-supervising skeleton actions based on cross-view consistency mining according to claim 7, wherein,

the cross-view consistency mining method is used for learning the process of collaborative representation of multiple views, specifically, priori knowledge is accumulated after single-view training, then cross-view positive sample mining is carried out, and the embedded features of samples in one view are used for guiding the representation learning of the other view through the cross-view consistency mining method, so that the collaborative representation of multiple views is obtained.

9. The self-supervising skeleton action recognition method based on cross-view consistency mining as claimed in claim 8, wherein,

in the process of evaluating the model effect by using test data on the action recognition task to obtain the recognition performance of the model, verifying the model performance by using the linear evaluation of the action recognition task, fixing the parameters of the self-supervision pre-training encoder, adding a linear classifier, namely a full-connection layer and a softmax layer, performing supervision training on the classifier, and evaluating the trained model by using the test data to obtain the recognition performance of the model.