CN115965995B

CN115965995B - Skeleton self-supervision method and model based on partial space-time data

Info

Publication number: CN115965995B
Application number: CN202211687076.7A
Authority: CN
Inventors: 周彧杰; 段浩东; 王佳琦
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2024-05-28
Anticipated expiration: 2042-12-27
Also published as: CN115965995A

Abstract

The invention provides a skeleton self-supervision method and model based on partial space-time data, wherein the method comprises the following steps: the skeleton time sequence data is subjected to data enhancement to obtain three original input stream data; directly inputting an anchor input stream to an encoder to obtain skeleton characteristics of complete semantics, and inputting the skeleton characteristics to a multi-layer perceptron to map to obtain a first high-dimensional space vector; inputting the space stream processed by CSM to an encoder to extract a feature vector corresponding to the space stream data, inputting the feature vector to a multi-layer perceptron to map to obtain a second high-dimensional space vector, and calculating a cross-correlation matrix and a loss function with the first high-dimensional space vector; inputting the MATM processed time stream to an encoder to extract a feature vector corresponding to the time stream data, inputting the feature vector to a multi-layer perceptron to map to obtain a third high-dimensional space vector, and calculating a cross-correlation matrix and a loss function with the first high-dimensional space vector again; and iteratively updating parameters of the skeleton self-supervision model. The invention can capture the space-time redundant information in the skeleton data, thereby promoting the extraction of time sequence action characteristics of human skeletons.

Description

Skeleton self-supervision method and model based on partial space-time data

Technical Field

The invention relates to the technical field of electronics, in particular to a skeleton self-supervision method and model based on partial space-time data.

Background

The human skeleton self-supervision study mainly researches how to pre-train the model by using the label-free training set to obtain effective data representation, so that the model has better generalization effect on downstream tasks. In the prior art, a contrast learning paradigm is mainly adopted, and the scale of a negative sample (positive sample: the reinforced sample obtained by data reinforcement of the current skeleton data is the positive sample of the current sample, and the reinforced sample obtained by data reinforcement of the skeleton data of other samples is the negative sample of the current sample) is improved by utilizing various data reinforcement modes in a pre-training stage. These methods are limited to the number of negative examples and only consider global data enhancement.

Human skeleton self-supervision learning is divided into two phases: firstly, in a label-free pre-training stage, a model learns a distinguishing characteristic by utilizing a clustering relation of data; and secondly, in a downstream task stage, the model is initialized by using the weights trained in the pre-training stage and then is migrated to various downstream tasks, such as linear probes, fine tuning, semi-supervised learning and the like.

Contrast learning is a common paradigm of human skeleton self-supervised learning. Firstly, the model can carry out twice data enhancement on the same input sample to construct a positive sample pair, other enhancement samples in the same batch are taken as negative samples of the positive samples, the distance between the positive samples is shortened by a contrast learning method, the distance between the negative samples is lengthened, and therefore the model can generate discriminative representation by utilizing a clustering relation and then is generalized to various downstream tasks. For example, one of the prior art solutions uses MoCo (Momentum Contrast for Unsupervised Visual Representation Learning, non-supervised visual representation learning based on momentum contrast) as a basic model, and in the pre-training stage, the model uses two data enhancements of spatial shearing and temporal clipping to obtain enhanced samples, while using a large storage space to store large-scale negative samples, forcing the distance between positive samples to decrease by the method of contrast learning, and increasing the distance between negative samples to enable the model to distinguish between the specificities of different samples, thereby learning a discriminative representation for downstream tasks. However, the model of contrast learning only considers the global data enhancement method to generate positive and negative sample pairs, and meanwhile needs to maintain a large-scale negative sample storage space to ensure that the learned features cannot collapse, when the scale of the skeleton data is small, the negative sample data cannot be ensured to reach the pre-training requirement, so that the performance of the model is influenced, the space-time redundancy characteristic of the skeleton data is not utilized, and the robustness of the features to the loss of skeleton nodes is not high.

Disclosure of Invention

The invention aims to construct a skeleton self-supervision model without negative samples, avoid dependence on a large number of negative samples, and enable the model to obtain the characteristic with space-time robustness more effectively by utilizing redundancy of skeleton data in time and space.

The invention mainly aims to provide a skeleton self-supervision method based on partial space-time data.

It is another object of the present invention to provide a skeletal self-supervising model based on partial spatio-temporal data.

In order to achieve the above purpose, the technical scheme of the invention is specifically realized as follows:

The invention provides a skeleton self-supervision method based on partial space-time data, which comprises the following steps: acquiring skeleton time sequence data; respectively obtaining data of three original input streams from the skeleton time sequence data through data enhancement, wherein the data of the original input streams comprise: anchor input stream data, spatial stream data, and temporal stream data; directly inputting the anchor input stream data to an encoder to obtain skeleton characteristics of complete semantics, and inputting the skeleton of the complete semantics to a multi-layer perceptron to map to obtain a first high-dimensional space vector; performing CSM processing on the spatial stream data by using a CSM spatial mask strategy, inputting the spatial stream data subjected to CSM processing to the encoder to extract feature vectors corresponding to the spatial stream data, inputting the feature vectors corresponding to the spatial stream data to the multi-layer perceptron to map to obtain second high-dimensional spatial vectors, calculating a first cross-correlation matrix by using the second high-dimensional spatial vectors and the first high-dimensional spatial vectors, and calculating a first loss function by using the first cross-correlation matrix; MATM processing the time stream data by utilizing MATM time sequence mask strategies, inputting the MATM processed time stream data into the encoder to extract feature vectors corresponding to the time stream data, inputting the feature vectors corresponding to the time stream data into the multi-layer perceptron to map to obtain a third high-dimensional space vector, calculating a second cross-correlation matrix by utilizing the third high-dimensional space vector and the first high-dimensional space vector, and calculating a second loss function by utilizing the second cross-correlation matrix; and iteratively updating parameters of the skeleton self-supervision model according to the first loss function and the second loss function.

In another aspect, the present invention provides a skeleton self-supervision model based on partial spatiotemporal data, including: the acquisition module is used for acquiring the skeleton time sequence data; the enhancement module is used for respectively obtaining data of three original input streams from the skeleton time sequence data through data enhancement, and the data of the original input streams comprise: anchor input stream data, spatial stream data, and temporal stream data; the anchor input stream module is used for directly inputting the anchor input stream data to the encoder to obtain skeleton characteristics of complete semantics, and inputting the skeleton of the complete semantics to the multi-layer perceptron to map to obtain a first high-dimensional space vector; the space flow module performs CSM processing on the space flow data by using a CSM space mask strategy, inputs the space flow data subjected to CSM processing to the encoder to extract feature vectors corresponding to the space flow data, inputs the feature vectors corresponding to the space flow data to the multi-layer perceptron to map to obtain a second high-dimensional space vector, calculates a first cross-correlation matrix by using the second high-dimensional space vector and the first high-dimensional space vector, and calculates a first loss function by using the first cross-correlation matrix; the time stream module is used for processing MATM the time stream data by utilizing MATM time sequence mask strategies, inputting the MATM processed time stream data into the encoder to extract feature vectors corresponding to the time stream data, inputting the feature vectors corresponding to the time stream data into the multi-layer perceptron to map to obtain a third high-dimensional space vector, calculating a second cross-correlation matrix by utilizing the third high-dimensional space vector and the first high-dimensional space vector, and calculating a second loss function by utilizing the second cross-correlation matrix; and the updating module is used for iteratively updating parameters of the skeleton self-supervision model according to the first loss function and the second loss function.

The technical scheme provided by the invention can be seen that the invention provides a skeleton self-supervision method and model based on partial space-time data, and the core is that a three-stream skeleton self-supervision framework without negative samples is provided, data is input into three streams after being enhanced by the same data, wherein an anchor input stream directly utilizes the model to extract feature vectors, the aim of keeping original action semantic information is fulfilled, a spatial stream introduces a isocenter masking strategy based on skeleton node diagrams to mask partial skeleton nodes, and a time stream introduces a frame attention mechanism masking strategy based on action change amplitude to mask partial time sequence frames. The utilization of the space-time redundancy information by the model is further promoted by exploring the node correlation and the frame correlation by two strategies, so that the model can capture the space-time redundancy information in the skeleton data, and the extraction of skeleton characteristics is promoted.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a skeleton self-supervision method based on partial spatio-temporal data provided in embodiment 1 of the present invention;

FIG. 2 is a specific flow chart of the skeleton self-supervision method based on partial spatio-temporal data provided in embodiment 1 of the present invention;

fig. 3 is a schematic structural diagram of a skeleton self-supervision model based on partial spatio-temporal data according to embodiment 1 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or quantity or position.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

Example 1

The present embodiment provides a skeleton self-supervision method based on partial spatiotemporal data, as shown in fig. 1, which is a flowchart of the skeleton self-supervision method based on partial spatiotemporal data of the present embodiment, and fig. 2 shows a specific flowchart structure of the skeleton self-supervision method based on partial spatiotemporal data of the present embodiment. The flow chart shown in fig. 1 includes:

Step S101, acquiring skeleton time sequence data; specifically, skeleton space-time data refers to skeleton data generated when a human body moves in space-time. For the skeleton data acquisition method, a Kinect (a 3D somatosensory camera published by Microsoft corporation) sensor or other 3D somatosensory sensors can be utilized to capture depth information of a human body image, so that distance between an object and a camera is acquired, more accurate description of the object is acquired by combining color data provided by a color camera, and finally 3D coordinate information and confidence of each point of a human body are estimated by matching with a skeleton point estimation algorithm, so that skeleton time sequence data are acquired. As shown in fig. 2 (a), the acquired skeleton timing data is denoted as s.

Step S102, data enhancement is carried out on the skeleton time sequence data to respectively obtain data of three original input streams, wherein the data of the original input streams comprise: anchor input stream data, spatial stream data, and temporal stream data; specifically, as shown in fig. 2 (a), for the acquired skeleton timing data s, the same data enhancement method τ is first used to acquire the original inputs of three streams, denoted as x', x,Wherein the middle stream is the anchor input stream x, the upper stream is the spatial stream x', and the lower stream is the temporal stream/>

Step S103, directly inputting anchor input stream data to an encoder to obtain skeleton features of complete semantics, and inputting the skeleton of the complete semantics to a multi-layer perceptron to map to obtain a first high-dimensional space vector; specifically, the anchor input stream refers to the original skeleton data, which is first subjected to data enhancement to obtain enhanced data, and then directly input into the encoder without any space-time mask module to obtain complete skeleton feature data with all action semantics. As shown in fig. 2 (a), the anchor input stream data x is directly input to the encoder f to obtain skeleton characteristics h of complete semantics, and then mapped to a vector z in a high-dimensional space through the multi-layer perceptron g for standby.

Step S104, carrying out CSM processing on the space stream data by utilizing a CSM space mask strategy, inputting the space stream data subjected to CSM processing to an encoder to extract feature vectors corresponding to the space stream data, inputting the feature vectors corresponding to the space stream data to a multi-layer perceptron to map so as to obtain second high-dimensional space vectors, calculating a first cross-correlation matrix by utilizing the second high-dimensional space vectors and the first high-dimensional space vectors, and calculating a first loss function by utilizing the first cross-correlation matrix; specifically, a skeleton node diagram-based centrality masking strategy (CENTRAL SPATIAL MASKING, CSM) is introduced for spatial stream data, the spatial stream data x ' passes through a spatial masking module of CSM, the spatial stream data x ' _c processed by CSM is extracted by an encoder f to obtain a feature vector h ' _c, the feature vector h ' _c is mapped to a vector z ' _c in a high-dimensional space through a multi-layer perceptron g, a cross-correlation matrix is calculated between the vector z and an embedded vector z of a previous anchor input stream in the high-dimensional space, and a loss function is calculated by using the cross-correlation matrix

In an alternative embodiment, using a CSM spatial masking strategy to CSM process spatial stream data includes: for each skeleton node in the spatial stream data, acquiring the degree d of each skeleton node, wherein the degree of the skeleton node refers to the number of the skeleton nodes adjacent to the current skeleton node; the mask probability p _i of each skeleton node is calculated, wherein p _i is used for enabling an encoder to obtain masked skeleton node information from adjacent skeleton node information, and a calculation formula of p _i is as follows:

Where p _i is the mask probability of the ith node and d _i is the degree of the ith node. Specifically, as shown in fig. 2 (b), the skeleton data processed by the CSM module is different from the human skeleton node diagram in information interaction modes between different nodes and adjacent nodes, wherein the nodes with larger degrees represent more adjacent nodes, so that the more nodes interactable with the nodes are, the easier the information integration is. Specifically, some nodes have 4 or 3 neighbors, while some nodes contain only 2 neighbors, and for edge nodes only a minimum of one neighbor. In the CSM module, the higher the node degree, the higher the probability of being masked, helping the encoder f more easily to derive masked node information from neighboring node information by calculating the mask probability for each node. The masked skeleton node data x ' _c is to extract the feature vector h ' _c from the encoder f, then map it to the vector z ' _c in the high-dimensional space through the multi-layer perceptron g, and finally calculate a cross-correlation matrix with the embedded vector z of the previous anchor input stream in the high-dimensional space, where the loss function is to trend the cross-correlation matrix as much as possible towards a unit matrix, i.e. diagonal elements tend towards 1, and non-diagonal elements tend towards 0. In an alternative embodiment, the first loss function is calculated as:

Wherein, Representing a first loss function; λ represents a balance super parameter, and the main function of the balance super parameter is to balance the influence of the number of diagonal elements and the number of off-diagonal elements in the cross correlation matrix on the loss, so that λ=0.0002 can be achieved in the actual use process; /(I)Representing a cross-correlation matrix calculated between the two vectors; i. j represents the corresponding position of the cross-correlation matrix,The calculation formula of (2) is as follows:

Where z _b,i and z' _b,j represent a first high-dimensional space vector and a second high-dimensional space vector, respectively, and b represents a lot size. z _b,i and z ' _b,j correspond to fig. 2 (a), and refer to vectors z and z ' _c in the high-dimensional space after x and x ' are mapped, and the batch denoted by b refers to the number of input data in each iteration in the experimental process, and the value of b can be b=128.

Step S105, carrying out MATM processing on the time stream data by utilizing MATM time sequence mask strategy, inputting the time stream data processed by MATM into an encoder to extract a feature vector corresponding to the time stream data, inputting the feature vector corresponding to the time stream data into a multi-layer perceptron to map to obtain a third high-dimensional space vector, calculating a second cross-correlation matrix by utilizing the third high-dimensional space vector and the first high-dimensional space vector, and calculating a second loss function by utilizing the second cross-correlation matrix; specifically, a frame attention mechanism mask strategy (Motion Attention Temporal Masking, MATM) based on motion variation amplitude is introduced for time stream data, so that the time stream dataThrough a MATM timing mask module.

In an alternative embodiment, processing MATM the time stream data using MATM timing mask policies includes: calculating the variation amplitude of skeleton nodes in the action frames aiming at each action frame in the time stream data; judging whether the action frame is a key frame according to the variation amplitude of the skeleton node; the key frames are masked. Specifically, as shown in fig. (c), when the node change rate of a segment is larger for a human motion, the MATM process skeleton node represents that the larger the change amplitude of the motion at that moment, the more likely the segment is a key frame of the motion. MATM by masking the key frames, the model is forced to pay no more attention to the key frames when recognizing actions, but decides action semantics through other redundant frames, so that the model can be helped to deduce complete action information from partial time sequence frames. The formula for calculating the attention for each frame is:

Wherein a _t represents the attention of each frame, t is the sequence number of each frame, m is the variation amplitude of each frame, the variation amplitude of the current frame can be calculated by the position difference between the nodes of the frames before and after the skeleton sequence, and the formula is as follows:

m_:,t,:＝x_:,t+1,:-x_:,t,:

Wherein m _:,t,: represents the variation amplitude of the t frame, and for the subscript of m, the first bit represents the node number, the second bit represents the t frame, and the third bit represents the channel number; x _:,t,: denotes skeleton position information of the frame number t, x _:,t+1,: denotes skeleton position information of the frame number t+1, and for the subscript of x, the first bit denotes a node number, the second bit denotes a frame number, and the third bit denotes a channel number. Data through MATM Vector/>, also mapped in the same way through encoder f and multi-layer perceptron g, into high-dimensional spaceThen a cross-correlation matrix is calculated with the high-dimensional embedded vector z of the previously reserved anchor input stream, the loss function/>The cross-correlation matrix is also forced to be as much as possible towards the unit matrix. Loss function/>Calculation mode and loss function of (2)/>The calculation of (2) is similar and will not be described in detail herein.

The steps S103 to S105 may be performed simultaneously or sequentially without time sequence.

And S106, iteratively updating parameters of the skeleton self-supervision model according to the first loss function and the second loss function. In particular, the model parameters can be updated through the loss function, so that the self-supervision learning of the model is realized.

In an alternative embodiment, iteratively updating parameters of the skeletal self-monitoring model in accordance with the first and second loss functions includes: adding the first loss function and the second loss function to calculate a final loss function value; reducing the final loss function value by a gradient descent method; and iteratively updating parameters of the skeleton self-supervision model according to the final loss function value. Specifically, the final loss is the sum of the spatial stream loss and the temporal stream loss. The loss function value is reduced by a gradient descent method, so that the model parameters are iteratively updated. Total loss functionThe method comprises the following steps: /(I)

The skeleton self-supervision method based on partial space-time data provided by the embodiment is characterized in that a three-stream skeleton self-supervision framework without negative samples is provided, data is input into three streams after being enhanced by the same data, wherein an anchor input stream directly utilizes a model to extract feature vectors, the aim of keeping original action semantic information is fulfilled, a spatial stream introduces a isocenter masking strategy based on skeleton node diagrams to mask partial skeleton nodes, and a time stream introduces a frame attention mechanism masking strategy based on action change amplitude to mask partial time sequence frames. The utilization of the space-time redundancy information by the model is further promoted by exploring the node correlation and the frame correlation by two strategies, so that the model can capture the space-time redundancy information in the skeleton data, and the extraction of skeleton characteristics is promoted.

The embodiment also provides a skeleton self-supervision model based on partial space-time data, as shown in fig. 3, which includes:

an acquisition module 301, configured to acquire skeleton timing data; specifically, skeleton space-time data refers to skeleton data generated when a human body moves in space-time. For the skeleton data acquisition method, a Kinect (a 3D somatosensory camera published by Microsoft corporation) sensor (or other 3D somatosensory cameras) can be utilized to capture depth information of a human body image, so that distance between an object and a camera is acquired, more accurate description of the object is acquired by combining color data provided by a color camera, and finally 3D coordinate information and confidence of each point of a human body are estimated by matching with a skeleton point estimation algorithm, so that skeleton time sequence data are acquired. As shown in fig. 2 (a), the acquired skeleton timing data is denoted as s.

The enhancement module 302 is configured to perform data enhancement on the skeleton time sequence data to obtain data of three original input streams, where the data of the original input streams includes: anchor input stream data, spatial stream data, and temporal stream data; specifically, as shown in fig. 2 (a), for the acquired skeleton timing data s, the same data enhancement method τ is first used to acquire the original inputs of three streams, denoted as x', x,Wherein the middle stream is the anchor input stream, the upper stream is the spatial stream, and the lower stream is the temporal stream.

The anchor input stream module 303 is configured to directly input anchor input stream data to an encoder to obtain skeleton features of full semantics, and input the skeleton of the full semantics to a multi-layer perceptron to map to obtain a first high-dimensional space vector; specifically, the anchor input stream refers to the original skeleton data, which is first subjected to data enhancement to obtain enhanced data, and then directly input into the encoder without any space-time mask module to obtain complete skeleton feature data with all action semantics. As shown in fig. 2 (a), the anchor input stream data x is directly input to the encoder f to obtain skeleton characteristics h of complete semantics, and then mapped to a vector z in a high-dimensional space through the multi-layer perceptron g for standby.

The spatial stream module 304 performs CSM processing on the spatial stream data by using a CSM spatial mask strategy, inputs the spatial stream data after CSM processing to an encoder to extract feature vectors corresponding to the spatial stream data, inputs the feature vectors corresponding to the spatial stream data to a multi-layer perceptron to map to obtain a second high-dimensional spatial vector, calculates a first cross-correlation matrix by using the second high-dimensional spatial vector and the first high-dimensional spatial vector, and calculates a first loss function by using the first cross-correlation matrix; specifically, a skeleton node diagram-based centrality masking strategy (CSM) is introduced for the spatial stream data, the spatial stream data x ' passes through a CSM spatial masking module, the CSM processed spatial stream data x ' _c is extracted by an encoder f to obtain a feature vector h ' _c, the feature vector h ' _c is mapped to a vector z ' _c in a high-dimensional space through a multi-layer perceptron g, a cross-correlation matrix is calculated between the vector z and an embedded vector z of a previous anchor input stream in the high-dimensional space, and a loss function is calculated by using the cross-correlation matrix

The time flow module 305 performs MATM processing on the time flow data by utilizing MATM time sequence mask strategy, inputs the time flow data processed by MATM into an encoder to extract a feature vector corresponding to the time flow data, inputs the feature vector corresponding to the time flow data into a multi-layer perceptron to map to obtain a third high-dimensional space vector, calculates a second cross-correlation matrix by utilizing the third high-dimensional space vector and the first high-dimensional space vector, and calculates a second loss function by utilizing the second cross-correlation matrix; specifically, a frame attention mechanism mask strategy (MATM) based on motion variation amplitude is introduced for time stream data, so that the time stream dataThrough a MATM timing mask module.

m_:,t,:＝x_:,t+1,:-x_:,t,:

And an updating module 306, configured to iteratively update parameters of the skeleton self-supervision model according to the first loss function and the second loss function. In particular, the model parameters can be updated through the loss function, so that the self-supervision learning of the model is realized.

The skeleton self-supervision model based on partial space-time data is characterized in that a three-stream skeleton self-supervision framework without negative samples is provided, data is input into three streams after being enhanced by the same data, wherein an anchor input stream directly utilizes the model to extract feature vectors, the aim of keeping original action semantic information is achieved, a spatial stream introduces a degree centrality masking strategy based on skeleton node diagrams to mask partial skeleton nodes, and a temporal stream introduces a frame attention mechanism masking strategy based on action variation amplitude to mask partial time sequence frames. The utilization of the space-time redundancy information by the model is further promoted by exploring the node correlation and the frame correlation by two strategies, so that the model can capture the space-time redundancy information in the skeleton data, and the extraction of skeleton characteristics is promoted.

Because the invention uses the correlation between space-time nodes to model, when the downstream linear probe is used, even if part of node information is lost, the invention can still keep high accuracy, which proves that the invention greatly enhances the robustness and adaptability of the extracted features. In addition, the invention significantly improves the performance of downstream classification tasks, including linear probes, fine tuning, 5% semi-supervised and 10% semi-supervised testing. The invention verifies the effectiveness of the three large-scale skeleton data sets NTURGB +D60, NTU-RGB+D120 and PKU-MMD, and achieves the most advanced performance in each test scene, wherein NTU-RGB+D is a human skeleton behavior recognition data set proposed by Nanyang university. The NTURGB +d60 dataset includes 56578 skeleton sequences, a total of 60 action classes, each class represented by 25 skeleton nodes. Two segmentation methods exist for NTURGB +d60 data sets, namely xsub (cross model segmentation method) and xview (cross camera view segmentation method), respectively, in the xsub segmentation method, data of one half model is used for training, and data of the other half model is used for testing; in the xview split approach, the data captured by the cameras with sequence numbers 2 and 3 are used for training and the data captured by the camera with sequence number 1 are used for testing. NTU-rgb+d120 is an extended version of NTU-rgb+d60, comprising a total of 113945 skeleton sequences, for a total of 120 classes. Two segmentation methods, namely xsub (cross model segmentation method) and xset (cross camera setting segmentation method), exist in NTURGB +d120, in the xsub segmentation method, data of one half model are used for training, and data of the other half model are used for testing; in the xset segmentation method, the odd-numbered id cameras capture data for training and the even-numbered id for testing. The PKU-MMD dataset includes a total of about twenty thousand framework sequences, with 51 action categories. There are two partitioning methods of PKU-MMD, called Part1 (first Part) and Part2 (second Part), respectively, where Part1 contains 21539 sequences, part2 contains 6940 sequences, and Part2 is more difficult because of the presence of noise. On the linear probe of the downstream task, the invention improves 3.0, 2.1, 2.8, 4.3, 5.0 and 12.5 percentage points under the six divisions respectively; on the fine tuning of the downstream task, the invention improves 1.5, 2.8, 1.4 and 2.9 percentage points under four divisions of NTURGB +D60 and NTURGB +D120 respectively; on the 5% semi-supervised classification of downstream tasks, the invention improves 5.0 and 1.8 percentage points under two divisions of PKU-MMD respectively; on the 10% semi-supervised classification of downstream tasks, the invention improves by 0.8 and 8.6 percentage points under two divisions of PKU-MMD, respectively.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made in the above embodiments by those skilled in the art without departing from the spirit and principles of the invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The skeleton self-supervision method based on partial space-time data is characterized by comprising the following steps of:

acquiring skeleton time sequence data;

Respectively obtaining data of three original input streams from the skeleton time sequence data through data enhancement, wherein the data of the original input streams comprise: anchor input stream data, spatial stream data, and temporal stream data;

directly inputting the anchor input stream data to an encoder to obtain skeleton characteristics of complete semantics, and inputting the skeleton of the complete semantics to a multi-layer perceptron to map to obtain a first high-dimensional space vector;

Performing CSM processing on the spatial stream data by using a CSM spatial mask strategy, inputting the spatial stream data subjected to CSM processing to the encoder to extract feature vectors corresponding to the spatial stream data, inputting the feature vectors corresponding to the spatial stream data to the multi-layer perceptron to map to obtain second high-dimensional spatial vectors, calculating a first cross-correlation matrix by using the second high-dimensional spatial vectors and the first high-dimensional spatial vectors, and calculating a first loss function by using the first cross-correlation matrix;

MATM processing the time stream data by utilizing MATM time sequence mask strategies, inputting the MATM processed time stream data into the encoder to extract feature vectors corresponding to the time stream data, inputting the feature vectors corresponding to the time stream data into the multi-layer perceptron to map to obtain a third high-dimensional space vector, calculating a second cross-correlation matrix by utilizing the third high-dimensional space vector and the first high-dimensional space vector, and calculating a second loss function by utilizing the second cross-correlation matrix;

iteratively updating parameters of the skeleton self-supervision model according to the first loss function and the second loss function;

CSM introduces a centrality mask strategy based on a skeleton node diagram for spatial stream data;

MATM introduces a frame attention mechanism mask strategy based on motion variation amplitude for time stream data;

The anchor input stream refers to the original skeleton data, which is obtained by data enhancement, and then is directly input into an encoder without any space-time mask module to obtain complete skeleton feature data with all action semantics.

2. The method of claim 1, wherein CSM processing the spatial stream data using a CSM spatial mask strategy comprises:

Acquiring the degree d of each skeleton node aiming at each skeleton node in the spatial stream data, wherein the degree of the skeleton node refers to the number of the skeleton nodes adjacent to the current skeleton node;

computing a mask probability for each of the skeleton nodes Wherein said/>For enabling the encoder to obtain masked skeleton node information from neighboring skeleton node information,/>The calculation formula is as follows:

Wherein, Mask probability for the i-th node,/>The degree of the i-th node.

3. The partial spatiotemporal data based skeletal self-supervision method of claim 1, wherein the utilizing MATM timing mask strategy to process the temporal stream data MATM comprises:

Calculating the variation amplitude of skeleton nodes in the action frames aiming at each action frame in the time stream data;

judging whether the action frame is a key frame or not according to the variation amplitude of the skeleton node;

Masking the keyframes.

4. The method for skeleton self-supervision based on partial spatiotemporal data according to claim 1, wherein the first loss function has a calculation formula:

Wherein, Representing the first loss function, lambda representing a balance hyper-parameter,/>Representing a cross-correlation matrix calculated between two vectors, i, j representing the corresponding positions of the cross-correlation matrix,/>The calculation formula of (2) is as follows:

Wherein, And/>Representing the first and second high-dimensional space vectors, respectively, b representing the lot size.

5. The method of partial spatiotemporal data based skeleton self supervision according to any one of claims 1 to 4, wherein iteratively updating parameters of the skeleton self supervision model according to the first and second loss functions includes:

Adding the first loss function and the second loss function to calculate a final loss function value;

Reducing the final loss function value by a gradient descent method;

And iteratively updating parameters of the skeleton self-supervision model according to the final loss function value.

6. A skeleton self-supervision model based on partial spatiotemporal data, comprising:

the acquisition module is used for acquiring the skeleton time sequence data;

The enhancement module is used for respectively obtaining data of three original input streams from the skeleton time sequence data through data enhancement, and the data of the original input streams comprise: anchor input stream data, spatial stream data, and temporal stream data;

The anchor input stream module is used for directly inputting the anchor input stream data to the encoder to obtain skeleton characteristics of complete semantics, and inputting the skeleton of the complete semantics to the multi-layer perceptron to map to obtain a first high-dimensional space vector;

The space flow module performs CSM processing on the space flow data by using a CSM space mask strategy, inputs the space flow data subjected to CSM processing to the encoder to extract feature vectors corresponding to the space flow data, inputs the feature vectors corresponding to the space flow data to the multi-layer perceptron to map to obtain a second high-dimensional space vector, calculates a first cross-correlation matrix by using the second high-dimensional space vector and the first high-dimensional space vector, and calculates a first loss function by using the first cross-correlation matrix;

The time stream module is used for processing MATM the time stream data by utilizing MATM time sequence mask strategies, inputting the MATM processed time stream data into the encoder to extract feature vectors corresponding to the time stream data, inputting the feature vectors corresponding to the time stream data into the multi-layer perceptron to map to obtain a third high-dimensional space vector, calculating a second cross-correlation matrix by utilizing the third high-dimensional space vector and the first high-dimensional space vector, and calculating a second loss function by utilizing the second cross-correlation matrix;

And the updating module is used for iteratively updating parameters of the skeleton self-supervision model according to the first loss function and the second loss function.

7. The partially spatiotemporal data based skeletal self-supervision model of claim 6, wherein the CSM processing the spatial stream data using a CSM spatial mask strategy comprises:

Wherein, Mask probability for the i-th node,/>The degree of the i-th node.

8. The partially spatiotemporal data based skeletal self-supervision model of claim 6, wherein the processing MATM the timestream data with MATM timing mask policies comprises:

Masking the keyframes.

9. The partially spatiotemporal data based skeletal self-supervision model of claim 6, wherein the first loss function is calculated as:

Wherein, Representing the first loss function, lambda representing a balance hyper-parameter,/>Representing a cross-correlation matrix calculated between two vectors, i, j representing the corresponding positions of the cross-correlation matrix,/>The calculation formula is as follows:

10. The partially spatiotemporal data based skeletal self-supervision model according to any one of claims 6 to 9, wherein the iteratively updating parameters of the skeletal self-supervision model according to the first and second loss functions comprises:

Reducing the final loss function value by a gradient descent method;