CN112668492B

CN112668492B - Behavior recognition method for self-supervision learning and skeleton information

Info

Publication number: CN112668492B
Application number: CN202011616079.2A
Authority: CN
Inventors: 张冬雨; 成奕彬; 林倞
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-06-20
Anticipated expiration: 2040-12-30
Also published as: CN112668492A

Abstract

The invention discloses a behavior recognition method for self-supervision learning and skeleton information, and relates to the technical field of computer vision. The method comprises the following steps: s1, constructing a configurable depth model; s2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training the transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network; s3, initializing a converter network according to an initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized; s4, inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network. On the premise of ensuring higher precision, the human behavior recognition method and the device realize human behavior recognition with better effect, robustness and generalization.

Description

Behavior recognition method for self-supervision learning and skeleton information

Technical Field

The invention relates to the technical field of computer vision, in particular to a behavior recognition method for self-supervision learning and skeleton information.

Background

Human behavior recognition technology is an important and active basic research topic in the field of computer vision. The technique predicts the ongoing behavior in the scene by parsing and classifying images or videos containing human actions. With the development of video acquisition sensors and video monitoring, human behavior recognition gradually becomes research content with wide application scenes in the aspects of intelligent monitoring, man-machine interaction, intelligent robots and the like, and is focused by more and more researchers. At present, human behavior recognition is mainly directed to understanding of video data.

At present, the research on human behavior recognition is mainly divided into two types of methods, namely recognition based on RGB video and three-dimensional human skeleton video. The method based on RGB video is inevitably affected by factors such as illumination change, background interference, dynamic environment change and the like. In addition, the deep learning method relies on a large amount of manual label data for model training, and the cost of manual calibration data is quite expensive.

With the development of video acquisition technology and the appearance of Kinect cameras, the three-dimensional structure information of a human body in a scene can be acquired more easily and conveniently. Compared with the traditional camera, the Kinect camera can capture depth information of scenes and 3-dimensional coordinate data of key points of human bones. The human skeleton information has better robustness to problems such as illumination, scenes, shielding and the like. In addition, the human skeleton information contains high-level visual clues such as human body posture and movement, which cannot be directly obtained from RGB video. Thus, identification using 3-dimensional human skeletal video data becomes feasible and efficient.

However, in the human behavior recognition problem based on human skeleton video information, how to effectively extract timing dependent information between different video frames is a major difficulty. Because the amount of human skeleton data is smaller and the amount of information is complex relative to the RGB image, and human behavior is closely related to the action process in the whole video, if the information hidden in the time sequence cannot be effectively utilized, the accuracy of recognition can be affected.

Disclosure of Invention

The invention aims to provide a behavior recognition method for self-supervision learning and skeleton information, which is used for accurately recognizing the occurring behavior in a video by fully utilizing the time sequence dependency relationship of human skeleton video data.

In order to achieve the above object, an embodiment of the present invention provides a behavior recognition method based on self-supervised learning and bone information, including the steps of: s1, constructing a configurable depth model, wherein the depth model comprises a transformer network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase; s2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network; s3, initializing a converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized; s4, inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network.

Preferably, before the training of the transformation network by using the pre-training samples in the network pre-training stage, the method further comprises: and initializing parameters of the depth model in a random mode.

Preferably, the expression of the bone video in the pre-training sample is x= (X) ₁ ,x ₂ ,…,x _N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein x is _i The frame is an ith frame skeleton image in the video X, and N is the total frame number of the video; in the network pre-training stage, selecting 15% of video frames in the skeleton video to carry out random masking to obtain a masked skeleton video expression X _\i ＝(x ₁ ,…,x _i-1 ,MASK,x _i+1 …,x _N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein MASK is a MASK frame; calculating the optical flow motion direction between the mask frame and the next frame as a label of the task; wherein, the expression of the optical flow movement direction is Y= { Y _i I=1, 2, …, M being the discrete number of directions of motion; the output expression obtained after the bone video in the pre-training sample passes through the depth model is f=ψ _flow [T _θ (X _\i )]The method comprises the steps of carrying out a first treatment on the surface of the Wherein T is _θ Function, ψ, representing a converter network _flow Representing a function of a pre-trained classification network.

Preferably, in the network pre-training phase, the parameters of the depth model are trained using a cross entropy loss function in combination with a random gradient descent method.

Preferably, in the network fine tuning stage, the expression of the input skeleton video is x= (X) ₁ ,x ₂ ,…,x _N ) No masking is required; the expression of the video characteristic obtained after the skeleton video passes through the converter network of the network fine tuning stage is f' =t _θ (X); after the fine tuning classification network, the expression of the prediction result of the output behavior class is p=ψ _cls (f'); wherein ψ is _cls Representing a function of fine-tuning the classification network.

Preferably, in the network fine tuning stage, the depth model of this stage is trained using a cross entropy loss function in combination with a random gradient descent method.

The embodiment of the invention also provides a behavior recognition system based on self-supervision learning and skeleton information, which comprises: a configuration module for constructing a configurable depth model comprising a transformer network, a pre-trained classification network, and a fine-tuned classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase; the pre-training module is used for acquiring a pre-training sample according to a preset optical flow prediction task in a network pre-training stage; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network; the fine tuning module is used for initializing the converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized; and the output module is used for inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by the fine-tuning classification network.

The embodiment of the invention also provides a storage device, wherein a plurality of programs are stored, and the programs are suitable for being loaded and executed by a processor to realize the behavior recognition method based on self-supervision learning and skeleton information.

The embodiment of the invention also provides a control device, which comprises: a processor adapted to execute each program; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described behavior recognition method based on self-supervised learning and skeletal information.

Compared with the prior art, the embodiment of the invention has at least the following beneficial effects:

the invention provides a behavior recognition method based on self-supervision learning and skeleton information, which comprises the following steps: s1, constructing a configurable depth model, wherein the depth model comprises a transformer network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase; s2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network; s3, initializing a converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized; s4, inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network.

The invention can automatically construct samples and pseudo labels required by the optical flow prediction task, perform self-supervision pre-training on the depth model, and perform parameter fine adjustment on the depth model by utilizing the real labels marked manually, thereby realizing automatic processing of the input human skeleton video and returning the behavior category which is happening in the video. The invention overcomes the problems of illumination change, background interference, dynamic environment change and the like in the traditional RGB video by using the human skeleton data, and enhances the robustness of the model. The invention constructs a depth model by representing the converter network through bidirectional coding, and utilizes the converter network to deeply mine the long time sequence dependency relationship in the video sequence, thereby solving the defects of insufficient utilization of time sequence information and incapability of capturing the long time sequence relationship in the prior method. According to the method, the self-supervision learning is introduced into the deep model pre-training stage, so that a large amount of unlabeled data is utilized to optimize the parameter search space for the deep network, and the problems of insufficient performance and weak generalization of the deep model on human behavior recognition tasks are solved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a behavior recognition method based on self-supervised learning and skeleton information according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a depth model according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the step numbers used herein are for convenience of description only and are not limiting as to the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1, fig. 1 is a flowchart of a behavior recognition method based on self-supervised learning and skeleton information according to an embodiment of the invention. The behavior recognition method based on self-supervision learning and skeleton information provided by the embodiment comprises the following steps:

s1, constructing a configurable depth model, wherein the depth model comprises a converter network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase and the transformer network and the fine-tuning classification network act in a network fine-tuning phase.

The converter network is characterized by bidirectional coding and is responsible for extracting the space-time characteristics of human skeleton video, and a classical multi-head attention mechanism is adopted. And for the input video, the characteristics of the input video are acquired after the video passes through a converter network, and the video is output to a classification network in a corresponding stage for learning.

The self-supervision learning involved in the pre-training classification network is used as a learning method without manual labels, and can utilize a large amount of unlabeled video data to perform network self-learning.

After the converter network is pre-trained, the fine-tuning classification network obtains a group of specific parameters, the converter network is kept unchanged, the fine-tuning classification network replaces the pre-training network, data with behavior classification labels are used for carrying out parameter fine-tuning learning on the newly constructed model.

A configurable depth model framework constructed in accordance with an embodiment of the present invention is shown in fig. 2. The transformer network is used to extract spatial features of the bone data and timing dependent information in the video, respectively. The pre-training classification network is used for outputting a prediction result aiming at a designed self-supervision optical flow prediction task by using the video characteristics extracted by the converter network as input in a pre-training stage. And the fine-tuning classification network is used for fine-tuning parameters of the whole network by using the tagged data in a fine-tuning stage, and outputting a better classification result.

S2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training the transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network.

In the network pre-training stage, a model consists of a converter network and a pre-training classification network, wherein the input of the model is a human skeleton video subjected to masking operation, and the predicted target is the optical flow movement direction of a masking frame in the next frame; in the fine tuning stage, the model consists of a converter and a fine tuning classification network, the input of the model is a non-mask human skeleton video, and the prediction target is a human behavior type. Preparing training samples for the designed optical flow prediction task, and inputting the training samples to train the converter network; and learning out each part of parameters of the constructed model by using a forward algorithm and a backward propagation algorithm.

Before training the transformation network by using the pre-training sample, the method further comprises: and initializing parameters of the depth model in a random mode. And optimizing a parameter search space for the model by using the optical flow prediction task as a learning target of the model constructed at the stage. The expression of the skeletal video in the pre-training sample is x= (X) ₁ ,x ₂ ,…,x _N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein x is _i The frame skeleton image is the ith frame in the video X, and N is the total frame number of the video.

During the network pre-training stage, 15% of video frames in the skeleton video are selected to enterLine random masking to obtain a masked skeleton video expression X _\i ＝(x ₁ ,…,x _i-1 ,MASK,x _i+1 …,x _N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein MASK is a MASK frame; calculating the optical flow motion direction between the mask frame and the next frame as a label of the task; wherein, the expression of the optical flow movement direction is Y= { Y _i I=1, 2, …, M being the discrete number of directions of motion.

The output expression obtained after the bone video in the pre-training sample passes through the depth model is

f＝Ψ _flow [T _θ (X _\i )]；

Wherein T is _θ Function, ψ, representing a converter network _flow Representing a function of a pre-trained classification network.

In the network pre-training stage, the cross entropy loss function is used for training the parameters of the depth model by combining a random gradient descent method. The loss function used by this process is defined as follows:

L _flow ＝-E _{X～D,i～{1,…,N}} Y _i log(f)。

and S3, initializing the converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized. And initializing a converter network by utilizing the parameters obtained by learning in the step S2, replacing the pre-training classification network by utilizing a fine-tuning classification network, and further learning all partial parameters of the constructed model by utilizing a forward algorithm and a backward propagation algorithm.

The input video X does not need to be masked at this stage, x= (X) ₁ ,x ₂ ,…,x _N ) Let t represent the behavior class label, t= {1,2, …, C }, where C is the number of classes.

Let psi be _cls Representing the fine tuning classification network, replacing the pre-training classification network, forming a depth model of the fine tuning stage with the converter network, initializing the converter network by using the parameter theta' obtained in the step S2, and randomly initializing the parameters of the fine tuning classification network. The input video is obtained after passing through the stage converterThe video features reached can be expressed as:

f′＝T _θ (X)；

and after the fine tuning classification network, outputting a prediction result of the behavior category:

p＝Ψ _cls (f′)。

in the network fine tuning stage, a cross entropy loss function is used to train the depth model of the stage by combining a random gradient descent method. The loss function used by this process is defined as follows: l (L) _cls ＝-E _X～D tlog(p)。

After the parameters of each part of the constructed depth model are learned through the S3, the depth model is reinitialized by adopting the parameters, and then the human behavior recognition work can be formally carried out.

S4, inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network. And calculating the characteristics of the input human skeleton video by using the trained depth model to obtain the prediction scores of all the categories. And then sorting the sorting results according to the mode that the prediction score is from high to low, and returning the sorting results.

The embodiment of the invention also provides a behavior recognition system based on self-supervised learning and skeleton information, and the parts of the system which are the same as those of the embodiment are not repeated here. The system comprises: the configuration module is used for constructing a configurable depth model, and the depth model comprises a transformer network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase and the transformer network and the fine-tuning classification network act in a network fine-tuning phase. The pre-training module is used for acquiring a pre-training sample according to a preset optical flow prediction task in a network pre-training stage; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training the transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network. And the fine tuning module is used for initializing the converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized. And the output module is used for inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by the fine-tuning classification network.

Based on the above-mentioned behavior recognition method embodiment based on self-supervised learning and skeleton information, the invention also provides a storage device in which a plurality of programs can be stored, the programs being adapted to be loaded by a processor and to execute the behavior recognition method based on self-supervised learning and skeleton information.

Based on the above behavior recognition method embodiment based on self-supervision learning and skeleton information, the invention also provides a control device, which can comprise a processor and a storage device; a processor adapted to execute each program; a storage device adapted to store a plurality of programs; the program is adapted to be loaded by the processor and to perform a behavior recognition method based on self-supervised learning and bone information as described above.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. The behavior recognition method based on self-supervision learning and skeleton information is characterized by comprising the following steps of:

s1, constructing a configurable depth model, wherein the depth model comprises a transformer network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase;

s2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network;

s3, initializing a converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized;

s4, inputting the bone video to be identified into a fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network;

wherein, the expression of the skeleton video in the pre-training sample is X= (X) ₁ ,x ₂ ,…,x _N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein x is _i The frame is an ith frame skeleton image in the video X, and N is the total frame number of the video;

in the network pre-training stage, selecting 15% of video frames in the skeleton video to carry out random masking to obtain a masked skeleton video expression:

X _\ ＝(x ₁ ,…,x _i-1 ,MASK,x _i+1 …,x _N )；

wherein MASK is a MASK frame;

calculating the optical flow motion direction between the mask frame and the next frame as a label of the task; wherein, the expression of the optical flow movement direction is Y= { Y _i I=1, 2, …, M being the discrete number of directions of motion;

the output expression obtained after the bone video in the pre-training sample passes through the depth model is f=ψ _flow [T _θ (X _\ )]；

Wherein T is _θ Function, ψ, representing a converter network _flow A function representing a pre-trained classification network;

in the network fine tuning stage, the input skeleton video has the expression of x= (X) ₁ ,x ₂ ,…,x _N ) No masking is required;

the expression of the video characteristic obtained after the skeleton video passes through the converter network of the network fine tuning stage is f ^′ ＝ _θ (X)；

After the fine tuning classification network, the expression of the prediction result of the output behavior class is p=ψ _cls (f ^′ )；

Wherein ψ is _cls Representing a function of fine-tuning the classification network.

2. The method for behavior recognition of self-supervised learning and skeletal information of claim 1, further comprising, prior to training the transformation network with the pre-training samples during the network pre-training phase: and initializing parameters of the depth model in a random mode.

3. The method for identifying the behavior of the self-supervised learning and skeleton information according to claim 1, wherein the parameters of the depth model are trained by using a cross entropy loss function in combination with a random gradient descent method in a network pre-training stage.

4. The method for behavior recognition of self-supervised learning and skeletal information of claim 1, wherein in the network fine tuning stage, the depth model of the stage is trained using a cross entropy loss function in combination with a random gradient descent method.

5. A behavior recognition system based on self-supervised learning and skeletal information, comprising:

a configuration module for constructing a configurable depth model comprising a transformer network, a pre-trained classification network, and a fine-tuned classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase;

the pre-training module is used for acquiring a pre-training sample according to a preset optical flow prediction task in a network pre-training stage; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network;

the fine tuning module is used for initializing the converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized;

the output module is used for inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by the fine-tuning classification network;

X _\ ＝(x ₁ ,…,x _i-1 ,MASK,x _i+1 …,x _N )；

wherein MASK is a MASK frame;

6. A storage device in which a plurality of programs are stored, characterized in that the programs are adapted to be loaded and executed by a processor to implement the behavior recognition method based on self-supervised learning and bone information of any of claims 1-4.

7. A control apparatus comprising:

a processor adapted to execute each program;

a storage device adapted to store a plurality of programs;

characterized in that the program is adapted to be loaded and executed by a processor to implement the behavior recognition method based on self-supervised learning and bone information of any of claims 1-4.