CN112668492B - Behavior recognition method for self-supervision learning and skeleton information - Google Patents

Behavior recognition method for self-supervision learning and skeleton information Download PDF

Info

Publication number
CN112668492B
CN112668492B CN202011616079.2A CN202011616079A CN112668492B CN 112668492 B CN112668492 B CN 112668492B CN 202011616079 A CN202011616079 A CN 202011616079A CN 112668492 B CN112668492 B CN 112668492B
Authority
CN
China
Prior art keywords
network
training
video
fine
tuning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011616079.2A
Other languages
Chinese (zh)
Other versions
CN112668492A (en
Inventor
张冬雨
成奕彬
林倞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202011616079.2A priority Critical patent/CN112668492B/en
Publication of CN112668492A publication Critical patent/CN112668492A/en
Application granted granted Critical
Publication of CN112668492B publication Critical patent/CN112668492B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a behavior recognition method for self-supervision learning and skeleton information, and relates to the technical field of computer vision. The method comprises the following steps: s1, constructing a configurable depth model; s2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training the transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network; s3, initializing a converter network according to an initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized; s4, inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network. On the premise of ensuring higher precision, the human behavior recognition method and the device realize human behavior recognition with better effect, robustness and generalization.

Description

Behavior recognition method for self-supervision learning and skeleton information
Technical Field
The invention relates to the technical field of computer vision, in particular to a behavior recognition method for self-supervision learning and skeleton information.
Background
Human behavior recognition technology is an important and active basic research topic in the field of computer vision. The technique predicts the ongoing behavior in the scene by parsing and classifying images or videos containing human actions. With the development of video acquisition sensors and video monitoring, human behavior recognition gradually becomes research content with wide application scenes in the aspects of intelligent monitoring, man-machine interaction, intelligent robots and the like, and is focused by more and more researchers. At present, human behavior recognition is mainly directed to understanding of video data.
At present, the research on human behavior recognition is mainly divided into two types of methods, namely recognition based on RGB video and three-dimensional human skeleton video. The method based on RGB video is inevitably affected by factors such as illumination change, background interference, dynamic environment change and the like. In addition, the deep learning method relies on a large amount of manual label data for model training, and the cost of manual calibration data is quite expensive.
With the development of video acquisition technology and the appearance of Kinect cameras, the three-dimensional structure information of a human body in a scene can be acquired more easily and conveniently. Compared with the traditional camera, the Kinect camera can capture depth information of scenes and 3-dimensional coordinate data of key points of human bones. The human skeleton information has better robustness to problems such as illumination, scenes, shielding and the like. In addition, the human skeleton information contains high-level visual clues such as human body posture and movement, which cannot be directly obtained from RGB video. Thus, identification using 3-dimensional human skeletal video data becomes feasible and efficient.
However, in the human behavior recognition problem based on human skeleton video information, how to effectively extract timing dependent information between different video frames is a major difficulty. Because the amount of human skeleton data is smaller and the amount of information is complex relative to the RGB image, and human behavior is closely related to the action process in the whole video, if the information hidden in the time sequence cannot be effectively utilized, the accuracy of recognition can be affected.
Disclosure of Invention
The invention aims to provide a behavior recognition method for self-supervision learning and skeleton information, which is used for accurately recognizing the occurring behavior in a video by fully utilizing the time sequence dependency relationship of human skeleton video data.
In order to achieve the above object, an embodiment of the present invention provides a behavior recognition method based on self-supervised learning and bone information, including the steps of: s1, constructing a configurable depth model, wherein the depth model comprises a transformer network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase; s2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network; s3, initializing a converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized; s4, inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network.
Preferably, before the training of the transformation network by using the pre-training samples in the network pre-training stage, the method further comprises: and initializing parameters of the depth model in a random mode.
Preferably, the expression of the bone video in the pre-training sample is x= (X) 1 ,x 2 ,…,x N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein x is i The frame is an ith frame skeleton image in the video X, and N is the total frame number of the video; in the network pre-training stage, selecting 15% of video frames in the skeleton video to carry out random masking to obtain a masked skeleton video expression X \i =(x 1 ,…,x i-1 ,MASK,x i+1 …,x N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein MASK is a MASK frame; calculating the optical flow motion direction between the mask frame and the next frame as a label of the task; wherein, the expression of the optical flow movement direction is Y= { Y i I=1, 2, …, M being the discrete number of directions of motion; the output expression obtained after the bone video in the pre-training sample passes through the depth model is f=ψ flow [T θ (X \i )]The method comprises the steps of carrying out a first treatment on the surface of the Wherein T is θ Function, ψ, representing a converter network flow Representing a function of a pre-trained classification network.
Preferably, in the network pre-training phase, the parameters of the depth model are trained using a cross entropy loss function in combination with a random gradient descent method.
Preferably, in the network fine tuning stage, the expression of the input skeleton video is x= (X) 1 ,x 2 ,…,x N ) No masking is required; the expression of the video characteristic obtained after the skeleton video passes through the converter network of the network fine tuning stage is f' =t θ (X); after the fine tuning classification network, the expression of the prediction result of the output behavior class is p=ψ cls (f'); wherein ψ is cls Representing a function of fine-tuning the classification network.
Preferably, in the network fine tuning stage, the depth model of this stage is trained using a cross entropy loss function in combination with a random gradient descent method.
The embodiment of the invention also provides a behavior recognition system based on self-supervision learning and skeleton information, which comprises: a configuration module for constructing a configurable depth model comprising a transformer network, a pre-trained classification network, and a fine-tuned classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase; the pre-training module is used for acquiring a pre-training sample according to a preset optical flow prediction task in a network pre-training stage; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network; the fine tuning module is used for initializing the converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized; and the output module is used for inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by the fine-tuning classification network.
The embodiment of the invention also provides a storage device, wherein a plurality of programs are stored, and the programs are suitable for being loaded and executed by a processor to realize the behavior recognition method based on self-supervision learning and skeleton information.
The embodiment of the invention also provides a control device, which comprises: a processor adapted to execute each program; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described behavior recognition method based on self-supervised learning and skeletal information.
Compared with the prior art, the embodiment of the invention has at least the following beneficial effects:
the invention provides a behavior recognition method based on self-supervision learning and skeleton information, which comprises the following steps: s1, constructing a configurable depth model, wherein the depth model comprises a transformer network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase; s2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network; s3, initializing a converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized; s4, inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network.
The invention can automatically construct samples and pseudo labels required by the optical flow prediction task, perform self-supervision pre-training on the depth model, and perform parameter fine adjustment on the depth model by utilizing the real labels marked manually, thereby realizing automatic processing of the input human skeleton video and returning the behavior category which is happening in the video. The invention overcomes the problems of illumination change, background interference, dynamic environment change and the like in the traditional RGB video by using the human skeleton data, and enhances the robustness of the model. The invention constructs a depth model by representing the converter network through bidirectional coding, and utilizes the converter network to deeply mine the long time sequence dependency relationship in the video sequence, thereby solving the defects of insufficient utilization of time sequence information and incapability of capturing the long time sequence relationship in the prior method. According to the method, the self-supervision learning is introduced into the deep model pre-training stage, so that a large amount of unlabeled data is utilized to optimize the parameter search space for the deep network, and the problems of insufficient performance and weak generalization of the deep model on human behavior recognition tasks are solved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a behavior recognition method based on self-supervised learning and skeleton information according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a depth model according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the step numbers used herein are for convenience of description only and are not limiting as to the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Referring to fig. 1, fig. 1 is a flowchart of a behavior recognition method based on self-supervised learning and skeleton information according to an embodiment of the invention. The behavior recognition method based on self-supervision learning and skeleton information provided by the embodiment comprises the following steps:
s1, constructing a configurable depth model, wherein the depth model comprises a converter network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase and the transformer network and the fine-tuning classification network act in a network fine-tuning phase.
The converter network is characterized by bidirectional coding and is responsible for extracting the space-time characteristics of human skeleton video, and a classical multi-head attention mechanism is adopted. And for the input video, the characteristics of the input video are acquired after the video passes through a converter network, and the video is output to a classification network in a corresponding stage for learning.
The self-supervision learning involved in the pre-training classification network is used as a learning method without manual labels, and can utilize a large amount of unlabeled video data to perform network self-learning.
After the converter network is pre-trained, the fine-tuning classification network obtains a group of specific parameters, the converter network is kept unchanged, the fine-tuning classification network replaces the pre-training network, data with behavior classification labels are used for carrying out parameter fine-tuning learning on the newly constructed model.
A configurable depth model framework constructed in accordance with an embodiment of the present invention is shown in fig. 2. The transformer network is used to extract spatial features of the bone data and timing dependent information in the video, respectively. The pre-training classification network is used for outputting a prediction result aiming at a designed self-supervision optical flow prediction task by using the video characteristics extracted by the converter network as input in a pre-training stage. And the fine-tuning classification network is used for fine-tuning parameters of the whole network by using the tagged data in a fine-tuning stage, and outputting a better classification result.
S2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training the transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network.
In the network pre-training stage, a model consists of a converter network and a pre-training classification network, wherein the input of the model is a human skeleton video subjected to masking operation, and the predicted target is the optical flow movement direction of a masking frame in the next frame; in the fine tuning stage, the model consists of a converter and a fine tuning classification network, the input of the model is a non-mask human skeleton video, and the prediction target is a human behavior type. Preparing training samples for the designed optical flow prediction task, and inputting the training samples to train the converter network; and learning out each part of parameters of the constructed model by using a forward algorithm and a backward propagation algorithm.
Before training the transformation network by using the pre-training sample, the method further comprises: and initializing parameters of the depth model in a random mode. And optimizing a parameter search space for the model by using the optical flow prediction task as a learning target of the model constructed at the stage. The expression of the skeletal video in the pre-training sample is x= (X) 1 ,x 2 ,…,x N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein x is i The frame skeleton image is the ith frame in the video X, and N is the total frame number of the video.
During the network pre-training stage, 15% of video frames in the skeleton video are selected to enterLine random masking to obtain a masked skeleton video expression X \i =(x 1 ,…,x i-1 ,MASK,x i+1 …,x N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein MASK is a MASK frame; calculating the optical flow motion direction between the mask frame and the next frame as a label of the task; wherein, the expression of the optical flow movement direction is Y= { Y i I=1, 2, …, M being the discrete number of directions of motion.
The output expression obtained after the bone video in the pre-training sample passes through the depth model is
f=Ψ flow [T θ (X \i )];
Wherein T is θ Function, ψ, representing a converter network flow Representing a function of a pre-trained classification network.
In the network pre-training stage, the cross entropy loss function is used for training the parameters of the depth model by combining a random gradient descent method. The loss function used by this process is defined as follows:
L flow =-E X~D,i~{1,…,N} Y i log(f)。
and S3, initializing the converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized. And initializing a converter network by utilizing the parameters obtained by learning in the step S2, replacing the pre-training classification network by utilizing a fine-tuning classification network, and further learning all partial parameters of the constructed model by utilizing a forward algorithm and a backward propagation algorithm.
The input video X does not need to be masked at this stage, x= (X) 1 ,x 2 ,…,x N ) Let t represent the behavior class label, t= {1,2, …, C }, where C is the number of classes.
Let psi be cls Representing the fine tuning classification network, replacing the pre-training classification network, forming a depth model of the fine tuning stage with the converter network, initializing the converter network by using the parameter theta' obtained in the step S2, and randomly initializing the parameters of the fine tuning classification network. The input video is obtained after passing through the stage converterThe video features reached can be expressed as:
f′=T θ (X);
and after the fine tuning classification network, outputting a prediction result of the behavior category:
p=Ψ cls (f′)。
in the network fine tuning stage, a cross entropy loss function is used to train the depth model of the stage by combining a random gradient descent method. The loss function used by this process is defined as follows: l (L) cls =-E X~D tlog(p)。
After the parameters of each part of the constructed depth model are learned through the S3, the depth model is reinitialized by adopting the parameters, and then the human behavior recognition work can be formally carried out.
S4, inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network. And calculating the characteristics of the input human skeleton video by using the trained depth model to obtain the prediction scores of all the categories. And then sorting the sorting results according to the mode that the prediction score is from high to low, and returning the sorting results.
The embodiment of the invention also provides a behavior recognition system based on self-supervised learning and skeleton information, and the parts of the system which are the same as those of the embodiment are not repeated here. The system comprises: the configuration module is used for constructing a configurable depth model, and the depth model comprises a transformer network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase and the transformer network and the fine-tuning classification network act in a network fine-tuning phase. The pre-training module is used for acquiring a pre-training sample according to a preset optical flow prediction task in a network pre-training stage; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training the transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network. And the fine tuning module is used for initializing the converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized. And the output module is used for inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by the fine-tuning classification network.
Based on the above-mentioned behavior recognition method embodiment based on self-supervised learning and skeleton information, the invention also provides a storage device in which a plurality of programs can be stored, the programs being adapted to be loaded by a processor and to execute the behavior recognition method based on self-supervised learning and skeleton information.
Based on the above behavior recognition method embodiment based on self-supervision learning and skeleton information, the invention also provides a control device, which can comprise a processor and a storage device; a processor adapted to execute each program; a storage device adapted to store a plurality of programs; the program is adapted to be loaded by the processor and to perform a behavior recognition method based on self-supervised learning and bone information as described above.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims (7)

1. The behavior recognition method based on self-supervision learning and skeleton information is characterized by comprising the following steps of:
s1, constructing a configurable depth model, wherein the depth model comprises a transformer network, a pre-training classification network and a fine-tuning classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase;
s2, in a network pre-training stage, a pre-training sample is obtained according to a preset optical flow prediction task; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network;
s3, initializing a converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized;
s4, inputting the bone video to be identified into a fine-tuning depth model after training, and outputting a classification prediction result by a fine-tuning classification network;
wherein, the expression of the skeleton video in the pre-training sample is X= (X) 1 ,x 2 ,…,x N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein x is i The frame is an ith frame skeleton image in the video X, and N is the total frame number of the video;
in the network pre-training stage, selecting 15% of video frames in the skeleton video to carry out random masking to obtain a masked skeleton video expression:
X \ =(x 1 ,…,x i-1 ,MASK,x i+1 …,x N );
wherein MASK is a MASK frame;
calculating the optical flow motion direction between the mask frame and the next frame as a label of the task; wherein, the expression of the optical flow movement direction is Y= { Y i I=1, 2, …, M being the discrete number of directions of motion;
the output expression obtained after the bone video in the pre-training sample passes through the depth model is f=ψ flow [T θ (X \ )];
Wherein T is θ Function, ψ, representing a converter network flow A function representing a pre-trained classification network;
in the network fine tuning stage, the input skeleton video has the expression of x= (X) 1 ,x 2 ,…,x N ) No masking is required;
the expression of the video characteristic obtained after the skeleton video passes through the converter network of the network fine tuning stage is f θ (X);
After the fine tuning classification network, the expression of the prediction result of the output behavior class is p=ψ cls (f );
Wherein ψ is cls Representing a function of fine-tuning the classification network.
2. The method for behavior recognition of self-supervised learning and skeletal information of claim 1, further comprising, prior to training the transformation network with the pre-training samples during the network pre-training phase: and initializing parameters of the depth model in a random mode.
3. The method for identifying the behavior of the self-supervised learning and skeleton information according to claim 1, wherein the parameters of the depth model are trained by using a cross entropy loss function in combination with a random gradient descent method in a network pre-training stage.
4. The method for behavior recognition of self-supervised learning and skeletal information of claim 1, wherein in the network fine tuning stage, the depth model of the stage is trained using a cross entropy loss function in combination with a random gradient descent method.
5. A behavior recognition system based on self-supervised learning and skeletal information, comprising:
a configuration module for constructing a configurable depth model comprising a transformer network, a pre-trained classification network, and a fine-tuned classification network; wherein the transformer network and the pre-training classification network act in a network pre-training phase, and the transformer network and the fine-tuning classification network act in a network fine-tuning phase;
the pre-training module is used for acquiring a pre-training sample according to a preset optical flow prediction task in a network pre-training stage; the pre-training sample comprises a skeleton video and a label of an optical flow prediction task automatically generated by a machine; training a transformation network by using the pre-training sample to obtain an initial parameter theta' of the transformation network;
the fine tuning module is used for initializing the converter network according to the initial parameter theta' in a network fine tuning stage, and constructing a fine tuning depth model by combining the initialized converter network and a fine tuning classification network which is randomly initialized;
the output module is used for inputting the bone video to be identified into the fine-tuning depth model after training, and outputting a classification prediction result by the fine-tuning classification network;
wherein, the expression of the skeleton video in the pre-training sample is X= (X) 1 ,x 2 ,…,x N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein x is i The frame is an ith frame skeleton image in the video X, and N is the total frame number of the video;
in the network pre-training stage, selecting 15% of video frames in the skeleton video to carry out random masking to obtain a masked skeleton video expression:
X \ =(x 1 ,…,x i-1 ,MASK,x i+1 …,x N );
wherein MASK is a MASK frame;
calculating the optical flow motion direction between the mask frame and the next frame as a label of the task; wherein, the expression of the optical flow movement direction is Y= { Y i I=1, 2, …, M being the discrete number of directions of motion;
the output expression obtained after the bone video in the pre-training sample passes through the depth model is f=ψ flow [T θ (X \ )];
Wherein T is θ Function, ψ, representing a converter network flow A function representing a pre-trained classification network;
in the network fine tuning stage, the input skeleton video has the expression of x= (X) 1 ,x 2 ,…,x N ) No masking is required;
the expression of the video characteristic obtained after the skeleton video passes through the converter network of the network fine tuning stage is f θ (X);
After the fine tuning classification network, the expression of the prediction result of the output behavior class is p=ψ cls (f );
Wherein ψ is cls Representing a function of fine-tuning the classification network.
6. A storage device in which a plurality of programs are stored, characterized in that the programs are adapted to be loaded and executed by a processor to implement the behavior recognition method based on self-supervised learning and bone information of any of claims 1-4.
7. A control apparatus comprising:
a processor adapted to execute each program;
a storage device adapted to store a plurality of programs;
characterized in that the program is adapted to be loaded and executed by a processor to implement the behavior recognition method based on self-supervised learning and bone information of any of claims 1-4.
CN202011616079.2A 2020-12-30 2020-12-30 Behavior recognition method for self-supervision learning and skeleton information Active CN112668492B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011616079.2A CN112668492B (en) 2020-12-30 2020-12-30 Behavior recognition method for self-supervision learning and skeleton information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011616079.2A CN112668492B (en) 2020-12-30 2020-12-30 Behavior recognition method for self-supervision learning and skeleton information

Publications (2)

Publication Number Publication Date
CN112668492A CN112668492A (en) 2021-04-16
CN112668492B true CN112668492B (en) 2023-06-20

Family

ID=75411373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011616079.2A Active CN112668492B (en) 2020-12-30 2020-12-30 Behavior recognition method for self-supervision learning and skeleton information

Country Status (1)

Country Link
CN (1) CN112668492B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723341B (en) * 2021-09-08 2023-09-01 北京有竹居网络技术有限公司 Video identification method and device, readable medium and electronic equipment
CN114612685B (en) * 2022-03-22 2022-12-23 中国科学院空天信息创新研究院 Self-supervision information extraction method combining depth features and contrast learning
CN114764899B (en) * 2022-04-12 2024-03-22 华南理工大学 Method for predicting next interaction object based on transformation first view angle
CN116453220B (en) * 2023-04-19 2024-05-10 北京百度网讯科技有限公司 Target object posture determining method, training device and electronic equipment
CN117274656B (en) * 2023-06-06 2024-04-05 天津大学 Multi-mode model countermeasure training method based on self-adaptive depth supervision module

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609686A (en) * 2012-01-19 2012-07-25 宁波大学 Pedestrian detection method
CN109215080A (en) * 2018-09-25 2019-01-15 清华大学 6D Attitude estimation network training method and device based on deep learning Iterative matching
CN111191630A (en) * 2020-01-07 2020-05-22 中国传媒大学 Performance action identification method suitable for intelligent interactive viewing scene
CN111914778A (en) * 2020-08-07 2020-11-10 重庆大学 Video behavior positioning method based on weak supervised learning
CN111967433A (en) * 2020-08-31 2020-11-20 重庆科技学院 Action identification method based on self-supervision learning network
CN112070027A (en) * 2020-09-09 2020-12-11 腾讯科技(深圳)有限公司 Network training and action recognition method, device, equipment and storage medium
CN112102237A (en) * 2020-08-10 2020-12-18 清华大学 Brain tumor recognition model training method and device based on semi-supervised learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609686A (en) * 2012-01-19 2012-07-25 宁波大学 Pedestrian detection method
CN109215080A (en) * 2018-09-25 2019-01-15 清华大学 6D Attitude estimation network training method and device based on deep learning Iterative matching
CN111191630A (en) * 2020-01-07 2020-05-22 中国传媒大学 Performance action identification method suitable for intelligent interactive viewing scene
CN111914778A (en) * 2020-08-07 2020-11-10 重庆大学 Video behavior positioning method based on weak supervised learning
CN112102237A (en) * 2020-08-10 2020-12-18 清华大学 Brain tumor recognition model training method and device based on semi-supervised learning
CN111967433A (en) * 2020-08-31 2020-11-20 重庆科技学院 Action identification method based on self-supervision learning network
CN112070027A (en) * 2020-09-09 2020-12-11 腾讯科技(深圳)有限公司 Network training and action recognition method, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于Kinect传感器骨骼信息的人体动作识别;朱国刚;曹林;;计算机仿真(第12期);第1-4页 *
深度卷积神经网络在Caltech-101图像分类中的相关研究;段建;翟慧敏;;计算机应用与软件(第12期);第1-3页 *
深度学习的目标跟踪算法综述;李玺;查宇飞;张天柱;崔振;左旺孟;侯志强;卢湖川;王菡子;;中国图象图形学报(第12期);第1-5页 *

Also Published As

Publication number Publication date
CN112668492A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN112668492B (en) Behavior recognition method for self-supervision learning and skeleton information
Chen et al. Repetitive assembly action recognition based on object detection and pose estimation
KR102220174B1 (en) Learning-data enhancement device for machine learning model and method for learning-data enhancement
JP7016522B2 (en) Machine vision with dimensional data reduction
CN111046821B (en) Video behavior recognition method and system and electronic equipment
CN110147797A (en) A kind of sketch completion and recognition methods and device based on production confrontation network
CN110705412A (en) Video target detection method based on motion history image
CN111401293B (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
CN111709295A (en) SSD-MobileNet-based real-time gesture detection and recognition method and system
CN108921032B (en) Novel video semantic extraction method based on deep learning model
Wang et al. Multiscale deep alternative neural network for large-scale video classification
John et al. A comparative study of various object detection algorithms and performance analysis
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
Dai et al. Tan: Temporal aggregation network for dense multi-label action recognition
Jin et al. Cvt-assd: convolutional vision-transformer based attentive single shot multibox detector
CN114359892A (en) Three-dimensional target detection method and device and computer readable storage medium
Wang et al. Summary of object detection based on convolutional neural network
CN111651038A (en) Gesture recognition control method based on ToF and control system thereof
CN110929632A (en) Complex scene-oriented vehicle target detection method and device
CN113887373B (en) Attitude identification method and system based on urban intelligent sports parallel fusion network
Li et al. Complete video-level representations for action recognition
CN115565146A (en) Perception model training method and system for acquiring aerial view characteristics based on self-encoder
CN115311544A (en) Underwater fish target detection method and device
Oufqir et al. Deep Learning for the Improvement of Object Detection in Augmented Reality
CN115063724A (en) Fruit tree ridge identification method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant