CN109800659B

CN109800659B - Action recognition method and device

Info

Publication number: CN109800659B
Application number: CN201811604771.6A
Authority: CN
Inventors: 张一帆
Original assignee: Nanjing Artificial Intelligence Chip Innovation Institute Institute Of Automation Chinese Academy Of Sciences; Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Nanjing Artificial Intelligence Innovation Research Institute; Institute of Automation of Chinese Academy of Science
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2021-05-25
Anticipated expiration: 2038-12-26
Also published as: CN109800659A

Abstract

The invention provides a method and a device for recognizing actions, which comprise the following steps: acquiring skeleton data of an object to be identified in a video during motion; generating a bone sequence of the object to be identified according to the bone data; generating a bone feature image corresponding to the bone sequence, wherein the bone feature image comprises a plurality of bone points; and inputting the bone characteristic image into a preset convolutional neural network model for classification to obtain an action category corresponding to the bone characteristic image. The invention converts the problem of motion recognition into the problem of bone sequence image classification, converts the bone sequence into the bone characteristic image, and then classifies the bone characteristic image, so that the recognition is more accurate and the efficiency is higher.

Description

Action recognition method and device

Technical Field

The invention relates to the field of identification, in particular to a method and a device for identifying actions.

Background

Human motion recognition has a variety of modalities such as appearance, depth, optical flow, and body skeleton. Among these modalities, dynamic human bone often complements other modalities, conveying important information. Therefore, human motion recognition can be performed by the bone sequence.

However, the existing method for recognizing the motion of the skeleton is to directly connect the coordinates of the skeleton points in series into a one-dimensional long vector and perform time sequence analysis on the one-dimensional long vector, and the recognition method has low accuracy.

Therefore, the present invention provides a method and an apparatus for motion recognition to overcome the disadvantages of the prior art.

Disclosure of Invention

The invention aims to provide a motion recognition method and a motion recognition device, which solve the problem that the motion category recognition of the current bone sequence is inaccurate.

According to an aspect of the present invention, there is provided a motion recognition method including:

acquiring skeleton data of an object to be identified in a video during motion;

generating a bone sequence of the object to be identified according to the bone data;

generating a bone feature image corresponding to the bone sequence, wherein the bone feature image comprises a plurality of bone points;

and inputting the bone characteristic image into a preset convolutional neural network model for classification to obtain an action category corresponding to the bone characteristic image.

Further, generating a bone feature image corresponding to the bone sequence, comprising:

arranging three-dimensional point coordinates of a skeleton sequence in each frame of image in a video into three-channel data according to a preset sequence;

arranging the three-channel data into a three-channel matrix according to a time sequence;

and carrying out normalization processing on the three-channel matrix to obtain a bone characteristic image.

Further, normalizing the three-channel matrix to obtain a bone feature image, including:

the normalization is shown as follows:

wherein the content of the first and second substances,

the pixel value of the position coordinate (i, j) on the c channel of the bone feature image is obtained;

and

respectively the minimum value and the maximum value of the pixel on the c channel of the bone feature image; round (.) is a rounding function.

Further, inputting the bone feature image into a preset convolutional neural network model for classification to obtain an action category corresponding to the bone feature image, including:

extracting the characteristics of the bone characteristic image by using a preset convolutional neural network model;

converting the features into feature vectors using a full-connectivity layer;

and determining the type of the bone feature image according to the feature vector, wherein the type is the action category of the object to be identified.

According to another aspect of the present invention, there is disclosed a motion recognition apparatus comprising:

the acquisition module is used for acquiring bone data of an object to be identified in the video during motion;

the generating module is used for generating a bone sequence of the object to be identified according to the bone data;

the characteristic image module is used for generating a bone characteristic image corresponding to the bone sequence, and the bone characteristic image comprises a plurality of bone points;

and the determining module is used for inputting the bone characteristic image into a preset convolutional neural network model for classification to obtain an action category corresponding to the bone characteristic image.

Further, the feature image module includes:

the first sequencing submodule is used for arranging three-dimensional point coordinates of a skeleton sequence in each frame of image in a video into three-channel data according to a preset sequence;

the second sequencing submodule is used for arranging the three-channel data into a three-channel matrix according to a time sequence;

and the normalization submodule is used for performing normalization processing on the three-channel matrix to obtain a bone characteristic image.

Further, the normalization sub-module is configured to,

the normalization is shown as follows:

wherein the content of the first and second substances,

at position coordinate (i, j) on the c-th channel of the bone feature imageA pixel value;

and

Further, the determining module comprises:

the extraction submodule is used for extracting the characteristics of the bone characteristic image by utilizing a preset convolutional neural network model;

a conversion submodule for converting the features into feature vectors using a full connection layer;

and the determining submodule is used for determining the type of the bone feature image according to the feature vector, wherein the type is the action category of the object to be identified.

Compared with the closest prior art, the technical scheme has the beneficial effects that:

the technical scheme provided by the invention is that the bone data of an object to be identified in a video during motion is obtained, then the bone data is converted into a bone sequence of the object to be identified, the bone sequence is converted into a bone characteristic image, all bone points in the bone characteristic image are sequenced by using a preset replacement network, and finally the sequenced bone characteristic image is classified by using a convolutional neural network to obtain the action characteristic corresponding to the bone characteristic image. The invention converts the problem of motion recognition into the problem of bone sequence image classification, converts the bone sequence into the bone characteristic image, and then classifies the bone characteristic image, so that the recognition is more accurate and the efficiency is higher.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, the present invention provides an action recognition method, which comprises the following steps:

s101, obtaining skeleton data of an object to be identified in a video during motion;

s102, generating a bone sequence of the object to be identified according to the bone data;

s103, generating a bone characteristic image corresponding to the bone sequence, wherein the bone characteristic image comprises a plurality of bone points;

and S104, inputting the bone feature image into a preset convolutional neural network model for classification to obtain an action category corresponding to the bone feature image.

In the embodiment of the application, bone data of an object to be identified in a video during motion are acquired, then the bone data are converted into a bone sequence of the object to be identified, the bone sequence is converted into a bone characteristic image, and finally the bone characteristic image is classified by using a convolutional neural network to obtain action characteristics corresponding to the bone characteristic image. The invention converts the problem of motion recognition into the problem of bone sequence image classification, converts the bone sequence into the bone characteristic image, and then classifies the bone characteristic image, so that the recognition is more accurate and the efficiency is higher.

In some embodiments of the present application, given a bone sequence v of a T frame, the coordinates of the kth bone point in the T frame are expressed as

Wherein T is ∈ [1,2, …, T],k∈[1,2,…,N]And N represents the number of skeletal points in a frame. The skeletal data of the t-th frame is denoted S_t＝{J₁,J₂,…,J_N}. Generating a bone feature image corresponding to the bone sequence mainly comprises three steps:

firstly, arranging three-dimensional point coordinates of a skeleton sequence in each frame of image in a video into three-channel data according to a preset sequence;

step two, arranging the three-channel data into a three-channel matrix according to a time sequence;

and thirdly, carrying out normalization processing on the three-channel matrix to obtain a bone characteristic image.

In step one, the coordinates (x, y, z) in three dimensions are considered as three channels. Taking x dimension as an example, let S_tThe x dimension in (a) is taken in a predefined order O ═ O₁,o₂,…,o_k,…,o_K) Arranging into a feature vector to obtain a feature vector f_tX channel feature vector f_t ^xWherein

The order of arrangement O determines the proximity of skeletal points in the image. In step two, three-channel bone characteristics of all frames are determined

And a three-channel matrix M is arranged according to the time sequence. Taking the x-channel as an example,

the size of M is 3 × T × K, where T is the length of the video sequence and K is the length of the permutation order O.

In step three, M is normalized and quantized to an RGB image I as follows:

wherein the content of the first and second substances,

is the pixel value at position coordinate (I, j) on the c-th channel of image I;

and

respectively the minimum value and the maximum value of the pixel on the c channel of the bone feature image; round (.) is a rounding function. The normalization is performed by subtracting the minimum value in the channel from the element in the matrix and dividing by the maximum value variation in all channels. The values are then quantized to [0, 255 ]]The interval of (2).

In some embodiments of the present application, inputting a bone feature image into a preset convolutional neural network model for classification, and obtaining an action category corresponding to the bone feature image, includes:

converting the features into feature vectors using a full-connectivity layer;

The invention also provides a motion recognition device based on the same inventive concept, which comprises:

Optionally, the feature image module includes:

Optionally, the normalization sub-module is configured to,

the normalization is shown as follows:

wherein the content of the first and second substances,

and

Optionally, the determining module includes:

and the determining submodule is used for determining the type of the bone feature image according to the feature vector, wherein the type is the action feature of the object to be identified.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A motion recognition method, comprising:

acquiring skeleton data of an object to be identified in a video during motion;

generating a bone characteristic image corresponding to the bone sequence, wherein the bone characteristic image comprises a plurality of bone points, and sequencing all the bone points in the bone characteristic image by using a preset replacement network;

inputting the sequenced bone characteristic images into a preset convolutional neural network model for classification to obtain action categories corresponding to the bone characteristic images;

wherein generating a bone feature image corresponding to the bone sequence comprises:

carrying out normalization processing on the three-channel matrix to obtain a bone characteristic image;

the method comprises the following steps of arranging three-dimensional point coordinates of a skeleton sequence in each frame of image in a video into three-channel data according to a preset sequence:

will be provided with

The x dimension value in (1) is according to a predefined sequence

Arranging into a feature vector to obtain a feature vector f_tX channel feature vector of

Wherein, in the step (A),

the arrangement order O determines the proximity of the bone points in the image,

representing the t frame bone data, and N representing the number of bone points in one frame;

the process of arranging the three-channel data into a three-channel matrix according to the time sequence is as follows:

three-channel bone characteristics of all frames

A three-channel matrix M is arranged according to the time sequence, wherein in the x channel, the matrix

M has a size of 3 × T × K, T being the length of the video sequence, and K being the length of the arrangement order O.

2. The method of claim 1, wherein normalizing the three-channel matrix to obtain a bone feature image comprises:

the normalization is shown as follows:

wherein the content of the first and second substances,

and

3. The method of claim 1, wherein the step of inputting the sorted bone feature images into a preset convolutional neural network model for classification to obtain action classes corresponding to the bone feature images comprises:

converting the features into feature vectors using a full-connectivity layer;

4. An action recognition device, comprising:

the characteristic image module is used for generating a bone characteristic image corresponding to the bone sequence, the bone characteristic image comprises a plurality of bone points, and all the bone points in the bone characteristic image are sequenced by using a preset replacement network;

the determining module is used for inputting the sequenced bone characteristic images into a preset convolutional neural network model for classification to obtain action categories corresponding to the bone characteristic images;

wherein the feature image module comprises:

the normalization submodule is used for performing normalization processing on the three-channel matrix to obtain a bone characteristic image;

wherein the first ordering submodule is configured to:

will be provided with