CN109101858B

CN109101858B - Action recognition method and device

Info

Publication number: CN109101858B
Application number: CN201710470470.8A
Authority: CN
Inventors: 胡越予; 刘家瑛; 张昊华; 郭宗明
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University
Priority date: 2017-06-20
Filing date: 2017-06-20
Publication date: 2022-02-18
Anticipated expiration: 2037-06-20
Also published as: CN109101858A

Abstract

The action recognition method and the action recognition device determine a target frame and a plurality of continuous frames before the target frame in received video data, and extract data information of the target frame and data information of the plurality of continuous frames before the target frame from the video data. The method comprises the steps of performing convolution processing on gain parameters, data information of a target frame and data information of a plurality of continuous frames before the target frame for preset times to obtain high-order feature data, adding the high-order feature data into video data to form data to be extracted, performing time sequence feature extraction on the data to be extracted to obtain feature vectors, and finally obtaining action recognition results according to the feature vectors, so that the high-order features of the video data can be extracted, and the action recognition accuracy is improved.

Description

Action recognition method and device

Technical Field

The present invention relates to computer vision technologies, and in particular, to a method and an apparatus for motion recognition.

Background

With the development of computer vision technology, motion recognition by using video acquisition equipment becomes a research focus. In the existing action recognition method, data such as joint positions need to be extracted from a video stream, the data are input into a three-layer bidirectional long-time memory cycle artificial neural network, and dynamic features of the data are extracted by the neural network. And then, inputting the extracted dynamic characteristics into a classifier network, and finally acquiring action types corresponding to the data of the video stream.

However, due to the limitation of three-layer bidirectional long-time memory cyclic artificial neural network, it can only extract the dynamic features of data in the whole time sequence, and cannot extract the high-order features of data at a certain time. For example: when recognizing two kinds of actions with similar position data, namely 'push' and 'beat', the actions need to be distinguished by means of 'acceleration' of the 'push' and the 'beat' at a certain moment or a plurality of continuous moments. When the three-layer bidirectional long-time memory cycle artificial neural network is used for identifying 'push' or 'beat', only the dynamic characteristic of 'average acceleration' can be extracted, and the dynamic characteristics of 'instantaneous acceleration' at the moment of occurrence of the 'push' and 'beat' actions and before and after the moment of occurrence are not extracted. For action types needing to be identified by high-order features of data, the existing action identification method cannot realize accurate identification of the action types.

Disclosure of Invention

The invention provides a motion recognition method and device, aiming at solving the technical problem of low recognition accuracy rate in the prior art.

In one aspect, the present application provides a method for motion recognition, including:

receiving video data;

determining a target frame and a plurality of continuous frames before the target frame, and extracting data information of the target frame and data information of the plurality of continuous frames before the target frame from the video data;

performing convolution processing for preset times on gain parameters with preset number, data information of the target frame and data information of a plurality of continuous frames before the target frame to obtain high-order characteristic data;

adding the high-order characteristic data into the video data to form data to be extracted;

performing time sequence feature extraction on the data to be extracted to obtain a feature vector;

and acquiring a motion recognition result according to the feature vector.

Further, before the receiving the video data, the method includes:

receiving a training data set, wherein the training data set comprises a plurality of training data and a recognition result corresponding to each training data;

selecting training data from the training data set as data to be trained;

determining a training frame in the data to be trained and a plurality of continuous frames before the training frame, and extracting training data information of the training frame and training data information of the plurality of continuous frames before the training frame from the data to be trained;

performing convolution processing for preset times on gain parameters to be trained, training data information of the training frame and training data information of a plurality of continuous frames before the training frame, wherein the preset number of gain parameters to be trained, the training data information of the training frame and the training data information of the plurality of continuous frames before the training frame are processed to obtain high-order characteristic training data;

adding the high-order characteristic training data into the data to be trained to form training data to be extracted;

performing time sequence feature extraction on the training data to be extracted to obtain a training feature vector of the training data to be extracted;

obtaining a prediction result according to the training feature vector;

acquiring the cross entropy of the recognition result and the prediction result corresponding to the data to be trained, and judging whether the cross entropy is converged;

if the convergence is achieved, taking the gain parameter to be trained as a gain parameter, and executing the step of receiving the video data;

and if not, correcting the gain parameter to be trained according to the cross entropy, selecting next training data from a training data set as the data to be trained, and returning to the step of determining a training frame in the data to be trained and a plurality of continuous frames before the training frame.

Further, the performing convolution processing on a preset number of gain parameters, the data information of the target frame, and the data information of a plurality of consecutive frames before the target frame for a preset number of times to obtain high-order feature data includes:

generating data to be convolved according to the data information of the target frame and the data information of a plurality of continuous frames before the target frame;

the data to be convolved are convolved with the preset number of gain parameters respectively to obtain a plurality of convolution results;

splicing the convolution results to obtain high-order characteristic data;

correspondingly, the performing convolution processing for preset times on the gain parameters to be trained, the training frame training data information and a plurality of continuous frames of training data information before the training frame, which are preset in number, to obtain high-order characteristic training data includes:

generating training data to be convolved according to the training data information of the training frame and training data information of a plurality of continuous frames before the training frame;

the training data to be convolved are convolved with the preset number of training gain parameters respectively to obtain a plurality of training convolution results;

and splicing the training convolution results to obtain high-order characteristic training data.

Further, the adding the high-order feature data to the video data to form data to be extracted includes:

packing the data information of the target frame and the high-order characteristic data to generate updated data information of the target frame;

replacing the data information of the target frame in the video data with the updated data information of the target frame to obtain the data to be extracted;

correspondingly, the performing time sequence feature extraction on the data to be extracted to obtain a feature vector includes:

respectively extracting the characteristics of the data information of each frame in the data to be extracted to obtain the characteristic data of each frame;

and carrying out mean processing on the feature data of all frames in the data to be extracted to obtain the feature vector of the data to be extracted.

Further, each convolution processing corresponds to one gain parameter, and the number of convolution processing is determined according to the number of the high-order characteristic data.

The present invention also provides a motion recognition apparatus, comprising:

the receiving module is used for receiving video data;

the data extraction module is used for determining a target frame and a plurality of continuous frames before the target frame and extracting data information of the target frame and data information of the plurality of continuous frames before the target frame from the video data;

the high-order characteristic extraction module is used for carrying out convolution processing on a preset number of gain parameters, the data information of the target frame and the data information of a plurality of continuous frames before the target frame for a preset number of times to obtain high-order characteristic data; the high-order characteristic data are also used for being added into the video data to form data to be extracted;

the characteristic vector extraction module is used for extracting time sequence characteristics of the data to be extracted to obtain characteristic vectors;

and the identification result acquisition module is used for acquiring an action identification result according to the characteristic vector.

Further, the receiving module is further configured to receive a training data set before receiving the video data, where the training data set includes a plurality of training data and a recognition result corresponding to each training data;

the data extraction module is also used for selecting training data from the training data set as data to be trained; the training frame is used for determining a training frame in the data to be trained and a plurality of continuous frames before the training frame, and training data information of the training frame and training data information of the plurality of continuous frames before the training frame are extracted from the data to be trained;

the high-order feature extraction module is further used for performing convolution processing on a preset number of gain parameters to be trained, training data information of the training frame and training data information of a plurality of continuous frames before the training frame for a preset number of times to obtain high-order feature training data; adding the high-order characteristic training data into the data to be trained to form training data to be extracted;

the feature vector extraction module is further configured to perform timing feature extraction on the training data to be extracted to obtain a training feature vector of the training data to be extracted;

the identification result acquisition module is further used for acquiring a prediction result according to the training feature vector;

the motion recognition device further includes: a decision module; the judging module is used for obtaining the cross entropy of the recognition result and the prediction result corresponding to the data to be trained and judging whether the cross entropy is converged;

the high-order feature extraction module is further configured to use the gain parameter to be trained as a gain parameter when the determination module determines that the cross entropy converges, and the receiving module executes the step of receiving the video data;

the high-order feature extraction module is further configured to correct the gain parameter to be trained according to the cross entropy when the determination module determines that the cross entropy is not convergent, and the data extraction module is further configured to select next training data from a training data set as data to be trained, and return to the step of determining a training frame in the data to be trained and a plurality of consecutive frames before the training frame.

Further, the high-order feature extraction module is specifically configured to:

generating data to be convolved according to the data information of the target frame and the data information of a plurality of continuous frames before the target frame; the data to be convolved are convolved with the preset number of gain parameters respectively to obtain a plurality of convolution results; splicing the convolution results to obtain high-order characteristic data;

generating training data to be convolved according to the training data information of the training frame and training data information of a plurality of continuous frames before the training frame; the training data to be convolved are convolved with the preset number of training gain parameters respectively to obtain a plurality of training convolution results; and splicing the training convolution results to obtain high-order characteristic training data.

Further, the high-order feature extraction module is specifically configured to: packing the data information of the target frame and the high-order characteristic data to generate updated data information of the target frame; replacing the data information of the target frame in the video data with the updated data information of the target frame to obtain the data to be extracted;

correspondingly, the feature vector extraction module is specifically configured to: respectively extracting the characteristics of the data information of each frame in the data to be extracted to obtain the characteristic data of each frame; and carrying out mean processing on the feature data of all frames in the data to be extracted to obtain the feature vector of the data to be extracted.

Drawings

Fig. 1 is a schematic flow chart of a motion recognition method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart illustrating a motion recognition method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a motion recognition device according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a motion recognition device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

Fig. 1 is a schematic flow chart of a motion recognition method according to an embodiment of the present invention, as shown in fig. 1, the motion recognition method includes:

step 101, receiving video data.

The execution body of the present invention may specifically be an operation recognition device, and the physical form thereof may be a terminal device composed of hardware such as a processor, a memory, a logic circuit, and an electronic chip.

Specifically, in step 101, a video data is received, where the video data includes data information of several frames, and a source of the video data may be obtained by an acquisition device, or derived from another storage medium or obtained by downloading from a network, which is not limited by the present invention.

Step 102, determining a target frame and a plurality of continuous frames before the target frame, and extracting data information of the target frame and data information of a plurality of continuous frames before the target frame from the video data.

And 103, performing convolution processing on the gain parameters with the preset number, the data information of the target frame and the data information of a plurality of continuous frames before the target frame for preset times to obtain high-order characteristic data.

And each convolution processing can correspond to one gain parameter, and the number of the convolution processing is determined according to the number of the high-order characteristic data.

Specifically, there may be one gain parameter corresponding to each kind of high-order characteristic data, and only one gain parameter is used for each convolution processing. For example, for a high-order feature of "acceleration" to be extracted, a gain parameter may be used to correspond to the high-order feature, so that the high-order feature data obtained after performing convolution processing according to the gain parameter may characterize the high-order feature of "acceleration".

Further, in step 103, first, data to be convolved may be generated according to the data information of the target frame and the data information of consecutive frames before the target frame, then the data to be convolved is convolved with the preset number of gain parameters, respectively, to obtain a plurality of convolution results, and finally the plurality of convolution results are spliced, so as to obtain the high-order feature data.

And step 104, adding the high-order characteristic data into the video data to form data to be extracted.

Specifically, the obtained high-order feature data is added to the video data received in step 101 to form data to be extracted for extracting the time-series feature.

Further, in step 104, the data information of the target frame and the high-order feature data may be first packed to generate updated data information of the target frame, and then the data information of the target frame in the video data may be replaced with the updated data information of the target frame to obtain the data to be extracted.

And 105, performing time sequence feature extraction on the data to be extracted to obtain a feature vector.

Specifically, the time sequence characteristics of the data to be extracted can be extracted by utilizing an algorithm of a three-layer bidirectional long-time memory cyclic artificial neural network so as to obtain a characteristic vector. Because the data to be extracted already comprises high-order feature data, the feature vector obtained by extracting the data to be extracted can reflect the feature representation of the video data at high order.

And 106, acquiring a motion recognition result according to the feature vector.

Specifically, the feature vector may be classified by using a classifier network algorithm, a plurality of action types and corresponding probabilities matched with the feature vector may be obtained, and the action type with the highest probability may be selected from the plurality of action types as the action recognition result.

The action recognition method provided by the invention determines a target frame and a plurality of continuous frames before the target frame in the received video data, and extracts the data information of the target frame and the data information of the plurality of continuous frames before the target frame from the video data. The method comprises the steps of performing convolution processing on gain parameters, data information of a target frame and data information of a plurality of continuous frames before the target frame for preset times to obtain high-order feature data, adding the high-order feature data into video data to form data to be extracted, performing time sequence feature extraction on the data to be extracted to obtain feature vectors, and finally obtaining action recognition results according to the feature vectors, so that the high-order features of the video data can be extracted, and the action recognition accuracy is improved.

On the basis of the method shown in fig. 1, fig. 2 is a schematic flow chart of an action recognition method provided in a second embodiment of the present invention, as shown in fig. 2, the method includes:

step 200, receiving a training data set, wherein the training data set comprises a plurality of training data and a recognition result corresponding to each training data.

Step 201, selecting a training data from the training data set as a data to be trained.

Step 202, determining a training frame and a plurality of continuous frames before the training frame in the data to be trained, and extracting training data information of the training frame and training data information of the plurality of continuous frames before the training frame from the data to be trained.

To further describe the technical solution provided in the second embodiment, the position information of each joint of the human body including several frames in the video data is taken as an example for explanation.

Specifically, unlike the first embodiment, the second embodiment further includes a training process for the gain parameter. In the above steps 200 to 203, a training data set is first received, where the training data set includes a plurality of training data and a recognition result corresponding to each training data, that is, the training data set includes a plurality of groups of training data and a confirmed action type corresponding to each group of training data. The training data may include position information of each joint of the human body for a plurality of frames, and the recognition result may be an action type expressed by the position information of each joint of the human body for the plurality of frames. Then, selecting a piece of training data from the training data set as the data to be trained, determining a training frame to be processed and a plurality of continuous frames before the training frame, and extracting the training data information of the training frame and the training data information of the plurality of continuous frames before the training frame from the data to be trained.

Step 203, performing convolution processing for preset times on the gain parameters to be trained, the training data information of the training frame and the training data information of a plurality of continuous frames before the training frame, which are preset in number, to obtain high-order characteristic training data.

Specifically, one gain parameter to be trained can be preset for each kind of high-order feature data, so that only one gain parameter to be trained is adopted in each convolution processing. For example, for a high-order feature of "acceleration" that needs to be extracted, a gain parameter to be trained may be used to correspond to the high-order feature, so that the trained gain parameter may be used to extract the high-order feature of "acceleration".

Further, firstly, training data to be convolved can be generated according to training data information of a training frame and training data information of a plurality of continuous frames before the training frame, then the training data to be convolved are convolved with training gain parameters of a preset number respectively to obtain a plurality of training convolution results, and finally the training convolution results are spliced to obtain high-order characteristic training data.

For example, it is assumed that 30 frames of training data information are included in the data to be trained, each frame of training data information includes position information of 25 human joints, of which frame 20 is a training frame and frames 15-19 are consecutive frames before the training frame. In the above step, for a first human body joint, to-be-convolved training data of the first human body joint may be obtained first, and then the to-be-convolved training data of the first human body joint and a preset first training gain parameter are convolved to obtain a training convolution result of the first human body joint. Subsequently, similar processing is performed for the second human body joint, and training data to be convolved for the second human body joint is obtained. After training data to be convolved of all the human body joints aiming at the first training gain parameter are obtained, similar processing can be carried out aiming at a preset second training gain parameter until training convolution results of all the training gain parameters are obtained. And then, splicing the training convolution results and obtaining high-order characteristic training data.

And step 204, adding the high-order characteristic training data into the data to be trained to form training data to be extracted.

And step 205, performing time sequence feature extraction on the training data to be extracted to obtain a training feature vector of the training data to be extracted.

And step 206, obtaining a prediction result according to the training feature vector.

Specifically, in steps 204 to 206, first, the data information of the training frame and the high-order feature training data may be packed to generate updated data information of the training frame, and then the data information of the training frame in the data to be trained is replaced with the updated data information of the training frame to obtain the training data to be extracted. And then, performing time sequence feature extraction on the training data to be extracted by utilizing a three-layer bidirectional long-time memory cycle artificial neural network algorithm to obtain training feature vectors of the training data to be extracted. And finally, classifying the training feature vector by using a classifier network algorithm, acquiring a plurality of predicted action types matched with the training feature vector and corresponding probabilities, and selecting the action type with the highest probability from the plurality of predicted action types as a prediction result.

And step 207, obtaining the cross entropy of the recognition result and the prediction result corresponding to the data to be trained, and judging whether the cross entropy is converged.

If not, go to step 208; if yes, go to step 209.

And 208, correcting the gain parameter to be trained according to the cross entropy, selecting next training data from the training data set as the data to be trained, and returning to the step 202.

And step 209, taking the gain parameter to be trained as a gain parameter.

Specifically, in steps 204 to 206, the recognition result and the prediction result corresponding to the data to be trained may be expressed by using a vector expression, respectively, the cross entropy of the recognition result and the prediction result is calculated, and whether the cross entropy converges or not may be determined.

When the cross entropy is not converged, the cross entropy is used to correct the gain parameter to be trained, for example, the sum of the gain parameter to be trained and the cross entropy can be used as the corrected gain parameter to be trained, then, next training data is selected from the training data set as the data to be trained, the step of determining a training frame in the data to be trained and a plurality of continuous frames before the training frame is returned, at this time, when the step is executed again to the step 203, the gain parameter to be trained therein is the corrected gain parameter to be trained, that is, when the cross entropy is not converged, the corrected gain parameter to be trained obtained by correcting the gain parameter to be trained is used as the gain parameter to be trained during the next training, and the process is circulated until the cross entropy is converged.

When the cross entropy is converged, the training of the gain parameter to be trained is completed, and the gain parameter to be trained can be used as the gain parameter for identifying the video data.

Step 210, receiving video data.

Step 211, determining a target frame and a plurality of continuous frames before the target frame, and extracting data information of the target frame and data information of the plurality of continuous frames before the target frame from the video data;

and 212, performing convolution processing on the gain parameters with the preset number, the data information of the target frame and the data information of a plurality of continuous frames before the target frame for a preset number of times to obtain high-order characteristic data.

Step 213, adding the high-order feature data to the video data to form data to be extracted.

And 214, performing time sequence feature extraction on the data to be extracted to obtain a feature vector.

Step 215, obtaining the action recognition result according to the feature vector.

In steps 210 to 215, specifically, video data is received first, the video data may include position information of joints of the human body for several frames, a target frame to be processed and several consecutive frames before the target frame are determined, and data information of the target frame and several consecutive frames before the target frame are extracted from the video data. And generating data to be convolved according to the data information of the target frame and the data information of a plurality of continuous frames before the target frame.

And then, performing convolution processing for a preset number of times by using the trained gain parameters with a preset number, the data information of the target frame and the data information of a plurality of continuous frames before the target frame to obtain high-order characteristic data. Specifically, data to be convolved can be generated according to data information of a target frame and data information of a plurality of continuous frames before the target frame, then the data to be convolved are convolved with gain parameters of a preset number respectively to obtain a plurality of convolution results, and finally the convolution results are spliced to obtain high-order feature data. For example, it is assumed that 30 frames of data information are included in the video data, and the data information of each frame includes position information of 25 human joints, wherein the 20 th frame is a target frame, and the 15 th to 19 th frames are consecutive frames before the target frame. In the above step, for a first human body joint, to-be-convolved data of the first human body joint may be obtained first, and then the to-be-convolved data of the first human body joint and a preset first gain parameter are convolved to obtain a convolution result of the first human body joint. Subsequently, similar processing is performed for the second human body joint, and data to be convolved for the second human body joint is obtained. After the data to be convolved of all the human joints aiming at the first gain parameter are obtained, the video data can be similarly processed by using the second gain parameter until the convolution results of all the gain parameters are obtained. And then splicing the convolution results and obtaining high-order characteristic data, wherein each convolution processing can correspond to one gain parameter, and the number of the convolution processing is determined according to the number of the high-order characteristic data.

And adding the high-order characteristic data into the video data to form data to be extracted. Specifically, the data information of the target frame including the position information of each joint of the human body and the high-order feature data obtained in the above steps are packaged to generate updated data information of the target frame, and the data information of the target frame in the video data is replaced by the data information of the updated target frame to obtain the data to be extracted, that is, compared with the original video data, the high-order feature data obtained in the above steps are included in the data information of the target frame of the data to be extracted, and the data information of other frames except the target frame is kept unchanged.

And performing time sequence feature extraction on the data to be extracted to obtain a feature vector. Specifically, an algorithm of a three-layer bidirectional long-time memory cyclic artificial neural network can be used for respectively extracting the features of the data information of each frame in the data to be extracted to obtain the feature data of each frame. Then, the feature data of each frame can be directly spliced to obtain a feature vector of the data to be extracted; or, an averaging algorithm can be adopted to perform average processing on the feature data of all frames in the data to be extracted to obtain the feature vector of the data to be extracted; or, a mean clustering algorithm may be further adopted to perform mean processing on the feature data of all frames in the data to be extracted to obtain feature vectors of the data to be extracted, and a person skilled in the art may select different mean algorithms according to actual needs, which is not limited by the present invention.

And finally, classifying the feature vectors by utilizing a classifier network algorithm, acquiring a plurality of action types matched with the feature vectors and corresponding probabilities, and selecting the action type with the highest probability from the action types as an action recognition result.

The motion recognition device provided by the invention is also used for training the gain parameters by using a training data set comprising a plurality of training data and a recognition result corresponding to each training data before the video data is subjected to motion recognition so as to obtain the trained gain parameters. Subsequently, a target frame and several consecutive frames before the target frame in the video data are received, and data information of the target frame and data information of several consecutive frames before the target frame are extracted from the video data. The method comprises the steps of performing convolution processing on a preset number of trained gain parameters, data information of a target frame and data information of a plurality of continuous frames before the target frame for a preset number of times to obtain high-order feature data, adding the high-order feature data into video data to form data to be extracted, performing time sequence feature extraction on the data to be extracted to obtain feature vectors, and finally obtaining an action recognition result according to the feature vectors, so that the high-order features of the video data can be extracted, and the action recognition accuracy is improved.

Fig. 3 is a schematic structural diagram of a motion recognition device according to a third embodiment of the present invention, and as shown in fig. 3, the motion recognition device according to the third embodiment of the present invention specifically includes:

a receiving module 10, configured to receive video data;

the data extraction module 20 is configured to determine a target frame and a plurality of consecutive frames before the target frame, and extract data information of the target frame and data information of the plurality of consecutive frames before the target frame from the video data;

the high-order feature extraction module 30 is configured to perform convolution processing for preset times on gain parameters of a preset number, data information of a target frame, and data information of a plurality of consecutive frames before the target frame to obtain high-order feature data; the high-order characteristic data are added into the video data to form data to be extracted;

the feature vector extraction module 40 is configured to perform time sequence feature extraction on data to be extracted to obtain a feature vector;

and the identification result acquisition module 50 is used for acquiring the action identification result according to the feature vector.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and corresponding beneficial effects of the system described above may refer to the corresponding process in the foregoing method embodiments, and are not described herein again.

The action recognition device provided by the invention determines a target frame and a plurality of continuous frames before the target frame in received video data, and extracts data information of the target frame and data information of the plurality of continuous frames before the target frame in the video data. The method comprises the steps of performing convolution processing on gain parameters, data information of a target frame and data information of a plurality of continuous frames before the target frame for preset times to obtain high-order feature data, adding the high-order feature data into video data to form data to be extracted, performing time sequence feature extraction on the data to be extracted to obtain feature vectors, and finally obtaining action recognition results according to the feature vectors, so that the high-order features of the video data can be extracted, and the action recognition accuracy is improved.

Based on the structure shown in fig. 3, fig. 4 is a schematic structural diagram of an action recognition device according to a fourth embodiment of the present invention. As shown in fig. 4, similarly to the third embodiment, the motion recognition apparatus includes:

the receiving module 10 is used for receiving video data.

And the data extraction module 20 is configured to determine the target frame and consecutive frames before the target frame, and extract data information of the target frame and data information of consecutive frames before the target frame from the video data.

The high-order feature extraction module 30 is configured to perform convolution processing for preset times on gain parameters of a preset number, data information of a target frame, and data information of a plurality of consecutive frames before the target frame to obtain high-order feature data; and the method is also used for adding high-order characteristic data into the video data to form data to be extracted.

And the feature vector extraction module 40 is configured to perform time sequence feature extraction on the data to be extracted to obtain a feature vector.

The difference from the third embodiment is that:

the receiving module 10 is further configured to receive a training data set before receiving the video data, where the training data set includes a plurality of training data and a recognition result corresponding to each training data.

The data extraction module 20 is further configured to select a piece of training data from the training data set as data to be trained; the method is also used for determining a training frame and a plurality of continuous frames before the training frame in the data to be trained, and extracting training data information of the training frame and training data information of the plurality of continuous frames before the training frame in the data to be trained.

The high-order feature extraction module 30 is further configured to perform convolution processing for preset times on gain parameters to be trained, training data information of a training frame, and training data information of consecutive frames before the training frame, which are preset in number, to obtain high-order feature training data; and adding the high-order characteristic training data into the data to be trained to form the training data to be extracted.

The feature vector extraction module 40 is further configured to perform timing feature extraction on the training data to be extracted to obtain a training feature vector of the training data to be extracted.

The recognition result obtaining module 50 is further configured to obtain a prediction result according to the training feature vector.

The motion recognition device further includes: and the judging module 60, wherein the judging module 60 is configured to obtain the cross entropy of the recognition result and the prediction result corresponding to the data to be trained, and judge whether the cross entropy converges.

The high-order feature extraction module 30 is further configured to use the gain parameter to be trained as the gain parameter when the determination module 60 determines that the cross entropy converges, and the receiving module 10 performs the step of receiving the video data.

The high-order feature extraction module 30 is further configured to correct the gain parameter to be trained according to the cross entropy when the determination module 60 determines that the cross entropy is not converged, and the data extraction module 20 is further configured to select next training data from the training data set as the data to be trained, and return to the step of determining a training frame in the data to be trained and several consecutive frames before the training frame.

Further, the high-order feature extraction module 30 is specifically configured to:

generating data to be convolved according to the data information of the target frame and the data information of a plurality of continuous frames before the target frame; convolving the data to be convolved with a preset number of gain parameters respectively to obtain a plurality of convolution results; splicing a plurality of convolution results to obtain high-order characteristic data; generating training data to be convolved according to training data information of a training frame and training data information of a plurality of continuous frames before the training frame; performing convolution on the training data to be convolved and a preset number of training gain parameters respectively to obtain a plurality of training convolution results; and splicing a plurality of training convolution results to obtain high-order characteristic training data.

Further, the high-order feature extraction module 30 is specifically configured to: packing the data information of the target frame and the high-order characteristic data to generate updated data information of the target frame; and replacing the data information of the target frame in the video data with the updated data information of the target frame to obtain the data to be extracted.

Correspondingly, the feature vector extraction module 40 is specifically configured to: respectively extracting the characteristics of the data information of each frame in the data to be extracted to obtain the characteristic data of each frame; and carrying out mean processing on the feature data of all frames in the data to be extracted to obtain the feature vector of the data to be extracted.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A motion recognition method, comprising:

receiving video data;

the performing convolution processing for preset times on the gain parameters with the preset number, the data information of the target frame and the data information of a plurality of continuous frames before the target frame to obtain high-order characteristic data includes:

the data to be convolved are convolved with the gain parameters of the preset number respectively, the frequency of convolution processing is determined according to the number of the high-order characteristic data, and a plurality of convolution results are obtained;

splicing the convolution results to obtain high-order characteristic data;

and acquiring a motion recognition result according to the feature vector.

2. The motion recognition method according to claim 1, wherein the receiving video data comprises:

selecting training data from the training data set as data to be trained;

obtaining a prediction result according to the training feature vector;

and if not, correcting the gain parameter to be trained according to the cross entropy, selecting next training data from a training data set as the data to be trained, and returning to a training frame in the data to be trained and a plurality of continuous frames before the training frame.

3. The motion recognition method according to claim 2, wherein the performing convolution processing on a preset number of gain parameters to be trained, training data information of the training frame, and training data information of a plurality of consecutive frames before the training frame for a preset number of times to obtain high-order feature training data comprises:

4. The motion recognition method according to claim 1, wherein the adding the high-order feature data to the video data to form data to be extracted comprises:

5. An action recognition device, comprising:

the receiving module is used for receiving video data;

the high-order feature extraction module is specifically configured to:

generating data to be convolved according to the data information of the target frame and the data information of a plurality of continuous frames before the target frame; the data to be convolved are convolved with the gain parameters of the preset number respectively, the frequency of convolution processing is determined according to the number of the high-order characteristic data, and a plurality of convolution results are obtained; splicing the convolution results to obtain high-order characteristic data;

6. The motion recognition apparatus according to claim 5,

the receiving module is further configured to receive a training data set before receiving the video data, where the training data set includes a plurality of training data and an identification result corresponding to each training data;

the high-order feature extraction module is further configured to correct the gain parameter to be trained according to the cross entropy when the determination module determines that the cross entropy is not converged, and the data extraction module is further configured to select next training data from a training data set as data to be trained, and return a training frame in the data to be trained and a plurality of consecutive frames before the training frame.

7. The motion recognition device according to claim 6, wherein the high-order feature extraction module is further specifically configured to:

8. The motion recognition apparatus according to claim 5,

the high-order feature extraction module is specifically configured to: packing the data information of the target frame and the high-order characteristic data to generate updated data information of the target frame; replacing the data information of the target frame in the video data with the updated data information of the target frame to obtain the data to be extracted;