CN112200041B

CN112200041B - Video motion recognition method and device, storage medium and electronic equipment

Info

Publication number: CN112200041B
Application number: CN202011055889.5A
Authority: CN
Inventors: 尹康; 吴宇斌; 孔翰; 郭烽
Original assignee: Oppo Chongqing Intelligent Technology Co Ltd
Current assignee: Oppo Chongqing Intelligent Technology Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-08-02
Anticipated expiration: 2040-09-29
Also published as: CN112200041A

Abstract

The disclosure provides a video action recognition method and device, a computer readable storage medium and electronic equipment, and relates to the technical field of computer vision. The video motion recognition method comprises the following steps: detecting human body key points in image frames of a video to be processed, and forming a two-dimensional coordinate sequence of the human body key points according to position information of the human body key points in each image frame; performing three-dimensional reconstruction based on the two-dimensional coordinate sequence to generate three-dimensional coordinate data of the human body key points; and performing action recognition processing on the three-dimensional coordinate data to obtain an action recognition result of the video to be processed. The video motion recognition accuracy is improved, and the data processing amount is reduced.

Description

Video motion recognition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a video motion recognition method, a video motion recognition apparatus, a computer-readable storage medium, and an electronic device.

Background

Video action recognition is an important task in the field of computer vision, and is widely applied to scenes such as video classification, electronic monitoring, advertisement putting and the like. Compared with images, the video content is more complex and changeable, and the video shooting may have shielding, shaking, visual angle change and the like, so that more difficulties are brought to motion recognition.

In the related art, video motion recognition is mostly implemented based on image motion recognition, and the processing procedure generally includes: firstly, extracting image features of a video frame by frame, then fusing the image features into global features of the whole video by using a feature fusion method, and finally obtaining an action recognition result based on the processing of the global features. However, the recognition accuracy achieved by the method is low, and a large amount of redundant information irrelevant to motion recognition is obtained when the image features are extracted, so that the data processing amount is high.

Disclosure of Invention

The present disclosure provides a video motion recognition method, a video motion recognition apparatus, a computer-readable storage medium, and an electronic device, thereby solving the problems of low recognition accuracy and high data processing amount in the related art to a certain extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a video motion recognition method, including: detecting human body key points in image frames of a video to be processed, and forming a two-dimensional coordinate sequence of the human body key points according to position information of the human body key points in each image frame; performing three-dimensional reconstruction based on the two-dimensional coordinate sequence to generate three-dimensional coordinate data of the human body key points; and performing action recognition processing on the three-dimensional coordinate data to obtain an action recognition result of the video to be processed.

According to a second aspect of the present disclosure, there is provided a video motion recognition apparatus comprising: the key point detection module is used for detecting human key points in image frames of a video to be processed and forming a two-dimensional coordinate sequence of the human key points according to position information of the human key points in each image frame; the three-dimensional reconstruction module is used for performing three-dimensional reconstruction based on the two-dimensional coordinate sequence to generate three-dimensional coordinate data of the human key points; and the action recognition module is used for carrying out action recognition processing on the three-dimensional coordinate data to obtain an action recognition result of the video to be processed.

According to a third aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the video motion recognition method of the first aspect described above and possible implementations thereof.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the video action recognition method of the first aspect and possible embodiments thereof via execution of the executable instructions.

The technical scheme of the disclosure has the following beneficial effects:

and extracting a two-dimensional coordinate sequence of the key points of the human body from the image frame of the video to be processed, reconstructing the two-dimensional coordinate sequence into three-dimensional coordinate data, and then performing action recognition to obtain an action recognition result of the video to be processed. On one hand, the information of the key points of the human body has strong correlation with the action, which is equivalent to introducing the prior information of an action recognition scene, and three-dimensional coordinate data with richer information is obtained through three-dimensional reconstruction, so that the accuracy of action recognition is improved. On the other hand, according to the scheme, all features of the image frames in the video do not need to be extracted, and only the information of the human key points relevant to motion recognition is extracted, so that redundant information is reduced, and the data processing amount is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is apparent that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings can be obtained from those drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a system architecture diagram of an operating environment in the exemplary embodiment;

FIG. 2 illustrates a flow chart of a video motion recognition method in the present exemplary embodiment;

FIG. 3 shows a schematic diagram of a three-dimensional reconstruction network in the present exemplary embodiment;

FIG. 4 illustrates a flow diagram of training a CNN in the present exemplary embodiment;

FIG. 5 illustrates a flow chart for determining a second loss function in the exemplary embodiment;

fig. 6 is a diagram showing a video action recognition flow in the present exemplary embodiment;

FIG. 7 shows a schematic diagram of a CNN training and testing flow in the present exemplary embodiment;

fig. 8 is a block diagram showing a video motion recognition apparatus in the present exemplary embodiment;

fig. 9 shows a block diagram of an electronic device in the present exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In the related technology, video motion recognition is carried out by extracting image features and fusing, the method still belongs to the category of a general video classification task, and the characteristics of a video motion recognition scene cannot be effectively utilized, so that the accuracy of realization is limited.

In view of the above, exemplary embodiments of the present disclosure provide a video motion recognition method. Fig. 1 is a system architecture diagram illustrating an operating environment of the video motion recognition method. As shown in fig. 1, the system architecture 100 may include: terminal 110, network 120, and server 130. The terminal 110 may be various electronic devices having a video capturing or video playing function, including but not limited to a mobile phone, a tablet computer, a digital camera, a personal computer, and the like. The medium used by network 120 to provide communication links between terminals 110 and server 130 may include various connection types, such as wired, wireless communication links, and the like. It should be understood that the number of terminals, networks, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, and servers, as desired for an implementation. For example, the server 130 may be a server cluster composed of a plurality of servers, and the like.

The video motion recognition method provided by the embodiment of the present disclosure may be executed by the terminal 110, for example, after the terminal 110 shoots a video, the motion recognition is performed on the video; the server 130 may execute the operation, for example, by uploading a video captured by the terminal 110 to the server 130, and causing the server 130 to recognize the motion of the video. The present disclosure is not limited thereto.

Fig. 2 shows an exemplary flow of a video motion recognition method, which may include the following steps S210 to S230:

step S210, detecting human body key points in image frames of a video to be processed, and forming a two-dimensional coordinate sequence of the human body key points according to position information of the human body key points in each image frame;

step S220, performing three-dimensional reconstruction based on the two-dimensional coordinate sequence to generate three-dimensional coordinate data of the key points of the human body;

step S230, performing motion recognition processing on the three-dimensional coordinate data to obtain a motion recognition result of the video to be processed.

In the method, a two-dimensional coordinate sequence of the key points of the human body is extracted from the image frame of the video to be processed, the two-dimensional coordinate sequence is reconstructed into three-dimensional coordinate data, and then action recognition is carried out to obtain an action recognition result of the video to be processed. On one hand, the information of the key points of the human body has strong correlation with the action, which is equivalent to introducing the prior information of an action recognition scene, and three-dimensional coordinate data with richer information is obtained through three-dimensional reconstruction, so that the accuracy of action recognition is improved. On the other hand, according to the scheme, all features of the image frames in the video do not need to be extracted, and only the information of the human key points relevant to motion recognition is extracted, so that redundant information is reduced, and the data processing amount is reduced.

Each step in fig. 2 is explained in detail below.

In step S210, human body key points are detected in image frames of the video to be processed, and a two-dimensional coordinate sequence of the human body key points is formed according to position information of the human body key points in each image frame.

Human body keypoints may include head, neck, shoulder, elbow, hand, waist, knee, foot, and the like. In this exemplary embodiment, J human body key points may be selected in advance, and for an image frame in the video to be processed, the J human body key points are detected to obtain position information of the human body key points in each image frame, for example, two-dimensional pixel coordinates of each human body key point may be obtained, and each image frame obtains J × 2-dimensional two-dimensional coordinate data, for example, a J × 2 matrix. And arranging the two-dimensional coordinate data of each image frame according to the sequence of the image frames in the video to form a two-dimensional coordinate sequence of the key points of the human body.

In an alternative embodiment, step S210 may include: and extracting the position information of the human body key points from each image frame of the video to be processed by utilizing a pre-trained key point extraction network to obtain a two-dimensional coordinate sequence of the human body key points.

The key point extraction network may adopt an open-source human limb recognition network, such as depetcut, depercut, openpos, and the like. In addition, an autonomously designed network can be adopted and training is performed by combining with an image data set in a specific scene, so that a better effect can be obtained. The autonomously designed key point extraction network can be in the following two structures:

a convolution layer and a full connection layer. The convolution layer can be set to different sizes, and the characteristics of the single-frame image are extracted from multiple scales; the number of output nodes of the full link layer may be set to J × 2, corresponding to two-dimensional coordinate data of J individual body key points.

A full convolution configuration. The output can be set as J Feature maps (Feature maps), the point with the maximum value on each Feature Map corresponds to a key point, and the two-dimensional coordinate data of J multiplied by 2 is obtained by fusing the coordinates of the key points of the J Feature maps.

The specific scenes refer to different types of action scenes, such as travel, somatosensory games, sports matches, building monitoring and the like. Taking a sports game as an example, a large number of images of the sports game are acquired and the key points of the human body are marked, thereby forming an image data set of a scene of the sports game. By training the key point extraction network by using the image data set, more accurate human body key point detection can be realized in the sports game image.

In an alternative embodiment, an image description algorithm, such as a Scale-Invariant Feature Transform (SIFT) algorithm, may also be used to describe and detect the human body key points in the image frame.

It should be noted that human body key point detection may be performed on a part of image frames in the video, or all image frames may be detected. For example:

human body key points can be detected in each frame of image of the video to be processed, and the video to be processed is assumed to comprise L in total ₁ Frame, thus obtaining L ₁ The information is most complete by a two-dimensional coordinate sequence of xJx 2. The method may include that human key points cannot be detected due to the fact that no human exists or a human body is blocked in a part of image frames, for the human key points, preset data can be used as position information of the human key points, the preset data can be (0, 0) or other numerical values, and therefore the two-dimensional coordinate data corresponding to each image frame are guaranteed to have the same dimensionality, and subsequent processing is facilitated.

In order to reduce the number of images and improve the processing efficiency, image frames of a video to be processed may be filtered, and in an optional implementation, the image frames may specifically be:

extracting key frame images from a video to be processed;

human key points are detected in the key frame images.

Ways to extract key frame images include, but are not limited to:

randomly extracting key frame images from a video to be processed;

selecting one frame as a key frame image every certain frame number, or deleting one frame every certain frame number, and the rest are key frame images;

for example, each frame image in the video to be processed is input into a pre-trained human body pre-detection model (for example, another neural network model different from the above-mentioned key point extraction network), and a corresponding confidence coefficient is output to represent the probability that the image contains the human body, and when the confidence coefficient is higher than a preset confidence coefficient threshold (which may be set according to experience or actual conditions, such as 70%, 80%, etc.), the corresponding image is determined as the key frame image.

Therefore, after the key frame image is determined, the key frame image is only subjected to human body key point detection, and other image frames are not detected, so that the processing efficiency is further improved.

With continued reference to fig. 2, in step S220, three-dimensional reconstruction is performed based on the two-dimensional coordinate sequence, so as to generate three-dimensional coordinate data of the human body key points.

The two-dimensional coordinate sequence contains continuity information of human body motion between different frames, and three-dimensional reconstruction can be carried out based on the continuity information. For example, although the two-dimensional coordinates only include x-axis and y-axis data of a plane, z-axis data can be reconstructed by changing the two-dimensional coordinates between different frames, thereby obtaining three-dimensional coordinate data.

In an alternative embodiment, step S220 includes: and performing three-dimensional reconstruction on the two-dimensional coordinate sequence by using a pre-trained three-dimensional reconstruction network to generate three-dimensional coordinate data of the key points of the human body.

In the three-dimensional reconstruction network, the coordinate data of the third dimension can be predicted through the transformation of the dimension and the channel number. For example, mixing L ₁ Inputting the two-dimensional coordinate sequence of the multiplied by J multiplied by 2 into a three-dimensional reconstruction network, calculating the two-dimensional coordinate data of the multiplied by 2 of different channels by the network, and outputting L ₂ Three-dimensional coordinate data of × J × 3. Wherein J × 3 represents a three-dimensional coordinate, L ₂ Number of channels representing three-dimensional coordinates, usually L ₂ Less than L ₁ . In particular, when L ₂ When greater than 1, L is obtained ₂ The XJX 3 three-dimensional coordinate data is actually a three-dimensional coordinate sequence formed of data of plural channels, where L ₂ Each channel can be regarded as a channel formed by the original L ₁ L obtained by fusing data of image frames ₂ And each virtual three-dimensional image frame corresponds to the three-dimensional coordinates of the J key points.

The process of performing three-dimensional reconstruction by using the three-dimensional reconstruction network may specifically be: and extracting characteristic data in the neighborhood from the two-dimensional coordinate sequence by using a convolution kernel in the three-dimensional reconstruction network, and processing to obtain corresponding three-dimensional coordinate data. Mixing L with ₁ The two-dimensional coordinate sequence of XJX 2 is regarded as L ₁ The J multiplied by 2 image of the channel can be logarithmized in a convolution modeUpon processing, the convolution may extract feature data from within a neighborhood in the two-dimensional coordinate sequence, which may be a neighboring or closely spaced channel within a certain range. Fig. 3 shows a schematic diagram of a three-dimensional reconstruction network, which includes first extracting two-dimensional coordinate data of each two adjacent channels in a two-dimensional coordinate data sequence, and obtaining feature data of a first convolution layer through weighting and offset calculation of convolution kernels; two-dimensional coordinate data (such as a channel 1 and a channel 4, and a channel 2 and a channel 5) of the two channels are extracted every two channels, and characteristic data, namely three-dimensional coordinate data, of the second convolutional layer is obtained through weighting and offset calculation of the convolutional kernel.

In the three-dimensional reconstruction network, the data extracted by the convolution kernel is not necessarily the data of the adjacent channel, that is, the convolution kernel can be an expansion convolution kernel, so that the receptive field can be expanded, and the feature data can be extracted in a larger scale (that is, a longer timestamp range in a video) to obtain more effective three-dimensional coordinate data.

In fig. 3, it is merely exemplary that the void of the first convolutional layer is set to 0, and the void of the second convolutional layer is set to 2, and the number of convolutional layers, the number of convolutional cores, the void value of the convolutional layer, and the like are not particularly limited in the present disclosure.

With reference to fig. 2, in step S230, the motion recognition processing is performed on the three-dimensional coordinate data to obtain a motion recognition result of the video to be processed.

This simplifies the video motion recognition processing into motion recognition processing of three-dimensional coordinate data. The three-dimensional coordinate data embodies all information related to motion recognition in the video to be processed, so that higher accuracy and lower calculation amount can be realized.

In the present exemplary embodiment, different motion categories, such as jogging, jumping, waving, etc., may be preset, and the motion recognition result of the video to be processed corresponds to one of the categories, i.e., the motion classification result.

In an alternative embodiment, step S230 may include: and processing the three-dimensional coordinate data by using a pre-trained motion recognition network to obtain a motion recognition result of the video to be processed.

The motion recognition network can perform feature processing on the three-dimensional coordinate data, and can adopt the following two structures: similar to the convolution structure of image processing, the method is suitable for the condition that the number of channels of three-dimensional coordinate data is less; the LSTM (Long Short-Term Memory network) structure is suitable for the condition that the number of channels of three-dimensional coordinate data is large. In any case, after feature processing, the D-dimensional vector P ═ P can be output through one full-connection layer ₁ ,p ₂ ,…,p _D D is the preset action category number; the vector P is further subjected to a Softmax (normalized exponential function) operation, as shown below

Where σ (·) represents a Sigmoid function. And selecting the category corresponding to the maximum numerical value in the vector P as a video action recognition result.

The key point extraction network, the three-dimensional reconstruction network and the action recognition network can be three independent neural networks, and any two or all three networks can be arranged in the same large neural network.

In an alternative embodiment, the three-dimensional reconstruction Network and the motion recognition Network are both subnetworks in the same CNN (Convolutional Neural Network). For example, the last layer of the three-dimensional reconstruction network connects the action identifying the first layer of the network.

Referring to fig. 4, the CNN may be trained through the following steps S410 to S450:

step S410, detecting human body key points in image frames of the sample video, and forming a two-dimensional coordinate sample sequence of the human body key points according to position information of the human body key points in each image frame of the sample video.

The two-dimensional coordinate sample sequence is a two-dimensional coordinate sequence corresponding to the sample video. Before training, a video data set may be obtained, where the video data set includes a large number of sample videos in the same or similar scenes, and each sample video has an action label (ground route), which is generally an artificially labeled action recognition result.

The method for detecting the human body key points for the image frames of the sample video is the same as that in the step S210, so as to obtain a two-dimensional coordinate sample sequence corresponding to the sample video, wherein the data format, the dimensionality and the like of the two-dimensional coordinate sample sequence are the same as those of the two-dimensional coordinate sequence obtained in the step S210.

In an alternative embodiment, before the human body key point detection is performed, the image frames of the sample video may be normalized to a preset size, such as 448 × 448 pixels; converting the action label into a one-hot vector; the sequence of image frames of the sample video and the vector of motion labels are packed into a binary file (e.g., tfrecrd format may be used) to speed up the training process.

Step S420, inputting the two-dimensional coordinate sample sequence into the CNN to be trained, outputting the action recognition result of the sample video, and extracting three-dimensional coordinate sample data of the key points of the human body from the middle layer of the CNN.

A two-dimensional coordinate sample sequence and an action label (generally manually labeled) corresponding to a sample video form a pair of training data, the two-dimensional coordinate sample sequence is input into a CNN, an action recognition result of the sample video is output, and three-dimensional coordinate sample data is extracted from an intermediate layer (the last layer of a three-dimensional reconstruction network part) of the CNN. The three-dimensional coordinate sample data is three-dimensional coordinate data corresponding to the sample video, and the data format, the dimensionality and the like of the three-dimensional coordinate sample data are the same as the three-dimensional coordinate data obtained in the step S220.

Step S430, determining a first loss function according to the motion recognition result of the sample video and the motion label of the sample video.

Based on the deviation between the action recognition result and the action label of the sample video, a first Loss function can be constructed in a cross entropy mode and the like, namely Loss ₁ 。

Step S440, determining a second loss function according to the two-dimensional coordinate sample sequence and the three-dimensional coordinate sample data.

The position relation between the key points of the human body reflected by the two-dimensional coordinate sample sequence and the three-dimensional coordinate sample data should be similar, so that a constraint can be setConditional, and construct a corresponding second Loss function, denoted Loss ₂ . The second loss function represents the deviation between the position relation information between different human key points in the two-dimensional coordinate sample sequence and the position relation information between different human key points in the three-dimensional coordinate sample data.

Step S450, update the parameter of CNN by using the first loss function and the second loss function.

The goal of training the CNN is to optimize both Loss1 and Loss 2. For example, the Loss of synthesis function Loss can be set ₀ ：

Loss ₀ ＝αLoss ₁ +(1-α)Loss ₂ (2)

Wherein alpha is a weighting factor between 0 and 1, reflecting Loss ₁ The occupied weight can be set according to the actual situation. Computing Loss through back propagation ₀ And (5) carrying out gradient on each parameter of the CNN, and updating the parameter through gradient reduction. And iterating the updating process until the CNN reaches a certain accuracy, and determining that the training is finished.

By the method of FIG. 4, the three-dimensional reconstruction network and the motion recognition network can be trained simultaneously, training efficiency is improved, and the first loss function and the second loss function are set, so that compared with the constraint condition of prediction data and a label in conventional training, the constraint condition of a two-dimensional coordinate sample sequence and a three-dimensional coordinate sample data is increased, and more accurate training is facilitated.

Further, referring to fig. 5, step S440 may be implemented by the following steps S510 to S530:

step S510, obtaining two-dimensional joint distance data through distances between preset human body key points in a two-dimensional coordinate sample sequence;

step S520, obtaining three-dimensional joint distance data through distances between preset human body key points in three-dimensional coordinate sample data;

step S530, determining a second loss function according to the two-dimensional joint distance data and the three-dimensional joint distance data.

For example, a plurality of sets of point pairs are preset in the key points of the human body, such as head-neck, left shoulder-left elbow,left elbow-left hand, right shoulder-right elbow, right elbow-right hand, neck-waist, etc. In the two-dimensional coordinate sample sequence, the distance between the point pairs is calculated by respectively adopting the two-dimensional coordinate data of each channel and is recorded as two-dimensional joint distance data, for example, the two-dimensional joint distance data of the channel 1 is recorded as H ₁ ＝{h ₁₁ ，h ₁₂ ，...，h _1m H11 for head-neck distance, h12 for left shoulder-left elbow distance, etc., m being the preset number of point pairs; two-dimensional joint distance data for channel 2 is recorded as H ₂ ＝{h ₂₁ ，h ₂₂ ，...，h _2m }. Averaging the two-dimensional joint distance data of all the channels on the same dimension to obtain

Is recorded as a two-dimensional joint distance vector. Carrying out the same processing on the three-dimensional coordinate sample data to obtain

And g represents the distance of the point pair under the three-dimensional coordinate. And calculating the Euclidean distance between the two-dimensional joint distance vector and the three-dimensional joint distance vector, namely calculating the mean square error of the two-dimensional joint distance data and the three-dimensional joint distance data of each group of point pairs to obtain a second loss function.

In an alternative embodiment, the key point extraction network adopted in step S210 may also be a sub-network in the CNN. Referring to fig. 6, the CNN includes three sub-networks, namely, a key point extraction network, a three-dimensional reconstruction network, and an action recognition network. The video motion recognition process may include: extracting image frames from a video to be processed to form a video frame sequence, inputting the video frame sequence into the key point extraction network to obtain a two-dimensional coordinate sequence, inputting the two-dimensional coordinate sequence into a three-dimensional reconstruction network to obtain three-dimensional coordinate data, inputting the three-dimensional coordinate data into an action recognition network, and outputting a final action recognition result, wherein the two-dimensional coordinate sequence and the three-dimensional coordinate data both belong to intermediate data of the CNN.

Fig. 7 shows a schematic diagram of the training of the CNN in the exemplary embodiment, which includes two stages of training and testing. The video data set is divided into a training set and a test set (e.g., may be divided by 6: 4). During training, the sample video in the training set is input into the CNN to obtain prediction data, namely the motion recognition result of the sample video, the prediction data and the motion label of the sample video are substituted into the first loss function, meanwhile, the second loss function can be determined, and the parameters of the CNN are updated according to the first loss function and the second loss function, namely, the three sub-networks are trained simultaneously, so that the method is very efficient. And in the testing stage, inputting the testing video in the testing set into the trained CNN, and outputting a corresponding action recognition result. And calculating the accuracy on the test set, and if the accuracy reaches a preset standard, determining that the training is finished to obtain the available CNN.

Exemplary embodiments of the present disclosure also provide a video motion recognition apparatus. Referring to fig. 8, the video motion recognition apparatus 800 may include:

the key point detection module 810 is configured to detect a human body key point in an image frame of a video to be processed, and form a two-dimensional coordinate sequence of the human body key point according to position information of the human body key point in each image frame;

a three-dimensional reconstruction module 820, configured to perform three-dimensional reconstruction based on the two-dimensional coordinate sequence, and generate three-dimensional coordinate data of the human body key points;

and the action recognition module 830 is configured to perform action recognition processing on the three-dimensional coordinate data to obtain an action recognition result of the video to be processed.

In an alternative embodiment, the three-dimensional reconstruction module 820 is configured to:

and performing three-dimensional reconstruction on the two-dimensional coordinate sequence by using a pre-trained three-dimensional reconstruction network to generate three-dimensional coordinate data of the key points of the human body.

and extracting characteristic data in the neighborhood from the two-dimensional coordinate sequence by using a convolution kernel in the three-dimensional reconstruction network, and processing to obtain corresponding three-dimensional coordinate data.

In an alternative embodiment, the convolution kernel includes an expanded convolution kernel.

In an alternative embodiment, the action recognition module 830 is configured to:

and processing the three-dimensional coordinate data by using a pre-trained motion recognition network to obtain a motion recognition result of the video to be processed.

In an alternative embodiment, the three-dimensional reconstruction network and the motion recognition network are both sub-networks in the same convolutional neural network.

In an alternative embodiment, the video motion recognition apparatus 800 further includes a network training module configured to:

detecting human body key points in image frames of a sample video, and forming a two-dimensional coordinate sample sequence of the human body key points according to position information of the human body key points in each image frame of the sample video; the two-dimensional coordinate sample sequence is a two-dimensional coordinate sequence corresponding to the sample video;

inputting a two-dimensional coordinate sample sequence into a convolutional neural network to be trained, outputting a motion recognition result of a sample video, and extracting three-dimensional coordinate sample data of key points of a human body from an intermediate layer of the convolutional neural network; the three-dimensional coordinate sample data is three-dimensional coordinate data corresponding to the sample video;

determining a first loss function according to the action recognition result of the sample video and the action label of the sample video;

determining a second loss function according to the two-dimensional coordinate sample sequence and the three-dimensional coordinate sample data;

parameters of the convolutional neural network are updated with the first loss function and the second loss function.

In an optional implementation manner, the network training module is configured to:

obtaining two-dimensional joint distance data through distances between preset human body key points in the two-dimensional coordinate sample sequence;

obtaining three-dimensional joint distance data by presetting the distance between key points of the human body in three-dimensional coordinate sample data;

a second loss function is determined from the two-dimensional joint distance data and the three-dimensional joint distance data.

In an alternative embodiment, the keypoint detection module 810 is configured to:

and extracting the position information of the key points of the human body from each image frame of the video to be processed by utilizing a pre-trained key point extraction network to obtain a two-dimensional coordinate sequence of the key points of the human body. The key point extraction network is a sub-network of the convolutional neural network.

and detecting human key points in each frame of image of the video to be processed.

extracting key frame images from a video to be processed;

human key points are detected in the key frame images.

when the human body key points are detected, the human body key points which cannot be detected are taken as the position information of the human body key points.

In an alternative embodiment, the three-dimensional coordinate data may be a three-dimensional coordinate sequence.

The specific details of each part in the above device have been described in detail in the method part embodiments, and thus are not described again.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product including program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device, for example, any one or more of the steps in fig. 2, fig. 4 or fig. 5 may be performed. The program product may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

An exemplary embodiment of the present disclosure further provides an electronic device, which may be the server or the terminal in the cloud. The electronic device is explained below with reference to fig. 9. It should be understood that the electronic device 900 shown in fig. 9 is only one example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: at least one processing unit 910, at least one memory unit 920, and a bus 930 that couples various system components including the memory unit 920 and the processing unit 910.

Where the storage unit stores program code, which may be executed by the processing unit 910, to cause the processing unit 910 to perform the steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification. For example, processing unit 910 may perform method steps, etc., as shown in fig. 2.

The storage unit 920 may include volatile memory units such as a random access memory unit (RAM)921 and/or a cache memory unit 922, and may further include a read only memory unit (ROM) 923.

Storage unit 920 may also include a program/utility 924 having a set (at least one) of program modules 925, such program modules 925 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The bus 930 may include a data bus, an address bus, and a control bus.

The electronic device 900 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), which may be through an input/output (I/O) interface 940. The electronic device 900 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through a network adapter 950. As shown, the network adapter 950 communicates with the other modules of the electronic device 900 over a bus 930. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the following claims.

Claims

1. A video motion recognition method is characterized by comprising the following steps:

detecting human body key points in image frames of a video to be processed, and forming a two-dimensional coordinate sequence of the human body key points according to position information of the human body key points in each image frame;

performing three-dimensional reconstruction on the two-dimensional coordinate sequence by using a pre-trained three-dimensional reconstruction network to generate three-dimensional coordinate data of the human body key points;

processing the three-dimensional coordinate data by utilizing a pre-trained action recognition network to obtain an action recognition result of the video to be processed;

the three-dimensional reconstruction network and the action recognition network are both sub-networks in the same convolutional neural network; the convolutional neural network is trained by:

inputting the two-dimensional coordinate sample sequence into the convolutional neural network to be trained, outputting a motion recognition result of the sample video, and extracting three-dimensional coordinate sample data of the human key points from an intermediate layer of the convolutional neural network; the three-dimensional coordinate sample data is three-dimensional coordinate data corresponding to the sample video;

determining a second loss function according to the two-dimensional coordinate sample sequence and the three-dimensional coordinate sample data; the second loss function represents the deviation of the position relation information between different human key points in the two-dimensional coordinate sample sequence and the position relation information between different human key points in the three-dimensional coordinate sample data;

updating parameters of the convolutional neural network with the first loss function and the second loss function.

2. The method of claim 1, wherein the two-dimensional coordinate sequence has a dimension L ₁ X J x 2, and the dimension of the three-dimensional coordinate data is L ₂ xJx 3; wherein L is ₁ Representing the number of image frames of the video to be processed, J representing the number of key points of the human body, L ₂ Number of channels, L, representing said three-dimensional coordinate data ₂ Less than L ₁ 。

3. The method according to claim 1, wherein the three-dimensional reconstruction of the two-dimensional coordinate sequence using a pre-trained three-dimensional reconstruction network to generate three-dimensional coordinate data of the human body key points comprises:

and extracting feature data in a neighborhood from the two-dimensional coordinate sequence by using a convolution kernel in the three-dimensional reconstruction network, and processing to obtain the corresponding three-dimensional coordinate data.

4. The method of claim 3, wherein the convolution kernel comprises a dilated convolution kernel.

5. The method of claim 1, wherein in the convolutional neural network, a last layer of the three-dimensional reconstruction network connects a first layer of the motion recognition network.

6. The method of claim 1, wherein prior to detecting human keypoints in image frames of a sample video, the method further comprises:

normalizing image frames of the sample video to a preset size;

converting the action tag into a one-hot vector;

encapsulating the sequence of the image frames and the one-hot vector normalized to a preset size into a binary file.

7. The method of claim 1, wherein updating the parameters of the convolutional neural network with the first loss function and the second loss function comprises:

weighting the first loss function and the second loss function to obtain a comprehensive loss function;

and updating the parameters of the convolutional neural network by using the comprehensive loss function.

8. The method of claim 1, wherein said determining a second loss function from said sequence of two-dimensional coordinate samples and said three-dimensional coordinate sample data comprises:

obtaining two-dimensional joint distance data through the distance between preset human body key points in the two-dimensional coordinate sample sequence;

obtaining three-dimensional joint distance data according to the distance between the preset human body key points in the three-dimensional coordinate sample data;

determining the second loss function from the two-dimensional joint distance data and the three-dimensional joint distance data.

9. The method according to claim 1, wherein the detecting human body key points in image frames of the video to be processed, and the forming a two-dimensional coordinate sequence of the human body key points according to the position information of the human body key points in each image frame comprises:

extracting the position information of the human body key points from each image frame of the video to be processed by utilizing a pre-trained key point extraction network to obtain a two-dimensional coordinate sequence of the human body key points;

the key point extraction network is a sub-network of the convolutional neural network.

10. The method according to claim 1, wherein the detecting human key points in image frames of the video to be processed comprises:

and detecting the human body key points in each frame of image of the video to be processed.

11. The method according to claim 1, wherein the detecting human key points in image frames of the video to be processed comprises:

extracting key frame images from the video to be processed;

and detecting the human body key points in the key frame images.

12. The method according to any one of claims 1 to 11, wherein in detecting the human keypoints, the method further comprises:

and for human key points which cannot be detected, preset data is used as the position information of the human key points.

13. The method of any one of claims 1 to 11, wherein the three-dimensional coordinate data comprises a sequence of three-dimensional coordinates.

14. A video motion recognition apparatus, comprising:

the key point detection module is used for detecting human key points in image frames of a video to be processed and forming a two-dimensional coordinate sequence of the human key points according to position information of the human key points in each image frame;

the three-dimensional reconstruction module is used for performing three-dimensional reconstruction on the two-dimensional coordinate sequence by utilizing a pre-trained three-dimensional reconstruction network to generate three-dimensional coordinate data of the human key points;

the action recognition module is used for processing the three-dimensional coordinate data by utilizing a pre-trained action recognition network to obtain an action recognition result of the video to be processed;

the three-dimensional reconstruction network and the action recognition network are both sub-networks in the same convolutional neural network; the video motion recognition device further comprises a network training module, which is used for:

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 13.

16. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1 to 13 via execution of the executable instructions.