CN113792712A

CN113792712A - Action recognition method, device, equipment and storage medium

Info

Publication number: CN113792712A
Application number: CN202111346086.XA
Authority: CN
Inventors: 闾凡兵; 吴蕊; 曹达; 秦拯
Original assignee: Changsha Hisense Intelligent System Research Institute Co ltd
Current assignee: Changsha Hisense Intelligent System Research Institute Co ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2021-12-14

Abstract

The application discloses a method, a device, equipment and a storage medium for motion recognition, and relates to the technical field of computer vision. The method comprises the following steps: under the condition that the first video comprises two character objects, identifying human body joint points of the two character objects in a video frame image of the first video to obtain human body joint point position information of each character object in the two character objects in the video frame image; coding the position information of the human body joint point of each character object in the video frame image to obtain the space characteristic vector of the human body joint point of each character object in the video frame image; fusing the spatial feature vector of the human body joint point of each character object in the video frame image with the time sequence feature corresponding to the position information of the human body joint point to obtain the space-time fusion feature of the human body joint points of the two character objects; and inputting the space-time fusion characteristics into a classifier, and obtaining the interaction action corresponding to the space-time fusion characteristics through the classifier.

Description

Action recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method, an apparatus, a device, and a storage medium for motion recognition.

Background

Human action recognition has a high application value, and for example, in man-machine interaction scenes such as smart homes and 3D games, there is a demand for human action recognition.

At present, motion recognition based on human body joint gestures aims at single-person motion recognition, and no technical scheme aiming at double-person interactive motion recognition exists. Aiming at the action recognition of a single person, only the spatial characteristics of the joint gesture are recognized. However, it cannot be determined whether two persons act simultaneously to generate double interaction actions only by the change of the spatial characteristics of each person.

Disclosure of Invention

The embodiment of the application aims to provide a motion recognition method, a motion recognition device, motion recognition equipment and a storage medium, and the technical problem that double-person interactive motion cannot be recognized in the prior art can be solved.

The technical scheme of the application is as follows:

in a first aspect, a motion recognition method is provided, including:

under the condition that the first video comprises two character objects, identifying human body joint points of the two character objects in a video frame image of the first video to obtain human body joint point position information of each character object in the two character objects in the video frame image;

coding the position information of the human body joint points of each character object in the video frame image to obtain the space characteristic vector of the human body joint points of each character object in the video frame image, wherein the space characteristic vector is used for representing the change characteristics of the motion of the character object in the space;

fusing the space characteristic vector of the human body joint point of each character object in the video frame image with the time sequence characteristic corresponding to the position information of the human body joint point to obtain the space-time fusion characteristic of the human body joint points of the two character objects, wherein the space-time fusion characteristic is used for representing the change characteristic of the space interaction action of the two character objects on the time sequence;

and inputting the space-time fusion characteristics into a classifier, and obtaining an interactive action corresponding to the space-time fusion characteristics through the classifier, wherein the interactive action is a matched action of two character objects.

In some embodiments, prior to identifying the human joint points of the two human objects in the video frame images of the first video, the method further comprises:

inputting the first video into a target detection model, and detecting a character object in a video frame image of the first video through the target detection model to obtain the character object in the first video;

under the condition that the first video comprises two character objects, outputting a region to be detected in the video frame image, wherein the region to be detected is a region where the two character objects are located;

identifying human body joint points of two character objects in a video frame image of a first video specifically comprises:

and identifying human body joint points of two human body objects in the area to be detected.

In some embodiments, identifying human joint points of two human objects in the region to be detected comprises:

and identifying human body joint points of two character objects in the area to be detected according to the posture estimation model.

In some embodiments, before fusing the spatial feature vector of the human body joint point of each human body object in the video frame image with the time-series feature corresponding to the position information of the human body joint point, the method further includes:

inputting the spatial feature vector of the human body joint point of each character object in the video frame image into a graph convolution network trained in advance, and modeling in a spatial position based on the spatial feature vector of the human body joint point of each character object in the video frame image through the graph convolution network to obtain a target spatial feature;

fusing the space characteristic vector of the human body joint point of each person object in the video frame image with the time sequence characteristic corresponding to the position information of the human body joint point, comprising the following steps:

and fusing the target space characteristics and the time sequence characteristics corresponding to the position information of the human body joint points.

In some embodiments, the graph convolution network models spatial positions of human joint points in the video frame image based on spatial feature vectors of each human object, including:

the graph convolution network constructs a spatial relationship graph based on the spatial feature vectors of human body joint points of each character object in the video frame image, wherein the nodes of the spatial relationship graph are human body joint points, and connecting lines among the human body joint points are edges of the spatial relationship graph.

In some embodiments, the fusing the spatial feature vector of the human body joint point of each person object in the video frame image with the time sequence feature corresponding to the position information of the human body joint point to obtain the time-space fusion feature of the human body joint points of the two person objects includes:

and sequentially inputting the space feature vectors of the human body joint points in the video frame images into a pre-trained recurrent neural network model based on the time sequence features corresponding to the position information of the human body joint points to obtain the space-time fusion features of the human body joint points of the two human body objects.

and based on the human body joint points, applying a time cycle neural network to obtain time sequence characteristics corresponding to the position information of the human body joint points in the video frame image of the first video.

In some embodiments, the interaction comprises at least one of:

fanning, kicking, pushing, shoulder clapping, fingering, hugging, giving things, touching a pocket, shaking hands, walking in opposite directions.

In some embodiments, where the interaction is a handshake, the spatial feature vector of each person object represents that the person object has an outstretched motion in space;

correspondingly, the space-time fusion feature is used for representing the hand stretching actions of the two character objects, the two character objects are overlapped in time, and the palms of the two character objects are contacted in space;

obtaining the interactive action corresponding to the space-time fusion characteristic through a classifier, comprising:

determining whether the stretching actions of the two character objects are overlapped in time, whether palms of the two character objects are contacted in space and whether the contact time is greater than a preset threshold value through a classifier;

and under the conditions that the hand stretching actions of the two character objects are overlapped in time, contact exists in space, and the contact time is greater than a preset threshold value, determining that the interaction action of the two character objects in the first video is handshake.

In a second aspect, there is provided a motion recognition apparatus comprising:

the identification module is used for identifying human body joint points of two character objects in a video frame image of the first video under the condition that the first video is detected to comprise the two character objects, and obtaining the human body joint point position information of each character object in the two character objects in the video frame image;

the encoding module is used for encoding the position information of the human body joint points of each character object in the video frame image to obtain the spatial feature vector of the human body joint points of each character object in the video frame image, and the spatial feature vector is used for representing the change features of the actions of the character objects in the space;

the fusion module is used for fusing the spatial feature vector of the human body joint point of each character object in the video frame image with the time sequence feature corresponding to the position information of the human body joint point to obtain the space-time fusion feature of the human body joint points of the two character objects, and the space-time fusion feature is used for representing the change feature of the space interaction action of the two character objects on the time sequence;

and the classification module is used for inputting the space-time fusion characteristics into the classifier, and obtaining the interaction action corresponding to the space-time fusion characteristics through the classifier, wherein the interaction action is the matching action of two character objects.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

the motion identification method provided by the embodiment of the application identifies human body joint points of two character objects in a video frame image of a first video to obtain human body joint point position information of each character object in the two character objects in the video frame image, and codes the human body joint point position information to obtain a spatial feature vector of the human body joint point; fusing the space characteristic vector of the human body joint point and the time sequence characteristic corresponding to the position information of the human body joint point to obtain the space-time fusion characteristic of two character objects; and inputting the space-time fusion characteristics into a classifier, and determining the interaction actions corresponding to the space-time fusion characteristics of the two character objects. Because the space-time fusion characteristics of the two character objects comprise the space characteristics and the time sequence characteristics, whether the change occurs between the two people or not can be judged when the space characteristics change, and then the interaction action of the two people is identified and obtained by combining the interaction action preset in the classifier, so that the identification of the double actions is realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application and are not to be construed as limiting the application.

FIG. 1 is a schematic view of a joint of a human body according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a hand joint provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for recognizing an action according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another motion recognition method provided in the embodiment of the present application;

FIG. 5 is a schematic diagram of a target detection result provided in an embodiment of the present application;

FIG. 6 is a diagram illustrating the detection result of a joint point according to an embodiment of the present application;

FIG. 7 is a diagram of a result of motion recognition provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a spatial location profile of a double action provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a motion recognition device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are intended to be illustrative only and are not intended to be limiting. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present application by illustrating examples thereof.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples consistent with certain aspects of the present application, as detailed in the appended claims.

Based on the background technology, the motion recognition based on the human body joint gesture in the prior art is aimed at the motion recognition of a single person, and no technical scheme aiming at the double-person interactive motion recognition exists.

The object of human behavior recognition is to automatically analyze an ongoing behavior in an unknown video, and the behavior is composed of actions, so that human behavior recognition first accurately recognizes the actions. For the recognition of motion in a video, generally, given a piece of video, features are extracted and then correctly classified into several known motion classes based on the features.

Traditional video motion recognition based on optical flow and video flow has been gradually replaced by video motion recognition based on human body gestures. Compared with the motion recognition based on optical flow and video flow, the video motion recognition based on the human body gesture has the following advantages:

firstly, the human skeleton data is light; secondly, the method has robustness to the changing viewpoint appearance and the interference of the environment; third, the spatial information and the adjacent joints have strong correlation, so that rich human body structure information can be obtained in frames, time domain related information can be utilized between the frames, but the current motion recognition based on human body gestures aims at the motion recognition of a single person, and the motion recognition of double persons is not considered.

The inventor finds that the double-person motion recognition is more complex than the single-person motion recognition, and the difference between the double-person motion recognition and the single-person motion recognition is mainly expressed in the following two aspects:

first, the space wholeness, double interactive action are a whole in the space, must two people cooperation actions, for example shake hands, embrace etc. and move.

Secondly, the time sequence is integrated, and double interaction actions are also continuous in time, such as: act in opposite directions and back to back, etc. Unlike single-person actions, double-person interaction actions are more integrated and must occur simultaneously. Therefore, the double interaction motion recognition based on the human body gesture must consider the integrity of the double motion in space and time.

Based on the above findings, embodiments of the present application provide a motion recognition method, apparatus, device, and storage medium, which identify human joint points of two character objects in a video frame image of a first video, obtain human joint point position information of each of the two character objects in the video frame image, and encode the human joint point position information to obtain spatial feature vectors of the human joint points; fusing the space characteristic vector of the human body joint point and the time sequence characteristic corresponding to the position information of the human body joint point to obtain the space-time fusion characteristic of two character objects; and inputting the space-time fusion characteristics into a classifier, and determining the interaction actions corresponding to the space-time fusion characteristics of the two character objects. Because the space-time fusion characteristics of the two character objects comprise the space characteristics and the time sequence characteristics, whether the change occurs between the two people or not can be judged when the space characteristics change, and then the interaction action of the two people is identified and obtained by combining the interaction action preset in the classifier, so that the identification of the double actions is realized.

Before describing specific embodiments of the present application, related technical terms in the embodiments of the present application will be described.

The interactive action can be a double interactive action. The double interaction specifically may comprise 11 types of double daily interaction, which is fanning, kicking, pushing, shoulder clapping, fingering, hugging, giving things, touching a pocket, shaking hands, walking in opposite directions, and walking in opposite directions, respectively. Among these, some actions are necessarily two-person contact: such as hugging and shaking, and some interactive actions do not need two people to contact: such as a person with fingers, move in opposite directions.

The human body joint points can be joint points of a human body skeleton, and the joint points of the human body skeleton can comprise a head central skeleton point, a shoulder joint point, an elbow joint point, a hand joint point, a spine point, a crotch joint point, a knee joint point, a foot joint point and the like. As shown in FIG. 1, the human body joint points may include 15A-O joint points, and those skilled in the art can increase or decrease the human body joint points as appropriate according to actual needs. For example, for one hand joint point E in fig. 1, one skilled in the art can extend it to the joint point diagram shown in fig. 2 according to actual needs. That is, the hand joint points E may include 21 joint points shown as E01-E21 in FIG. 2.

In addition, as shown in fig. 2, the human joint point may include a key point that is not a human joint, for example, key points at the ends of fingers such as E05, E09, E13, E17, and E21 in fig. 2.

It should be understood that the motion recognition method provided by the embodiment of the present application is generally applied to a server or a terminal device having a data processing capability. The server may specifically be an application server or a Web server, and when specifically deployed, the server may be an independent server or a cluster server. The terminal device may be an electronic device with data processing capability, such as a computer, a mobile phone, and a tablet.

The following describes a method, an apparatus, a device, and a storage medium for recognizing an action according to an embodiment of the present application in detail with reference to the accompanying drawings.

Fig. 3 is a schematic flowchart illustrating a method for motion recognition according to an embodiment of the present application, where as shown in fig. 3, the method includes:

s310, under the condition that the first video comprises two character objects, identifying human body joint points of the two character objects in a video frame image of the first video to obtain human body joint point position information of each character object in the two character objects in the video frame image;

s320, coding the position information of the human body joint point of each person object in the video frame image to obtain the space characteristic vector of the human body joint point of each person object in the video frame image;

s330, fusing the spatial feature vector of the human body joint point of each person object in the video frame image with the time sequence feature corresponding to the position information of the human body joint point to obtain the space-time fusion feature of the human body joint points of the two person objects;

and S340, inputting the space-time fusion characteristics into a classifier, and obtaining the interaction action corresponding to the space-time fusion characteristics through the classifier.

The above steps are described in detail below, specifically as follows:

in the above step S310, the human joint point position information of each of the two human objects in the video frame image is identified and obtained. The position information here may be the relative position of the respective human joint points of each human object itself, and the relative position between two human objects.

The relative position of each human body joint point of each person object can be used for judging the single action of each person object; the relative position between two human objects may be used to assist in determining an interaction between the two human objects, such as going forward and backward in the foregoing, and the like.

The encoding mode in S320 may be multiple, and the spatial feature vector of the human body joint point of each person object in the video frame image may be obtained after encoding. For example, the joint points of the human body can be encoded by using a normal encoding method embedding.

In the above S330, the spatial feature vector of the human body joint point and the time sequence feature corresponding to the position information of the human body joint point are fused, so that the time-space fusion feature of the human body joint points of the two character objects can be obtained. The spatial fusion feature comprises the spatial feature of the human body joint point and the time sequence feature of the human body joint point. Thus, the classifier may determine the interaction of the two human objects based on the spatial features and the temporal features in S340.

In the foregoing, by way of example, the spatial characteristics of the two character objects show the walking motions of the two character objects in the video, and after the time sequence characteristics are combined, the two character objects are walking motions in the same time period, and after the walking motions, the spatial characteristics also show that the distance between the two character objects is closer, so that the interactive motions between the two character objects can be obtained in the opposite directions in S340.

In addition, the spatial characteristics of the human object in the first video are acquired more conveniently. The first video here may be a video taken with a depth camera.

Among them, the depth camera is also called a 3D camera. Pictures (2D images) taken by a normal camera can see and record all objects within the camera's view, but the recorded data does not contain the distance of the objects from the camera. The distance between each point in the image and the camera can be accurately known through the data acquired by the depth camera, and thus the three-dimensional space coordinate of each pixel point in the image can be acquired by adding the (xy) coordinate of the pixel point in the 2D image.

The classifier in S340 may be any classifier capable of determining its corresponding interaction based on the spatio-temporal fusion features, and is not limited herein.

As one example, the classifier in the foregoing may be a softmax classifier. The classifier can determine the interactive action corresponding to the space-time fusion feature in a plurality of preset groups of interactive actions based on the space-time fusion feature.

Wherein, the preset multiple groups of interaction actions may include at least one of the following: fanning, kicking, pushing, shoulder clapping, fingering, hugging, giving things, touching a pocket, shaking hands, walking in opposite directions.

In some embodiments, in order to improve the accuracy of identifying the human body joint point of the human body object, as shown in fig. 4, on the basis of the above embodiments, before S310, the motion identification method may further include:

s410, before human body joint points of two character objects in a video frame image of a first video are identified, the first video is input into a target detection model, the character objects in the video frame image of the first video are detected through the target detection model, and the character objects in the first video are obtained;

and S420, outputting a region to be detected in the video frame image under the condition that the first video comprises two human objects, wherein the region to be detected is a region where the two human objects are located.

Accordingly, on this basis, S310 in the above embodiment may specifically be that, in a case that it is detected that two human objects are included in the first video, human joint points of the two human objects in the region to be detected are identified, so as to obtain human joint point position information of each of the two human objects in the video frame image.

The target detection model in S410 may be a detection model capable of detecting a position of the target object in the video. For example, the target detection model may be a YOLO model.

The YOLO model is used for object detection, i.e. detecting an object in an image and identifying the position of the object in the image, and the YOLO model is faster than a neural network with a suggested frame. Here, the YOLO model may be that a human object is detected in the video frame image, and the position of the human object in the image is identified, so as to obtain the region to be detected in the video frame image.

The detection result of the YOLO model may be as shown in fig. 5, and a monitoring box including two human objects is output.

In the embodiment of the application, the target detection model is adopted to detect the first video to obtain the region to be detected, and then only the human joint points are identified aiming at the region to be detected, so that the efficiency of identifying the human joint points of the person object can be improved, the region to be detected is identified, the number of interferents is less, and the accuracy of identification can also be improved.

In some embodiments, in order to improve efficiency of identifying human joint points, identifying human joint points of two human objects in a region to be detected may include:

The pose estimation model, which can map the human body pixels in RGB images and videos to the three-dimensional surface of the limb, involves many computer vision tasks such as object detection, pose estimation, segmentation, etc.

Here, the Pose Estimation model may be of many kinds, such as DensePose, OpenPose, real Multi-Person Pose Estimation, AlphaPose, Human Body Pose Estimation, DeepPose, and the like.

As one example, the pose estimation model herein may be an AlphaPose model. Here, the alphapos model may recognize a human joint point sequence of each human object in the video frame image, and the recognition result may be as shown in fig. 6. Based on this, the subsequent encoding may be encoding of a human joint point sequence.

The final detection result can be as shown in fig. 7, which can include a detection box output by the YOLO model, a skeleton of the person is displayed in the box, an action label of the person can be displayed above the box, and thereafter, a confidence level, and a tracking id as shown by the number in the figure can also be displayed.

In the embodiment of the application, the posture estimation model is adopted to automatically identify the human body joint points of the person object, so that the identification efficiency can be further improved, and the processing speed of motion identification is further improved.

In some embodiments, to improve the accuracy of motion recognition, on the basis of the foregoing embodiments, before S330, the motion recognition method may further include:

based on this, correspondingly fusing the spatial feature vector of the human body joint point of each human body object in the video frame image and the time sequence feature corresponding to the position information of the human body joint point in S330 may include:

In some embodiments, in order to further improve the accuracy of motion recognition, the graph convolution network models the spatial position of the human joint point in the video frame image based on the spatial feature vector of each human object, and the model comprises:

inputting the spatial feature vector of the human body joint point of each character object in the video frame image into a graph convolution network trained in advance, constructing a spatial relationship graph based on the spatial feature vector of the human body joint point of each character object in the video frame image through the graph convolution network to obtain target spatial features, wherein the nodes of the spatial relationship graph are human body joint points, and the connecting lines between the human body joint points are edges of the spatial relationship graph.

The graph convolution network may be a graph convolution neural network trained by a graph convolution network in the prior art.

As an example, in order to learn the spatial positional relationship of two persons from the above-described human joint point sequence (coordinate sequence), a relationship diagram may be defined for a human joint point object set, and then used to update object features. Specifically, the joint points of two persons are taken as the nodes of the space diagram, the connecting lines between the joints are taken as the edges of the space diagram, and they have relative spatial position relationship, and a simple spatial position relationship diagram is shown in fig. 8, wherein s joint points of each person's video frame are given, each joint point is taken as the node of the diagram,R _i ^k∈R^{s d*}a dimension d representing each joint point, where i =1, 2., L, k = {1, 2}, for distinguishing a kth personal object joint point feature,A∈R^{s s*}and representing a correlation coefficient matrix of s joint points.

WhereinW _i、W _j∈R^{d d’*}，b _i∈R^d，b _j∈R^d’All are learning parameters, then normalizing A to make the sum of edges connecting the same node 1:

（3）

when k1=1, k2=2, wherein a' representsk ₁And the joint point ofk ₂The correlation relationship of the joint points is marked as A'; when k1=1 and k2=1, a' represents a spatial correlation between joint points of k1, noted asA ₁'; when in usek1=2，k2=2, A' representsk ₁Is recorded as the spatial correlation between the joint pointsA ₂'. A' is the relative position relation of double persons,A ₁' andA ₂' this is the relative spatial position relationship of the joints of two persons.

Then using graph convolution network GCN to make relation inference to obtain original object characteristicsR ^k1、R ^k2Is updated toR ¹’、R ²’：

Wherein the content of the first and second substances,R’∈R^{s d*}is a feature of the object whose spatial relationship is enhanced,W _r∈R^{d d*}are learning parameters.

Also, from the above description, the following formula can be obtained:

the features are fused together, i.e. the following formula:

（4）

representing the features after simple fusion.

（5）

h _tAndh _t-1the hidden states of the LSTM at the t-th step and the t-1 step are respectively, wherein:

（6）

where W is a learnable parameter.

In addition, the graph convolution network described above may be trained in advance before use. In the training process, a cross entropy loss function can be used as an objective function:

（7）

where M is the number of classes, N is the total number of samples,P _itthe probability of observation sample i belonging to the category t is represented, whether the label of the sample i is t is represented, and the probability is essentially a symbolic function, wherein if the label of the sample i is t, the symbolic function is 1, otherwise, the symbolic function is 0:

（8）

（9）

in the embodiment of the application, the space feature vectors of the human body joint points are processed by adopting the graph convolution network to obtain the target space features, the space position feature degree of the human body joint points can be enhanced, the enhanced target space features are fused with the corresponding time sequence features, the obtained space-time fusion features are easier to judge the corresponding accurate double-person interaction action, and the action identification accuracy is improved.

In some embodiments, in order to better fuse the spatial feature and the time sequence feature of the human body joint point, the spatial feature vector of the human body joint point in the video frame image of each human body object is fused with the time sequence feature corresponding to the position information of the human body joint point, so as to obtain the time-space fusion feature of the human body joint points of the two human body objects, including:

In some embodiments, in order to obtain more accurate time sequence features, before fusing the spatial feature vector of the human body joint point of each human body object in the video frame image with the time sequence features corresponding to the position information of the human body joint point, the method further includes:

In one specific example, the interaction introduced above may be a handshake, in which case the spatial feature vector of each human object may be used to represent that the human object has a hand stretching action in space; accordingly, the spatiotemporal fusion features are used to represent the reaching motion of two human objects, coinciding in time, and the palms of the two human objects being in contact in space.

In the foregoing embodiment, the step 340 of obtaining the interaction corresponding to the space-time fusion feature through the classifier may specifically include:

In addition, in order to further improve the accuracy of the recognition result, the recognition of the handshake action may further include recognition of some additional characteristic actions, such as a shaking action after the palms contact each other, a change situation of the body position, an action situation of another palm, and the like.

According to the analysis, the double-person motion recognition is more complex than the single-person motion recognition, in order to improve the precision of the double-person motion recognition, useful information is maximally excavated, and the more complex double-person motion is recognized, in space, the spatial position of the double persons needs to be modeled, and in order to better mine the spatial position information of the double persons, the relative position of the double persons needs to be modeled, and the relative spatial position of each joint of the double persons needs to be modeled; in time, to model the time sequence information of two persons, the recurrent neural network can well solve the time sequence, the feature vector of each person object is updated after the space position is modeled by using the graph convolution network in space, and then the feature vectors of two person objects are input into the recurrent neural network model to obtain the time sequence feature.

Therefore, more accurate time sequence characteristics can be obtained by using the recurrent neural network, and further the spatial characteristics and the time sequence characteristics of the human body joint points can be better fused, so that the space-time fusion characteristics are more accurate, and the result of action identification is more accurate.

In the motion recognition method provided in the embodiment of the present application, the execution subject may also be a motion recognition device, or a control module for executing the motion recognition method in the motion recognition device. In the embodiment of the present application, an example in which a motion recognition apparatus executes a motion recognition method is taken as an example, and the motion recognition apparatus provided in the embodiment of the present application is described.

Based on the same inventive concept, the embodiment of the application also provides a motion recognition device.

Fig. 9 illustrates a motion recognition apparatus according to an embodiment of the present application, and as shown in fig. 9, the motion recognition apparatus 900 includes:

the identification module 910 is configured to identify human body joint points of two human body objects in a video frame image of a first video when it is detected that the first video includes the two human body objects, so as to obtain human body joint point position information of each of the two human body objects in the video frame image;

the encoding module 920 is configured to encode information of positions of human body joints of each human body object in the video frame image to obtain spatial feature vectors of the human body joints of each human body object in the video frame image;

a fusion module 930, configured to fuse the spatial feature vector of the human body joint point in the video frame image of each person object with the time sequence feature corresponding to the position information of the human body joint point, so as to obtain a time-space fusion feature of the human body joint points of the two person objects;

and the classification module 940 is configured to input the spatio-temporal fusion features into the classifier, and obtain an interaction corresponding to the spatio-temporal fusion features through the classifier.

In some embodiments, the motion recognition device 900 may further include:

the detection module can be used for inputting the first video into the target detection model before identifying human body joint points of two character objects in the video frame image of the first video, and detecting the character objects in the video frame image of the first video through the target detection model to obtain the character objects in the first video;

the output module can be used for outputting a region to be detected in a video frame image under the condition that the first video is detected to comprise two character objects, wherein the region to be detected is a region where the two character objects are located;

accordingly, the identifying module 910 may specifically identify human body joint points of two human body objects in the video frame image of the first video, including:

In some embodiments, the identifying module 910 may identify human joint points of two human objects in the region to be detected, which may include:

In some embodiments, the motion recognition device 900 may further include:

the modeling module can be used for inputting the spatial feature vector of the human body joint point of each character object in the video frame image into a graph convolution network trained in advance before fusing the spatial feature vector of the human body joint point of each character object in the video frame image with the time sequence feature corresponding to the position information of the human body joint point, and modeling at a spatial position based on the spatial feature vector of the human body joint point of each character object in the video frame image through the graph convolution network to obtain a target spatial feature;

correspondingly, the fusing module 930 fuses the spatial feature vector of the human body joint point of each person object in the video frame image and the time sequence feature corresponding to the position information of the human body joint point, which may specifically include:

In some embodiments, the modeling module may be specifically configured to input a spatial feature vector of a human body joint point of each person object in the video frame image into a graph convolution network trained in advance, construct a spatial relationship graph based on the spatial feature vector of the human body joint point of each person object in the video frame image through the graph convolution network, and obtain the target spatial feature, where a node of the spatial relationship graph is a human body joint point, and a connection line between human body joint points is an edge of the spatial relationship graph.

In some embodiments, the fusion module 930 may be specifically configured to sequentially input the spatial feature vectors of the human body joint points in the video frame image into a pre-trained recurrent neural network model based on the time sequence features corresponding to the position information of the human body joint points, so as to obtain the time-space fusion features of the human body joint points of the two human body objects.

In some embodiments, the motion recognition device 900 may further include:

the acquisition module may be configured to acquire, based on the human body joint point, a time-series characteristic corresponding to the position information of the human body joint point in the video frame image of the first video by applying a time-cycle neural network before fusing the spatial characteristic vector of the human body joint point in the video frame image of each person object with the time-series characteristic corresponding to the position information of the human body joint point.

In some embodiments, the interaction comprises at least one of:

the classification module 940 may be specifically configured to perform the following actions:

inputting the spatio-temporal fusion features into a classifier;

The motion recognition device provided in the embodiment of the present application may be configured to execute the motion recognition method provided in each of the above method embodiments, and the implementation principle and the technical effect are similar, and for the sake of brevity, no further description is given here.

Based on the same inventive concept, the embodiment of the application also provides the electronic equipment.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 10, the electronic device may include a processor 1001 and a memory 1002 that stores computer programs or instructions.

Specifically, the processor 1001 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 1002 may include mass storage for data or instructions. By way of example, and not limitation, memory 1002 may include a Hard Disk Drive (HDD), a floppy Disk Drive, flash memory, an optical Disk, a magneto-optical Disk, magnetic tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 1002 may include removable or non-removable (or fixed) media, where appropriate. The memory 1002 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 1002 is non-volatile solid-state memory. In a particular embodiment, the memory 1002 includes Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory or a combination of two or more of these.

The processor 1001 realizes any one of the motion recognition methods in the above-described embodiments by reading and executing computer program instructions stored in the memory 1002.

In one example, the electronic device may also include a communication interface 1003 and a bus 1010. As shown in fig. 10, the processor 1001, the memory 1002, and the communication interface 1003 are connected by a bus 1010 to complete communication therebetween.

The communication interface 1003 is mainly used to implement communication between modules, devices, units and/or devices in this embodiment.

The bus 1010 includes hardware, software, or both to couple the components of the electronic device to one another. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 710 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The electronic device may execute the motion recognition method in the embodiment of the present application, so as to implement the motion recognition method and apparatus described in fig. 1 to fig. 2.

In addition, in combination with the motion recognition method in the foregoing embodiments, the embodiments of the present application may be implemented by providing a readable storage medium. The readable storage medium having stored thereon program instructions; the program instructions, when executed by a processor, implement any of the motion recognition methods in the above embodiments.

It is to be understood that the present application is not limited to the particular arrangements and instrumentality described above and shown in the attached drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present application are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications, and additions or change the order between the steps after comprehending the spirit of the present application.

The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Aspects of the present application are described above in terms of flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware for performing the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As described above, only the specific embodiments of the present application are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application.

Claims

1. A motion recognition method, comprising:

under the condition that two character objects are detected to be included in a first video, identifying human body joint points of the two character objects in a video frame image of the first video to obtain human body joint point position information of each character object in the two character objects in the video frame image;

coding the position information of the human body joint points of each human body object in the video frame image to obtain the spatial feature vector of the human body joint points of each human body object in the video frame image, wherein the spatial feature vector is used for representing the change features of the motion of the human body object in the space;

fusing the spatial feature vector of the human body joint point of each human body object in the video frame image with the time sequence feature corresponding to the position information of the human body joint point to obtain the space-time fusion feature of the human body joint points of the two human body objects, wherein the space-time fusion feature is used for representing the change feature of the space interaction action of the two human body objects on the time sequence;

and inputting the space-time fusion characteristics into a classifier, and obtaining an interactive action corresponding to the space-time fusion characteristics through the classifier, wherein the interactive action is a matched action of the two character objects.

2. The motion recognition method of claim 1, wherein before the recognizing human joint points of two human objects in the video frame images of the first video, the method further comprises:

under the condition that two character objects are detected in a first video, outputting a region to be detected in the video frame image, wherein the region to be detected is a region where the two character objects are located;

the identifying human body joint points of two character objects in the video frame image of the first video specifically includes:

3. The motion recognition method according to claim 1, wherein the spatiotemporal fusion features are as follows:

wherein R is¹' and R²' object features for representing spatial relationship enhancement of two human objects, respectively;

wherein A' is a value obtained by normalizing A; a. the₁' is the spatial correlation between the joint points of k1 at k1=1, k2= 1; a. the₂' is off of k1 at k1=2 and k2=2Spatial correlation between nodes; k1 represents one of the two human objects; k2 represents the other of the two character objects;R ^k1andR ^k2the method comprises the following steps that original object characteristics of two character objects are respectively obtained, A is a correlation coefficient matrix of a joint point, and Wi and Wj are learning parameters; wr ∈ Rd ×, d are learning parameters; r¹'' is R¹' features obtained after graph convolution network update; r²'' is R²' features obtained after graph convolution network update.

4. The motion recognition method according to claim 1, wherein before fusing the spatial feature vector of the human joint in the video frame image and the time-series feature corresponding to the position information of the human joint, the method further comprises:

inputting the spatial feature vector of the human body joint point of each human body object in the video frame image into a graph convolution network trained in advance, and modeling in a spatial position based on the spatial feature vector of the human body joint point of each human body object in the video frame image through the graph convolution network to obtain a target spatial feature;

the fusing the spatial feature vector of the human body joint point of each human body object in the video frame image with the time sequence feature corresponding to the position information of the human body joint point comprises:

and fusing the target space characteristic and the time sequence characteristic corresponding to the human body joint point position information.

5. The motion recognition method according to claim 4, wherein the step of inputting the spatial feature vector of the human joint point of each human object in the video frame image into a pre-trained graph volume network, and modeling the spatial position of each human object in the video frame image based on the spatial feature vector of the human joint point by the graph volume network to obtain the target spatial feature comprises:

inputting the spatial feature vector of the human body joint point of each human body object in the video frame image into a graph convolution network trained in advance, and constructing a spatial relationship graph based on the spatial feature vector of the human body joint point of each human body object in the video frame image through the graph convolution network to obtain a target spatial feature, wherein the node of the spatial relationship graph is the human body joint point, and the connecting line between the human body joint points is the edge of the spatial relationship graph.

6. The motion recognition method of claim 1, wherein training the graph-convolution network employs a cross-entropy loss function:

wherein M is the number of categories, N is the total number of samples,P _itthe probability that the observation sample i belongs to the category t is a symbolic function; the sign function is used for indicating whether the label of the sample i is t, and if the label of the sample i is t, the sign function is 1, otherwise, the sign function is 0.

7. The motion recognition method according to claim 1, wherein in a case where the interaction is a handshake, the spatial feature vector of each human object represents that the human object has a hand stretching motion in space;

correspondingly, the space-time fusion feature is used for representing the hand stretching actions of the two human objects, the hand stretching actions are overlapped in time, and the palms of the two human objects are contacted in space;

the obtaining of the interactive action corresponding to the space-time fusion feature through the classifier includes:

determining whether the hand stretching actions of the two human objects are overlapped in time, whether palms of the two human objects are contacted in space and whether the contact time is greater than a preset threshold value through the classifier;

and under the condition that the hand stretching actions of the two character objects are overlapped in time, contact exists in space, and the contact time is greater than a preset threshold value, determining that the interaction actions of the two character objects in the first video are handshake.

8. An action recognition device, comprising:

the identification module is used for identifying human body joint points of two character objects in a video frame image of a first video under the condition that the first video is detected to comprise the two character objects, and obtaining the position information of the human body joint points of each character object in the two character objects in the video frame image;

the encoding module is used for encoding the position information of the human body joint points of each human body object in the video frame image to obtain the spatial feature vector of the human body joint points of each human body object in the video frame image, wherein the spatial feature vector is used for representing the change features of the motion of the human body object in the space;

the fusion module is used for fusing the spatial feature vector of the human body joint point of each person object in the video frame image with the time sequence feature corresponding to the position information of the human body joint point to obtain the space-time fusion feature of the human body joint points of the two person objects, wherein the space-time fusion feature is used for representing the change feature of the space interaction action of the two person objects on the time sequence;

and the classification module is used for inputting the space-time fusion characteristics into a classifier, and obtaining an interactive action corresponding to the space-time fusion characteristics through the classifier, wherein the interactive action is a matched action of the two character objects.

9. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the action recognition method according to any one of claims 1 to 7.

10. A readable storage medium, characterized in that the readable storage medium stores thereon a program or instructions which, when executed by a processor, implement the steps of the action recognition method according to any one of claims 1 to 7.