CN117011941A

CN117011941A - Interactive behavior recognition method

Info

Publication number: CN117011941A
Application number: CN202310998576.0A
Authority: CN
Inventors: 刘星
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-11-07

Abstract

The application is suitable for the technical field of image data processing, provides an interactive behavior recognition method, and aims to improve the interactive behavior recognition accuracy of a depth camera on both parties. The method mainly comprises the following steps: determining a first target object and a second target object for interactive behavior recognition; establishing a new coordinate system by taking an interaction center point of the first target object and the second target object as an origin; calculating a new coordinate of the first target object corresponding to the first target object in the new coordinate system and a new coordinate of the second target object corresponding to the second target object in the new coordinate system; and identifying the specific interaction behavior of the first target object and the second target object according to the new coordinates of the first target object and the new coordinates of the second target object.

Description

Interactive behavior recognition method

Technical Field

The application belongs to the technical field of image data processing, and particularly relates to an interactive behavior recognition method.

Background

The human behavior recognition technology is taken as a behavior recognition technology (BehavioralRecognition, BR) which can monitor and recognize human behaviors so as to predict or discover any unexpected or dangerous event in advance, is a challenging task and has wide application prospect, such as intelligent video monitoring, man-machine interaction, automatic recognition alarm, public safety and the like, and the human behavior recognition has become a research hotspot in the related field and has potential economic value.

At present, many schemes for researching human behavior recognition focus on performing behavior recognition on each personnel coordinate obtained under an original coordinate system corresponding to a depth camera, but a camera device usually only has one or a few camera angles for shooting personnel, and recognition classification of the same action is greatly different under different angles, so that matching on the same action is difficult to realize, classification of interactive behavior features is not facilitated, and the interactive behavior recognition effect of the personnel is affected.

Disclosure of Invention

The application aims to provide an interactive behavior recognition method which aims to improve the accuracy of interactive behavior recognition of personnel on both sides.

In a first aspect, the present application provides an interactive behavior recognition method, including:

determining a first target object and a second target object for interactive behavior recognition;

establishing a new coordinate system by taking an interaction center point of the first target object and the second target object as an origin;

calculating a new coordinate of the first target object corresponding to the first target object in the new coordinate system and a new coordinate of the second target object corresponding to the second target object in the new coordinate system;

and identifying the specific interaction behavior of the first target object and the second target object according to the new coordinates of the first target object and the new coordinates of the second target object.

In a second aspect, the present application provides an interactive behavior recognition system, comprising:

the determining unit is used for determining a first target object and a second target object for interactive behavior recognition;

the establishing unit is used for establishing a new coordinate system by taking the interaction center point of the first target object and the second target object as an origin;

the calculating unit is used for calculating a first target object new coordinate corresponding to the first target object in the new coordinate system and a second target object new coordinate corresponding to the second target object in the new coordinate system;

and the identification unit is used for identifying the specific interaction behavior of the first target object and the second target object according to the new coordinates of the first target object and the new coordinates of the second target object.

In a third aspect, the present application provides a computer device comprising:

processor, memory, bus, input/output interface, network interface;

the processor is connected with the memory, the input/output interface and the network interface through the bus;

the memory stores a program;

the processor, when executing the program stored in the memory, implements the interactive behavior recognition method according to any one of the foregoing first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the interactive behavior recognition method according to any one of the preceding first aspects.

In a fifth aspect, the present application provides a computer program product which, when executed on a computer, causes the computer to perform the interactive behavior recognition method according to any one of the preceding first aspects.

The above technical solution can be seen that the embodiment of the application has the following advantages:

according to the interactive behavior recognition method, the first target object and the second target object for interactive behavior recognition are determined, so that both sides for interactive behavior recognition are obtained; and then, a new coordinate system is established by taking an interaction center point of the first target object and the second target object as an origin, a new coordinate of the first target object corresponding to the new coordinate system of the first target object and a new coordinate of the second target object corresponding to the new coordinate system of the second target object are calculated, and two parties for carrying out interaction actions are described by the new coordinate system.

Drawings

FIG. 1 is a flow chart of an interactive behavior recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of an interactive behavior recognition method according to another embodiment of the present application;

FIG. 3 is a schematic diagram illustrating the structure of an interactive behavior recognition system according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating an interactive behavior recognition system according to another embodiment of the present application;

FIG. 5 is a schematic diagram illustrating the structure of a computer device according to an embodiment of the present application;

FIG. 6 is a diagram of one embodiment of the present application of "same double interaction behavior" at camera angles of different depth cameras;

FIG. 7 is a schematic representation of one embodiment of a human skeleton sequence using a default human model "NTU RGB+D60" dataset for the experiments of the present application;

FIG. 8 is a schematic diagram showing the effect of an embodiment of the present application for interactive behavior recognition using geometric features;

FIG. 9 is a schematic representation of the effect of one embodiment of the present application for individual B "kicking" individual A;

FIG. 10 is a schematic representation of the effect of one embodiment of the present application of individual B "pushing" individual A;

FIG. 11 is a schematic diagram showing the effect of one embodiment of the present application translating from an original coordinate system O-XYZ to a new coordinate system I-XYZ;

FIG. 12 is a flow chart of one embodiment of an expression S' of skeleton sequence coordinates representing individual weights, temporal weights, and spatial weights in a first target object and a second target object according to the present application;

FIG. 13 is a schematic of a confusion matrix for 11 interaction behavior categories using the default mannequin "NTU RGB+D60" dataset for the experiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that, in the prior art, human behavior recognition research based on RGB video finds that: the success rate of human behavior recognition is greatly affected by factors such as illumination, scene and camera lens angle. This is because it is difficult to accurately describe skeleton sequence related information of a human body under the influence of many factors such as differences in lighting conditions, complexity of the background, and diversity of viewing angles, and thus classification of behaviors of the human body is not accurate enough. However, with the development of technologies such as Depth Camera (Depth Camera) and pose estimation (pose estimation), it is becoming easier and more accurate to acquire human skeleton sequence related information through a Depth Camera as a video captured by an imaging device in combination with a pose estimation algorithm. The human skeleton consisting of a plurality of joints can be obtained by conjecture through processing the depth images acquired by the depth camera sensor frame by combining with a gesture estimation algorithm, so that the identification of human behaviors in the video can be changed into the description of the human skeleton sequence related information at different moments. It should be noted that, currently, a more sophisticated depth camera is, for example, a Kinect depth camera, a realsense depth camera, and the like.

In a video shot by an actual depth camera, most scenes for identifying human behaviors are often represented by interactions with other people, and even if multiple people exist in the scenes at the same time, the interactions generated by one person at a certain moment are often directed against another person, namely, the scenes for identifying the human behaviors are usually interactions between two people, and are also called double interactions. Therefore, the behavior recognition research oriented to double interaction has great practical application value.

The video shot on the basis of the depth camera can identify and acquire relevant information (such as joint coordinates, joint names and the like of a skeleton sequence) of human skeleton sequences of two target persons, and simultaneously comprises more complex spatial relationships (expressed as gestures, relative distances and the like between the two target persons) and time relationships (expressed as gestures, relative distances and the like between the two target persons and reflected in different image frames of the video). One of the more critical problems is the numerical variability of the related information of the skeleton sequence of the same behavior from different view angles, as shown in fig. 6, the view angle 1 scene, the view angle 2 scene, and the view angle 3 scene are: the same double interaction behavior under different scenes is that one person kicks the other person; it can be seen that the coordinate values of each joint of the human skeleton sequences of two target persons obtained by the depth cameras at different visual angles have larger difference under the same original coordinate system O-XYZ, and are difficult to be classified and identified as the same interaction behavior by the network model.

Assuming that the personnel of the two parties of the interaction behavior record videos under a set original coordinate system O-XYZ by a depth camera, the two parties for carrying out the interaction behavior identification can be determined firstly, the two parties of the interaction behavior are called a first target object and a second target object, and the video recording is carried out on the first target object and the second target object through the original coordinate system O-XYZ preset in the depth camera to obtain the video containing the F frame images; determining three-dimensional coordinates of a first target object contained in an imageThree-dimensional coordinates with the second target object +.>In particular, the method comprises the steps of,

wherein n, i and F in the formula are positive integers greater than 0; three-dimensional coordinates of a first target objectWherein "A" refers to the first target object, "n" refers to the first target object corresponding to the joint labeled n in the default human body model, "i" refers to the i-th frame image of the video composed of F-frame images recorded by the depth camera, then->It is possible to comprehensively express the three-dimensional coordinates of the first target object +.>Similarly, the three-dimensional coordinates of the second target object +.>Wherein "B" refers to the second target object, "n" refers to the second target object corresponding to the joint labeled n in the default human body model, "i" refers to the i-th frame image of the video composed of F-frame images recorded by the depth camera, then- >It is possible to comprehensively express the three-dimensional coordinates of the second target object +.>

It should be noted that, to simplify the complexity of identifying the person to a certain extent and reduce the amount of calculation for identifying the interaction behavior of the person, the simplified default manikin is generally adopted to replace the real skeleton joint of the person to make the basis for identifying the interaction behavior of the embodiment, and for the rule of the label in the default manikin of the embodiment, please refer to fig. 7. Fig. 7 shows a default manikin that may be employed in this embodiment, the default manikin comprising 25 joint labels, wherein: joint index 1 indicates Hip Center (Hip Center), joint index 2 indicates Spine (Spine), joint index 3 indicates Neck (neg), joint index 4 indicates Head (Head), joint index 5 indicates Left Shoulder (Left Hand holder), joint index 6 indicates Left Elbow (Left Elbow), joint index 7 indicates Left Wrist (Left write), joint index 8 indicates Left Hand (Left Hand handle), joint index 9 indicates Right Shoulder (Right Hand holder), joint index 10 indicates Right Elbow (Right Elbow), joint index 11 indicates Right Wrist (Right write), joint index 12 indicates Right Hand (Right Hand holder), joint index 13 indicates Left Hip (Left Hip), joint index 14 indicates Left Knee (Left Knee), joint index 15 indicates Left Ankle (Left Ankle), joint index 16 indicates Left Foot (Left Foot), joint index 17 indicates Right Hip (Right Hip), joint index 18 indicates Right Knee), joint index 19 indicates Right Shoulder (Right Foot), joint index 19 indicates Right finger tip 20, finger tip 21 indicates Right finger tip 21 (Right finger tip 35), finger tip 35 indicates finger tip 35, and Thumb 37. As can be seen, in fig. 7, each person is simplified to a default human body model represented by 25 joint labels as a whole, so that the two-party identification of the interaction behavior can be regarded as the identification of joint combination classification between two human body models, and the coordinates of the positions corresponding to the 25 joint labels of each person can be obtained by calculating through a pose estimation algorithm in an original coordinate system through a depth camera.

Specifically, the depth camera can determine that the skeleton sequence coordinate of the first target object in the original coordinate system (i.e. the set of all the joint point coordinates of the human body model corresponding to the first target object in the video image frame) is S _A Skeleton sequence coordinates of the second target object in the original coordinate system (namely, all the joints of the human body model corresponding to the second target object are seatedSet marked in video image frame) is S _B ；

S＝(S _A ,S _B )

Where N is a positive integer greater than 0 (where N is also a positive integer less than or equal to 25 when the default manikin with only 25 joint labels described above is employed),representing three-dimensional coordinates of an nth joint in an ith frame image of the first target object +.>Representing three-dimensional coordinates of an nth joint in an ith frame image of the second target object; s represents a skeleton sequence coordinate combination expression for both interaction actions (a first target object and a second target object).

As can be seen from fig. 6, the coordinate values of the joints of the human skeleton sequences of the two target persons obtained by the depth cameras at different angles of view have a large difference under the same original coordinate system O-XYZ, namely the skeleton sequence coordinates S of the first target object acting in the same motion at different angles of view _A Skeleton sequence coordinates S with a second target object _B The numerical values are greatly different, so that the matching of the same action on the numerical values is difficult to realize, the classification of the characteristics is not facilitated, and the effect of behavior recognition is affected. At present, a plurality of students researching interactive behavior recognition still obtain a skeleton coordinate sequence under an original coordinate axis, and the interactive behavior features such as relative distance, relative position and the like are calculated, so that the method is not beneficial to interactive behavior classification and recognition.

Referring to fig. 1, an embodiment of an interactive behavior recognition method of the present application includes:

101. and determining a first target object and a second target object for interactive behavior recognition.

The step may be performed by determining, by the depth camera, both sides performing the interaction, for example, the depth camera includes the aforementioned Kinect depth camera, realsense depth camera, etc., and the step is implemented by using a relatively mature prior art, which is not described in detail herein.

102. And establishing a new coordinate system by taking the interaction center point of the first target object and the second target object as an origin.

Specifically, for the interaction Center point of the first target object and the second target object, in this step, a midpoint of a line between a certain node of the torso of the first target object and a corresponding certain node of the torso of the second target object may be used as the interaction Center point, where the node is, for example, a Hip Center (Hip Center) of the joint mark 1, a Spine (Spine) of the joint mark 2, and a Shoulder Center (Shoulder Center) of the joint mark 21 in the default manikin. Specifically, this step can confirm the three-dimensional coordinates J of the interaction Center point I of the first target object and the second target object in the original coordinate system by using the Hip Center (Hip Center) with the joint number 1 _I The following are provided:

wherein,representing three-dimensional coordinates of a default 1 st joint in a 1 st frame image of the first target object, wherein the 1 st joint is an individual center joint (namely a hip joint center with joint number 1) of the first target object; similarly, let go of>Representing the three-dimensional coordinates of the default 1 st joint in the 1 st frame image of the second target object, wherein the 1 st joint is the individual center joint (namely the hip joint center with joint number 1) of the second target object, the step is as followsThree-dimensional coordinates J of interaction center point I in original coordinate system _I As an origin, a new coordinate system is established, that is, the skeleton sequence coordinates of the first target object and the second target object are translated from the original coordinate system O-XYZ to the new coordinate system I-XYZ to perform the correlation calculation, which is shown in fig. 11.

103. And calculating a new coordinate of the first target object corresponding to the first target object in the new coordinate system, and a new coordinate of the second target object corresponding to the second target object in the new coordinate system.

Specifically, it is required to determine that the skeleton sequence coordinate of the first target object in the original coordinate system is S _A And the skeleton sequence coordinate of the second target object in the original coordinate system is S _B The following are provided:

where N is a positive integer greater than 0 (where N is also a positive integer less than or equal to 25 when the default manikin with only 25 joint labels described above is employed), Representing three-dimensional coordinates of an nth joint in an ith frame image of the first target object +.>Representing the three-dimensional coordinates of the nth joint in the ith frame image of the second target object.

Then, the framework sequence coordinates of the first target object in the new coordinate system are calculated asAnd calculating the skeleton sequence coordinates of the second target object in the new coordinate system as +.>

Wherein J is _I The three-dimensional coordinates of the center point I in the original coordinate system are interacted for the above step 101.

104. And identifying the specific interaction behavior of the first target object and the second target object according to the new coordinates of the first target object and the new coordinates of the second target object.

Specifically, in this step, the interaction behavior of the first target object and the second target object described in the new coordinate system in step 103 may be directly identified through a preset neural network model, so that the specific interaction behavior of the first target object and the second target object, for example, the specific interaction behavior is directly known: "kick" or "push" and the like. The new coordinate system is the same as the distance between two parties performing the interaction, so that the difficulty in identifying the same interaction caused by the angle of view error can be reduced, and the accuracy of identifying the interaction by the neural network model can be improved.

The preset neural network model is a trained neural network model capable of identifying interaction behavior types of two parties, for example, the neural network model is one or more of VGG (visual geometry group), residual network (residual networks, resNet) and the like. The trained neural network model can be stored in the memory of the depth camera in the embodiment, so that the trained neural network model can be called by the neural network chip of the depth camera, and the image frames of the video shot by the depth camera can be quickly identified. The training of the neural network model may collect the human videos in public places (such as shopping malls, airports, amusement parks, stations, etc.) through legal authorization as training samples, and if necessary, label classification (such as division into communication behaviors, no-interaction behaviors, kicking behaviors, pushing behaviors, etc.) of interaction behaviors of both parties in different training samples may be combined manually.

Based on the description of the embodiment of fig. 1, in other embodiments, as shown in fig. 8, for the current recognition of the interaction behavior of the two interaction parties, the two parties may be used to describe the geometric features existing in the interaction behavior, including the skeleton sequence coordinates of the two parties in the interaction behavior, the relative distances between different parts of the two parties (such as the relative distances between the corresponding nodes, the relative distances between the hand nodes, the relative distances between the trunk nodes, the relative distances between the closest nodes, etc.), and so on. The geometric features describe the relative geometric features among individuals and the geometric features of individuals, and the importance of one individual participating in the interaction behavior is not described, so that the importance weight of two individuals in the interaction behavior of two parties to the interaction behavior is the same. This clearly ignores that in many interactions, the degree of participation of different individuals may be significantly different. Referring to fig. 9 and 10, fig. 9 illustrates the interaction of the right individual B with the left individual a; FIG. 10 illustrates the interaction of right individual B with left individual A in a "tip-over" manner; under the scenes, the individual B shows more remarkable action characteristics of kicking and pushing than the individual A, so that the importance of the description characteristics of the interactive behaviors of the individual B is far higher than that of the individual A, and the interactive behaviors meet the human cognition rationality, so that the defect that the interactive behavior recognition technology adopts the scheme with the same individual weight to be unfavorable for the remarkable description of the interactive behavior characteristics and further unfavorable for the improvement of the interactive behavior recognition effect is overcome.

Referring to fig. 2, another embodiment of the interactive behavior recognition method of the present application includes:

201. and determining a first target object and a second target object for interactive behavior recognition.

The execution of this step is similar to step 101 in the embodiment of fig. 1, and the repetition is not repeated here.

202. And establishing a new coordinate system by taking the interaction center point of the first target object and the second target object as an origin.

The execution of this step is similar to step 102 in the embodiment of fig. 1, and the repetition is not repeated here.

203. And calculating a new coordinate of the first target object corresponding to the first target object in the new coordinate system, and a new coordinate of the second target object corresponding to the second target object in the new coordinate system.

The execution of this step is similar to step 103 in the embodiment of fig. 1, and the repetition is not repeated here.

204. The method comprises the steps of providing a first target object with a first weight, providing a second target object with a second weight, and enabling the sum of the first weight and the second weight to be 1.

The embodiment aims to solve the technical problem that the scheme with the same individual weight is not beneficial to the significance description of the interactive behavior characteristics in the interactive behavior recognition technology, and realize that 'some specific interactive behaviors are specific actions implemented by a specific person on another person' in specific recognition characteristic classification, thereby being beneficial to improving the interactive behavior recognition accuracy. For example, in fig. 9 and 10, the individual B exhibits more pronounced motion characteristics "kick" and "tip" than the individual a, and should be specifically identified as "the individual B kicks the individual a" and "the individual B tips the individual a", not just as "the interaction between the individual B and the individual a is a kick" and the interaction between the individual B and the individual a is a tip ".

To achieve the above object, this step requires providing the first target object with a first weight W _A Providing a second weight W for a second target object _B Wherein the first weight W _A And a second weight W _B The sum is 1, and the concrete steps are as follows:

W _A ＝L _A /(L _A +L _B )

W _B ＝L _B /(L _A +L _B )

wherein the maximum position change amount expression L corresponding to the first target object _A And a maximum position change amount expression L corresponding to the second target object _B The following are provided:

where max () represents the maximum value and norm () represents the modulus of the vector.

205. A first behavioral characteristic expression of the behavioral characteristic of the first target object is established, the first behavioral characteristic expression being equal to the first target object new coordinates multiplied by a first weight.

It should be noted that, in addition to the individual importance of different individuals (the first target object or the second target object) in the interaction, the present embodiment also considers the temporal importance of the individual in the skeleton sequence coordinates in different image frames of the video, and the spatial importance of the individual in the skeleton sequence coordinates in the same image frame of the video. In the embodiment, the individual importance in the interaction behavior, the time importance of different individual skeleton sequences in different image frames of the video and the space importance of the individual skeleton sequences in different image frames of the video are uniformly described.

Specifically, referring to fig. 12, first, according to the skeleton sequence related data of different individuals a (assumed to be the first target object of the implementation) and B (assumed to be the second target object of the embodiment) of the interaction behavior, the skeleton sequence coordinates of the first target object in the new coordinate system are madeAnd skeleton sequence coordinates of the second target object in the new coordinate system +.>Three-dimensional space geometries of size c×f×n can be formed, where C represents the dimension of the skeletal sequence coordinates, i.e., c=3; as in the previous embodiment, F represents the number of frames of the video recorded by the depth camera, and represents the dimension in which the time is located; n represents the number of joints of an individual A and/or an individual B reflected by one frame of image in the video recorded by the depth camera, and represents the dimension of the space.

In order to embody the characteristic weights of different individual skeleton sequence coordinates in the time dimension and the space dimension, the step is to the skeleton sequence coordinates of the first target object in a new coordinate systemAnd skeleton sequence coordinates of the second target object in the new coordinate system +.>Pooling is carried out in the dimension N to obtain time weight characteristics, wherein the size of the time weight characteristics is C multiplied by F multiplied by 1; skeleton sequence coordinates of the first target object in the new coordinate system>And skeleton sequence coordinates of the second target object in the new coordinate system +. >And carrying out pooling in the dimension F to obtain a space weight characteristic, wherein the size is C multiplied by 1 multiplied by N.

Establishing a spatial weight characteristic expression of the first target objectAnd/or, establishing a time weight characteristic expression of the first target object +.>The following are provided:

wherein,representing the pooling of the dimension F representing the three-dimensional vector at the first target object skeleton sequence coordinates,/->The first target object skeleton sequence coordinates represent the dimension N of the three-dimensional vector to be pooled.

The method comprises the following steps of:the time weight characteristic expression of the first target object is as follows: />

206. And establishing a second behavior characteristic expression of the behavior characteristic of the second target object, wherein the second behavior characteristic expression is equal to the new coordinates of the second target object multiplied by a second weight.

Similarly, a spatial weight feature expression of the second target object is established according to step 205And/or, establishing a time weight characteristic expression of the second target object +.>The following are provided:

in the same way, the processing method comprises the steps of,the pooling (N) represents the pooling in the dimension F of the second target object skeleton sequence coordinate representing the three-dimensional vector, and the pooling in the dimension N of the second target object skeleton sequence coordinate representing the three-dimensional vector.

The space weight characteristic expression of the second target object is established as follows: The time weight characteristic expression of the second target object is: />

207. And establishing an interactive behavior expression of the first target object and the second target object, wherein the interactive behavior expression is equal to the first behavior characteristic expression plus the second behavior characteristic expression.

Specifically, a space weight characteristic expression U of an interactive behavior skeleton sequence of a first target object and a second target object is established _N And/or, establishing a time weight characteristic expression U of the interactive behavior skeleton sequence of the first target object and the second target object _F The following are provided:

208. the specific interaction behaviors of the first target object and the second target object are obtained by identifying the interaction behavior expression through the trained network model, and the trained network model can identify the specific interaction behaviors of the target object with remarkable interaction behaviors in the two target objects.

Specifically, the network model in this embodiment may be a multi-layer depth map convolution network model, and before the trained network model identifies the interactive behavior expression of the first target object and the second target object, the spatio-temporal weight U of the interactive behavior skeleton sequence of the first target object and the second target object may be calculated first:

U＝U _N ×U _F

And then carrying out normalization processing on the time-space weight:

wherein f ₁ () Indicating the passing of the full connection layer f ₁ Bn () represents the normalized layer bn, f ₂ Indicating the passing of the full connection layer f ₂ The method comprises the steps of carrying out a first treatment on the surface of the Then, an expression S 'of skeleton sequence coordinates representing individual weights, time weights and space weights in the first target object and the second target object is established, wherein the expression S' is as follows:

the network model is trained by taking the expression S' of the skeleton sequence as the input characteristic of the network model, so that the network model realizes extraction of preset characteristics and classification of preset interaction behaviors, a trained network model can be obtained, the interaction behavior expression can be identified through the network model, the specific interaction behaviors of the first target object and the second target object are obtained, the trained network model can realize identification of the specific interaction behaviors of the target object with remarkable interaction behaviors in the two target objects, the training process of the network model is a mature technology, and the relevant description of step 104 in the embodiment of the figure 1 can be combined, so that repeated description is omitted.

It should be noted that, regarding the extraction and classification of the interactive behavior features of the multi-layer depth map convolution network model, the skeleton sequence related data of the individual attention weights are considered to be the input features of the multi-layer depth map convolution network model, the depth features of the skeleton sequence related data are extracted through the multi-layer depth map convolution network model, the recognition result of each multi-layer depth map convolution network model can be obtained through the global pooling layer and the full connection layer, and then through one SoftMax layer. The multi-layer depth map convolution network model comprises nine time space map convolution layers, each layer of map convolution comprises a space convolution and a time convolution, a BatchNorm layer and a ReLU layer are connected behind each time convolution and the space convolution, and a residual mechanism is applied to each layer.

In order to verify the recognition rate of the network model, the embodiment performs experiments on the interaction behavior in the data set on the skeleton data set of 'NTU rgb+d60' which is most widely applied at present. The dataset contains 10347 interaction skeleton sequences, containing 11 interaction types (as shown in the confusion matrix of fig. 13), performed by 40 subjects, the skeleton of each subject containing 25 joints (as shown in fig. 7), 3 depth cameras at different positions and angles for shooting interaction, i.e. depth camera 2 is shooting interaction, depth camera 1 is shooting 45 degrees right, and depth camera 3 is shooting 45 degrees left. The dataset gives two evaluation protocols: cross-validation based on visual Cross-View (CV) and based on object Cross-Subject (CS). For the CS protocol, the subjects were equally divided into two parts, each part containing 20 subjects for training and testing, respectively, with the number of samples for training and testing being 7319 and 3028, respectively. For the CV protocol, depth camera 1 was used for testing, depth cameras 2 and 3 were used for training, and the number of samples for training and testing was 6889 and 3458, respectively.

In the embodiment, the multi-layer depth map convolution network model is utilized to extract depth features of the skeleton sequence related information of different interaction behaviors, and the extracted depth features are utilized to conduct behavior classification, so that the recognition rate is obtained. The multi-layer depth map convolution network model comprises 9 space-time map convolution layers, all experiments are realized under a Pytorch frame based on NVIDIA GeForce P4000 GPU, and the experimental results in the following table 1 are obtained:

TABLE 1

Table 1 shows an experimental result of the interactive behavior recognition method of this embodiment, and according to the experimental result, it can be known that: skeleton sequence based on interaction center point coordinate conversionThe recognition rate under the CV verification mode is 95.16% after the interactive behavior recognition is carried out, which is 2.75% higher than the recognition effect obtained by the technical scheme based on the original skeleton sequence S, thus indicating the effectiveness of the strategy based on the interactive center point conversion. In addition, after the individual weight, the time weight and the space weight are considered, the recognition rate under the CV verification mode is 96.44%, and is 1.28% higher than the recognition effect under the framework sequence without the individual weight, the time weight and the space weight, and the importance and the effectiveness of the consideration of the individual weight, the time weight and the space weight are further explained.

In the CV verification method, a confusion matrix of 11 interaction behavior categories is shown in fig. 12.

The comparison of the interactive behavior recognition method of the embodiment with other interactive behavior recognition methods in the prior art can prove that the interactive behavior recognition method provided by the embodiment has higher recognition rate than other methods under different verification modes.

The comparison results are shown in table 2 below:

TABLE 2

The foregoing embodiment describes the interactive behavior recognition method of the present application, and the following describes the interactive behavior recognition system of the present application, referring to fig. 3, an embodiment of the interactive behavior recognition system of the present application includes:

a determining unit 301, configured to determine a first target object and a second target object that perform interactive behavior recognition;

a building unit 302, configured to build a new coordinate system with an interaction center point of the first target object and the second target object as an origin;

a calculating unit 303, configured to calculate a new coordinate of the first target object corresponding to the first target object in the new coordinate system, and a new coordinate of the second target object corresponding to the second target object in the new coordinate system;

the identifying unit 304 is configured to identify a specific interaction behavior of the first target object and the second target object according to the new coordinates of the first target object and the new coordinates of the second target object.

The operations performed by the interactive behavior recognition system of the present embodiment are similar to those performed in the foregoing embodiment of fig. 1, and will not be described herein.

Referring to fig. 4, another embodiment of the interactive behavior recognition system of the present application includes:

A determining unit 401, configured to determine a first target object and a second target object for performing interactive behavior recognition;

a building unit 402, configured to build a new coordinate system with an interaction center point of the first target object and the second target object as an origin;

a calculating unit 403, configured to calculate a new coordinate of the first target object corresponding to the first target object in the new coordinate system, and a new coordinate of the second target object corresponding to the second target object in the new coordinate system;

and the identifying unit 404 is configured to identify a specific interaction behavior of the first target object and the second target object according to the new coordinates of the first target object and the new coordinates of the second target object.

Optionally, when the establishing unit 402 establishes a new coordinate system with the interaction center point of the first target object and the second target object as an origin, the establishing unit is specifically configured to:

recording the first target object and the second target object in an original coordinate system preset by a camera device to obtain an F frame image;

determining that the image contains three-dimensional coordinates of the first target objectThree-dimensional coordinates +.>

Wherein n, i and F are positive integers greater than 0;

Confirming three-dimensional coordinates J of interaction center point I of the first target object and the second target object in the original coordinate system _I ；

Wherein the saidRepresenting three-dimensional coordinates of a default 1 st joint in a 1 st frame image of the first target object, wherein the 1 st joint is an individual center joint of the first target object;

the saidRepresenting three-dimensional coordinates of a default 1 st joint in a 1 st frame image of the second target object, wherein the 1 st joint is an individual center joint of the second target object;

three-dimensional coordinates J in the original coordinate system with the interaction center point I _I A new coordinate system is established as the origin.

Optionally, the calculating unit 403 is configured to calculate new coordinates of the first target object corresponding to the new coordinate system, and when the second target object corresponds to the new coordinate system, specifically:

determining that the skeleton sequence coordinate of the first target object in the original coordinate system is S _A The skeleton sequence coordinate of the second target object in the original coordinate system is S _B ；

Wherein N is a positive integer greater than 0, theRepresenting three-dimensional coordinates of an nth joint in an ith frame image of the first target object, said +. >Representing three-dimensional coordinates of an nth joint in an ith frame image of the second target object;

calculating the skeleton sequence coordinates of the first target object in the new coordinate system asAnd calculating the skeleton sequence coordinates of the second target object in the new coordinate system as +.>

Wherein the J _I Is the three-dimensional coordinates of the interaction center point I in the original coordinate system.

Optionally, when the identifying unit 404 identifies the interaction behavior of the first target object and the second target object according to the new coordinates of the first target object and the new coordinates of the second target object, the identifying unit is specifically configured to:

providing the first target object with a first weight W _A Providing the second target object with a second weight W _B The first weight W _A And the second weight W _B The sum is 1;

establishing a first behavioral characteristic expression of the behavioral characteristic of the first target object, wherein the first behavioral characteristic expression is equal to the new coordinate of the first target object multiplied by the first weight;

establishing a second behavior feature expression of the behavior feature of the second target object, wherein the second behavior feature expression is equal to the new coordinates of the second target object multiplied by the second weight;

Establishing an interactive behavior expression of the first target object and the second target object, wherein the interactive behavior expression is equal to a first behavior characteristic expression plus the second behavior characteristic expression;

and identifying the interactive behavior expression through a trained network model to obtain the specific interactive behavior of the first target object and the second target object, wherein the trained network model can identify the specific interactive behavior of the target object with remarkable interactive behavior in the two target objects.

Optionally, the identifying unit 404 assigns a first weight W to the first target object _A Providing the second target object with a second weight W _B The first weight W _A And the second weight W _B The sum is 1, and specifically comprises the following steps:

W _A ＝L _A /(L _A +L _B )

W _B ＝L _B /(L _A +L _B )

wherein, max () represents a maximum value and norm () represents a modulus of the vector.

Optionally, the system further comprises:

the establishing unit 402 is further configured to establish a spatial weight feature expression of the first target object

And/or the number of the groups of groups,

the establishing unit 402 is further configured to establish a time weight feature expression of the first target object

Wherein the pooling (F) represents pooling in the dimension F of the three-dimensional vector;

the pooling (N) represents pooling in the dimension N of the three-dimensional vector.

Optionally, the system further comprises:

the establishing unit 402 is further configured to establish a spatial weight feature expression of the second target object

And/or the number of the groups of groups,

the establishing unit 402 is further configured to establish a time weight feature expression of the second target object

Optionally, the establishing unit 402 establishes an interaction behavior expression of the first target object and the second target object, where the interaction behavior expression is equal to a first behavior feature expression plus the second behavior feature expression, and specifically includes:

establishing a space weight characteristic expression U of the interactive behavior skeleton sequence of the first target object and the second target object _N ；

And/or the number of the groups of groups,

establishing a time weight characteristic expression U of the interactive behavior skeleton sequence of the first target object and the second target object _F ；

Optionally, the system further comprises:

the computing unit 403 is further configured to compute a spatiotemporal weight U of an interaction behavior skeleton sequence of the first target object and the second target object:

U＝U _N ×U _F

a normalization unit 405, configured to normalize the space-time weights:

wherein said f ₁ () Indicating the passing of the full connection layer f ₁ The bn () represents the normalized layer bn, the f ₂ Indicating the passing of the full connection layer f ₂ ；

The establishing unit 402 is further configured to establish an expression S' that represents a skeleton sequence of an individual weight, a time weight, and a spatial weight in the first target object and the second target object:

wherein the said

Optionally, the system further comprises:

and the training unit 406 is configured to train the network model by using the expression S' of the skeleton sequence as an input feature of the network model, so that the network model realizes extraction of a preset feature and classification of a preset interaction behavior, and obtains the trained network model.

The operations performed by the interactive behavior recognition system of the present embodiment are similar to those performed in the foregoing embodiment of fig. 2, and will not be described herein.

Referring to fig. 5, an embodiment of a computer device according to an embodiment of the present application includes:

The computer device 500 may include one or more processors (central processing units, CPU) 501 and memory 502, with one or more applications or data stored in the memory 502. Wherein the memory 502 is volatile storage or persistent storage. The program stored in memory 502 may include one or more modules, each of which may include a series of instruction operations in a computer device. Still further, the processor 501 may be configured to communicate with the memory 502 and execute a series of instruction operations in the memory 502 on the computer device 500. The computer device 500 may also include one or more wireless network interfaces 503, one or more input/output interfaces 504, and/or one or more operating systems, such as Windows Server, mac OS, unix, linux, freeBSD, etc. The processor 501 may perform the operations performed in the embodiments shown in fig. 1 and fig. 2, and detailed descriptions thereof are omitted herein.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, or alternatives falling within the spirit and principles of the application.

Claims

1. An interactive behavior recognition method, comprising:

2. The interactive behavior recognition method according to claim 1, wherein establishing a new coordinate system with an interaction center point of the first target object and the second target object as an origin comprises:

recording the first target object and the second target object in an original coordinate system preset by a depth camera to obtain an F frame image;

Wherein n, i and F are positive integers greater than 0;

3. The interactive behavior recognition method according to claim 2, wherein calculating new coordinates of the first target object corresponding to the new coordinate system for the first target object, and new coordinates of the second target object corresponding to the new coordinate system for the second target object comprises:

Wherein N is a positive integer greater than 0, theRepresenting three-dimensional coordinates of an nth joint in an ith frame image of the first target object, said +.>Representing three-dimensional coordinates of an nth joint in an ith frame image of the second target object;

4. The interactive behavior recognition method of claim 3, wherein recognizing the interactive behavior of the first target object and the second target object based on the first target object new coordinates and the second target object new coordinates comprises:

5. The interactive behavior recognition method according to claim 4, wherein the first target object is provided with a first weight W _A Providing the second target object with a second weight W _B The first weight W _A And the second weight W _B The sum is 1, and specifically comprises the following steps:

W _A ＝L _A /(L _A +L _B )

W _B ＝L _B /(L _A +L _B )

6. The interactive behavior recognition method of claim 5, wherein prior to establishing the first behavior feature expression of the behavior feature of the first target object, the method further comprises:

establishing a space weight characteristic expression of the first target object

And/or the number of the groups of groups,

establishing a time weight characteristic expression of the first target object

7. The interactive behavior recognition method of claim 6, wherein prior to establishing a second behavioral characteristic expression of the behavioral characteristic of the second target object, the method further comprises:

establishing a space weight characteristic expression of the second target object

And/or the number of the groups of groups,

establishing a time weight characteristic expression of the second target object

8. The method for identifying interactive behavior according to claim 7, wherein establishing an interactive behavior expression of the first target object and the second target object, the interactive behavior expression being equal to a first behavior feature expression plus the second behavior feature expression, comprises:

And/or the number of the groups of groups,

9. The interactive behavior recognition method of claim 8, wherein prior to recognition of the interactive behavior expression by the trained network model, the method further comprises:

calculating a space-time weight U of an interactive behavior skeleton sequence of the first target object and the second target object:

U＝U _N ×U _F

normalizing the space-time weights:

Establishing an expression S' of skeleton sequence coordinates reflecting individual weights, time weights and space weights in the first target object and the second target object:

wherein the said

10. The interactive behavior recognition method of claim 9, wherein prior to recognition of the interactive behavior expression by the trained network model, the method further comprises:

And training the network model by taking the expression S' of the skeleton sequence coordinates as the input characteristic of the network model, so that the network model realizes extraction of preset characteristics and classification of preset interaction behaviors, and the trained network model is obtained.