CN113158782B

CN113158782B - Multi-person concurrent interaction behavior understanding method based on single-frame image

Info

Publication number: CN113158782B
Application number: CN202110259862.6A
Authority: CN
Inventors: 王振华; 周瑾
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2024-03-26
Anticipated expiration: 2041-03-10
Also published as: CN113158782A

Abstract

A multi-person concurrent interaction behavior understanding method based on a single frame image comprises the following steps: 1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region; 2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, and constructing an attention map; 3) Defining a Resnet-Attention network based on human skeleton Attention; 4) Defining a double-flow network for understanding multi-person interaction behaviors; 5) And (5) training network parameters. The provided algorithm utilizes attention to try to strengthen the convolution network characteristics of RGB images, and extracts double interaction characteristics based on human skeleton data and a shift graph convolution network, so that multi-person interaction behavior modeling of single-frame images is realized, and effective interaction behavior characterization is obtained. The method is suitable for multi-person concurrent interactive understanding in a single frame image.

Description

Multi-person concurrent interaction behavior understanding method based on single-frame image

Technical Field

The invention belongs to the field of image understanding in computer vision, and relates to a multi-person interaction understanding method.

Background

In order to construct a novel smart city, ensure personal safety of people and reduce property loss, the monitoring function of cameras in public places needs to be perfected, so that the camera can automatically and accurately identify and understand crowd behaviors in a monitoring scene based on video data, and computer intelligent auxiliary analysis and real-time networking early warning are carried out on key events. To achieve multi-person interactive understanding, it is necessary to automatically identify the interaction relationship and interaction behavior categories between persons, such as "boxing", "kicking", "pushing", "skimming", etc., based on videos or images. There are two problems with existing technologies: the first problem is modal failure, which is usually done with single-mode information, but which is insufficient to understand complex human interactions; the second problem is modal deletion, namely, the problem that a region of interest is deleted due to human body shielding exists in an interactive scene.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-person concurrent interaction behavior understanding method based on a single frame image, which can effectively identify the interaction relationship and interaction behavior category between every two persons in a multi-person scene.

The technical scheme adopted for solving the technical problems is as follows:

a multi-person concurrent interaction behavior understanding method based on a single frame image, the method comprising the steps of:

1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region;

2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, constructing an attention map;

3) Defining a Resnet-Attention network based on human skeleton Attention;

4) Dual-flow network defining multiple human interaction behavior understanding

Starting from multi-mode information and an Attention mechanism, a double-flow network model is provided, wherein the first path of algorithm is a Resnet-Attention network based on human skeleton Attention, and enhanced RGB features are extracted; and the second path of algorithm is based on skeleton data, and a shift graph convolution network with the optimal current behavior recognition effect is used for extracting accurate skeleton characteristics.

Further, in the step 1), the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is a basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha phase and is output to the human body boundary box, which is called skeleton human body frame; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with severe occlusion or abnormal human body pose, human skeleton estimation may fail, compared with less tracking human body frame missing.

Further, the obtained human skeleton and human frame need to be matched according to the labeling data, ordered human skeleton data and the interested region are obtained, and the ordered data comprises: human skeleton, skeleton human frame, tracking human frame, region of interest, interaction group serial number, interaction group action label and single action label, the following steps are calculated:

1.1 Extracting a human skeleton by using an alpha Pose algorithm, and outputting a skeleton human skeleton;

1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;

1.3 Calculating the real action label and the interaction group serial number of the boundary frame through the skeleton human body frame, the tracking human body frame and the labeling data obtained in 1.1) and 1.2), wherein the labeling data comprises the human body frame, the interaction group data and the action label of the interaction target, and matching the labeling data with the tracking frame: for any tracking frame B, calculating a marked boundary frame Bmax with the largest intersection ratio with the B, if Bmax exists and the corresponding intersection ratio is larger than 0.5, considering that Bmax is matched with the B, and assigning an action tag corresponding to Bmax and an interaction group serial number to the tracking frame B;

1.4 Fusion frame and tracking frame to obtain fusion frame, the fusion rule is as follows:

1.4.1 When both the skeletal frame and the tracking frame are present): calculating the intersection ratio rho of the skeleton frame and the tracking frame, and taking a smaller boundary frame of the skeleton frame and the tracking frame as a fusion frame when rho is larger than 0.3; otherwise, taking the skeleton frame as a fusion frame;

1.4.2 When there is a skeletal frame, a tracking frame is absent): taking a framework frame as a fusion frame;

1.4.3 When the skeleton frame is missing, the tracking frame is present): if rho is larger than 0.3, taking the tracking frame as a fusion frame; otherwise, no fusion frame exists;

1.4.4 When both the skeleton frame and the tracking frame are absent: no fusion frame exists;

and the subsequent model training and testing adopts a fusion frame.

Still further, in the step 2), in order to enhance RGB features, the present invention adopts a data form of a skeleton feature map in an openPose algorithm, and attention is split into a component confidence map and a component affinity field;

component confidence map C: skeleton sequence V= { V estimated by alpha Pose algorithm _i I=1, …, K }, calculate each joint point confidence map, v _i Representing one of key skeleton points of human skeleton, v _i ＝(x _i ,y _i ) And calculating a confidence coefficient map by using a Gaussian blur algorithm for each joint point coordinate, wherein sigma is a Gaussian blur threshold value and the size is 0.5, and the confidence coefficient corresponding to the (x, y) position of the kth skeleton point on the feature map is as follows:

part affinity field P: the limb trend is represented by a component affinity field, key joint points are connected according to a natural connection mode of a human body, the implementation mode is that an affinity field is calculated for a component between every two connected joint points of the human body, the affinity field is called a component affinity field, and the coordinates of a starting joint point s of a certain component are set as (x) _s ,y _s ) The coordinates of the termination point e are (x _e ,y _e ) At each pixel of this joint region, a 2D vector code is designed from the start joint point to the end joint point, setting the orthogonal distance threshold τ of the two joint point connection to 0.5. Then for the channel in which s is located, the values of all pixels in the orthogonal threshold range connected to the two joint points are:

above L _e,s Euclidean distance between the articulation point e and the articulation point s. For the channel where e is located, the values of all pixels in the orthogonal threshold range connected with the two joint points are as follows:

the currently considered double skeleton is provided with K joint points and M parts in total, and a joint confidence map { C (joint confidence map) of K channels is obtained _i I=1, …, K } and component affinity fields of M channels { P } _j The joint confidence maps of K channels are overlapped, and a component confidence map K is obtained; superimposing the M channels of component affinity fields to obtain component affinity field P, and adding the component confidence map and the component affinity field to obtain component features Where W is _s ＝320，H _s =240, last pair F _c Performing a 1 x 1 convolution operation to output a skeleton attention map F of the same size _a 。

In the step 3), in order to extract image features, the Resnet50 with the tail full-connection layer removed is taken as a backbone network, in view of that the channel numbers of feature graphs output by the Resnet50 penultimate module and the penultimate module are 2048 and 1024 respectively, overfitting easily occurs on a small-scale interaction behavior data set, the number of output feature channels is reduced to 1024 and 512 respectively by using 1×1 convolution, and in order to obtain multi-scale behavior characterization, the outputs of the backbone network penultimate convolution layer and the penultimate convolution layer are jointly used to obtain a spatial pyramid feature F _b ；

Resnet-Attention network uses skeleton Attention diagram F _a Enhancing image feature F _b ，F _a Is 240 x 320 in size, and the spatial pyramid feature F _b For fusing image features and skeleton attention, expanding the image features of two layers of the spatial pyramid into the same scale as the skeleton features by bilinear interpolation, stacking the expanded feature images in the channel direction, and finally obtaining a unified feature image with dimensions of 1536×240×320;

to further enhance the characteristic diagram F _b Resnet-Attention network computation F _a And F _b The Hadamard product of (2) to obtain an enhanced feature map F _action ：

F _action ＝F _a ⊙F _b (4)

Here, the

The main flow and result visualization of extracting the interaction behavior characteristics of two persons by adopting a Resnet-Attention network are shown in figure 1, network input is an RGB picture and a human body frame, and firstly, a backbone network is adopted to extract the RGB image characteristics; then, based on the characteristics of the human skeleton computing component, obtaining skeleton attention map through 1X 1 convolution and Sigmoid activation function; the RGB features and the skeleton attention are subjected to Hadamard product to obtain enhanced features; and finally extracting double interaction features based on the enhancement features and the target boundary box.

Preferably, the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; the MEB and the enhancement feature are input into a RoIAlign module, and MEB region feature F with the size of 1536 multiplied by 5 is output _inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will _inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories.

Further, in said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F _rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F _gcn The original Shift-GCN requires that the input data is a skeleton sequence based on video, the output is the action classification result of the whole video sequence, and the interaction behavior classification score F of any two persons in a single frame image cannot be calculated in view of the problem that the Shift-GCN aims at the classification of video behaviors different from a target task _gcn Modifying the shift graph convolution network to be compatible with single-frame data and capable of processing double interaction behavior classification questionsThe questions, modified content includes: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;

given double skeleton dataWherein v is the number of the nodes of the double skeleton, c=2 is the number of channels, each coordinate component occupies one channel, and the shift graph rolling network outputs a score vector F classified for interaction behaviors _gcn :

F _gcn ＝ShiftGCN(S) (5)

Finally, double-flow network pair F _rgb And F _gcn Fusion is carried out:

F _fused ＝F _rgb +F _gcn (6)。

still further, the method comprises the steps of:

5) Network parameter training

In the network parameter training process, for the framework gate component network and the shift graph convolution network, the training set adopted by the system is different, and the training set is described in two parts:

training uses a graphics card: GTX 1080Ti

Skeleton-gate assembly network parameters training, refer to table 1:

TABLE 1

Shift map convolution network parameter training, refer to table 2:

TABLE 2

Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increasedThe number of interaction groups is increased sharply, and the ratio of interaction groups without action categories is high; setting a maximum distance Max_dis in the training process, calculating the distance (dis) between the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no interaction category;

when the number N of people in the picture is {2,3,4,5}, according to the arrangement and combinationThe interaction group number M ε {1,2, …,10}. The input data of the network are local images of N persons and M groups of double area images in the image, however, the input data amounts of the N persons and the M groups of double area images are different due to the fact that the number of persons contained in different images is not fixed; adopting a filling operation to make the data quantity of each frame of data input into the network model be the same, setting an upper limit of the number of people max_num=5, and corresponding double group number max_interaction_num=10, and filling when the number of people is insufficient: the action categories of the supplementary data are all-1, and the human body boundary boxes are all [0,0]；

When the number of people N in the picture is {5,6, …,15}, according to the arrangement and combinationObtaining the interaction group number M epsilon {10,11, …,105}, setting the upper limit of the number of people Max _num =15, corresponding double group number +.> The proportion of the actual interactive groups to the total groups is below 5%, and the problem that the actual interactive groups are extremely unbalanced in proportion is solved; on the other hand, the difference in the number of people may cause data complement to be excessive: for example, for a scene containing 5 people, the complement group number is +.>A group; to solve the above two problems, a maximum distance elimination method is adopted: setting a distance threshold Max _dis Calculating Euclidean distance dis of center points of two human body boundary frames in an image, and when dis is more than or equal to Max _dis The data set is judged to have no interaction behavior, the number of double groups input into the network model is reduced to be within 36 groups by a maximum distance elimination method, and on the other hand, an unbalanced weight class is added to a loss function trained by each model _weight ：

In the above equation, OT represents any other human behavior than normal interactive behavior.

According to the method, firstly, a method for calculating an interested region is provided according to a skeleton estimation algorithm and a multi-target tracking algorithm, and the interested region is used for cutting local region features from a feature map; secondly, a dual-flow network based on skeleton attention is provided: one branch extracts the RGB image characteristics of two persons and enhances the characteristics by utilizing a human skeleton; the second branch adopts human skeleton data and a shift graph convolution network to extract double interaction characteristics, and models the multi-person interaction behavior of a single frame image, so that more effective interaction behavior characterization is obtained.

The beneficial effects of the invention are mainly shown in the following steps: the convolution network characteristics of the RGB image are enhanced by attention, and double interaction characteristics are extracted based on human skeleton data and a shift graph convolution network, so that multi-person interaction behavior modeling of a single frame image is realized, and effective interaction behavior characterization is obtained. The method is suitable for multi-person concurrent interactive understanding in a single frame image; a more efficient characterization of the interaction behavior is obtained.

Drawings

Fig. 1 is a schematic diagram of a two-person interaction feature extraction based on skeletal attention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, a multi-person concurrent interaction behavior understanding method based on a single frame image includes the steps of:

the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is the basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha Pose and is output as a skeleton human body boundary box; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with severe occlusion or abnormal human body pose, human skeleton estimation may fail, compared with less tracking human body frame missing.

1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;

1.4.3 When there is a skeletal frame, a tracking frame is absent): taking a framework frame as a fusion frame;

and the subsequent model training and testing adopts a fusion frame.

in order to enhance RGB features, the invention adopts a data form of a skeleton feature map in an OpenPose algorithm, and attention is focused on dividing the map into a part confidence map and a part affinity field;

part affinity field P: the limb trend is expressed by using a part affinity field, and key joint points are connected according to a natural connection mode of a human body, wherein the implementation mode is that every two phases of the human body are usedThe parts between the joint points calculate an affinity field, called part affinity field, assuming that the coordinates of a part starting joint point s are (x _s ,y _s ) The coordinates of the termination point e are (x _e ,y _e ) At each pixel of this joint region, a 2D vector code is designed from the start joint point to the end joint point, setting the orthogonal distance threshold τ of the two joint point connection to 0.5. Then for the channel in which s is located, the values of all pixels in the orthogonal threshold range connected to the two joint points are:

3) Defining a Resnet-Attention network based on human skeleton Attention;

to extract image features, the Resne of the tail full-connection layer is removedt50 is taken as a backbone network, in view of the fact that the channel numbers of the feature map output by the penultimate module and the penultimate module of the ResNet50 are 2048 and 1024 respectively, overfitting easily occurs on a small-scale interaction behavior data set, the number of the output feature channels is reduced to 1024 and 512 respectively by using 1X 1 convolution, and in order to obtain multi-scale behavior characterization, the outputs of the penultimate convolution layer and the penultimate convolution layer of the backbone network are jointly used to obtain a spatial pyramid feature F _b ；

F _action ＝F _a ⊙F _b (4)

Here, the

Preferably, the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; MEB andenhancement features are input into the RoIAlign module and MEB region features F with the size of 1536×5×5 are output _inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will _inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories.

4) Dual-flow network defining multiple human interaction behavior understanding

In said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F _rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F _gcn The original Shift-GCN requires that the input data is a skeleton sequence based on video, the output is the action classification result of the whole video sequence, and the interaction behavior classification score F of any two persons in a single frame image cannot be calculated in view of the problem that the Shift-GCN aims at the classification of video behaviors different from a target task _gcn The shift graph convolution network is modified to be compatible with single-frame data and can process the problem of classification of double interaction behaviors, and the modification content comprises the following steps: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;

F _gcn ＝ShiftGCN(S) (5)

Finally, double-flow network pair F _rgb And F _gcn Fusion is carried out:

F _fused ＝F _rgb +F _gcn (6)。

still further, the method comprises the steps of:

5) Network parameter training

training uses a graphics card: GTX 1080Ti

Skeleton-gate assembly network parameters training, refer to table 1:

TABLE 1

Shift map convolution network parameter training, refer to table 2:

TABLE 2

Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increased to cause the number of interaction groups to be increased sharply, and the interaction group ratio of the no-action type interaction groups is high; setting a maximum distance Max_dis in the training process, calculating the distance (dis) between the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no crossThe interaction is used as a category;

The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.

Claims

1. A multi-person concurrent interaction behavior understanding method based on a single frame image is characterized by comprising the following steps:

3) Defining a Resnet-Attention network based on human skeleton Attention;

4) Defining a double-flow network for understanding multi-person interaction behaviors;

starting from multi-mode information and an Attention mechanism, a double-flow network model is provided, wherein the first path of algorithm is a Resnet-Attention network based on human skeleton Attention, and enhanced RGB features are extracted; the second path of algorithm is based on skeleton data, and a shift graph convolution network with the optimal current behavior recognition effect is used for extracting accurate skeleton characteristics;

in the step 1), the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is a basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha Pose and is output to the human body boundary box, which is called skeleton human body frame; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with serious occlusion or abnormal human body pose, human skeleton estimation may fail, and compared with the situation that tracking human body frames is missing, the method has fewer problems;

in the step 2), the obtained human skeleton and human frame need to be matched according to the labeling data, and ordered human skeleton data and the interested region are obtained, wherein the ordered data comprises: human skeleton, skeleton human frame, tracking human frame, region of interest, interaction group serial number, interaction group action label and single action label, the following steps are calculated:

1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;

the subsequent model training and testing adopts a fusion frame;

in the step 2), in order to enhance the RGB features, the attention map is divided into a component confidence map and a component affinity field by adopting a data form of a skeleton feature map in an openPose algorithm;

component confidence map C: skeleton sequence V= { V estimated by alpha Pose algorithm _i I=1, …, K }, calculate each joint point confidence map, v _i Representing one of key skeleton points of human skeleton, v _i ＝(x _i ,y _i ) And calculating a confidence coefficient map by using a Gaussian blur algorithm for each joint point coordinate, wherein sigma is a Gaussian blur threshold value, and the confidence coefficient corresponding to the (x, y) position of the kth skeleton point on the feature map is as follows:

part affinity field P: the limb trend is represented by a component affinity field, key joint points are connected according to a natural connection mode of a human body, the implementation mode is that an affinity field is calculated for a component between every two connected joint points of the human body, the affinity field is called a component affinity field, and the coordinates of a starting joint point s of a certain component are set as (x) _s ,y _s ) The coordinates of the termination point e are (x _e ,y _e ) At each pixel of the joint region, 2D vector codes pointing from a start joint point to a stop joint point are designed, and an orthogonal distance threshold tau for the connection of two joint points is set, so that for a channel where s is located, the values of all pixel points in an orthogonal threshold range connected with the two joint points are as follows:

above L _e,s Node eThe Euclidean distance between the articulation points s is that for the channel where e is located, the values of all the pixel points in the orthogonal threshold range connected with the two articulation points are as follows:

the currently considered double skeleton is provided with K joint points and M parts in total, and a joint confidence map { C (joint confidence map) of K channels is obtained _i I=1, …, K } and component affinity fields of M channels { P } _j The joint confidence maps of K channels are overlapped, and a component confidence map K is obtained; superimposing the M channels of component affinity fields to obtain component affinity field P, and adding the component confidence map and the component affinity field to obtain component features Finally to F _c Performing a 1 x 1 convolution operation to output a skeleton attention map F of the same size _a ；

Resnet-Attention network uses skeleton Attention diagram F _a Enhancing image feature F _b ，F _a Is 240 x 320 in size, and the spatial pyramid feature F _b Is 512 x 15 x 20 and 1024 x 8 x 10, respectively, and spatial pyramid two layers are interpolated using bilinear interpolation for fusion of image features and skeleton attention mapThe image features of the image are expanded to be the same scale as the skeleton features, then the expanded feature images are stacked in the channel direction, and finally a unified feature image with dimensions of 1536 multiplied by 240 multiplied by 320 is obtained;

F _action ＝F _a ⊙F _b (4)

Here, the

Network input is RGB picture and human body frame, firstly adopt backbone network to extract RGB image characteristic; then, based on the characteristics of the human skeleton computing component, obtaining skeleton attention map through 1X 1 convolution and Sigmoid activation function; the RGB features and the skeleton attention are subjected to Hadamard product to obtain enhanced features; finally extracting double interaction features based on the enhancement features and the target boundary frame;

in the step 3), the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; the MEB and the enhancement feature are input into a RoIAlign module, and MEB region feature F with the size of 1536 multiplied by 5 is output _inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will _inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories;

in said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F _rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F _gcn The original Shift-GCN requires that the input data is a video-based skeleton sequence, and the output is the wholeAccording to action classification results of video sequences, in view of the problem that the Shift-GCN aims at classifying video behaviors different from target tasks, the interaction behavior classification score F of any two persons in a single frame image cannot be calculated _gcn The shift graph convolution network is modified to be compatible with single-frame data and can process the problem of classification of double interaction behaviors, and the modification content comprises the following steps: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;

F _gcn ＝ShiftGCN(S) (5)

Finally, double-flow network pair F _rgb And F _gcn Fusion is carried out:

F _fused ＝F _rgb +F _gcn (6)

the method further comprises the steps of:

5) Network parameter training

training uses a graphics card: GTX 1080Ti

Skeleton-gate assembly network parameters training, refer to table 1:

TABLE 1

Shift map convolution network parameter training, refer to table 2:

TABLE 2

Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increased to cause the number of interaction groups to be increased sharply, and the interaction group ratio of the no-action type interaction groups is high; setting a maximum distance Max_dis in the training process, calculating the distance dis of the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no interaction category;

the maximum distance elimination method is adopted: setting a distance threshold Max _dis Calculating Euclidean distance dis of center points of two human body boundary frames in an image, and when dis is more than or equal to Max _dis The data set is judged to have no interaction behavior, the number of double groups input into the network model is reduced to be within 36 groups by a maximum distance elimination method, and on the other hand, an unbalanced weight class is added to a loss function trained by each model _weight ：