CN113158782B - Multi-person concurrent interaction behavior understanding method based on single-frame image - Google Patents
Multi-person concurrent interaction behavior understanding method based on single-frame image Download PDFInfo
- Publication number
- CN113158782B CN113158782B CN202110259862.6A CN202110259862A CN113158782B CN 113158782 B CN113158782 B CN 113158782B CN 202110259862 A CN202110259862 A CN 202110259862A CN 113158782 B CN113158782 B CN 113158782B
- Authority
- CN
- China
- Prior art keywords
- skeleton
- frame
- interaction
- network
- human
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 105
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000006399 behavior Effects 0.000 claims abstract description 64
- 238000012549 training Methods 0.000 claims abstract description 34
- 230000002452 interceptive effect Effects 0.000 claims abstract description 17
- 238000012512 characterization method Methods 0.000 claims abstract description 7
- 230000004927 fusion Effects 0.000 claims description 34
- 230000009471 action Effects 0.000 claims description 30
- 238000002372 labelling Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 7
- 230000008030 elimination Effects 0.000 claims description 6
- 238000003379 elimination reaction Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000004048 modification Effects 0.000 claims description 2
- 238000012986 modification Methods 0.000 claims description 2
- 239000000284 extract Substances 0.000 abstract description 2
- 230000000295 complement effect Effects 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
Abstract
A multi-person concurrent interaction behavior understanding method based on a single frame image comprises the following steps: 1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region; 2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, and constructing an attention map; 3) Defining a Resnet-Attention network based on human skeleton Attention; 4) Defining a double-flow network for understanding multi-person interaction behaviors; 5) And (5) training network parameters. The provided algorithm utilizes attention to try to strengthen the convolution network characteristics of RGB images, and extracts double interaction characteristics based on human skeleton data and a shift graph convolution network, so that multi-person interaction behavior modeling of single-frame images is realized, and effective interaction behavior characterization is obtained. The method is suitable for multi-person concurrent interactive understanding in a single frame image.
Description
Technical Field
The invention belongs to the field of image understanding in computer vision, and relates to a multi-person interaction understanding method.
Background
In order to construct a novel smart city, ensure personal safety of people and reduce property loss, the monitoring function of cameras in public places needs to be perfected, so that the camera can automatically and accurately identify and understand crowd behaviors in a monitoring scene based on video data, and computer intelligent auxiliary analysis and real-time networking early warning are carried out on key events. To achieve multi-person interactive understanding, it is necessary to automatically identify the interaction relationship and interaction behavior categories between persons, such as "boxing", "kicking", "pushing", "skimming", etc., based on videos or images. There are two problems with existing technologies: the first problem is modal failure, which is usually done with single-mode information, but which is insufficient to understand complex human interactions; the second problem is modal deletion, namely, the problem that a region of interest is deleted due to human body shielding exists in an interactive scene.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-person concurrent interaction behavior understanding method based on a single frame image, which can effectively identify the interaction relationship and interaction behavior category between every two persons in a multi-person scene.
The technical scheme adopted for solving the technical problems is as follows:
a multi-person concurrent interaction behavior understanding method based on a single frame image, the method comprising the steps of:
1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region;
2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, constructing an attention map;
3) Defining a Resnet-Attention network based on human skeleton Attention;
4) Dual-flow network defining multiple human interaction behavior understanding
Starting from multi-mode information and an Attention mechanism, a double-flow network model is provided, wherein the first path of algorithm is a Resnet-Attention network based on human skeleton Attention, and enhanced RGB features are extracted; and the second path of algorithm is based on skeleton data, and a shift graph convolution network with the optimal current behavior recognition effect is used for extracting accurate skeleton characteristics.
Further, in the step 1), the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is a basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha phase and is output to the human body boundary box, which is called skeleton human body frame; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with severe occlusion or abnormal human body pose, human skeleton estimation may fail, compared with less tracking human body frame missing.
Further, the obtained human skeleton and human frame need to be matched according to the labeling data, ordered human skeleton data and the interested region are obtained, and the ordered data comprises: human skeleton, skeleton human frame, tracking human frame, region of interest, interaction group serial number, interaction group action label and single action label, the following steps are calculated:
1.1 Extracting a human skeleton by using an alpha Pose algorithm, and outputting a skeleton human skeleton;
1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;
1.3 Calculating the real action label and the interaction group serial number of the boundary frame through the skeleton human body frame, the tracking human body frame and the labeling data obtained in 1.1) and 1.2), wherein the labeling data comprises the human body frame, the interaction group data and the action label of the interaction target, and matching the labeling data with the tracking frame: for any tracking frame B, calculating a marked boundary frame Bmax with the largest intersection ratio with the B, if Bmax exists and the corresponding intersection ratio is larger than 0.5, considering that Bmax is matched with the B, and assigning an action tag corresponding to Bmax and an interaction group serial number to the tracking frame B;
1.4 Fusion frame and tracking frame to obtain fusion frame, the fusion rule is as follows:
1.4.1 When both the skeletal frame and the tracking frame are present): calculating the intersection ratio rho of the skeleton frame and the tracking frame, and taking a smaller boundary frame of the skeleton frame and the tracking frame as a fusion frame when rho is larger than 0.3; otherwise, taking the skeleton frame as a fusion frame;
1.4.2 When there is a skeletal frame, a tracking frame is absent): taking a framework frame as a fusion frame;
1.4.3 When the skeleton frame is missing, the tracking frame is present): if rho is larger than 0.3, taking the tracking frame as a fusion frame; otherwise, no fusion frame exists;
1.4.4 When both the skeleton frame and the tracking frame are absent: no fusion frame exists;
and the subsequent model training and testing adopts a fusion frame.
Still further, in the step 2), in order to enhance RGB features, the present invention adopts a data form of a skeleton feature map in an openPose algorithm, and attention is split into a component confidence map and a component affinity field;
component confidence map C: skeleton sequence V= { V estimated by alpha Pose algorithm i I=1, …, K }, calculate each joint point confidence map, v i Representing one of key skeleton points of human skeleton, v i =(x i ,y i ) And calculating a confidence coefficient map by using a Gaussian blur algorithm for each joint point coordinate, wherein sigma is a Gaussian blur threshold value and the size is 0.5, and the confidence coefficient corresponding to the (x, y) position of the kth skeleton point on the feature map is as follows:
part affinity field P: the limb trend is represented by a component affinity field, key joint points are connected according to a natural connection mode of a human body, the implementation mode is that an affinity field is calculated for a component between every two connected joint points of the human body, the affinity field is called a component affinity field, and the coordinates of a starting joint point s of a certain component are set as (x) s ,y s ) The coordinates of the termination point e are (x e ,y e ) At each pixel of this joint region, a 2D vector code is designed from the start joint point to the end joint point, setting the orthogonal distance threshold τ of the two joint point connection to 0.5. Then for the channel in which s is located, the values of all pixels in the orthogonal threshold range connected to the two joint points are:
above L e,s Euclidean distance between the articulation point e and the articulation point s. For the channel where e is located, the values of all pixels in the orthogonal threshold range connected with the two joint points are as follows:
the currently considered double skeleton is provided with K joint points and M parts in total, and a joint confidence map { C (joint confidence map) of K channels is obtained i I=1, …, K } and component affinity fields of M channels { P } j The joint confidence maps of K channels are overlapped, and a component confidence map K is obtained; superimposing the M channels of component affinity fields to obtain component affinity field P, and adding the component confidence map and the component affinity field to obtain component features Where W is s =320,H s =240, last pair F c Performing a 1 x 1 convolution operation to output a skeleton attention map F of the same size a 。
In the step 3), in order to extract image features, the Resnet50 with the tail full-connection layer removed is taken as a backbone network, in view of that the channel numbers of feature graphs output by the Resnet50 penultimate module and the penultimate module are 2048 and 1024 respectively, overfitting easily occurs on a small-scale interaction behavior data set, the number of output feature channels is reduced to 1024 and 512 respectively by using 1×1 convolution, and in order to obtain multi-scale behavior characterization, the outputs of the backbone network penultimate convolution layer and the penultimate convolution layer are jointly used to obtain a spatial pyramid feature F b ;
Resnet-Attention network uses skeleton Attention diagram F a Enhancing image feature F b ,F a Is 240 x 320 in size, and the spatial pyramid feature F b For fusing image features and skeleton attention, expanding the image features of two layers of the spatial pyramid into the same scale as the skeleton features by bilinear interpolation, stacking the expanded feature images in the channel direction, and finally obtaining a unified feature image with dimensions of 1536×240×320;
to further enhance the characteristic diagram F b Resnet-Attention network computation F a And F b The Hadamard product of (2) to obtain an enhanced feature map F action :
F action =F a ⊙F b (4)
Here, the
The main flow and result visualization of extracting the interaction behavior characteristics of two persons by adopting a Resnet-Attention network are shown in figure 1, network input is an RGB picture and a human body frame, and firstly, a backbone network is adopted to extract the RGB image characteristics; then, based on the characteristics of the human skeleton computing component, obtaining skeleton attention map through 1X 1 convolution and Sigmoid activation function; the RGB features and the skeleton attention are subjected to Hadamard product to obtain enhanced features; and finally extracting double interaction features based on the enhancement features and the target boundary box.
Preferably, the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; the MEB and the enhancement feature are input into a RoIAlign module, and MEB region feature F with the size of 1536 multiplied by 5 is output inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories.
Further, in said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F gcn The original Shift-GCN requires that the input data is a skeleton sequence based on video, the output is the action classification result of the whole video sequence, and the interaction behavior classification score F of any two persons in a single frame image cannot be calculated in view of the problem that the Shift-GCN aims at the classification of video behaviors different from a target task gcn Modifying the shift graph convolution network to be compatible with single-frame data and capable of processing double interaction behavior classification questionsThe questions, modified content includes: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;
given double skeleton dataWherein v is the number of the nodes of the double skeleton, c=2 is the number of channels, each coordinate component occupies one channel, and the shift graph rolling network outputs a score vector F classified for interaction behaviors gcn :
F gcn =ShiftGCN(S) (5)
Finally, double-flow network pair F rgb And F gcn Fusion is carried out:
F fused =F rgb +F gcn (6)。
still further, the method comprises the steps of:
5) Network parameter training
In the network parameter training process, for the framework gate component network and the shift graph convolution network, the training set adopted by the system is different, and the training set is described in two parts:
training uses a graphics card: GTX 1080Ti
Skeleton-gate assembly network parameters training, refer to table 1:
TABLE 1
Shift map convolution network parameter training, refer to table 2:
TABLE 2
Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increasedThe number of interaction groups is increased sharply, and the ratio of interaction groups without action categories is high; setting a maximum distance Max_dis in the training process, calculating the distance (dis) between the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no interaction category;
when the number N of people in the picture is {2,3,4,5}, according to the arrangement and combinationThe interaction group number M ε {1,2, …,10}. The input data of the network are local images of N persons and M groups of double area images in the image, however, the input data amounts of the N persons and the M groups of double area images are different due to the fact that the number of persons contained in different images is not fixed; adopting a filling operation to make the data quantity of each frame of data input into the network model be the same, setting an upper limit of the number of people max_num=5, and corresponding double group number max_interaction_num=10, and filling when the number of people is insufficient: the action categories of the supplementary data are all-1, and the human body boundary boxes are all [0,0];
When the number of people N in the picture is {5,6, …,15}, according to the arrangement and combinationObtaining the interaction group number M epsilon {10,11, …,105}, setting the upper limit of the number of people Max num =15, corresponding double group number +.> The proportion of the actual interactive groups to the total groups is below 5%, and the problem that the actual interactive groups are extremely unbalanced in proportion is solved; on the other hand, the difference in the number of people may cause data complement to be excessive: for example, for a scene containing 5 people, the complement group number is +.>A group; to solve the above two problems, a maximum distance elimination method is adopted: setting a distance threshold Max dis Calculating Euclidean distance dis of center points of two human body boundary frames in an image, and when dis is more than or equal to Max dis The data set is judged to have no interaction behavior, the number of double groups input into the network model is reduced to be within 36 groups by a maximum distance elimination method, and on the other hand, an unbalanced weight class is added to a loss function trained by each model weight :
In the above equation, OT represents any other human behavior than normal interactive behavior.
According to the method, firstly, a method for calculating an interested region is provided according to a skeleton estimation algorithm and a multi-target tracking algorithm, and the interested region is used for cutting local region features from a feature map; secondly, a dual-flow network based on skeleton attention is provided: one branch extracts the RGB image characteristics of two persons and enhances the characteristics by utilizing a human skeleton; the second branch adopts human skeleton data and a shift graph convolution network to extract double interaction characteristics, and models the multi-person interaction behavior of a single frame image, so that more effective interaction behavior characterization is obtained.
The beneficial effects of the invention are mainly shown in the following steps: the convolution network characteristics of the RGB image are enhanced by attention, and double interaction characteristics are extracted based on human skeleton data and a shift graph convolution network, so that multi-person interaction behavior modeling of a single frame image is realized, and effective interaction behavior characterization is obtained. The method is suitable for multi-person concurrent interactive understanding in a single frame image; a more efficient characterization of the interaction behavior is obtained.
Drawings
Fig. 1 is a schematic diagram of a two-person interaction feature extraction based on skeletal attention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a multi-person concurrent interaction behavior understanding method based on a single frame image includes the steps of:
1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region;
the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is the basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha Pose and is output as a skeleton human body boundary box; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with severe occlusion or abnormal human body pose, human skeleton estimation may fail, compared with less tracking human body frame missing.
Further, the obtained human skeleton and human frame need to be matched according to the labeling data, ordered human skeleton data and the interested region are obtained, and the ordered data comprises: human skeleton, skeleton human frame, tracking human frame, region of interest, interaction group serial number, interaction group action label and single action label, the following steps are calculated:
1.1 Extracting a human skeleton by using an alpha Pose algorithm, and outputting a skeleton human skeleton;
1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;
1.3 Calculating the real action label and the interaction group serial number of the boundary frame through the skeleton human body frame, the tracking human body frame and the labeling data obtained in 1.1) and 1.2), wherein the labeling data comprises the human body frame, the interaction group data and the action label of the interaction target, and matching the labeling data with the tracking frame: for any tracking frame B, calculating a marked boundary frame Bmax with the largest intersection ratio with the B, if Bmax exists and the corresponding intersection ratio is larger than 0.5, considering that Bmax is matched with the B, and assigning an action tag corresponding to Bmax and an interaction group serial number to the tracking frame B;
1.4 Fusion frame and tracking frame to obtain fusion frame, the fusion rule is as follows:
1.4.1 When both the skeletal frame and the tracking frame are present): calculating the intersection ratio rho of the skeleton frame and the tracking frame, and taking a smaller boundary frame of the skeleton frame and the tracking frame as a fusion frame when rho is larger than 0.3; otherwise, taking the skeleton frame as a fusion frame;
1.4.3 When there is a skeletal frame, a tracking frame is absent): taking a framework frame as a fusion frame;
1.4.3 When the skeleton frame is missing, the tracking frame is present): if rho is larger than 0.3, taking the tracking frame as a fusion frame; otherwise, no fusion frame exists;
1.4.4 When both the skeleton frame and the tracking frame are absent: no fusion frame exists;
and the subsequent model training and testing adopts a fusion frame.
2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, constructing an attention map;
in order to enhance RGB features, the invention adopts a data form of a skeleton feature map in an OpenPose algorithm, and attention is focused on dividing the map into a part confidence map and a part affinity field;
component confidence map C: skeleton sequence V= { V estimated by alpha Pose algorithm i I=1, …, K }, calculate each joint point confidence map, v i Representing one of key skeleton points of human skeleton, v i =(x i ,y i ) And calculating a confidence coefficient map by using a Gaussian blur algorithm for each joint point coordinate, wherein sigma is a Gaussian blur threshold value and the size is 0.5, and the confidence coefficient corresponding to the (x, y) position of the kth skeleton point on the feature map is as follows:
part affinity field P: the limb trend is expressed by using a part affinity field, and key joint points are connected according to a natural connection mode of a human body, wherein the implementation mode is that every two phases of the human body are usedThe parts between the joint points calculate an affinity field, called part affinity field, assuming that the coordinates of a part starting joint point s are (x s ,y s ) The coordinates of the termination point e are (x e ,y e ) At each pixel of this joint region, a 2D vector code is designed from the start joint point to the end joint point, setting the orthogonal distance threshold τ of the two joint point connection to 0.5. Then for the channel in which s is located, the values of all pixels in the orthogonal threshold range connected to the two joint points are:
above L e,s Euclidean distance between the articulation point e and the articulation point s. For the channel where e is located, the values of all pixels in the orthogonal threshold range connected with the two joint points are as follows:
the currently considered double skeleton is provided with K joint points and M parts in total, and a joint confidence map { C (joint confidence map) of K channels is obtained i I=1, …, K } and component affinity fields of M channels { P } j The joint confidence maps of K channels are overlapped, and a component confidence map K is obtained; superimposing the M channels of component affinity fields to obtain component affinity field P, and adding the component confidence map and the component affinity field to obtain component features Where W is s =320,H s =240, last pair F c Performing a 1 x 1 convolution operation to output a skeleton attention map F of the same size a 。
3) Defining a Resnet-Attention network based on human skeleton Attention;
to extract image features, the Resne of the tail full-connection layer is removedt50 is taken as a backbone network, in view of the fact that the channel numbers of the feature map output by the penultimate module and the penultimate module of the ResNet50 are 2048 and 1024 respectively, overfitting easily occurs on a small-scale interaction behavior data set, the number of the output feature channels is reduced to 1024 and 512 respectively by using 1X 1 convolution, and in order to obtain multi-scale behavior characterization, the outputs of the penultimate convolution layer and the penultimate convolution layer of the backbone network are jointly used to obtain a spatial pyramid feature F b ;
Resnet-Attention network uses skeleton Attention diagram F a Enhancing image feature F b ,F a Is 240 x 320 in size, and the spatial pyramid feature F b For fusing image features and skeleton attention, expanding the image features of two layers of the spatial pyramid into the same scale as the skeleton features by bilinear interpolation, stacking the expanded feature images in the channel direction, and finally obtaining a unified feature image with dimensions of 1536×240×320;
to further enhance the characteristic diagram F b Resnet-Attention network computation F a And F b The Hadamard product of (2) to obtain an enhanced feature map F action :
F action =F a ⊙F b (4)
Here, the
The main flow and result visualization of extracting the interaction behavior characteristics of two persons by adopting a Resnet-Attention network are shown in figure 1, network input is an RGB picture and a human body frame, and firstly, a backbone network is adopted to extract the RGB image characteristics; then, based on the characteristics of the human skeleton computing component, obtaining skeleton attention map through 1X 1 convolution and Sigmoid activation function; the RGB features and the skeleton attention are subjected to Hadamard product to obtain enhanced features; and finally extracting double interaction features based on the enhancement features and the target boundary box.
Preferably, the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; MEB andenhancement features are input into the RoIAlign module and MEB region features F with the size of 1536×5×5 are output inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories.
4) Dual-flow network defining multiple human interaction behavior understanding
Starting from multi-mode information and an Attention mechanism, a double-flow network model is provided, wherein the first path of algorithm is a Resnet-Attention network based on human skeleton Attention, and enhanced RGB features are extracted; and the second path of algorithm is based on skeleton data, and a shift graph convolution network with the optimal current behavior recognition effect is used for extracting accurate skeleton characteristics.
In said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F gcn The original Shift-GCN requires that the input data is a skeleton sequence based on video, the output is the action classification result of the whole video sequence, and the interaction behavior classification score F of any two persons in a single frame image cannot be calculated in view of the problem that the Shift-GCN aims at the classification of video behaviors different from a target task gcn The shift graph convolution network is modified to be compatible with single-frame data and can process the problem of classification of double interaction behaviors, and the modification content comprises the following steps: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;
given double skeleton dataWherein v is the number of the nodes of the double skeleton, c=2 is the number of channels, each coordinate component occupies one channel, and the shift graph rolling network outputs a score vector F classified for interaction behaviors gcn :
F gcn =ShiftGCN(S) (5)
Finally, double-flow network pair F rgb And F gcn Fusion is carried out:
F fused =F rgb +F gcn (6)。
still further, the method comprises the steps of:
5) Network parameter training
In the network parameter training process, for the framework gate component network and the shift graph convolution network, the training set adopted by the system is different, and the training set is described in two parts:
training uses a graphics card: GTX 1080Ti
Skeleton-gate assembly network parameters training, refer to table 1:
TABLE 1
Shift map convolution network parameter training, refer to table 2:
TABLE 2
Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increased to cause the number of interaction groups to be increased sharply, and the interaction group ratio of the no-action type interaction groups is high; setting a maximum distance Max_dis in the training process, calculating the distance (dis) between the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no crossThe interaction is used as a category;
when the number N of people in the picture is {2,3,4,5}, according to the arrangement and combinationThe interaction group number M ε {1,2, …,10}. The input data of the network are local images of N persons and M groups of double area images in the image, however, the input data amounts of the N persons and the M groups of double area images are different due to the fact that the number of persons contained in different images is not fixed; adopting a filling operation to make the data quantity of each frame of data input into the network model be the same, setting an upper limit of the number of people max_num=5, and corresponding double group number max_interaction_num=10, and filling when the number of people is insufficient: the action categories of the supplementary data are all-1, and the human body boundary boxes are all [0,0];
When the number of people N in the picture is {5,6, …,15}, according to the arrangement and combinationObtaining the interaction group number M epsilon {10,11, …,105}, setting the upper limit of the number of people Max num =15, corresponding double group number +.> The proportion of the actual interactive groups to the total groups is below 5%, and the problem that the actual interactive groups are extremely unbalanced in proportion is solved; on the other hand, the difference in the number of people may cause data complement to be excessive: for example, for a scene containing 5 people, the complement group number is +.>A group; to solve the above two problems, a maximum distance elimination method is adopted: setting a distance threshold Max dis Calculating Euclidean distance dis of center points of two human body boundary frames in an image, and when dis is more than or equal to Max dis The data set is judged to have no interaction behavior, the number of double groups input into the network model is reduced to be within 36 groups by a maximum distance elimination method, and on the other hand, an unbalanced weight class is added to a loss function trained by each model weight :
In the above equation, OT represents any other human behavior than normal interactive behavior.
The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.
Claims (1)
1. A multi-person concurrent interaction behavior understanding method based on a single frame image is characterized by comprising the following steps:
1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region;
2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, constructing an attention map;
3) Defining a Resnet-Attention network based on human skeleton Attention;
4) Defining a double-flow network for understanding multi-person interaction behaviors;
starting from multi-mode information and an Attention mechanism, a double-flow network model is provided, wherein the first path of algorithm is a Resnet-Attention network based on human skeleton Attention, and enhanced RGB features are extracted; the second path of algorithm is based on skeleton data, and a shift graph convolution network with the optimal current behavior recognition effect is used for extracting accurate skeleton characteristics;
in the step 1), the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is a basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha Pose and is output to the human body boundary box, which is called skeleton human body frame; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with serious occlusion or abnormal human body pose, human skeleton estimation may fail, and compared with the situation that tracking human body frames is missing, the method has fewer problems;
in the step 2), the obtained human skeleton and human frame need to be matched according to the labeling data, and ordered human skeleton data and the interested region are obtained, wherein the ordered data comprises: human skeleton, skeleton human frame, tracking human frame, region of interest, interaction group serial number, interaction group action label and single action label, the following steps are calculated:
1.1 Extracting a human skeleton by using an alpha Pose algorithm, and outputting a skeleton human skeleton;
1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;
1.3 Calculating the real action label and the interaction group serial number of the boundary frame through the skeleton human body frame, the tracking human body frame and the labeling data obtained in 1.1) and 1.2), wherein the labeling data comprises the human body frame, the interaction group data and the action label of the interaction target, and matching the labeling data with the tracking frame: for any tracking frame B, calculating a marked boundary frame Bmax with the largest intersection ratio with the B, if Bmax exists and the corresponding intersection ratio is larger than 0.5, considering that Bmax is matched with the B, and assigning an action tag corresponding to Bmax and an interaction group serial number to the tracking frame B;
1.4 Fusion frame and tracking frame to obtain fusion frame, the fusion rule is as follows:
1.4.1 When both the skeletal frame and the tracking frame are present): calculating the intersection ratio rho of the skeleton frame and the tracking frame, and taking a smaller boundary frame of the skeleton frame and the tracking frame as a fusion frame when rho is larger than 0.3; otherwise, taking the skeleton frame as a fusion frame;
1.4.2 When there is a skeletal frame, a tracking frame is absent): taking a framework frame as a fusion frame;
1.4.3 When the skeleton frame is missing, the tracking frame is present): if rho is larger than 0.3, taking the tracking frame as a fusion frame; otherwise, no fusion frame exists;
1.4.4 When both the skeleton frame and the tracking frame are absent: no fusion frame exists;
the subsequent model training and testing adopts a fusion frame;
in the step 2), in order to enhance the RGB features, the attention map is divided into a component confidence map and a component affinity field by adopting a data form of a skeleton feature map in an openPose algorithm;
component confidence map C: skeleton sequence V= { V estimated by alpha Pose algorithm i I=1, …, K }, calculate each joint point confidence map, v i Representing one of key skeleton points of human skeleton, v i =(x i ,y i ) And calculating a confidence coefficient map by using a Gaussian blur algorithm for each joint point coordinate, wherein sigma is a Gaussian blur threshold value, and the confidence coefficient corresponding to the (x, y) position of the kth skeleton point on the feature map is as follows:
part affinity field P: the limb trend is represented by a component affinity field, key joint points are connected according to a natural connection mode of a human body, the implementation mode is that an affinity field is calculated for a component between every two connected joint points of the human body, the affinity field is called a component affinity field, and the coordinates of a starting joint point s of a certain component are set as (x) s ,y s ) The coordinates of the termination point e are (x e ,y e ) At each pixel of the joint region, 2D vector codes pointing from a start joint point to a stop joint point are designed, and an orthogonal distance threshold tau for the connection of two joint points is set, so that for a channel where s is located, the values of all pixel points in an orthogonal threshold range connected with the two joint points are as follows:
above L e,s Node eThe Euclidean distance between the articulation points s is that for the channel where e is located, the values of all the pixel points in the orthogonal threshold range connected with the two articulation points are as follows:
the currently considered double skeleton is provided with K joint points and M parts in total, and a joint confidence map { C (joint confidence map) of K channels is obtained i I=1, …, K } and component affinity fields of M channels { P } j The joint confidence maps of K channels are overlapped, and a component confidence map K is obtained; superimposing the M channels of component affinity fields to obtain component affinity field P, and adding the component confidence map and the component affinity field to obtain component features Finally to F c Performing a 1 x 1 convolution operation to output a skeleton attention map F of the same size a ;
In the step 3), in order to extract image features, the Resnet50 with the tail full-connection layer removed is taken as a backbone network, in view of that the channel numbers of feature graphs output by the Resnet50 penultimate module and the penultimate module are 2048 and 1024 respectively, overfitting easily occurs on a small-scale interaction behavior data set, the number of output feature channels is reduced to 1024 and 512 respectively by using 1×1 convolution, and in order to obtain multi-scale behavior characterization, the outputs of the backbone network penultimate convolution layer and the penultimate convolution layer are jointly used to obtain a spatial pyramid feature F b ;
Resnet-Attention network uses skeleton Attention diagram F a Enhancing image feature F b ,F a Is 240 x 320 in size, and the spatial pyramid feature F b Is 512 x 15 x 20 and 1024 x 8 x 10, respectively, and spatial pyramid two layers are interpolated using bilinear interpolation for fusion of image features and skeleton attention mapThe image features of the image are expanded to be the same scale as the skeleton features, then the expanded feature images are stacked in the channel direction, and finally a unified feature image with dimensions of 1536 multiplied by 240 multiplied by 320 is obtained;
to further enhance the characteristic diagram F b Resnet-Attention network computation F a And F b The Hadamard product of (2) to obtain an enhanced feature map F action :
F action =F a ⊙F b (4)
Here, the
Network input is RGB picture and human body frame, firstly adopt backbone network to extract RGB image characteristic; then, based on the characteristics of the human skeleton computing component, obtaining skeleton attention map through 1X 1 convolution and Sigmoid activation function; the RGB features and the skeleton attention are subjected to Hadamard product to obtain enhanced features; finally extracting double interaction features based on the enhancement features and the target boundary frame;
in the step 3), the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; the MEB and the enhancement feature are input into a RoIAlign module, and MEB region feature F with the size of 1536 multiplied by 5 is output inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories;
in said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F gcn The original Shift-GCN requires that the input data is a video-based skeleton sequence, and the output is the wholeAccording to action classification results of video sequences, in view of the problem that the Shift-GCN aims at classifying video behaviors different from target tasks, the interaction behavior classification score F of any two persons in a single frame image cannot be calculated gcn The shift graph convolution network is modified to be compatible with single-frame data and can process the problem of classification of double interaction behaviors, and the modification content comprises the following steps: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;
given double skeleton dataWherein v is the number of the nodes of the double skeleton, c=2 is the number of channels, each coordinate component occupies one channel, and the shift graph rolling network outputs a score vector F classified for interaction behaviors gcn :
F gcn =ShiftGCN(S) (5)
Finally, double-flow network pair F rgb And F gcn Fusion is carried out:
F fused =F rgb +F gcn (6)
the method further comprises the steps of:
5) Network parameter training
In the network parameter training process, for the framework gate component network and the shift graph convolution network, the training set adopted by the system is different, and the training set is described in two parts:
training uses a graphics card: GTX 1080Ti
Skeleton-gate assembly network parameters training, refer to table 1:
TABLE 1
Shift map convolution network parameter training, refer to table 2:
TABLE 2
Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increased to cause the number of interaction groups to be increased sharply, and the interaction group ratio of the no-action type interaction groups is high; setting a maximum distance Max_dis in the training process, calculating the distance dis of the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no interaction category;
the maximum distance elimination method is adopted: setting a distance threshold Max dis Calculating Euclidean distance dis of center points of two human body boundary frames in an image, and when dis is more than or equal to Max dis The data set is judged to have no interaction behavior, the number of double groups input into the network model is reduced to be within 36 groups by a maximum distance elimination method, and on the other hand, an unbalanced weight class is added to a loss function trained by each model weight :
In the above equation, OT represents any other human behavior than normal interactive behavior.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110259862.6A CN113158782B (en) | 2021-03-10 | 2021-03-10 | Multi-person concurrent interaction behavior understanding method based on single-frame image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110259862.6A CN113158782B (en) | 2021-03-10 | 2021-03-10 | Multi-person concurrent interaction behavior understanding method based on single-frame image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113158782A CN113158782A (en) | 2021-07-23 |
CN113158782B true CN113158782B (en) | 2024-03-26 |
Family
ID=76886824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110259862.6A Active CN113158782B (en) | 2021-03-10 | 2021-03-10 | Multi-person concurrent interaction behavior understanding method based on single-frame image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113158782B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115797655B (en) * | 2022-12-13 | 2023-11-07 | 南京恩博科技有限公司 | Character interaction detection model, method, system and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160164A (en) * | 2019-12-18 | 2020-05-15 | 上海交通大学 | Action recognition method based on human body skeleton and image fusion |
CN111582220A (en) * | 2020-05-18 | 2020-08-25 | 中国科学院自动化研究所 | Skeleton point behavior identification system based on shift diagram convolution neural network and identification method thereof |
CN111985343A (en) * | 2020-07-23 | 2020-11-24 | 深圳大学 | Method for constructing behavior recognition deep network model and behavior recognition method |
-
2021
- 2021-03-10 CN CN202110259862.6A patent/CN113158782B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160164A (en) * | 2019-12-18 | 2020-05-15 | 上海交通大学 | Action recognition method based on human body skeleton and image fusion |
CN111582220A (en) * | 2020-05-18 | 2020-08-25 | 中国科学院自动化研究所 | Skeleton point behavior identification system based on shift diagram convolution neural network and identification method thereof |
CN111985343A (en) * | 2020-07-23 | 2020-11-24 | 深圳大学 | Method for constructing behavior recognition deep network model and behavior recognition method |
Also Published As
Publication number | Publication date |
---|---|
CN113158782A (en) | 2021-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Anwar et al. | Image colorization: A survey and dataset | |
Matern et al. | Exploiting visual artifacts to expose deepfakes and face manipulations | |
Huang et al. | DSNet: Joint semantic learning for object detection in inclement weather conditions | |
WO2023056889A1 (en) | Model training and scene recognition method and apparatus, device, and medium | |
CN108062525B (en) | Deep learning hand detection method based on hand region prediction | |
CN112131908A (en) | Action identification method and device based on double-flow network, storage medium and equipment | |
Zhang et al. | Semantic-aware dehazing network with adaptive feature fusion | |
CN111696110B (en) | Scene segmentation method and system | |
CN112232134B (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
CN112991350A (en) | RGB-T image semantic segmentation method based on modal difference reduction | |
CN112084952B (en) | Video point location tracking method based on self-supervision training | |
CN112785526A (en) | Three-dimensional point cloud repairing method for graphic processing | |
Huang et al. | Change detection with various combinations of fluid pyramid integration networks | |
CN111476133A (en) | Unmanned driving-oriented foreground and background codec network target extraction method | |
Zheng et al. | T-net: Deep stacked scale-iteration network for image dehazing | |
Li et al. | Image manipulation localization using attentional cross-domain CNN features | |
CN113158782B (en) | Multi-person concurrent interaction behavior understanding method based on single-frame image | |
CN112802048B (en) | Method and device for generating layer generation countermeasure network with asymmetric structure | |
CN113066074A (en) | Visual saliency prediction method based on binocular parallax offset fusion | |
CN112734914A (en) | Image stereo reconstruction method and device for augmented reality vision | |
CN110197226B (en) | Unsupervised image translation method and system | |
CN116468644A (en) | Infrared visible image fusion method based on self-supervision feature decoupling | |
Khan et al. | Towards monocular neural facial depth estimation: Past, present, and future | |
Huang et al. | Temporally-aggregating multiple-discontinuous-image saliency prediction with transformer-based attention | |
CN111047571B (en) | Image salient target detection method with self-adaptive selection training process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |