CN113158782B - Multi-person concurrent interaction behavior understanding method based on single-frame image - Google Patents

Multi-person concurrent interaction behavior understanding method based on single-frame image Download PDF

Info

Publication number
CN113158782B
CN113158782B CN202110259862.6A CN202110259862A CN113158782B CN 113158782 B CN113158782 B CN 113158782B CN 202110259862 A CN202110259862 A CN 202110259862A CN 113158782 B CN113158782 B CN 113158782B
Authority
CN
China
Prior art keywords
skeleton
frame
interaction
network
human
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110259862.6A
Other languages
Chinese (zh)
Other versions
CN113158782A (en
Inventor
王振华
周瑾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202110259862.6A priority Critical patent/CN113158782B/en
Publication of CN113158782A publication Critical patent/CN113158782A/en
Application granted granted Critical
Publication of CN113158782B publication Critical patent/CN113158782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Abstract

A multi-person concurrent interaction behavior understanding method based on a single frame image comprises the following steps: 1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region; 2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, and constructing an attention map; 3) Defining a Resnet-Attention network based on human skeleton Attention; 4) Defining a double-flow network for understanding multi-person interaction behaviors; 5) And (5) training network parameters. The provided algorithm utilizes attention to try to strengthen the convolution network characteristics of RGB images, and extracts double interaction characteristics based on human skeleton data and a shift graph convolution network, so that multi-person interaction behavior modeling of single-frame images is realized, and effective interaction behavior characterization is obtained. The method is suitable for multi-person concurrent interactive understanding in a single frame image.

Description

Multi-person concurrent interaction behavior understanding method based on single-frame image
Technical Field
The invention belongs to the field of image understanding in computer vision, and relates to a multi-person interaction understanding method.
Background
In order to construct a novel smart city, ensure personal safety of people and reduce property loss, the monitoring function of cameras in public places needs to be perfected, so that the camera can automatically and accurately identify and understand crowd behaviors in a monitoring scene based on video data, and computer intelligent auxiliary analysis and real-time networking early warning are carried out on key events. To achieve multi-person interactive understanding, it is necessary to automatically identify the interaction relationship and interaction behavior categories between persons, such as "boxing", "kicking", "pushing", "skimming", etc., based on videos or images. There are two problems with existing technologies: the first problem is modal failure, which is usually done with single-mode information, but which is insufficient to understand complex human interactions; the second problem is modal deletion, namely, the problem that a region of interest is deleted due to human body shielding exists in an interactive scene.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-person concurrent interaction behavior understanding method based on a single frame image, which can effectively identify the interaction relationship and interaction behavior category between every two persons in a multi-person scene.
The technical scheme adopted for solving the technical problems is as follows:
a multi-person concurrent interaction behavior understanding method based on a single frame image, the method comprising the steps of:
1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region;
2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, constructing an attention map;
3) Defining a Resnet-Attention network based on human skeleton Attention;
4) Dual-flow network defining multiple human interaction behavior understanding
Starting from multi-mode information and an Attention mechanism, a double-flow network model is provided, wherein the first path of algorithm is a Resnet-Attention network based on human skeleton Attention, and enhanced RGB features are extracted; and the second path of algorithm is based on skeleton data, and a shift graph convolution network with the optimal current behavior recognition effect is used for extracting accurate skeleton characteristics.
Further, in the step 1), the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is a basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha phase and is output to the human body boundary box, which is called skeleton human body frame; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with severe occlusion or abnormal human body pose, human skeleton estimation may fail, compared with less tracking human body frame missing.
Further, the obtained human skeleton and human frame need to be matched according to the labeling data, ordered human skeleton data and the interested region are obtained, and the ordered data comprises: human skeleton, skeleton human frame, tracking human frame, region of interest, interaction group serial number, interaction group action label and single action label, the following steps are calculated:
1.1 Extracting a human skeleton by using an alpha Pose algorithm, and outputting a skeleton human skeleton;
1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;
1.3 Calculating the real action label and the interaction group serial number of the boundary frame through the skeleton human body frame, the tracking human body frame and the labeling data obtained in 1.1) and 1.2), wherein the labeling data comprises the human body frame, the interaction group data and the action label of the interaction target, and matching the labeling data with the tracking frame: for any tracking frame B, calculating a marked boundary frame Bmax with the largest intersection ratio with the B, if Bmax exists and the corresponding intersection ratio is larger than 0.5, considering that Bmax is matched with the B, and assigning an action tag corresponding to Bmax and an interaction group serial number to the tracking frame B;
1.4 Fusion frame and tracking frame to obtain fusion frame, the fusion rule is as follows:
1.4.1 When both the skeletal frame and the tracking frame are present): calculating the intersection ratio rho of the skeleton frame and the tracking frame, and taking a smaller boundary frame of the skeleton frame and the tracking frame as a fusion frame when rho is larger than 0.3; otherwise, taking the skeleton frame as a fusion frame;
1.4.2 When there is a skeletal frame, a tracking frame is absent): taking a framework frame as a fusion frame;
1.4.3 When the skeleton frame is missing, the tracking frame is present): if rho is larger than 0.3, taking the tracking frame as a fusion frame; otherwise, no fusion frame exists;
1.4.4 When both the skeleton frame and the tracking frame are absent: no fusion frame exists;
and the subsequent model training and testing adopts a fusion frame.
Still further, in the step 2), in order to enhance RGB features, the present invention adopts a data form of a skeleton feature map in an openPose algorithm, and attention is split into a component confidence map and a component affinity field;
component confidence map C: skeleton sequence V= { V estimated by alpha Pose algorithm i I=1, …, K }, calculate each joint point confidence map, v i Representing one of key skeleton points of human skeleton, v i =(x i ,y i ) And calculating a confidence coefficient map by using a Gaussian blur algorithm for each joint point coordinate, wherein sigma is a Gaussian blur threshold value and the size is 0.5, and the confidence coefficient corresponding to the (x, y) position of the kth skeleton point on the feature map is as follows:
part affinity field P: the limb trend is represented by a component affinity field, key joint points are connected according to a natural connection mode of a human body, the implementation mode is that an affinity field is calculated for a component between every two connected joint points of the human body, the affinity field is called a component affinity field, and the coordinates of a starting joint point s of a certain component are set as (x) s ,y s ) The coordinates of the termination point e are (x e ,y e ) At each pixel of this joint region, a 2D vector code is designed from the start joint point to the end joint point, setting the orthogonal distance threshold τ of the two joint point connection to 0.5. Then for the channel in which s is located, the values of all pixels in the orthogonal threshold range connected to the two joint points are:
above L e,s Euclidean distance between the articulation point e and the articulation point s. For the channel where e is located, the values of all pixels in the orthogonal threshold range connected with the two joint points are as follows:
the currently considered double skeleton is provided with K joint points and M parts in total, and a joint confidence map { C (joint confidence map) of K channels is obtained i I=1, …, K } and component affinity fields of M channels { P } j The joint confidence maps of K channels are overlapped, and a component confidence map K is obtained; superimposing the M channels of component affinity fields to obtain component affinity field P, and adding the component confidence map and the component affinity field to obtain component features Where W is s =320,H s =240, last pair F c Performing a 1 x 1 convolution operation to output a skeleton attention map F of the same size a
In the step 3), in order to extract image features, the Resnet50 with the tail full-connection layer removed is taken as a backbone network, in view of that the channel numbers of feature graphs output by the Resnet50 penultimate module and the penultimate module are 2048 and 1024 respectively, overfitting easily occurs on a small-scale interaction behavior data set, the number of output feature channels is reduced to 1024 and 512 respectively by using 1×1 convolution, and in order to obtain multi-scale behavior characterization, the outputs of the backbone network penultimate convolution layer and the penultimate convolution layer are jointly used to obtain a spatial pyramid feature F b
Resnet-Attention network uses skeleton Attention diagram F a Enhancing image feature F b ,F a Is 240 x 320 in size, and the spatial pyramid feature F b For fusing image features and skeleton attention, expanding the image features of two layers of the spatial pyramid into the same scale as the skeleton features by bilinear interpolation, stacking the expanded feature images in the channel direction, and finally obtaining a unified feature image with dimensions of 1536×240×320;
to further enhance the characteristic diagram F b Resnet-Attention network computation F a And F b The Hadamard product of (2) to obtain an enhanced feature map F action
F action =F a ⊙F b (4)
Here, the
The main flow and result visualization of extracting the interaction behavior characteristics of two persons by adopting a Resnet-Attention network are shown in figure 1, network input is an RGB picture and a human body frame, and firstly, a backbone network is adopted to extract the RGB image characteristics; then, based on the characteristics of the human skeleton computing component, obtaining skeleton attention map through 1X 1 convolution and Sigmoid activation function; the RGB features and the skeleton attention are subjected to Hadamard product to obtain enhanced features; and finally extracting double interaction features based on the enhancement features and the target boundary box.
Preferably, the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; the MEB and the enhancement feature are input into a RoIAlign module, and MEB region feature F with the size of 1536 multiplied by 5 is output inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories.
Further, in said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F gcn The original Shift-GCN requires that the input data is a skeleton sequence based on video, the output is the action classification result of the whole video sequence, and the interaction behavior classification score F of any two persons in a single frame image cannot be calculated in view of the problem that the Shift-GCN aims at the classification of video behaviors different from a target task gcn Modifying the shift graph convolution network to be compatible with single-frame data and capable of processing double interaction behavior classification questionsThe questions, modified content includes: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;
given double skeleton dataWherein v is the number of the nodes of the double skeleton, c=2 is the number of channels, each coordinate component occupies one channel, and the shift graph rolling network outputs a score vector F classified for interaction behaviors gcn :
F gcn =ShiftGCN(S) (5)
Finally, double-flow network pair F rgb And F gcn Fusion is carried out:
F fused =F rgb +F gcn (6)。
still further, the method comprises the steps of:
5) Network parameter training
In the network parameter training process, for the framework gate component network and the shift graph convolution network, the training set adopted by the system is different, and the training set is described in two parts:
training uses a graphics card: GTX 1080Ti
Skeleton-gate assembly network parameters training, refer to table 1:
TABLE 1
Shift map convolution network parameter training, refer to table 2:
TABLE 2
Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increasedThe number of interaction groups is increased sharply, and the ratio of interaction groups without action categories is high; setting a maximum distance Max_dis in the training process, calculating the distance (dis) between the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no interaction category;
when the number N of people in the picture is {2,3,4,5}, according to the arrangement and combinationThe interaction group number M ε {1,2, …,10}. The input data of the network are local images of N persons and M groups of double area images in the image, however, the input data amounts of the N persons and the M groups of double area images are different due to the fact that the number of persons contained in different images is not fixed; adopting a filling operation to make the data quantity of each frame of data input into the network model be the same, setting an upper limit of the number of people max_num=5, and corresponding double group number max_interaction_num=10, and filling when the number of people is insufficient: the action categories of the supplementary data are all-1, and the human body boundary boxes are all [0,0];
When the number of people N in the picture is {5,6, …,15}, according to the arrangement and combinationObtaining the interaction group number M epsilon {10,11, …,105}, setting the upper limit of the number of people Max num =15, corresponding double group number +.> The proportion of the actual interactive groups to the total groups is below 5%, and the problem that the actual interactive groups are extremely unbalanced in proportion is solved; on the other hand, the difference in the number of people may cause data complement to be excessive: for example, for a scene containing 5 people, the complement group number is +.>A group; to solve the above two problems, a maximum distance elimination method is adopted: setting a distance threshold Max dis Calculating Euclidean distance dis of center points of two human body boundary frames in an image, and when dis is more than or equal to Max dis The data set is judged to have no interaction behavior, the number of double groups input into the network model is reduced to be within 36 groups by a maximum distance elimination method, and on the other hand, an unbalanced weight class is added to a loss function trained by each model weight
In the above equation, OT represents any other human behavior than normal interactive behavior.
According to the method, firstly, a method for calculating an interested region is provided according to a skeleton estimation algorithm and a multi-target tracking algorithm, and the interested region is used for cutting local region features from a feature map; secondly, a dual-flow network based on skeleton attention is provided: one branch extracts the RGB image characteristics of two persons and enhances the characteristics by utilizing a human skeleton; the second branch adopts human skeleton data and a shift graph convolution network to extract double interaction characteristics, and models the multi-person interaction behavior of a single frame image, so that more effective interaction behavior characterization is obtained.
The beneficial effects of the invention are mainly shown in the following steps: the convolution network characteristics of the RGB image are enhanced by attention, and double interaction characteristics are extracted based on human skeleton data and a shift graph convolution network, so that multi-person interaction behavior modeling of a single frame image is realized, and effective interaction behavior characterization is obtained. The method is suitable for multi-person concurrent interactive understanding in a single frame image; a more efficient characterization of the interaction behavior is obtained.
Drawings
Fig. 1 is a schematic diagram of a two-person interaction feature extraction based on skeletal attention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a multi-person concurrent interaction behavior understanding method based on a single frame image includes the steps of:
1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region;
the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is the basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha Pose and is output as a skeleton human body boundary box; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with severe occlusion or abnormal human body pose, human skeleton estimation may fail, compared with less tracking human body frame missing.
Further, the obtained human skeleton and human frame need to be matched according to the labeling data, ordered human skeleton data and the interested region are obtained, and the ordered data comprises: human skeleton, skeleton human frame, tracking human frame, region of interest, interaction group serial number, interaction group action label and single action label, the following steps are calculated:
1.1 Extracting a human skeleton by using an alpha Pose algorithm, and outputting a skeleton human skeleton;
1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;
1.3 Calculating the real action label and the interaction group serial number of the boundary frame through the skeleton human body frame, the tracking human body frame and the labeling data obtained in 1.1) and 1.2), wherein the labeling data comprises the human body frame, the interaction group data and the action label of the interaction target, and matching the labeling data with the tracking frame: for any tracking frame B, calculating a marked boundary frame Bmax with the largest intersection ratio with the B, if Bmax exists and the corresponding intersection ratio is larger than 0.5, considering that Bmax is matched with the B, and assigning an action tag corresponding to Bmax and an interaction group serial number to the tracking frame B;
1.4 Fusion frame and tracking frame to obtain fusion frame, the fusion rule is as follows:
1.4.1 When both the skeletal frame and the tracking frame are present): calculating the intersection ratio rho of the skeleton frame and the tracking frame, and taking a smaller boundary frame of the skeleton frame and the tracking frame as a fusion frame when rho is larger than 0.3; otherwise, taking the skeleton frame as a fusion frame;
1.4.3 When there is a skeletal frame, a tracking frame is absent): taking a framework frame as a fusion frame;
1.4.3 When the skeleton frame is missing, the tracking frame is present): if rho is larger than 0.3, taking the tracking frame as a fusion frame; otherwise, no fusion frame exists;
1.4.4 When both the skeleton frame and the tracking frame are absent: no fusion frame exists;
and the subsequent model training and testing adopts a fusion frame.
2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, constructing an attention map;
in order to enhance RGB features, the invention adopts a data form of a skeleton feature map in an OpenPose algorithm, and attention is focused on dividing the map into a part confidence map and a part affinity field;
component confidence map C: skeleton sequence V= { V estimated by alpha Pose algorithm i I=1, …, K }, calculate each joint point confidence map, v i Representing one of key skeleton points of human skeleton, v i =(x i ,y i ) And calculating a confidence coefficient map by using a Gaussian blur algorithm for each joint point coordinate, wherein sigma is a Gaussian blur threshold value and the size is 0.5, and the confidence coefficient corresponding to the (x, y) position of the kth skeleton point on the feature map is as follows:
part affinity field P: the limb trend is expressed by using a part affinity field, and key joint points are connected according to a natural connection mode of a human body, wherein the implementation mode is that every two phases of the human body are usedThe parts between the joint points calculate an affinity field, called part affinity field, assuming that the coordinates of a part starting joint point s are (x s ,y s ) The coordinates of the termination point e are (x e ,y e ) At each pixel of this joint region, a 2D vector code is designed from the start joint point to the end joint point, setting the orthogonal distance threshold τ of the two joint point connection to 0.5. Then for the channel in which s is located, the values of all pixels in the orthogonal threshold range connected to the two joint points are:
above L e,s Euclidean distance between the articulation point e and the articulation point s. For the channel where e is located, the values of all pixels in the orthogonal threshold range connected with the two joint points are as follows:
the currently considered double skeleton is provided with K joint points and M parts in total, and a joint confidence map { C (joint confidence map) of K channels is obtained i I=1, …, K } and component affinity fields of M channels { P } j The joint confidence maps of K channels are overlapped, and a component confidence map K is obtained; superimposing the M channels of component affinity fields to obtain component affinity field P, and adding the component confidence map and the component affinity field to obtain component features Where W is s =320,H s =240, last pair F c Performing a 1 x 1 convolution operation to output a skeleton attention map F of the same size a
3) Defining a Resnet-Attention network based on human skeleton Attention;
to extract image features, the Resne of the tail full-connection layer is removedt50 is taken as a backbone network, in view of the fact that the channel numbers of the feature map output by the penultimate module and the penultimate module of the ResNet50 are 2048 and 1024 respectively, overfitting easily occurs on a small-scale interaction behavior data set, the number of the output feature channels is reduced to 1024 and 512 respectively by using 1X 1 convolution, and in order to obtain multi-scale behavior characterization, the outputs of the penultimate convolution layer and the penultimate convolution layer of the backbone network are jointly used to obtain a spatial pyramid feature F b
Resnet-Attention network uses skeleton Attention diagram F a Enhancing image feature F b ,F a Is 240 x 320 in size, and the spatial pyramid feature F b For fusing image features and skeleton attention, expanding the image features of two layers of the spatial pyramid into the same scale as the skeleton features by bilinear interpolation, stacking the expanded feature images in the channel direction, and finally obtaining a unified feature image with dimensions of 1536×240×320;
to further enhance the characteristic diagram F b Resnet-Attention network computation F a And F b The Hadamard product of (2) to obtain an enhanced feature map F action
F action =F a ⊙F b (4)
Here, the
The main flow and result visualization of extracting the interaction behavior characteristics of two persons by adopting a Resnet-Attention network are shown in figure 1, network input is an RGB picture and a human body frame, and firstly, a backbone network is adopted to extract the RGB image characteristics; then, based on the characteristics of the human skeleton computing component, obtaining skeleton attention map through 1X 1 convolution and Sigmoid activation function; the RGB features and the skeleton attention are subjected to Hadamard product to obtain enhanced features; and finally extracting double interaction features based on the enhancement features and the target boundary box.
Preferably, the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; MEB andenhancement features are input into the RoIAlign module and MEB region features F with the size of 1536×5×5 are output inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories.
4) Dual-flow network defining multiple human interaction behavior understanding
Starting from multi-mode information and an Attention mechanism, a double-flow network model is provided, wherein the first path of algorithm is a Resnet-Attention network based on human skeleton Attention, and enhanced RGB features are extracted; and the second path of algorithm is based on skeleton data, and a shift graph convolution network with the optimal current behavior recognition effect is used for extracting accurate skeleton characteristics.
In said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F gcn The original Shift-GCN requires that the input data is a skeleton sequence based on video, the output is the action classification result of the whole video sequence, and the interaction behavior classification score F of any two persons in a single frame image cannot be calculated in view of the problem that the Shift-GCN aims at the classification of video behaviors different from a target task gcn The shift graph convolution network is modified to be compatible with single-frame data and can process the problem of classification of double interaction behaviors, and the modification content comprises the following steps: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;
given double skeleton dataWherein v is the number of the nodes of the double skeleton, c=2 is the number of channels, each coordinate component occupies one channel, and the shift graph rolling network outputs a score vector F classified for interaction behaviors gcn :
F gcn =ShiftGCN(S) (5)
Finally, double-flow network pair F rgb And F gcn Fusion is carried out:
F fused =F rgb +F gcn (6)。
still further, the method comprises the steps of:
5) Network parameter training
In the network parameter training process, for the framework gate component network and the shift graph convolution network, the training set adopted by the system is different, and the training set is described in two parts:
training uses a graphics card: GTX 1080Ti
Skeleton-gate assembly network parameters training, refer to table 1:
TABLE 1
Shift map convolution network parameter training, refer to table 2:
TABLE 2
Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increased to cause the number of interaction groups to be increased sharply, and the interaction group ratio of the no-action type interaction groups is high; setting a maximum distance Max_dis in the training process, calculating the distance (dis) between the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no crossThe interaction is used as a category;
when the number N of people in the picture is {2,3,4,5}, according to the arrangement and combinationThe interaction group number M ε {1,2, …,10}. The input data of the network are local images of N persons and M groups of double area images in the image, however, the input data amounts of the N persons and the M groups of double area images are different due to the fact that the number of persons contained in different images is not fixed; adopting a filling operation to make the data quantity of each frame of data input into the network model be the same, setting an upper limit of the number of people max_num=5, and corresponding double group number max_interaction_num=10, and filling when the number of people is insufficient: the action categories of the supplementary data are all-1, and the human body boundary boxes are all [0,0];
When the number of people N in the picture is {5,6, …,15}, according to the arrangement and combinationObtaining the interaction group number M epsilon {10,11, …,105}, setting the upper limit of the number of people Max num =15, corresponding double group number +.> The proportion of the actual interactive groups to the total groups is below 5%, and the problem that the actual interactive groups are extremely unbalanced in proportion is solved; on the other hand, the difference in the number of people may cause data complement to be excessive: for example, for a scene containing 5 people, the complement group number is +.>A group; to solve the above two problems, a maximum distance elimination method is adopted: setting a distance threshold Max dis Calculating Euclidean distance dis of center points of two human body boundary frames in an image, and when dis is more than or equal to Max dis The data set is judged to have no interaction behavior, the number of double groups input into the network model is reduced to be within 36 groups by a maximum distance elimination method, and on the other hand, an unbalanced weight class is added to a loss function trained by each model weight
In the above equation, OT represents any other human behavior than normal interactive behavior.
The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.

Claims (1)

1. A multi-person concurrent interaction behavior understanding method based on a single frame image is characterized by comprising the following steps:
1) Inputting a picture, and combining skeleton estimation and a multi-target tracking algorithm to obtain human skeleton data and an interested region;
2) Generating a skeleton component confidence map and a component affinity field using the human skeleton data, constructing an attention map;
3) Defining a Resnet-Attention network based on human skeleton Attention;
4) Defining a double-flow network for understanding multi-person interaction behaviors;
starting from multi-mode information and an Attention mechanism, a double-flow network model is provided, wherein the first path of algorithm is a Resnet-Attention network based on human skeleton Attention, and enhanced RGB features are extracted; the second path of algorithm is based on skeleton data, and a shift graph convolution network with the optimal current behavior recognition effect is used for extracting accurate skeleton characteristics;
in the step 1), the region of interest refers to a human body boundary box, the accurate calculation of the region of interest is a basis for extracting interactive behavior characteristics, and the region of interest is calculated by combining a skeleton estimation algorithm and a multi-target tracking algorithm, wherein the human body skeleton is extracted from an original image by using alpha Pose and is output to the human body boundary box, which is called skeleton human body frame; meanwhile, a FairMOT is used for tracking human bodies in the video, so that a human body boundary frame of each person in a certain frame is obtained, and the frame is called a tracking human body frame; the skeleton human body frame has the advantages that the degree of fitting the skeleton human body frame to an actual human body is high, and the condition that four limbs are outside the boundary frame easily occurs when the human body frame is tracked; for complex scenes with serious occlusion or abnormal human body pose, human skeleton estimation may fail, and compared with the situation that tracking human body frames is missing, the method has fewer problems;
in the step 2), the obtained human skeleton and human frame need to be matched according to the labeling data, and ordered human skeleton data and the interested region are obtained, wherein the ordered data comprises: human skeleton, skeleton human frame, tracking human frame, region of interest, interaction group serial number, interaction group action label and single action label, the following steps are calculated:
1.1 Extracting a human skeleton by using an alpha Pose algorithm, and outputting a skeleton human skeleton;
1.2 Extracting and tracking a human body frame by using a FairMOT algorithm;
1.3 Calculating the real action label and the interaction group serial number of the boundary frame through the skeleton human body frame, the tracking human body frame and the labeling data obtained in 1.1) and 1.2), wherein the labeling data comprises the human body frame, the interaction group data and the action label of the interaction target, and matching the labeling data with the tracking frame: for any tracking frame B, calculating a marked boundary frame Bmax with the largest intersection ratio with the B, if Bmax exists and the corresponding intersection ratio is larger than 0.5, considering that Bmax is matched with the B, and assigning an action tag corresponding to Bmax and an interaction group serial number to the tracking frame B;
1.4 Fusion frame and tracking frame to obtain fusion frame, the fusion rule is as follows:
1.4.1 When both the skeletal frame and the tracking frame are present): calculating the intersection ratio rho of the skeleton frame and the tracking frame, and taking a smaller boundary frame of the skeleton frame and the tracking frame as a fusion frame when rho is larger than 0.3; otherwise, taking the skeleton frame as a fusion frame;
1.4.2 When there is a skeletal frame, a tracking frame is absent): taking a framework frame as a fusion frame;
1.4.3 When the skeleton frame is missing, the tracking frame is present): if rho is larger than 0.3, taking the tracking frame as a fusion frame; otherwise, no fusion frame exists;
1.4.4 When both the skeleton frame and the tracking frame are absent: no fusion frame exists;
the subsequent model training and testing adopts a fusion frame;
in the step 2), in order to enhance the RGB features, the attention map is divided into a component confidence map and a component affinity field by adopting a data form of a skeleton feature map in an openPose algorithm;
component confidence map C: skeleton sequence V= { V estimated by alpha Pose algorithm i I=1, …, K }, calculate each joint point confidence map, v i Representing one of key skeleton points of human skeleton, v i =(x i ,y i ) And calculating a confidence coefficient map by using a Gaussian blur algorithm for each joint point coordinate, wherein sigma is a Gaussian blur threshold value, and the confidence coefficient corresponding to the (x, y) position of the kth skeleton point on the feature map is as follows:
part affinity field P: the limb trend is represented by a component affinity field, key joint points are connected according to a natural connection mode of a human body, the implementation mode is that an affinity field is calculated for a component between every two connected joint points of the human body, the affinity field is called a component affinity field, and the coordinates of a starting joint point s of a certain component are set as (x) s ,y s ) The coordinates of the termination point e are (x e ,y e ) At each pixel of the joint region, 2D vector codes pointing from a start joint point to a stop joint point are designed, and an orthogonal distance threshold tau for the connection of two joint points is set, so that for a channel where s is located, the values of all pixel points in an orthogonal threshold range connected with the two joint points are as follows:
above L e,s Node eThe Euclidean distance between the articulation points s is that for the channel where e is located, the values of all the pixel points in the orthogonal threshold range connected with the two articulation points are as follows:
the currently considered double skeleton is provided with K joint points and M parts in total, and a joint confidence map { C (joint confidence map) of K channels is obtained i I=1, …, K } and component affinity fields of M channels { P } j The joint confidence maps of K channels are overlapped, and a component confidence map K is obtained; superimposing the M channels of component affinity fields to obtain component affinity field P, and adding the component confidence map and the component affinity field to obtain component features Finally to F c Performing a 1 x 1 convolution operation to output a skeleton attention map F of the same size a
In the step 3), in order to extract image features, the Resnet50 with the tail full-connection layer removed is taken as a backbone network, in view of that the channel numbers of feature graphs output by the Resnet50 penultimate module and the penultimate module are 2048 and 1024 respectively, overfitting easily occurs on a small-scale interaction behavior data set, the number of output feature channels is reduced to 1024 and 512 respectively by using 1×1 convolution, and in order to obtain multi-scale behavior characterization, the outputs of the backbone network penultimate convolution layer and the penultimate convolution layer are jointly used to obtain a spatial pyramid feature F b
Resnet-Attention network uses skeleton Attention diagram F a Enhancing image feature F b ,F a Is 240 x 320 in size, and the spatial pyramid feature F b Is 512 x 15 x 20 and 1024 x 8 x 10, respectively, and spatial pyramid two layers are interpolated using bilinear interpolation for fusion of image features and skeleton attention mapThe image features of the image are expanded to be the same scale as the skeleton features, then the expanded feature images are stacked in the channel direction, and finally a unified feature image with dimensions of 1536 multiplied by 240 multiplied by 320 is obtained;
to further enhance the characteristic diagram F b Resnet-Attention network computation F a And F b The Hadamard product of (2) to obtain an enhanced feature map F action
F action =F a ⊙F b (4)
Here, the
Network input is RGB picture and human body frame, firstly adopt backbone network to extract RGB image characteristic; then, based on the characteristics of the human skeleton computing component, obtaining skeleton attention map through 1X 1 convolution and Sigmoid activation function; the RGB features and the skeleton attention are subjected to Hadamard product to obtain enhanced features; finally extracting double interaction features based on the enhancement features and the target boundary frame;
in the step 3), the method for extracting the interaction characteristics of two persons is as follows: calculating a minimum bounding box MEB containing double bounding boxes; the MEB and the enhancement feature are input into a RoIAlign module, and MEB region feature F with the size of 1536 multiplied by 5 is output inter The method comprises the steps of carrying out a first treatment on the surface of the Next, F will inter Flattening to become a vector of length 38400 (1536×5×5); layerNorm, reLu, dropout (0.9) and FC operations are sequentially performed on the vector to obtain a feature vector with the length of 512; finally, calculating the interactive behavior classification score by using an FC layerL represents the number of interaction behavior categories;
in said step 4), the dual-flow network comprises two branches: RGB feature extraction branch output interaction behavior classification score F rgb The Shift graph convolution network branch adopts a customized Shift-GCN network to calculate a fraction F gcn The original Shift-GCN requires that the input data is a video-based skeleton sequence, and the output is the wholeAccording to action classification results of video sequences, in view of the problem that the Shift-GCN aims at classifying video behaviors different from target tasks, the interaction behavior classification score F of any two persons in a single frame image cannot be calculated gcn The shift graph convolution network is modified to be compatible with single-frame data and can process the problem of classification of double interaction behaviors, and the modification content comprises the following steps: removing the time domain shift convolution; changing single skeleton sequence input into double single frame skeleton input;
given double skeleton dataWherein v is the number of the nodes of the double skeleton, c=2 is the number of channels, each coordinate component occupies one channel, and the shift graph rolling network outputs a score vector F classified for interaction behaviors gcn :
F gcn =ShiftGCN(S) (5)
Finally, double-flow network pair F rgb And F gcn Fusion is carried out:
F fused =F rgb +F gcn (6)
the method further comprises the steps of:
5) Network parameter training
In the network parameter training process, for the framework gate component network and the shift graph convolution network, the training set adopted by the system is different, and the training set is described in two parts:
training uses a graphics card: GTX 1080Ti
Skeleton-gate assembly network parameters training, refer to table 1:
TABLE 1
Shift map convolution network parameter training, refer to table 2:
TABLE 2
Because the number of people in different pictures is large in difference, different training parameters are set for different people, and when the number N= { x|2 +.x +.ltoreq.5 } of interaction targets in the pictures, the parameters are used for training; when the number of interaction targets in the picture N= { x|5<When x is less than or equal to 15}, the number of people is increased to cause the number of interaction groups to be increased sharply, and the interaction group ratio of the no-action type interaction groups is high; setting a maximum distance Max_dis in the training process, calculating the distance dis of the center points of two target frames in the picture, and judging that no action type interaction exists in the interaction group when dis is more than or equal to Max_dis; because of unbalanced proportion of categories in the interaction group, unbalanced weights are added into the loss functions in the two networks, and the weight values are as follows'OT' is no interaction category;
the maximum distance elimination method is adopted: setting a distance threshold Max dis Calculating Euclidean distance dis of center points of two human body boundary frames in an image, and when dis is more than or equal to Max dis The data set is judged to have no interaction behavior, the number of double groups input into the network model is reduced to be within 36 groups by a maximum distance elimination method, and on the other hand, an unbalanced weight class is added to a loss function trained by each model weight
In the above equation, OT represents any other human behavior than normal interactive behavior.
CN202110259862.6A 2021-03-10 2021-03-10 Multi-person concurrent interaction behavior understanding method based on single-frame image Active CN113158782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110259862.6A CN113158782B (en) 2021-03-10 2021-03-10 Multi-person concurrent interaction behavior understanding method based on single-frame image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110259862.6A CN113158782B (en) 2021-03-10 2021-03-10 Multi-person concurrent interaction behavior understanding method based on single-frame image

Publications (2)

Publication Number Publication Date
CN113158782A CN113158782A (en) 2021-07-23
CN113158782B true CN113158782B (en) 2024-03-26

Family

ID=76886824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110259862.6A Active CN113158782B (en) 2021-03-10 2021-03-10 Multi-person concurrent interaction behavior understanding method based on single-frame image

Country Status (1)

Country Link
CN (1) CN113158782B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797655B (en) * 2022-12-13 2023-11-07 南京恩博科技有限公司 Character interaction detection model, method, system and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160164A (en) * 2019-12-18 2020-05-15 上海交通大学 Action recognition method based on human body skeleton and image fusion
CN111582220A (en) * 2020-05-18 2020-08-25 中国科学院自动化研究所 Skeleton point behavior identification system based on shift diagram convolution neural network and identification method thereof
CN111985343A (en) * 2020-07-23 2020-11-24 深圳大学 Method for constructing behavior recognition deep network model and behavior recognition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160164A (en) * 2019-12-18 2020-05-15 上海交通大学 Action recognition method based on human body skeleton and image fusion
CN111582220A (en) * 2020-05-18 2020-08-25 中国科学院自动化研究所 Skeleton point behavior identification system based on shift diagram convolution neural network and identification method thereof
CN111985343A (en) * 2020-07-23 2020-11-24 深圳大学 Method for constructing behavior recognition deep network model and behavior recognition method

Also Published As

Publication number Publication date
CN113158782A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
Anwar et al. Image colorization: A survey and dataset
Matern et al. Exploiting visual artifacts to expose deepfakes and face manipulations
Huang et al. DSNet: Joint semantic learning for object detection in inclement weather conditions
WO2023056889A1 (en) Model training and scene recognition method and apparatus, device, and medium
CN108062525B (en) Deep learning hand detection method based on hand region prediction
CN112131908A (en) Action identification method and device based on double-flow network, storage medium and equipment
Zhang et al. Semantic-aware dehazing network with adaptive feature fusion
CN111696110B (en) Scene segmentation method and system
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN112991350A (en) RGB-T image semantic segmentation method based on modal difference reduction
CN112084952B (en) Video point location tracking method based on self-supervision training
CN112785526A (en) Three-dimensional point cloud repairing method for graphic processing
Huang et al. Change detection with various combinations of fluid pyramid integration networks
CN111476133A (en) Unmanned driving-oriented foreground and background codec network target extraction method
Zheng et al. T-net: Deep stacked scale-iteration network for image dehazing
Li et al. Image manipulation localization using attentional cross-domain CNN features
CN113158782B (en) Multi-person concurrent interaction behavior understanding method based on single-frame image
CN112802048B (en) Method and device for generating layer generation countermeasure network with asymmetric structure
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion
CN112734914A (en) Image stereo reconstruction method and device for augmented reality vision
CN110197226B (en) Unsupervised image translation method and system
CN116468644A (en) Infrared visible image fusion method based on self-supervision feature decoupling
Khan et al. Towards monocular neural facial depth estimation: Past, present, and future
Huang et al. Temporally-aggregating multiple-discontinuous-image saliency prediction with transformer-based attention
CN111047571B (en) Image salient target detection method with self-adaptive selection training process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant