CN113378676A

CN113378676A - Method for detecting figure interaction in image based on multi-feature fusion

Info

Publication number: CN113378676A
Application number: CN202110608515.XA
Authority: CN
Inventors: 马世伟; 汪畅; 孙金玉
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-09-10

Abstract

The invention discloses a method for detecting figure interaction in an image based on multi-feature fusion, which is characterized by detecting all example information in the image by using a target detection algorithm, wherein the example information comprises human body position information, object position and category information and the like, then inputting a trained figure interaction behavior recognition network, and detecting the interaction behavior between figure pairs in the image to be detected. On the basis of capturing the global spatial configuration of the interaction relation by using the pose, the invention focuses on the effective information provided by the intersection region of the people and the object, learns more precise local characteristics, increases the matching probability of correct people interaction pairs, effectively screens and utilizes the information of the people, the object and the background region thereof by using the short-term memory selection module, and improves the precision of people interaction detection by fusing various characteristics.

Description

Method for detecting figure interaction in image based on multi-feature fusion

Technical Field

The invention belongs to the technical field of detecting and understanding of visual relationships in images by using computer vision, and particularly relates to a method for detecting character interaction in images based on multi-feature fusion.

Background

The Human-Object Interaction (HOI) detection in the image aims to automatically detect the specific positions of objects such as people and objects which interact with each other in an input picture by using computer vision and identify the Interaction behavior category between the < Human-Object > pairs, thereby realizing the automatic understanding of the machine to the image content. The character interaction detection is a core technology for automatically understanding deep visual relation and realizing high-grade artificial intelligence through computer vision, and can be widely applied to the fields of intelligent robots, safety monitoring, information retrieval, human-computer interaction and the like.

Most of the existing human interaction detection methods are based on the result of target detection, all the people and objects in the graph are completely paired, and the characteristics of the people and the objects and the spatial characteristics between the < people-objects > pairs are extracted to estimate the interaction behavior between the people and the objects. This speculative approach relying only on instance-level features is still insufficient when dealing with relatively complex interaction classes, resulting in poor overall detection accuracy. First, due to the lack of detail cues, it is difficult to determine the relevance of a person with an instance-level representation to an object instance, easily leading to erroneous associations between the person and non-interacting objects. In addition, when the interaction types with fine granularity are distinguished only by means of similar example-level features, the internal relation between the features is not effectively utilized, and the complex conditions cannot be accurately judged.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to overcome the defects in the prior art and provide a method for detecting the interaction of people in an image based on multi-feature fusion. According to the method, on the basis of capturing the global spatial configuration of the interactive behavior by using the human body pose, effective information provided by the intersection region of the human body and the object in an image scene is focused, more refined local features are learned through a multi-branch neural network, the probability of correctly matching the < human-object > interactive pairs is increased, the information of the human body, the object and the background region of the human body and the object is effectively screened and utilized through a short-term memory selection module, and the character interactive behavior detection is realized through the fusion of various features.

In order to achieve the purpose of the invention, the invention adopts the following inventive concept:

firstly, detecting all example information in the picture by using a target detection algorithm, wherein the example information comprises human body position information, object position and category information and the like, then inputting a trained figure interactive behavior recognition network, and detecting the interactive behavior between the figure pairs in the picture to be detected. The character interactive recognition network adopts a multi-branch neural network structure, comprises paired branches, intersection branches and short-term memory selection branches, and the network performs learning training on < human-object > examples in pictures on various characteristics.

According to the inventive concept, the invention adopts the following technical scheme:

a method for detecting human interaction in an image based on multi-feature fusion comprises the following operation steps:

step 1: inputting an original picture;

step 2: detecting a target;

and step 3: constructing a figure interaction identification network;

and 4, step 4: detecting the character interaction behavior of the picture to be detected;

in the step 2, after all example information including human body position information and object position and category information in the picture is detected by using a target detection algorithm, inputting a trained character interaction behavior recognition network, and detecting interaction behaviors between character pairs in the picture to be detected;

in the step 3, the character interaction identification network adopts a multi-branch neural network structure, which comprises paired branches, intersection branches and short-term memory selection branches, and the network performs learning training on the < human-object > example in the picture on various characteristics.

Preferably, in the step 2, the process of target detection is as follows:

adopting a trained target detector to carry out target detection on the input picture to obtain a candidate frame b of a person_hAnd confidence s of the person_hAnd candidate frame b of object_oAnd confidence s of the object_oWhich isThe middle subscript h denotes a human body and o denotes an object.

Preferably, in the step 3, constructing the human interaction recognition network includes the following steps:

1) extracting convolution characteristics of the whole picture:

carrying out convolution feature extraction on the original input picture by using a classical residual error network ResNet-50 to obtain a global convolution feature map F of the whole picture and a human body position b of a target detection result_hObject position b_oThe characters are used as the input of a character interaction detection network;

2) constructing paired tributaries:

generating a binary image B with two channels according to a given character bounding box_h,oInputting the convolution kernel into a shallow convolutional neural network comprising two convolutional layers and two pooling layers, wherein the sizes of convolution kernels of the two convolutional layers are 5 multiplied by 5, the number of the convolution kernels is 64 and 32 respectively, and the pooling layers are maximum pooling; then, a position feature vector f is obtained by tiling the position feature map_spWherein the subscript sp represents the relative position of the person and the object, and then the vector is input into a full-link layer classifier and a sigmoid activation function to obtain the classification result of the position characteristic tributary on each interaction class

Wherein the superscript a is the corresponding interaction category { 1.,. A }, wherein A is the number of all interaction categories;

3) constructing intersection branch flow:

firstly, the coordinate b of the boundary box of the character pair intersection is obtained according to the positions of the human and the object_interThe subscript inter represents the intersection of the person and the object, ROI Pooling operation is utilized to intercept convolution characteristics of the intersection region on a global convolution characteristic graph F, then residual block Res is used for optimizing the characteristics, and the character pair intersection region characteristic F is obtained after a global average Pooling layer GAP_inter(ii) a Meanwhile, the detection result of the human key points of the picture is coded, in the minimum circumscribed rectangle frame of each character pair, the model pairs are connected with different joint points by connecting lines with different gray values according to a skeleton model provided by a COCO data set,for characterizing different parts of the body, wherein the COCO dataset is a large public dataset made by microsoft corporation suitable for various computer vision tasks; setting the pixel values of the rest areas in the rectangular frame to be 0, and adjusting the rectangular frame to a fixed size of 64 multiplied by 64 to obtain a pose feature map; then extracting pose characteristics f through two convolution pooling layers_poseThe subscript position represents the pose of the human body, the sizes of convolution kernels of the two convolution layers are both 5 multiplied by 5, the number of the convolution kernels is 32 and 16 respectively, and the pooling layers are both subjected to maximum pooling; then the intersection region feature f_interAnd pose feature f_poseSplicing and carrying out characteristic fusion through two full connecting layers to obtain f_inter-poseInputting the data into a full-connection layer classifier and a sigmoid function to obtain an A-dimensional classification result of intersection characteristic tributaries

4) Constructing short-term memory selection tributaries:

firstly, according to the position coordinates b of the human body_hExtracting human body region characteristics by performing ROI Powing operation on the global convolution characteristic diagram F, and then obtaining pooled human body characteristic vector F by using residual block Res to optimize characteristics and through global average Pooling GAP_h；

According to object position coordinates b_oPerforming ROI Powing operation on the global convolution feature map F to extract object region features, and then obtaining pooled object visual feature vectors by using a residual block Res to optimize the features and through global average Pooling GAP

The superscript vis represents visual features, publicly-usable Word2vec vectors pre-trained on a Google-News data set are selected as object semantic features, and a 300-dimensional semantic feature vector can be extracted for each object class label

Wherein the superscript sem represents a semantic feature; then the semantic feature vector of the object

And visual features

After vector splicing, a full connection layer is passed through, and finally 1024-dimensional object characteristic vector f is obtained_o(ii) a The Word2vec vector, which is preferably publicly available, is at least 1000 billion words; the Google-News dataset is a dataset made by Google;

for the visual characteristics of the common region, the minimum bounding rectangle is first calculated according to the bounding boxes of the human and the object, namely the union region b of the two bounding boxes_unionWherein the subscript unit represents a union set of people and objects, ROI Pooling operation normalization is carried out on a convolution feature map through common region bounding box coordinates to 7 multiplied by 7 fixed size, and then 2048-dimensional visual feature vectors are obtained through residual block and global average Pooling extraction

Position feature vector f of the output of the branch and the paired branch_spHard connecting and sending into a full connection layer to obtain the common region characteristic f after 1024-dimensional fusion_union；

Finally, the human body characteristics f_hObject feature f_oCharacteristic f of common area with human and object_unionInputting a short-term memory selection module, wherein the short-term memory selection module consists of two Gated Recurrent Unit (GRU) units, and the common region characteristic f is input into a GRU Unit_unionAs an initial state of the short term memory module, the first GRU unit inputs a characterization f for a person_hThe second unit input is a representation f of the object_oFinally, the output state of the short-term memory selection module is used to obtain a representation f_hoiObtaining short-term memory selection tributary classification results through a full-link layer classifier and a sigmoid function

5) Training a character interactive recognition network:

and the three branches jointly form the whole character interactive identification network, samples in a training set are used as input of the character interactive behavior identification network, the sum of cross entropy loss functions of the three branches is calculated, network parameters are updated by using a gradient descent method until the optimization reaches the maximum times, the training is stopped, and the trained character interactive behavior identification network is obtained.

Preferably, in the step 4, the detection process of detecting the human interaction behavior in the picture to be detected is as follows:

firstly, obtaining position category information of people and objects by target detection aiming at a picture to be detected, and then sending all information into a trained people interactive recognition network for judgment; a characteristic fusion mode of classification and fusion is adopted, namely, each branch is respectively extracted with characteristics and detected and classified, and then classification result scores of all branches are fused to obtain a final character interaction behavior detection result; then for each character pair (b)_h,b_o) Person interaction detection final score

The calculation formula is as follows:

wherein s is_h,s_oAs the confidence of the human body and the object in the target detection result,

to belong to each category of probability score vectors in the category a interactive behavior classification task,

denotes the different substreams.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable advantages:

1. according to the invention, the detection precision of character interaction behavior is improved by fully fusing multiple characteristics, the human body pose information is concerned, the human body pose information and the intersection region characteristics are effectively fused, and the detail information is fused on the basis of capturing global space configuration, so that more local detail information is learned by a network, the association between a character and an object is established, the characteristic distinction is more obvious, the matching probability of correct character interaction is increased, and the integral classification accuracy is improved;

2. the invention fully screens the characteristics of people, objects and the common region thereof through the short-term memory selection module, effectively utilizes background information and further improves the average precision of character interaction detection.

Drawings

FIG. 1 is a flowchart of a method for detecting human interaction in an image based on multi-feature fusion according to the present invention.

Fig. 2 is a schematic structural diagram of a human interaction recognition network according to the present invention.

FIG. 3 is a schematic diagram of 17 human skeleton joint point models obtained by human key point detection.

Detailed Description

The above-described scheme is further illustrated below with reference to specific embodiments, which are detailed below:

the first embodiment is as follows:

in this embodiment, referring to fig. 1, a method for detecting human interaction in an image based on multi-feature fusion includes the following steps:

step 1: inputting an original picture;

step 2: detecting a target;

and step 3: constructing a figure interaction identification network;

According to the method and the device, on the basis of capturing the global spatial configuration of the interaction relation by using the pose, effective information provided by the intersection region of the people and the object is focused, more precise local features are learned, the probability of correct people interaction on matching is increased, the people, the object and background region information are effectively screened and utilized by means of the short-term memory selection module, and the precision of people interaction detection is improved by means of fusion of various features.

Example two:

this embodiment is substantially the same as the first embodiment, and is characterized in that:

in this embodiment, in step 2, the target detection process is as follows:

adopting a trained target detector to carry out target detection on the input picture to obtain a candidate frame b of a person_hAnd confidence s of the person_hAnd candidate frame b of object_oAnd confidence s of the object_o. Where the subscript h denotes the human body and o denotes the object.

In this embodiment, in the step 3, constructing the human interaction recognition network includes the following steps:

1) extracting convolution characteristics of the whole picture:

2) constructing paired tributaries:

generating a binary image B with two channels according to a given character bounding box_h,oInputting the convolution kernel into a shallow convolutional neural network comprising two convolutional layers and two pooling layers, wherein the sizes of convolution kernels of the two convolutional layers are 5 multiplied by 5, the number of the convolution kernels is 64 and 32 respectively, and the pooling layers are maximum pooling; then, a position feature vector f is obtained by tiling the position feature map_spWhereinSubscript sp represents the relative position of the person and the object, and then the vector is input into a full-link layer classifier and a sigmoid activation function to obtain the classification result of the position characteristic tributary on each interaction class

3) constructing intersection branch flow:

firstly, the coordinate b of the boundary box of the character pair intersection is obtained according to the positions of the human and the object_interThe subscript inter represents the intersection of the person and the object, intersection region convolution features are intercepted from a global convolution feature map F by using ROIPooling operation of the region of interest, then features are optimized by using a residual block Res, and the feature F of the person pair intersection region is obtained through a global average pooling layer GAP_inter(ii) a Meanwhile, encoding the detection result of the human key points of the picture, and connecting different joint points of the model pair by using connecting lines with different gray values according to a skeleton model provided by a COCO data set in a minimum circumscribed rectangular frame of each figure pair for representing different parts of a body; wherein the COCO data set is a large public data set manufactured by Microsoft corporation and suitable for various computer vision tasks; setting the pixel values of the rest areas in the rectangular frame to be 0, and adjusting the rectangular frame to a fixed size of 64 multiplied by 64 to obtain a pose feature map; then extracting pose characteristics f through two convolution pooling layers_poseThe subscript position represents the pose of the human body, the sizes of convolution kernels of the two convolution layers are both 5 multiplied by 5, the number of the convolution kernels is 32 and 16 respectively, and the pooling layers are both subjected to maximum pooling; then the intersection region feature f_interAnd pose feature f_poseSplicing and carrying out characteristic fusion through two full connecting layers to obtain f_inter-poseInputting the data into a full-connection layer classifier and a sigmoid function to obtain an A-dimensional classification result of intersection characteristic tributaries

4) Constructing short-term memory selection tributaries:

The superscript vis represents semantic features, publicly-usable Word2vec vectors pre-trained on a Google-News data set are selected as object semantic features, and a 300-dimensional semantic feature vector can be extracted for each object class label

And visual features

After vector splicing, a full connection layer is passed through, and finally 1024-dimensional object characteristic vector f is obtained_o(ii) a The Word2vec vector, which is preferably publicly available, is at least 1000 billion words;

5) Training a character interactive recognition network:

In this embodiment, in the step 4, the detection process for detecting the human interaction behavior in the picture to be detected includes:

The calculation formula is as follows:

denotes the different substreams.

In the method for detecting the interaction of the people in the image based on the multi-feature fusion, after all example information including human body position information, object position information, class information and the like in the image is detected by using a target detection algorithm, a trained people interaction behavior recognition network is input, and the interaction behavior between people in the image to be detected is detected. On the basis of capturing the global spatial configuration of the interaction relationship by using the pose, the method focuses on the effective information provided by the intersection region of the people and the object, learns more precise local features, increases the probability of matching of correct character interaction, effectively screens and utilizes the information of the people, the object and the background region thereof by means of the short-term memory selection module, and improves the precision of character interaction detection by fusing various features.

Example three:

this embodiment is substantially the same as the above embodiment, and is characterized in that:

in this embodiment, as shown in fig. 1, a method for detecting human interaction in an image based on multi-feature fusion specifically includes the following steps:

step 1: and performing target detection on the picture to acquire all instance information including human body position information and object position and category information, and forming < human-object > instances by people and objects to perform human interaction detection on an input human interaction identification network.

Step 2: and constructing a character interactive recognition network, and learning various characteristics of < human-object > examples in the picture by adopting a multi-branch neural network structure, wherein the characteristics comprise paired branches, intersection branches and short-term memory selection branches, and each branch is used for extracting different characteristic information to detect the interactive behavior between character pairs. The steps of implementing the human interactive identification network according to the present invention are further described with reference to fig. 2.

1) Performing convolution feature extraction on the original input picture by using ResNet-50 to obtain a global convolution feature map F of the whole picture and a human body position b of a target detection result_hObject position b_oThe characters are used as the input of a character interaction identification network together;

2) constructing paired branches, specifically:

2a) first, a binary image with two channels is generated according to a given human bounding box: the first channel has a value of 1 at the pixel surrounded by the human bounding box and a value of 0 at other positions; the second channel has a value of 1 at the pixel enclosed by the object bounding box and 0 at other locations. Then using a square which is concentric with an external rectangle formed by a boundary frame of the example of the person and the object to cut out the position information of the person and the object, abandoning the rest invalid information, wherein the side length of the square used for cutting out is equal to the maximum side length of the external rectangle, and finally adjusting the size of the two-channel binary image with the square shape to obtain a binary image B with the fixed size being equal to 64 multiplied by 64_h,o；

2b) Binary image B_h,oThe shallow convolutional neural network comprising two convolutional layers and two pooling layers is input, the sizes of convolution kernels of the two convolutional layers are 5 multiplied by 5, the number of the convolution kernels is 64 and 32 respectively, and the pooling layers are maximum pooling. Finally, position feature vector f is obtained by tiling the position feature map_spThen, inputting the vector into a full-connection layer classifier and a sigmoid function to obtain a classification result of the position characteristic tributary on each interactive class

Is what corresponds toWherein a is the number of all interaction categories.

3) Constructing intersection branch flow, which specifically comprises the following steps:

3a) first according to the human boundary frame

Boundary frame with object

Finding the coordinate b of the frame of the character pair intersection_interThe calculation expression is as follows:

wherein area () represents a region area, and:

intercepting intersection region convolution characteristics on the global convolution characteristic graph F by utilizing ROIPooling, optimizing characteristics by utilizing a residual block Res and obtaining character pair intersection region characteristics F through global average pooling GAP_interThe formula expression is as follows:

f_inter＝GAP(Res_inter(ROI(F,b_inter)))

3b) encoding the detection result of the key points of the human body by using the human bodyPose estimation is carried out on the picture by the key point detection network to obtain a human body key point vector

Wherein

And K is the coordinate of the key point of the kth human body, and K is 17 which is the total number of the extracted key points. Seventeen joint points of the pose estimation result are encoded, and in a minimum circumscribed rectangle of each character pair, skeleton models provided according to COCO data sets are connected between different joint points of the model pairs as shown in fig. 3 by connecting lines with different gray values, wherein the different connecting lines with the gray values from 0.15 to 0.95 represent different parts of the body, for example, a connecting line with the gray value of 0.5 is used for connecting the left elbow and the left wrist to represent the left forearm. And setting the pixel values of the rest areas in the rectangular frame to be 0, and adjusting the rectangular frame to a fixed scale of 64 multiplied by 64 to obtain the pose feature map. Finally, extracting pose characteristics f through two convolution pooling layers_poseThe sizes of the two convolutional layer convolutional kernels are both 5 multiplied by 5, the number of the convolutional layer convolutional kernels in the two layers is 32 and 16 respectively, and the pooling layers are both maximum pooling.

3c) The intersection region characteristic f_interAnd pose feature f_poseSplicing and carrying out characteristic fusion through two full connecting layers to obtain f_inter-poseThe formula expression is as follows:

wherein

Representing a hard connection of feature vectors (c), W_interIs a projection matrix. Final fused feature f_inter-poseObtaining the A-dimensional classification result of intersection characteristic tributaries through a full-connection layer classifier and a sigmoid function

4) Constructing short-term memory selection tributaries, specifically:

4a) according to the position coordinates b of the human body_hROIPooling operation is carried out on the global convolution feature map F to extract human body region features, and then the pooled human body feature vector F is obtained by utilizing the Res optimized features of the residual block and through the global average pooling GAP_hThe process is described by the formula as follows:

f_h＝GAP(Res_h(ROI(F,b_h)))

4b) according to object position coordinates b_oCarrying out ROIPooling operation on the global convolution feature map F to extract object region features, then optimizing the features by using a residual block Res and obtaining a pooled object visual feature vector through a global average pooling GAP

The process is described by the formula:

extracting object semantic features by using publicly available Word2vec vectors pre-trained on a Google-News data set, and extracting a 300-dimensional semantic feature vector for each object class label

Then the semantic feature vector of the object

And visual features

After vector splicing, a full connection layer is passed through, and finally 1024-dimensional object characteristic vector f is obtained_o。

4c) For the visual characteristics of the common region, the minimum bounding rectangle, i.e. the union region of two bounding boxes, is first calculated from the bounding boxes of the person and the object, and thenCoordinate b of bounding box passing through common region on convolution feature map_unionCarrying out ROI Pooling operation standardization to 7 × 7 fixed size, and obtaining 2048-dimensional visual feature vector through residual block and global average Pooling extraction

The process is described by the formula:

position feature vector f of the output of the branch and the paired branch_spHard connecting and sending into a full connection layer to obtain the common region characteristic f after 1024-dimensional fusion_union. The formula expression is as follows:

wherein

Representing a hard connection of feature vectors (c), W_unionIs a projection matrix.

4d) The extracted human body characteristics f_hObject feature f_oCharacteristic f of common region with person_unionInputting the short-term memory selection module to perform feature fusion. The short-term memory selection module consists of two GRU units, and the parameter updating formula is as follows:

z_t＝σ(W_zx_t+U_zh_t-1+b_z)

r_t＝σ(W_rx_t+U_rh_t-1+b_r)

wherein x is_tThe input to the t-th GRU unit, t is 1 and 2, respectively. GRU network in storage unit h_tAccumulation and update in (hidden state). z is a radical of_tFor updating the gate control signal, W_z,U_zTo update the door z_tWeight matrix, b_zTo update the door z_tThe offset vector, σ (-) is the sigmoid activation function.

Indicates a candidate state, W_h,U_hAs candidate state weight matrix, b_hA candidate state offset vector, an indicates an element corresponding product. r is_tTo reset the gate control signal, W_r,U_rTo reset the gate r_tWeight matrix, b_rTo reset the gate r_tA bias vector. The initial state of the short-term memory module is the characteristic h of the common region of the human body and the object₀＝f_unionThe first GRU unit input is a human characterization, x₁＝f_hThe second unit input is a representation x of the object₂＝f_oAnd finally outputting the state h₂Characterization f as HOI_hoiClassifying and sigmoid function is carried out on the full connection layer with output dimension being all HOI category number A to obtain a classification result

5) Training a character interactive recognition network, wherein the goal of neural network training is to minimize a loss function between a real label and a predicted label, and a prediction result y of each branch is calculated for all samples in a training set_iWith a genuine label a_iThe cross entropy loss function between the two functions is calculated as follows:

the loss function of the whole cross-recognition network is the loss function of each tributaryAnd, including paired tributary classification loss_spIntersection branch class loss_interAnd short term memory selection tributary classification loss_STMSThe calculation formula is as follows:

Loss_hoi＝λ₁*loss_STMS+λ₂*loss_sp+loss_inter

to balance the differences in contribution capacity of different tributaries, different weighting factors λ are used₁＝2,λ₂Each branch loss is weighted by 0.5. And updating network parameters by adopting a random gradient descent method with momentum in the training process until the optimization reaches the maximum times, and terminating the training to obtain the finally trained character interactive behavior recognition network.

And step 3: and (3) detecting the character interaction behavior in the picture to be detected, firstly detecting the target of the picture to be detected in the step (1) to obtain the position category information of the character and the object, and then sending all the information into a trained character interaction identification network for judgment. The invention adopts a characteristic fusion mode of classifying first and then fusing, namely, each branch respectively extracts characteristics and detects and classifies the characteristics, and finally, the classification result scores of all the branches are fused to obtain the final human interaction behavior detection result. Finally for each character pair (b)_h,b_o) The final score calculation formula of the human interaction detection is as follows:

denotes the different substreams.

In the embodiment, the character interaction behavior detection precision is improved by fully fusing multiple features, the human body pose information is focused on, the human body pose information is effectively fused with the intersection region features, and the detail information is fused on the basis of capturing the global space configuration, so that the network learns more local detail information, the association between the human body and the object is established, the feature distinction is more obvious, the probability of matching of correct character interaction pairs is increased, and the overall classification accuracy is improved; meanwhile, the short-term memory selection module is used for fully screening the characteristics of people, objects and the common region of the people and the objects, background information is effectively utilized, and the average precision of character interaction detection is further improved.

The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the present invention, and changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and the inventive concept of the present invention.

Claims

1. A method for detecting human interaction in an image based on multi-feature fusion is characterized by comprising the following steps: the operation steps are as follows:

step 1: inputting an original picture;

step 2: detecting a target;

and step 3: constructing a figure interaction identification network;

2. The human interaction detection method based on multi-feature fusion as claimed in claim 1, wherein: in step 2, the target detection process is as follows:

adopting a trained target detector to carry out target detection on the input picture to obtain a candidate frame b of a person_hAnd confidence s of the person_hAnd candidate frame b of object_oAnd confidence s of the object_oWhere the subscript h denotes a human body and o denotes an object.

3. The human interaction detection method based on multi-feature fusion as claimed in claim 1, wherein: in the step 3, constructing the human interaction recognition network comprises the following steps:

1) extracting convolution characteristics of the whole picture:

2) constructing paired tributaries:

3) constructing intersection branch flow:

firstly, the coordinate b of the boundary box of the character pair intersection is obtained according to the positions of the human and the object_interThe subscript inter represents the intersection of the person and the object, ROI Pooling operation is utilized to intercept convolution characteristics of the intersection region on a global convolution characteristic graph F, then residual block Res is used for optimizing the characteristics, and the character pair intersection region characteristic F is obtained after a global average Pooling layer GAP_inter(ii) a Meanwhile, encoding the detection result of the human key points of the picture, and connecting different joint points of the model pair by connecting lines with different gray values according to a skeleton model provided by a COCO data set in a minimum circumscribed rectangular frame of each figure pair for representing different parts of a body, wherein the COCO data set is a large-scale public data set which is manufactured by Microsoft and is suitable for various computer vision tasks; setting the pixel values of the rest areas in the rectangular frame to be 0, and adjusting the rectangular frame to a fixed size of 64 multiplied by 64 to obtain a pose feature map; then extracting pose characteristics f through two convolution pooling layers_poseThe subscript position represents the pose of the human body, the sizes of convolution kernels of the two convolution layers are both 5 multiplied by 5, the number of the convolution kernels is 32 and 16 respectively, and the pooling layers are both subjected to maximum pooling; then the intersection region feature f_interAnd pose feature f_poseSplicing and carrying out characteristic fusion through two full connecting layers to obtain f_inter-poseInputting the data into a full-connection layer classifier and a sigmoid function to obtain an A-dimensional classification result of intersection characteristic tributaries

4) Constructing short-term memory selection tributaries:

According to object position coordinates b_oPerforming ROI Pooling operation on the global convolution feature map F to extract object region features, then optimizing the features by using a residual block Res and obtaining pooled object regions through a global average Pooling GAPObject visual feature vector of

And visual features

After vector splicing, a full connection layer is passed through, and finally 1024-dimensional object characteristic vector f is obtained_o；

5) Training a character interactive recognition network:

4. The human interaction detection method based on multi-feature fusion as claimed in claim 1, wherein: in the step 4, the detection process for detecting the character interaction behavior in the picture to be detected is as follows:

The calculation formula is as follows:

whereins_h,s_oAs the confidence of the human body and the object in the target detection result,

denotes the different substreams.