CN113378676A - Method for detecting figure interaction in image based on multi-feature fusion - Google Patents

Method for detecting figure interaction in image based on multi-feature fusion Download PDF

Info

Publication number
CN113378676A
CN113378676A CN202110608515.XA CN202110608515A CN113378676A CN 113378676 A CN113378676 A CN 113378676A CN 202110608515 A CN202110608515 A CN 202110608515A CN 113378676 A CN113378676 A CN 113378676A
Authority
CN
China
Prior art keywords
interaction
character
feature
convolution
human
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110608515.XA
Other languages
Chinese (zh)
Inventor
马世伟
汪畅
孙金玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202110608515.XA priority Critical patent/CN113378676A/en
Publication of CN113378676A publication Critical patent/CN113378676A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting figure interaction in an image based on multi-feature fusion, which is characterized by detecting all example information in the image by using a target detection algorithm, wherein the example information comprises human body position information, object position and category information and the like, then inputting a trained figure interaction behavior recognition network, and detecting the interaction behavior between figure pairs in the image to be detected. On the basis of capturing the global spatial configuration of the interaction relation by using the pose, the invention focuses on the effective information provided by the intersection region of the people and the object, learns more precise local characteristics, increases the matching probability of correct people interaction pairs, effectively screens and utilizes the information of the people, the object and the background region thereof by using the short-term memory selection module, and improves the precision of people interaction detection by fusing various characteristics.

Description

Method for detecting figure interaction in image based on multi-feature fusion
Technical Field
The invention belongs to the technical field of detecting and understanding of visual relationships in images by using computer vision, and particularly relates to a method for detecting character interaction in images based on multi-feature fusion.
Background
The Human-Object Interaction (HOI) detection in the image aims to automatically detect the specific positions of objects such as people and objects which interact with each other in an input picture by using computer vision and identify the Interaction behavior category between the < Human-Object > pairs, thereby realizing the automatic understanding of the machine to the image content. The character interaction detection is a core technology for automatically understanding deep visual relation and realizing high-grade artificial intelligence through computer vision, and can be widely applied to the fields of intelligent robots, safety monitoring, information retrieval, human-computer interaction and the like.
Most of the existing human interaction detection methods are based on the result of target detection, all the people and objects in the graph are completely paired, and the characteristics of the people and the objects and the spatial characteristics between the < people-objects > pairs are extracted to estimate the interaction behavior between the people and the objects. This speculative approach relying only on instance-level features is still insufficient when dealing with relatively complex interaction classes, resulting in poor overall detection accuracy. First, due to the lack of detail cues, it is difficult to determine the relevance of a person with an instance-level representation to an object instance, easily leading to erroneous associations between the person and non-interacting objects. In addition, when the interaction types with fine granularity are distinguished only by means of similar example-level features, the internal relation between the features is not effectively utilized, and the complex conditions cannot be accurately judged.
Disclosure of Invention
In order to solve the problems in the prior art, the invention aims to overcome the defects in the prior art and provide a method for detecting the interaction of people in an image based on multi-feature fusion. According to the method, on the basis of capturing the global spatial configuration of the interactive behavior by using the human body pose, effective information provided by the intersection region of the human body and the object in an image scene is focused, more refined local features are learned through a multi-branch neural network, the probability of correctly matching the < human-object > interactive pairs is increased, the information of the human body, the object and the background region of the human body and the object is effectively screened and utilized through a short-term memory selection module, and the character interactive behavior detection is realized through the fusion of various features.
In order to achieve the purpose of the invention, the invention adopts the following inventive concept:
firstly, detecting all example information in the picture by using a target detection algorithm, wherein the example information comprises human body position information, object position and category information and the like, then inputting a trained figure interactive behavior recognition network, and detecting the interactive behavior between the figure pairs in the picture to be detected. The character interactive recognition network adopts a multi-branch neural network structure, comprises paired branches, intersection branches and short-term memory selection branches, and the network performs learning training on < human-object > examples in pictures on various characteristics.
According to the inventive concept, the invention adopts the following technical scheme:
a method for detecting human interaction in an image based on multi-feature fusion comprises the following operation steps:
step 1: inputting an original picture;
step 2: detecting a target;
and step 3: constructing a figure interaction identification network;
and 4, step 4: detecting the character interaction behavior of the picture to be detected;
in the step 2, after all example information including human body position information and object position and category information in the picture is detected by using a target detection algorithm, inputting a trained character interaction behavior recognition network, and detecting interaction behaviors between character pairs in the picture to be detected;
in the step 3, the character interaction identification network adopts a multi-branch neural network structure, which comprises paired branches, intersection branches and short-term memory selection branches, and the network performs learning training on the < human-object > example in the picture on various characteristics.
Preferably, in the step 2, the process of target detection is as follows:
adopting a trained target detector to carry out target detection on the input picture to obtain a candidate frame b of a personhAnd confidence s of the personhAnd candidate frame b of objectoAnd confidence s of the objectoWhich isThe middle subscript h denotes a human body and o denotes an object.
Preferably, in the step 3, constructing the human interaction recognition network includes the following steps:
1) extracting convolution characteristics of the whole picture:
carrying out convolution feature extraction on the original input picture by using a classical residual error network ResNet-50 to obtain a global convolution feature map F of the whole picture and a human body position b of a target detection resulthObject position boThe characters are used as the input of a character interaction detection network;
2) constructing paired tributaries:
generating a binary image B with two channels according to a given character bounding boxh,oInputting the convolution kernel into a shallow convolutional neural network comprising two convolutional layers and two pooling layers, wherein the sizes of convolution kernels of the two convolutional layers are 5 multiplied by 5, the number of the convolution kernels is 64 and 32 respectively, and the pooling layers are maximum pooling; then, a position feature vector f is obtained by tiling the position feature mapspWherein the subscript sp represents the relative position of the person and the object, and then the vector is input into a full-link layer classifier and a sigmoid activation function to obtain the classification result of the position characteristic tributary on each interaction class
Figure BDA0003094564100000021
Wherein the superscript a is the corresponding interaction category { 1.,. A }, wherein A is the number of all interaction categories;
3) constructing intersection branch flow:
firstly, the coordinate b of the boundary box of the character pair intersection is obtained according to the positions of the human and the objectinterThe subscript inter represents the intersection of the person and the object, ROI Pooling operation is utilized to intercept convolution characteristics of the intersection region on a global convolution characteristic graph F, then residual block Res is used for optimizing the characteristics, and the character pair intersection region characteristic F is obtained after a global average Pooling layer GAPinter(ii) a Meanwhile, the detection result of the human key points of the picture is coded, in the minimum circumscribed rectangle frame of each character pair, the model pairs are connected with different joint points by connecting lines with different gray values according to a skeleton model provided by a COCO data set,for characterizing different parts of the body, wherein the COCO dataset is a large public dataset made by microsoft corporation suitable for various computer vision tasks; setting the pixel values of the rest areas in the rectangular frame to be 0, and adjusting the rectangular frame to a fixed size of 64 multiplied by 64 to obtain a pose feature map; then extracting pose characteristics f through two convolution pooling layersposeThe subscript position represents the pose of the human body, the sizes of convolution kernels of the two convolution layers are both 5 multiplied by 5, the number of the convolution kernels is 32 and 16 respectively, and the pooling layers are both subjected to maximum pooling; then the intersection region feature finterAnd pose feature fposeSplicing and carrying out characteristic fusion through two full connecting layers to obtain finter-poseInputting the data into a full-connection layer classifier and a sigmoid function to obtain an A-dimensional classification result of intersection characteristic tributaries
Figure BDA0003094564100000031
4) Constructing short-term memory selection tributaries:
firstly, according to the position coordinates b of the human bodyhExtracting human body region characteristics by performing ROI Powing operation on the global convolution characteristic diagram F, and then obtaining pooled human body characteristic vector F by using residual block Res to optimize characteristics and through global average Pooling GAPh
According to object position coordinates boPerforming ROI Powing operation on the global convolution feature map F to extract object region features, and then obtaining pooled object visual feature vectors by using a residual block Res to optimize the features and through global average Pooling GAP
Figure BDA0003094564100000032
The superscript vis represents visual features, publicly-usable Word2vec vectors pre-trained on a Google-News data set are selected as object semantic features, and a 300-dimensional semantic feature vector can be extracted for each object class label
Figure BDA0003094564100000033
Wherein the superscript sem represents a semantic feature; then the semantic feature vector of the object
Figure BDA0003094564100000034
And visual features
Figure BDA0003094564100000035
After vector splicing, a full connection layer is passed through, and finally 1024-dimensional object characteristic vector f is obtainedo(ii) a The Word2vec vector, which is preferably publicly available, is at least 1000 billion words; the Google-News dataset is a dataset made by Google;
for the visual characteristics of the common region, the minimum bounding rectangle is first calculated according to the bounding boxes of the human and the object, namely the union region b of the two bounding boxesunionWherein the subscript unit represents a union set of people and objects, ROI Pooling operation normalization is carried out on a convolution feature map through common region bounding box coordinates to 7 multiplied by 7 fixed size, and then 2048-dimensional visual feature vectors are obtained through residual block and global average Pooling extraction
Figure BDA0003094564100000036
Position feature vector f of the output of the branch and the paired branchspHard connecting and sending into a full connection layer to obtain the common region characteristic f after 1024-dimensional fusionunion
Finally, the human body characteristics fhObject feature foCharacteristic f of common area with human and objectunionInputting a short-term memory selection module, wherein the short-term memory selection module consists of two Gated Recurrent Unit (GRU) units, and the common region characteristic f is input into a GRU UnitunionAs an initial state of the short term memory module, the first GRU unit inputs a characterization f for a personhThe second unit input is a representation f of the objectoFinally, the output state of the short-term memory selection module is used to obtain a representation fhoiObtaining short-term memory selection tributary classification results through a full-link layer classifier and a sigmoid function
Figure BDA0003094564100000041
5) Training a character interactive recognition network:
and the three branches jointly form the whole character interactive identification network, samples in a training set are used as input of the character interactive behavior identification network, the sum of cross entropy loss functions of the three branches is calculated, network parameters are updated by using a gradient descent method until the optimization reaches the maximum times, the training is stopped, and the trained character interactive behavior identification network is obtained.
Preferably, in the step 4, the detection process of detecting the human interaction behavior in the picture to be detected is as follows:
firstly, obtaining position category information of people and objects by target detection aiming at a picture to be detected, and then sending all information into a trained people interactive recognition network for judgment; a characteristic fusion mode of classification and fusion is adopted, namely, each branch is respectively extracted with characteristics and detected and classified, and then classification result scores of all branches are fused to obtain a final character interaction behavior detection result; then for each character pair (b)h,bo) Person interaction detection final score
Figure BDA0003094564100000042
The calculation formula is as follows:
Figure BDA0003094564100000043
wherein s ish,soAs the confidence of the human body and the object in the target detection result,
Figure BDA0003094564100000044
to belong to each category of probability score vectors in the category a interactive behavior classification task,
Figure BDA0003094564100000045
denotes the different substreams.
Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable advantages:
1. according to the invention, the detection precision of character interaction behavior is improved by fully fusing multiple characteristics, the human body pose information is concerned, the human body pose information and the intersection region characteristics are effectively fused, and the detail information is fused on the basis of capturing global space configuration, so that more local detail information is learned by a network, the association between a character and an object is established, the characteristic distinction is more obvious, the matching probability of correct character interaction is increased, and the integral classification accuracy is improved;
2. the invention fully screens the characteristics of people, objects and the common region thereof through the short-term memory selection module, effectively utilizes background information and further improves the average precision of character interaction detection.
Drawings
FIG. 1 is a flowchart of a method for detecting human interaction in an image based on multi-feature fusion according to the present invention.
Fig. 2 is a schematic structural diagram of a human interaction recognition network according to the present invention.
FIG. 3 is a schematic diagram of 17 human skeleton joint point models obtained by human key point detection.
Detailed Description
The above-described scheme is further illustrated below with reference to specific embodiments, which are detailed below:
the first embodiment is as follows:
in this embodiment, referring to fig. 1, a method for detecting human interaction in an image based on multi-feature fusion includes the following steps:
step 1: inputting an original picture;
step 2: detecting a target;
and step 3: constructing a figure interaction identification network;
and 4, step 4: detecting the character interaction behavior of the picture to be detected;
in the step 2, after all example information including human body position information and object position and category information in the picture is detected by using a target detection algorithm, inputting a trained character interaction behavior recognition network, and detecting interaction behaviors between character pairs in the picture to be detected;
in the step 3, the character interaction identification network adopts a multi-branch neural network structure, which comprises paired branches, intersection branches and short-term memory selection branches, and the network performs learning training on the < human-object > example in the picture on various characteristics.
According to the method and the device, on the basis of capturing the global spatial configuration of the interaction relation by using the pose, effective information provided by the intersection region of the people and the object is focused, more precise local features are learned, the probability of correct people interaction on matching is increased, the people, the object and background region information are effectively screened and utilized by means of the short-term memory selection module, and the precision of people interaction detection is improved by means of fusion of various features.
Example two:
this embodiment is substantially the same as the first embodiment, and is characterized in that:
in this embodiment, in step 2, the target detection process is as follows:
adopting a trained target detector to carry out target detection on the input picture to obtain a candidate frame b of a personhAnd confidence s of the personhAnd candidate frame b of objectoAnd confidence s of the objecto. Where the subscript h denotes the human body and o denotes the object.
In this embodiment, in the step 3, constructing the human interaction recognition network includes the following steps:
1) extracting convolution characteristics of the whole picture:
carrying out convolution feature extraction on the original input picture by using a classical residual error network ResNet-50 to obtain a global convolution feature map F of the whole picture and a human body position b of a target detection resulthObject position boThe characters are used as the input of a character interaction detection network;
2) constructing paired tributaries:
generating a binary image B with two channels according to a given character bounding boxh,oInputting the convolution kernel into a shallow convolutional neural network comprising two convolutional layers and two pooling layers, wherein the sizes of convolution kernels of the two convolutional layers are 5 multiplied by 5, the number of the convolution kernels is 64 and 32 respectively, and the pooling layers are maximum pooling; then, a position feature vector f is obtained by tiling the position feature mapspWhereinSubscript sp represents the relative position of the person and the object, and then the vector is input into a full-link layer classifier and a sigmoid activation function to obtain the classification result of the position characteristic tributary on each interaction class
Figure BDA0003094564100000051
Wherein the superscript a is the corresponding interaction category { 1.,. A }, wherein A is the number of all interaction categories;
3) constructing intersection branch flow:
firstly, the coordinate b of the boundary box of the character pair intersection is obtained according to the positions of the human and the objectinterThe subscript inter represents the intersection of the person and the object, intersection region convolution features are intercepted from a global convolution feature map F by using ROIPooling operation of the region of interest, then features are optimized by using a residual block Res, and the feature F of the person pair intersection region is obtained through a global average pooling layer GAPinter(ii) a Meanwhile, encoding the detection result of the human key points of the picture, and connecting different joint points of the model pair by using connecting lines with different gray values according to a skeleton model provided by a COCO data set in a minimum circumscribed rectangular frame of each figure pair for representing different parts of a body; wherein the COCO data set is a large public data set manufactured by Microsoft corporation and suitable for various computer vision tasks; setting the pixel values of the rest areas in the rectangular frame to be 0, and adjusting the rectangular frame to a fixed size of 64 multiplied by 64 to obtain a pose feature map; then extracting pose characteristics f through two convolution pooling layersposeThe subscript position represents the pose of the human body, the sizes of convolution kernels of the two convolution layers are both 5 multiplied by 5, the number of the convolution kernels is 32 and 16 respectively, and the pooling layers are both subjected to maximum pooling; then the intersection region feature finterAnd pose feature fposeSplicing and carrying out characteristic fusion through two full connecting layers to obtain finter-poseInputting the data into a full-connection layer classifier and a sigmoid function to obtain an A-dimensional classification result of intersection characteristic tributaries
Figure BDA0003094564100000061
4) Constructing short-term memory selection tributaries:
firstly, according to the position coordinates b of the human bodyhExtracting human body region characteristics by performing ROI Powing operation on the global convolution characteristic diagram F, and then obtaining pooled human body characteristic vector F by using residual block Res to optimize characteristics and through global average Pooling GAPh
According to object position coordinates boPerforming ROI Powing operation on the global convolution feature map F to extract object region features, and then obtaining pooled object visual feature vectors by using a residual block Res to optimize the features and through global average Pooling GAP
Figure BDA0003094564100000062
The superscript vis represents semantic features, publicly-usable Word2vec vectors pre-trained on a Google-News data set are selected as object semantic features, and a 300-dimensional semantic feature vector can be extracted for each object class label
Figure BDA0003094564100000063
Wherein the superscript sem represents a semantic feature; then the semantic feature vector of the object
Figure BDA0003094564100000064
And visual features
Figure BDA0003094564100000065
After vector splicing, a full connection layer is passed through, and finally 1024-dimensional object characteristic vector f is obtainedo(ii) a The Word2vec vector, which is preferably publicly available, is at least 1000 billion words;
for the visual characteristics of the common region, the minimum bounding rectangle is first calculated according to the bounding boxes of the human and the object, namely the union region b of the two bounding boxesunionWherein the subscript unit represents a union set of people and objects, ROI Pooling operation normalization is carried out on a convolution feature map through common region bounding box coordinates to 7 multiplied by 7 fixed size, and then 2048-dimensional visual feature vectors are obtained through residual block and global average Pooling extraction
Figure BDA0003094564100000066
Position feature vector f of the output of the branch and the paired branchspHard connecting and sending into a full connection layer to obtain the common region characteristic f after 1024-dimensional fusionunion
Finally, the human body characteristics fhObject feature foCharacteristic f of common area with human and objectunionInputting a short-term memory selection module, wherein the short-term memory selection module consists of two Gated Recurrent Unit (GRU) units, and the common region characteristic f is input into a GRU UnitunionAs an initial state of the short term memory module, the first GRU unit inputs a characterization f for a personhThe second unit input is a representation f of the objectoFinally, the output state of the short-term memory selection module is used to obtain a representation fhoiObtaining short-term memory selection tributary classification results through a full-link layer classifier and a sigmoid function
Figure BDA0003094564100000071
5) Training a character interactive recognition network:
and the three branches jointly form the whole character interactive identification network, samples in a training set are used as input of the character interactive behavior identification network, the sum of cross entropy loss functions of the three branches is calculated, network parameters are updated by using a gradient descent method until the optimization reaches the maximum times, the training is stopped, and the trained character interactive behavior identification network is obtained.
In this embodiment, in the step 4, the detection process for detecting the human interaction behavior in the picture to be detected includes:
firstly, obtaining position category information of people and objects by target detection aiming at a picture to be detected, and then sending all information into a trained people interactive recognition network for judgment; a characteristic fusion mode of classification and fusion is adopted, namely, each branch is respectively extracted with characteristics and detected and classified, and then classification result scores of all branches are fused to obtain a final character interaction behavior detection result; then for each character pair (b)h,bo) Person interaction detection final score
Figure BDA0003094564100000072
The calculation formula is as follows:
Figure BDA0003094564100000073
wherein s ish,soAs the confidence of the human body and the object in the target detection result,
Figure BDA0003094564100000074
to belong to each category of probability score vectors in the category a interactive behavior classification task,
Figure BDA0003094564100000075
denotes the different substreams.
In the method for detecting the interaction of the people in the image based on the multi-feature fusion, after all example information including human body position information, object position information, class information and the like in the image is detected by using a target detection algorithm, a trained people interaction behavior recognition network is input, and the interaction behavior between people in the image to be detected is detected. On the basis of capturing the global spatial configuration of the interaction relationship by using the pose, the method focuses on the effective information provided by the intersection region of the people and the object, learns more precise local features, increases the probability of matching of correct character interaction, effectively screens and utilizes the information of the people, the object and the background region thereof by means of the short-term memory selection module, and improves the precision of character interaction detection by fusing various features.
Example three:
this embodiment is substantially the same as the above embodiment, and is characterized in that:
in this embodiment, as shown in fig. 1, a method for detecting human interaction in an image based on multi-feature fusion specifically includes the following steps:
step 1: and performing target detection on the picture to acquire all instance information including human body position information and object position and category information, and forming < human-object > instances by people and objects to perform human interaction detection on an input human interaction identification network.
Step 2: and constructing a character interactive recognition network, and learning various characteristics of < human-object > examples in the picture by adopting a multi-branch neural network structure, wherein the characteristics comprise paired branches, intersection branches and short-term memory selection branches, and each branch is used for extracting different characteristic information to detect the interactive behavior between character pairs. The steps of implementing the human interactive identification network according to the present invention are further described with reference to fig. 2.
1) Performing convolution feature extraction on the original input picture by using ResNet-50 to obtain a global convolution feature map F of the whole picture and a human body position b of a target detection resulthObject position boThe characters are used as the input of a character interaction identification network together;
2) constructing paired branches, specifically:
2a) first, a binary image with two channels is generated according to a given human bounding box: the first channel has a value of 1 at the pixel surrounded by the human bounding box and a value of 0 at other positions; the second channel has a value of 1 at the pixel enclosed by the object bounding box and 0 at other locations. Then using a square which is concentric with an external rectangle formed by a boundary frame of the example of the person and the object to cut out the position information of the person and the object, abandoning the rest invalid information, wherein the side length of the square used for cutting out is equal to the maximum side length of the external rectangle, and finally adjusting the size of the two-channel binary image with the square shape to obtain a binary image B with the fixed size being equal to 64 multiplied by 64h,o
2b) Binary image Bh,oThe shallow convolutional neural network comprising two convolutional layers and two pooling layers is input, the sizes of convolution kernels of the two convolutional layers are 5 multiplied by 5, the number of the convolution kernels is 64 and 32 respectively, and the pooling layers are maximum pooling. Finally, position feature vector f is obtained by tiling the position feature mapspThen, inputting the vector into a full-connection layer classifier and a sigmoid function to obtain a classification result of the position characteristic tributary on each interactive class
Figure BDA0003094564100000081
Is what corresponds toWherein a is the number of all interaction categories.
3) Constructing intersection branch flow, which specifically comprises the following steps:
3a) first according to the human boundary frame
Figure BDA0003094564100000082
Boundary frame with object
Figure BDA0003094564100000083
Finding the coordinate b of the frame of the character pair intersectioninterThe calculation expression is as follows:
Figure BDA0003094564100000084
wherein area () represents a region area, and:
Figure BDA0003094564100000085
Figure BDA0003094564100000086
Figure BDA0003094564100000087
Figure BDA0003094564100000091
intercepting intersection region convolution characteristics on the global convolution characteristic graph F by utilizing ROIPooling, optimizing characteristics by utilizing a residual block Res and obtaining character pair intersection region characteristics F through global average pooling GAPinterThe formula expression is as follows:
finter=GAP(Resinter(ROI(F,binter)))
3b) encoding the detection result of the key points of the human body by using the human bodyPose estimation is carried out on the picture by the key point detection network to obtain a human body key point vector
Figure BDA0003094564100000092
Wherein
Figure BDA0003094564100000093
And K is the coordinate of the key point of the kth human body, and K is 17 which is the total number of the extracted key points. Seventeen joint points of the pose estimation result are encoded, and in a minimum circumscribed rectangle of each character pair, skeleton models provided according to COCO data sets are connected between different joint points of the model pairs as shown in fig. 3 by connecting lines with different gray values, wherein the different connecting lines with the gray values from 0.15 to 0.95 represent different parts of the body, for example, a connecting line with the gray value of 0.5 is used for connecting the left elbow and the left wrist to represent the left forearm. And setting the pixel values of the rest areas in the rectangular frame to be 0, and adjusting the rectangular frame to a fixed scale of 64 multiplied by 64 to obtain the pose feature map. Finally, extracting pose characteristics f through two convolution pooling layersposeThe sizes of the two convolutional layer convolutional kernels are both 5 multiplied by 5, the number of the convolutional layer convolutional kernels in the two layers is 32 and 16 respectively, and the pooling layers are both maximum pooling.
3c) The intersection region characteristic finterAnd pose feature fposeSplicing and carrying out characteristic fusion through two full connecting layers to obtain finter-poseThe formula expression is as follows:
Figure BDA0003094564100000094
wherein
Figure BDA0003094564100000095
Representing a hard connection of feature vectors (c), WinterIs a projection matrix. Final fused feature finter-poseObtaining the A-dimensional classification result of intersection characteristic tributaries through a full-connection layer classifier and a sigmoid function
Figure BDA0003094564100000096
4) Constructing short-term memory selection tributaries, specifically:
4a) according to the position coordinates b of the human bodyhROIPooling operation is carried out on the global convolution feature map F to extract human body region features, and then the pooled human body feature vector F is obtained by utilizing the Res optimized features of the residual block and through the global average pooling GAPhThe process is described by the formula as follows:
fh=GAP(Resh(ROI(F,bh)))
4b) according to object position coordinates boCarrying out ROIPooling operation on the global convolution feature map F to extract object region features, then optimizing the features by using a residual block Res and obtaining a pooled object visual feature vector through a global average pooling GAP
Figure BDA0003094564100000097
The process is described by the formula:
Figure BDA0003094564100000098
extracting object semantic features by using publicly available Word2vec vectors pre-trained on a Google-News data set, and extracting a 300-dimensional semantic feature vector for each object class label
Figure BDA0003094564100000101
Then the semantic feature vector of the object
Figure BDA0003094564100000102
And visual features
Figure BDA0003094564100000103
After vector splicing, a full connection layer is passed through, and finally 1024-dimensional object characteristic vector f is obtainedo
4c) For the visual characteristics of the common region, the minimum bounding rectangle, i.e. the union region of two bounding boxes, is first calculated from the bounding boxes of the person and the object, and thenCoordinate b of bounding box passing through common region on convolution feature mapunionCarrying out ROI Pooling operation standardization to 7 × 7 fixed size, and obtaining 2048-dimensional visual feature vector through residual block and global average Pooling extraction
Figure BDA0003094564100000104
The process is described by the formula:
Figure BDA0003094564100000105
position feature vector f of the output of the branch and the paired branchspHard connecting and sending into a full connection layer to obtain the common region characteristic f after 1024-dimensional fusionunion. The formula expression is as follows:
Figure BDA0003094564100000106
wherein
Figure BDA0003094564100000107
Representing a hard connection of feature vectors (c), WunionIs a projection matrix.
4d) The extracted human body characteristics fhObject feature foCharacteristic f of common region with personunionInputting the short-term memory selection module to perform feature fusion. The short-term memory selection module consists of two GRU units, and the parameter updating formula is as follows:
zt=σ(Wzxt+Uzht-1+bz)
Figure BDA0003094564100000108
rt=σ(Wrxt+Urht-1+br)
Figure BDA0003094564100000109
wherein x istThe input to the t-th GRU unit, t is 1 and 2, respectively. GRU network in storage unit htAccumulation and update in (hidden state). z is a radical oftFor updating the gate control signal, Wz,UzTo update the door ztWeight matrix, bzTo update the door ztThe offset vector, σ (-) is the sigmoid activation function.
Figure BDA00030945641000001010
Indicates a candidate state, Wh,UhAs candidate state weight matrix, bhA candidate state offset vector, an indicates an element corresponding product. r istTo reset the gate control signal, Wr,UrTo reset the gate rtWeight matrix, brTo reset the gate rtA bias vector. The initial state of the short-term memory module is the characteristic h of the common region of the human body and the object0=funionThe first GRU unit input is a human characterization, x1=fhThe second unit input is a representation x of the object2=foAnd finally outputting the state h2Characterization f as HOIhoiClassifying and sigmoid function is carried out on the full connection layer with output dimension being all HOI category number A to obtain a classification result
Figure BDA0003094564100000111
5) Training a character interactive recognition network, wherein the goal of neural network training is to minimize a loss function between a real label and a predicted label, and a prediction result y of each branch is calculated for all samples in a training setiWith a genuine label aiThe cross entropy loss function between the two functions is calculated as follows:
Figure BDA0003094564100000112
the loss function of the whole cross-recognition network is the loss function of each tributaryAnd, including paired tributary classification lossspIntersection branch class lossinterAnd short term memory selection tributary classification lossSTMSThe calculation formula is as follows:
Losshoi=λ1*lossSTMS2*losssp+lossinter
to balance the differences in contribution capacity of different tributaries, different weighting factors λ are used1=2,λ2Each branch loss is weighted by 0.5. And updating network parameters by adopting a random gradient descent method with momentum in the training process until the optimization reaches the maximum times, and terminating the training to obtain the finally trained character interactive behavior recognition network.
And step 3: and (3) detecting the character interaction behavior in the picture to be detected, firstly detecting the target of the picture to be detected in the step (1) to obtain the position category information of the character and the object, and then sending all the information into a trained character interaction identification network for judgment. The invention adopts a characteristic fusion mode of classifying first and then fusing, namely, each branch respectively extracts characteristics and detects and classifies the characteristics, and finally, the classification result scores of all the branches are fused to obtain the final human interaction behavior detection result. Finally for each character pair (b)h,bo) The final score calculation formula of the human interaction detection is as follows:
Figure BDA0003094564100000113
wherein s ish,soAs the confidence of the human body and the object in the target detection result,
Figure BDA0003094564100000114
to belong to each category of probability score vectors in the category a interactive behavior classification task,
Figure BDA0003094564100000115
denotes the different substreams.
In the embodiment, the character interaction behavior detection precision is improved by fully fusing multiple features, the human body pose information is focused on, the human body pose information is effectively fused with the intersection region features, and the detail information is fused on the basis of capturing the global space configuration, so that the network learns more local detail information, the association between the human body and the object is established, the feature distinction is more obvious, the probability of matching of correct character interaction pairs is increased, and the overall classification accuracy is improved; meanwhile, the short-term memory selection module is used for fully screening the characteristics of people, objects and the common region of the people and the objects, background information is effectively utilized, and the average precision of character interaction detection is further improved.
The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the present invention, and changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and the inventive concept of the present invention.

Claims (4)

1. A method for detecting human interaction in an image based on multi-feature fusion is characterized by comprising the following steps: the operation steps are as follows:
step 1: inputting an original picture;
step 2: detecting a target;
and step 3: constructing a figure interaction identification network;
and 4, step 4: detecting the character interaction behavior of the picture to be detected;
in the step 2, after all example information including human body position information and object position and category information in the picture is detected by using a target detection algorithm, inputting a trained character interaction behavior recognition network, and detecting interaction behaviors between character pairs in the picture to be detected;
in the step 3, the character interaction identification network adopts a multi-branch neural network structure, which comprises paired branches, intersection branches and short-term memory selection branches, and the network performs learning training on the < human-object > example in the picture on various characteristics.
2. The human interaction detection method based on multi-feature fusion as claimed in claim 1, wherein: in step 2, the target detection process is as follows:
adopting a trained target detector to carry out target detection on the input picture to obtain a candidate frame b of a personhAnd confidence s of the personhAnd candidate frame b of objectoAnd confidence s of the objectoWhere the subscript h denotes a human body and o denotes an object.
3. The human interaction detection method based on multi-feature fusion as claimed in claim 1, wherein: in the step 3, constructing the human interaction recognition network comprises the following steps:
1) extracting convolution characteristics of the whole picture:
carrying out convolution feature extraction on the original input picture by using a classical residual error network ResNet-50 to obtain a global convolution feature map F of the whole picture and a human body position b of a target detection resulthObject position boThe characters are used as the input of a character interaction detection network;
2) constructing paired tributaries:
generating a binary image B with two channels according to a given character bounding boxh,oInputting the convolution kernel into a shallow convolutional neural network comprising two convolutional layers and two pooling layers, wherein the sizes of convolution kernels of the two convolutional layers are 5 multiplied by 5, the number of the convolution kernels is 64 and 32 respectively, and the pooling layers are maximum pooling; then, a position feature vector f is obtained by tiling the position feature mapspWherein the subscript sp represents the relative position of the person and the object, and then the vector is input into a full-link layer classifier and a sigmoid activation function to obtain the classification result of the position characteristic tributary on each interaction class
Figure FDA0003094564090000011
Wherein the superscript a is the corresponding interaction category { 1.,. A }, wherein A is the number of all interaction categories;
3) constructing intersection branch flow:
firstly, the coordinate b of the boundary box of the character pair intersection is obtained according to the positions of the human and the objectinterThe subscript inter represents the intersection of the person and the object, ROI Pooling operation is utilized to intercept convolution characteristics of the intersection region on a global convolution characteristic graph F, then residual block Res is used for optimizing the characteristics, and the character pair intersection region characteristic F is obtained after a global average Pooling layer GAPinter(ii) a Meanwhile, encoding the detection result of the human key points of the picture, and connecting different joint points of the model pair by connecting lines with different gray values according to a skeleton model provided by a COCO data set in a minimum circumscribed rectangular frame of each figure pair for representing different parts of a body, wherein the COCO data set is a large-scale public data set which is manufactured by Microsoft and is suitable for various computer vision tasks; setting the pixel values of the rest areas in the rectangular frame to be 0, and adjusting the rectangular frame to a fixed size of 64 multiplied by 64 to obtain a pose feature map; then extracting pose characteristics f through two convolution pooling layersposeThe subscript position represents the pose of the human body, the sizes of convolution kernels of the two convolution layers are both 5 multiplied by 5, the number of the convolution kernels is 32 and 16 respectively, and the pooling layers are both subjected to maximum pooling; then the intersection region feature finterAnd pose feature fposeSplicing and carrying out characteristic fusion through two full connecting layers to obtain finter-poseInputting the data into a full-connection layer classifier and a sigmoid function to obtain an A-dimensional classification result of intersection characteristic tributaries
Figure FDA0003094564090000021
4) Constructing short-term memory selection tributaries:
firstly, according to the position coordinates b of the human bodyhExtracting human body region characteristics by performing ROI Powing operation on the global convolution characteristic diagram F, and then obtaining pooled human body characteristic vector F by using residual block Res to optimize characteristics and through global average Pooling GAPh
According to object position coordinates boPerforming ROI Pooling operation on the global convolution feature map F to extract object region features, then optimizing the features by using a residual block Res and obtaining pooled object regions through a global average Pooling GAPObject visual feature vector of
Figure FDA0003094564090000022
The superscript vis represents semantic features, publicly-usable Word2vec vectors pre-trained on a Google-News data set are selected as object semantic features, and a 300-dimensional semantic feature vector can be extracted for each object class label
Figure FDA0003094564090000023
Wherein the superscript sem represents a semantic feature; then the semantic feature vector of the object
Figure FDA0003094564090000024
And visual features
Figure FDA0003094564090000025
After vector splicing, a full connection layer is passed through, and finally 1024-dimensional object characteristic vector f is obtainedo
For the visual characteristics of the common region, the minimum bounding rectangle is first calculated according to the bounding boxes of the human and the object, namely the union region b of the two bounding boxesunionWherein the subscript unit represents a union set of people and objects, ROI Pooling operation normalization is carried out on a convolution feature map through common region bounding box coordinates to 7 multiplied by 7 fixed size, and then 2048-dimensional visual feature vectors are obtained through residual block and global average Pooling extraction
Figure FDA0003094564090000026
Position feature vector f of the output of the branch and the paired branchspHard connecting and sending into a full connection layer to obtain the common region characteristic f after 1024-dimensional fusionunion
Finally, the human body characteristics fhObject feature foCharacteristic f of common area with human and objectunionInputting a short-term memory selection module, wherein the short-term memory selection module consists of two Gated Recurrent Unit (GRU) units, and the common region characteristic f is input into a GRU UnitunionAs an initial state of the short term memory module, the first GRU unit inputs a characterization f for a personhThe second unit input is a representation f of the objectoFinally, the output state of the short-term memory selection module is used to obtain a representation fhoiObtaining short-term memory selection tributary classification results through a full-link layer classifier and a sigmoid function
Figure FDA0003094564090000031
5) Training a character interactive recognition network:
and the three branches jointly form the whole character interactive identification network, samples in a training set are used as input of the character interactive behavior identification network, the sum of cross entropy loss functions of the three branches is calculated, network parameters are updated by using a gradient descent method until the optimization reaches the maximum times, the training is stopped, and the trained character interactive behavior identification network is obtained.
4. The human interaction detection method based on multi-feature fusion as claimed in claim 1, wherein: in the step 4, the detection process for detecting the character interaction behavior in the picture to be detected is as follows:
firstly, obtaining position category information of people and objects by target detection aiming at a picture to be detected, and then sending all information into a trained people interactive recognition network for judgment; a characteristic fusion mode of classification and fusion is adopted, namely, each branch is respectively extracted with characteristics and detected and classified, and then classification result scores of all branches are fused to obtain a final character interaction behavior detection result; then for each character pair (b)h,bo) Person interaction detection final score
Figure FDA0003094564090000032
The calculation formula is as follows:
Figure FDA0003094564090000033
whereinsh,soAs the confidence of the human body and the object in the target detection result,
Figure FDA0003094564090000034
to belong to each category of probability score vectors in the category a interactive behavior classification task,
Figure FDA0003094564090000035
denotes the different substreams.
CN202110608515.XA 2021-06-01 2021-06-01 Method for detecting figure interaction in image based on multi-feature fusion Pending CN113378676A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110608515.XA CN113378676A (en) 2021-06-01 2021-06-01 Method for detecting figure interaction in image based on multi-feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110608515.XA CN113378676A (en) 2021-06-01 2021-06-01 Method for detecting figure interaction in image based on multi-feature fusion

Publications (1)

Publication Number Publication Date
CN113378676A true CN113378676A (en) 2021-09-10

Family

ID=77575206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110608515.XA Pending CN113378676A (en) 2021-06-01 2021-06-01 Method for detecting figure interaction in image based on multi-feature fusion

Country Status (1)

Country Link
CN (1) CN113378676A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114004985A (en) * 2021-10-29 2022-02-01 北京百度网讯科技有限公司 Human interaction detection method, neural network and training method, device and medium thereof
CN114170688A (en) * 2022-02-11 2022-03-11 北京世纪好未来教育科技有限公司 Character interaction relation identification method and device and electronic equipment
CN114170623A (en) * 2021-11-15 2022-03-11 华侨大学 Human interaction detection equipment and method and device thereof, and readable storage medium
CN114550223A (en) * 2022-04-25 2022-05-27 中国科学院自动化研究所 Person interaction detection method and device and electronic equipment
CN115063640A (en) * 2022-08-15 2022-09-16 阿里巴巴(中国)有限公司 Interaction detection method, and pre-training method and device of interaction detection model
CN115147817A (en) * 2022-06-17 2022-10-04 淮阴工学院 Posture-guided driver distraction behavior recognition method of instance-aware network
CN115170817A (en) * 2022-07-21 2022-10-11 广州大学 Figure interaction detection method based on three-dimensional figure-figure grid topology enhancement
CN116662587A (en) * 2023-07-31 2023-08-29 华侨大学 Character interaction detection method, device and equipment based on query generator
CN117953589A (en) * 2024-03-27 2024-04-30 武汉工程大学 Interactive action detection method, system, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523421A (en) * 2020-04-14 2020-08-11 上海交通大学 Multi-user behavior detection method and system based on deep learning and fusion of various interaction information
CN112149616A (en) * 2020-10-13 2020-12-29 西安电子科技大学 Figure interaction behavior recognition method based on dynamic information
CN112784736A (en) * 2021-01-21 2021-05-11 西安理工大学 Multi-mode feature fusion character interaction behavior recognition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523421A (en) * 2020-04-14 2020-08-11 上海交通大学 Multi-user behavior detection method and system based on deep learning and fusion of various interaction information
CN112149616A (en) * 2020-10-13 2020-12-29 西安电子科技大学 Figure interaction behavior recognition method based on dynamic information
CN112784736A (en) * 2021-01-21 2021-05-11 西安理工大学 Multi-mode feature fusion character interaction behavior recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WANG C, ET AL.: "An improved human-object interaction detection method based on short-term memory selection network", 《2020 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO PROCESSING AND ARTIFICIAL INTELLIGENCE》, pages 1 - 7 *
WANG C, ET AL.: "Multi-stream network for human-object interaction detection", 《INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL》, pages 1 - 16 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114004985A (en) * 2021-10-29 2022-02-01 北京百度网讯科技有限公司 Human interaction detection method, neural network and training method, device and medium thereof
CN114004985B (en) * 2021-10-29 2023-10-31 北京百度网讯科技有限公司 Character interaction detection method, neural network, training method, training equipment and training medium thereof
CN114170623A (en) * 2021-11-15 2022-03-11 华侨大学 Human interaction detection equipment and method and device thereof, and readable storage medium
CN114170688A (en) * 2022-02-11 2022-03-11 北京世纪好未来教育科技有限公司 Character interaction relation identification method and device and electronic equipment
CN114170688B (en) * 2022-02-11 2022-04-19 北京世纪好未来教育科技有限公司 Character interaction relation identification method and device and electronic equipment
CN114550223A (en) * 2022-04-25 2022-05-27 中国科学院自动化研究所 Person interaction detection method and device and electronic equipment
CN115147817A (en) * 2022-06-17 2022-10-04 淮阴工学院 Posture-guided driver distraction behavior recognition method of instance-aware network
CN115147817B (en) * 2022-06-17 2023-06-20 淮阴工学院 Driver distraction behavior recognition method of instance perception network guided by gestures
CN115170817A (en) * 2022-07-21 2022-10-11 广州大学 Figure interaction detection method based on three-dimensional figure-figure grid topology enhancement
CN115170817B (en) * 2022-07-21 2023-04-28 广州大学 Character interaction detection method based on three-dimensional human-object grid topology enhancement
CN115063640A (en) * 2022-08-15 2022-09-16 阿里巴巴(中国)有限公司 Interaction detection method, and pre-training method and device of interaction detection model
CN116662587A (en) * 2023-07-31 2023-08-29 华侨大学 Character interaction detection method, device and equipment based on query generator
CN116662587B (en) * 2023-07-31 2023-10-03 华侨大学 Character interaction detection method, device and equipment based on query generator
CN117953589A (en) * 2024-03-27 2024-04-30 武汉工程大学 Interactive action detection method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN113378676A (en) Method for detecting figure interaction in image based on multi-feature fusion
Boulahia et al. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition
CN110728209B (en) Gesture recognition method and device, electronic equipment and storage medium
Baccouche et al. Sequential deep learning for human action recognition
Akmeliawati et al. Real-time Malaysian sign language translation using colour segmentation and neural network
CN110633632A (en) Weak supervision combined target detection and semantic segmentation method based on loop guidance
Chen et al. Research on recognition of fly species based on improved RetinaNet and CBAM
Wang et al. Towards realistic predictors
CN113673510B (en) Target detection method combining feature point and anchor frame joint prediction and regression
CN107133569A (en) The many granularity mask methods of monitor video based on extensive Multi-label learning
CN112036276B (en) Artificial intelligent video question-answering method
CN106909938B (en) Visual angle independence behavior identification method based on deep learning network
CN103605972A (en) Non-restricted environment face verification method based on block depth neural network
CN111160350A (en) Portrait segmentation method, model training method, device, medium and electronic equipment
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
Potdar et al. A convolutional neural network based live object recognition system as blind aid
CN106548194B (en) The construction method and localization method of two dimensional image human joint points location model
CN109033321B (en) Image and natural language feature extraction and keyword-based language indication image segmentation method
Patel et al. Hand gesture recognition system using convolutional neural networks
Defriani et al. Recognition of Regional Traditional House in Indonesia Using Convolutional Neural Network (CNN) Method
CN112749738B (en) Zero sample object detection method for performing superclass reasoning by fusing context
Adewopo et al. Baby physical safety monitoring in smart home using action recognition system
CN115798055B (en) Violent behavior detection method based on cornersort tracking algorithm
KR20010050988A (en) Scale and Rotation Invariant Intelligent Face Detection
CN117011274A (en) Automatic glass bottle detection system and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination