CN113887468B - Single-view human-object interaction identification method of three-stage network framework - Google Patents

Single-view human-object interaction identification method of three-stage network framework Download PDF

Info

Publication number
CN113887468B
CN113887468B CN202111200063.8A CN202111200063A CN113887468B CN 113887468 B CN113887468 B CN 113887468B CN 202111200063 A CN202111200063 A CN 202111200063A CN 113887468 B CN113887468 B CN 113887468B
Authority
CN
China
Prior art keywords
image
character
stage network
stage
position information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111200063.8A
Other languages
Chinese (zh)
Other versions
CN113887468A (en
Inventor
田锋
王耀智
张吉仲
南方
洪振鑫
吴砚泽
郑庆华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202111200063.8A priority Critical patent/CN113887468B/en
Publication of CN113887468A publication Critical patent/CN113887468A/en
Application granted granted Critical
Publication of CN113887468B publication Critical patent/CN113887468B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single-view human-object interaction identification method of a three-stage network frame, and belongs to the field of Human Objection Interaction. The three-stage framework solves the problems of human-object combination explosion, low pixels, shielding and the like in the MPMO scene of the existing HOI recognition algorithm, and combines the capabilities of recognition efficiency and accuracy. Compared with the classical HOI method such as iCAN and QPIC, the invention improves mAP by 0.21 in the constructed real multi-person multi-object classroom scene data set, and greatly improves the execution efficiency; compared with the classical Faster R-CNN model, the average accuracy of target detection of an object is improved by 0.301. The invention accurately captures the context information and the interactive object information around the task on the premise of not introducing complex background information, and effectively solves the problem of human-object combined explosion.

Description

Single-view human-object interaction identification method of three-stage network framework
Technical Field
The invention belongs to the field of Human Objection Interaction, and particularly relates to a single-view human-object interaction identification method of a three-stage network frame.
Background
The behavior of the person in the picture and the interaction relation between the person and the object are extracted by utilizing an artificial intelligence technology, and the hot spot problem in computer vision is solved. Compared with the traditional behavior recognition method, the HOI recognition task is to detect and recognize the interaction mode of each person and surrounding objects, mainly aims at scenes with more complex actions (such as more action types, more complex background information and the like), firstly recognizes all persons/objects in the scenes through a target detection method, obtains interaction types of the two by utilizing the appearance information and the relative position information of the related persons/objects in the scenes, outputs a triple structure of < person, verb and object >, and outputs a forward reasoning called HOI from the last triple detected by the target.
The HOI recognition method at the present stage is mostly aimed at the research under the condition of simple scenes, namely, the image only contains 1-2 persons/objects, and the influence of the multi-person multi-object scene on HOI recognition is ignored. In a multi-person and multi-object scene, people are usually dense, the types and the number of objects are large, so that a large amount of shielding exists among people, people and objects in a picture, the image resolution of the people and objects is low, and the problem of explosion of the combination relation of the people and the objects is solved. The existing method is applied to a real multi-person multi-object scene, and has the following problems: in the object detection stage, if the pre-trained object detection model cannot accurately detect all the people/object objects in the image due to the occlusion, even the concealment, or the blurring of the image, the person-object combination with the real HOI relationship may be missed, so that the performance of the HOI recognition is degraded. Second,: in the interactive action recognition stage, since the model takes the form of a person-object combination as input, the number of person and objects in one image increases to cause the number of person-object combinations to exponentially increase, and under the condition that the model batch number is set to be the same, the time consumption of forward reasoning of the model becomes longer as the number of person/objects in one image increases. Third,: in the traditional HOI algorithm (taking the iCAN as an example), a method of convolution before segmentation is utilized for extracting appearance characteristics of people and objects, but under a multi-person multi-object scene data set, the occupied pixel points of a single person and objects are fewer, the proportion of the single person and objects in the whole image is small, surrounding redundant background information can be captured by a convolution layer too much, noise is introduced, and the algorithm identification accuracy is reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a single-view human-object interaction identification method of a three-stage network framework.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
a recognition method of single-view man-object interaction of a three-stage network frame, wherein the three-stage network frame comprises a first-stage network frame, a second-stage network frame and a third-stage network frame; the first-stage network framework comprises a ResNet model, a self-attention mechanism module, a joint module and a full-connection layer; the second-stage network framework comprises a pooling layer, a joint module and a full-connection layer; the third-stage network framework comprises a ResNet model, a joint module and a full connection layer; the identification method comprises the steps of training a network and using the trained network to carry out identification, wherein the training network comprises the following steps:
1) Identifying character position information, object type information and position information thereof in the picture through a Faster R-CNN model, and outputting a character image according to the character position information by the Faster R-CNN model;
the character image sequentially passes through a ResNet model of the first-stage network frame and a self-attention mechanism module and then outputs the appearance characteristic f of the character image s
Acquiring single-channel binary matrix character position information f by using multiple downsampling of character position information l The joint module of the first-stage network framework makes f s And said f l The obtained combined features are input into a full-connection layer of the first-stage network framework, and the full-connection layer carries out preliminary classification on behaviors of the characters to obtain a multi-label action classification prediction result;
2) Expanding the character position information to obtain position information of a local expansion area and a corresponding character local expansion area image, and inputting the position information into a pooling layer of a second-stage network frame to obtain corresponding characteristics;
the combination module of the second-stage network framework combines the characteristics with the combined characteristics, and then inputs the characteristics and the combined characteristics into a full-connection layer of the second-stage network framework, and the full-connection layer outputs the position thermodynamic diagram of the most focused object of each person to obtain the position thermodynamic diagram of the interaction object;
3) Inputting object position information and a person local expansion area image into a ResNet network of a third-stage network frame, outputting image characteristics of an expansion area by the ResNet network, combining the image characteristics of the expansion area with the interactive object position thermodynamic diagram by a joint module of the third-stage network frame, and outputting corrected person behavior categories through a full-connection layer of the third-stage network frame to obtain a multi-label action classification prediction correction result;
in the training process, calculating the Margin-Loss of the multi-label action classification prediction correction result and the multi-label action classification prediction result output by the first-stage network frame, and simultaneously dynamically adjusting the Margin interval threshold by using the Focal-Loss, and optimizing the fast R-CNN model and the three-stage network frame in a counter-propagation mode;
4) Repeating the steps 1) -3) until the Margin-Loss is stable, and finishing training to obtain a trained fast R-CNN model and a three-stage network frame;
identifying using the trained network includes the following operations:
inputting a single Zhang Jiaoshi character visual angle picture into a trained Faster R-CNN model, inputting character information and object information output by the Faster R-CNN model into a trained three-stage network frame, and outputting the behavior and interactive object result of each character by the three-stage network frame.
Further, the specific process of identifying by using the trained network is as follows:
will take the form Zhang Jiaoshi figuresThe visual angle picture is input into a fast R-CNN model target detection model, and the fast R-CNN model target detection model detects people in the picture and outputs a human body coordinate frame b of each person o According to the coordinate frame b of the person o Intercepting a character part image and a corresponding LER image;
inputting the original picture into ResNet and self-attention mechanism module in first stage network frame, calculating to obtain image appearance characteristic f s Inputting character position information and character image information obtained by the Faster R-CNN model into a continuous pooling layer to obtain character position information feature f l Fusion to obtain joint feature f h =[f s ,f l ];
Inputting the character position information into a pooling layer of a second-stage network framework, extracting the position characteristics of the character specific information in the LER image, and combining the extracted position characteristics with f h Fusing, generating an interactive object position thermodynamic diagram through the full-connection layer, and performing binarization processing;
inputting the LER region image into a ResNet model of a third-stage network frame, extracting LER region image characteristics, multiplying the thermodynamic diagram with the LER region characteristics, and calculating a final fusion characteristic f all And outputting the final behavior category through the full connection layer and the activation function.
Furthermore, the convolutional layer parameter sharing in the characteristic extraction process of the character original image information and the LER image information is set, so that the parameter quantity in model training is reduced, and the model training speed is improved.
Further, the specific operation of obtaining the joint characteristics of the characters in the first-stage network frame in the step 1) is as follows:
extracting appearance characteristic f of human body part image of person by using ResNet model as backbone network s Capturing a concerned region which is favorable for classification by using a self-attention mechanism module in the iCAN;
representing position information f of person in classroom using binary matrix l The value of the position of the character is 1, the values of the rest positions are 0, and the binary matrix is obtained through continuous pooling layer downsampling;
the f is set to s And said f l And fusing to obtain the joint characteristics of the characters.
Further, the specific operation of generating the thermodynamic diagram of the position of the interactive object in the second stage network frame in the step 2) is as follows:
the method comprises the steps that an LER image developed in a local area of a character image is used for positioning an interactive object in the LER image;
based on the character position information in the LER image, extracting character position information characteristics through a pooling layer of the second-stage network framework and combining the character position information characteristics with f h And (3) fusing, adjusting the thermodynamic diagram to be 14 multiplied by 14 through a 196-dimensional full-connection layer in the second-stage network, and performing binarization processing on the thermodynamic diagram through an activation function to obtain the thermodynamic diagram of the position of the interaction object.
Further, the LER image is formed by expanding the length and width of an original human body local image outwards by alpha times respectively, alpha is defined expansion factor super-parameters, and new coordinates of the LER image are calculated as follows:
[X min′ ,Y min′ ,X max′ ,Y max′ ]
wherein:
Figure BDA0003304584260000051
Figure BDA0003304584260000052
Figure BDA0003304584260000053
Figure BDA0003304584260000054
wherein W and H are the width of the image and the height of the image, X min ,Y min ,X max ,Y max Representing the upper left and lower right corner coordinates of the position box of the partial image of the person in the original picture.
Further, the calculation result is compared with the image edge portion when calculating the LER region position coordinates, and the exceeding of the image boundary is avoided.
Further, in step 3), the specific operations for classifying the person actions are:
inputting the image of the local expansion area of the character and the position information of the object into a ResNet model in a third-stage network, modeling the relative position space relation of the image and the object, combining the output relative position space relation with the interaction object position thermodynamic diagram, jointly carrying out HOI identification, and outputting a final behavior result and object information.
Further, in step 3), the used Margin-Loss is used to ensure the classification accuracy, so as to correct the recognition result in step 1), and meanwhile, the Focal-Loss is used to dynamically adjust the interval threshold according to the difficulty level of sample training, and the final Loss calculation formula is as follows:
Figure BDA0003304584260000061
in the method, in the process of the invention,
Figure BDA0003304584260000062
a score representing the behavior recognition in step 1); />
Figure BDA0003304584260000063
Representing a current person-object interaction behavior recognition score; m is m pos 、m neg Hyper-parameter interval margin representing positive and negative samples, respectively;
Figure BDA0003304584260000064
and the difficulty level of sample learning is represented.
Further, final fusion feature f all The expression is as follows:
f all =[f h ,f o ,f sp ]
wherein f LER Convolution feature map, loc, representing LER region o Representing an interactive object positioning thermodynamic diagram.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a single-view human-object interaction identification method of a three-stage network framework, which solves a series of problems of human-object combination explosion, low pixels, shielding and the like in an MPMO scene of the existing HOI identification algorithm and combines the capabilities of identification efficiency and accuracy. Compared with the classical HOI method such as iCAN and QPIC, the invention improves mAP by 0.21 in the constructed real multi-person multi-object classroom scene data set, and greatly improves the execution efficiency; compared with the classical Faster R-CNN model, the average accuracy of target detection of an object is improved by 0.301. The LER concept provided by the invention accurately captures the context information and the interactive object information around the task on the premise of not introducing complex background information, and effectively solves the problem of human-object combined explosion; compared with classical HOI algorithm iCAN and QPIC, the model has the advantages of greatly improving the overall accuracy and the execution efficiency.
Drawings
Fig. 1 is an overall flow chart of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention is described in further detail below with reference to the attached drawing figures:
referring to fig. 1, fig. 1 is a flowchart showing a three-Stage network frame according to the present invention, wherein the upper left input is the result of recognition by the fast R-CNN model, including character position information, object category information, object position information, and character image in the picture, stage 1 represents a first-Stage network model, and the Stage inputs the character image to the res net model of the Stage and the self-attention mechanism module acquires the appearance feature f of the character image s Simultaneously, the character position information is downsampled for a plurality of times to obtain single-channel binary matrix character position information f l Combining the two features to form a combined feature, inputting the combined feature into a full-connection layer network at the current stage, and outputting a primary classification result of the character behaviors; stage 2 represents a second Stage network model, wherein the Stage expands character position information to obtain position information of a local expansion area and a corresponding character local expansion area image, the position information is input into a pooling layer of the Stage to obtain corresponding characteristics, the characteristics are combined with the combined characteristics of the first Stage network, the characteristics are input into a fully-connected layer of the layer, and finally an interactive object position thermodynamic diagram is output; stage 3 represents a third Stage network model, the Stage inputs the partial expansion area image of the person and the object position information of the second Stage into the ResNet network of the Stage, extracts image features, combines the image features with the interaction object position thermodynamic diagram of the second Stage, and finally inputs the image features and the interaction object position thermodynamic diagram of the second Stage into the full network frame of the third StageAnd the connection layer outputs the corrected person action classification result. The invention mainly comprises the following steps:
model training part:
step1: images including people and object objects are collected and annotated.
The annotated format refers to the annotated format of the HICO-DET dataset and is modified based thereon. For each sample in the data set, the sample contains the storage address of the original picture, the label of the character action, the character body position coordinate (not normalized), the character body face position coordinate (not normalized), the interaction object position coordinate (not normalized) and the position of the character in the classroom:
for each dataset sample, the label form of the character action is as follows:
[Action1,Action2,Action3,...]
i.e., a single character may perform multiple actions simultaneously.
The information of the character body position, the face body position and the interactive object position is determined by marking the coordinates of two points of the upper left corner and the lower right corner of the rectangular area, and the form is as follows:
[X min ,Y min ,X max ,Y max ]
wherein the coordinates of the upper left and lower right points respectively correspond to [ X ] min ,Y min ]And [ X ] max ,Y max ]。
The positions of the characters in the classroom are used for representing seat information in the classroom environment, and the basic form is as follows:
[Position1,Position2]
where Position1 indicates the row of the seats in the classroom, and Position2 indicates the column of the seats in the classroom.
In training the detection HOI model, all sample information in the dataset is used and according to 6:1 divides the training set and the test set.
Step2: and performing a behavior recognition task based on the traditional convolutional neural network.
This stage modelConsisting of a traditional behavior recognition framework, aimed at extracting the appearance features f of images of persons s And single channel binary matrix character position information f l The combined characteristics are formed, meanwhile, the behaviors of the people are initially classified, and the specific operation of acquiring information is as follows:
the appearance information contained in the figure image is quite rich, the input of the figure behavior recognition module only depends on human body frame detection, and ResNet50 is used as a backbone network to extract the appearance characteristic f of the figure human body part image s The Self-Attention module in the iCAN is then used to capture regions of interest that are favorable for classification.
Different positions of the person in the classroom can cause the shooting angles of the cameras to be different, so that deformation of the image of the person can be caused, and the behavior recognition result can be influenced. The present invention proposes to use a binary matrix to represent the position information f of a person in a classroom l The value of the position where the person is located is 1, the values of the rest positions are 0, and the binary matrix is obtained through continuous MaxPooling downsampling.
Behavior recognition feature f h =[f s ,f l ]Combining the two types of features, calculating confidence scores for each type of action a by using a full connection structure and a Sigmoid activation function
Figure BDA0003304584260000091
Step3: an interactive object position thermodynamic diagram is generated.
The possible positions of the most interesting objects of each person are output and are regularized as a network through supervision training. Using the character feature f obtained in step2 h The method for estimating the thermodynamic diagram of the position of the interactive object with the position information of the character in the local expansion area comprises the following steps:
(1) And (3) performing interactive object positioning on the LER image developed in the partial area of the character image in the LER image developed in the partial area of the character image. The LER image is formed by expanding the length and width of an original human body local image outwards by alpha times respectively, alpha is defined expansion factor super-parameters, and new coordinates of the LER image are calculated as follows:
[X min′ ,Y min′ ,X max′ ,Y max′ ]
wherein:
Figure BDA0003304584260000101
Figure BDA0003304584260000102
Figure BDA0003304584260000103
Figure BDA0003304584260000104
where W and H are the width of the image and the height of the image, respectively.
In consideration of the situation that the task main body part is at the edge of the classroom, the calculation result is compared with the image edge part when calculating the LER region position coordinates, so that the situation that the image boundary is exceeded is avoided.
(2) Generating object position thermodynamic diagrams
According to the character position information in the LER image, extracting the character position information and f in the step2 h The fusion is carried out, the thermodynamic diagram is adjusted to be 14 multiplied by 14 through a 196-dimensional full-connection layer, and binarization processing is carried out on the thermodynamic diagram through a Sigmoid function, so that an object position thermodynamic diagram is generated.
The thermodynamic diagram is selected as object position information rather than frame coordinates, and is mainly studied in two points: firstly, the output in the form of thermodynamic diagram is applied to the step4, and can be directly used as attention diagram to carry out multiplication operation with the image characteristic diagram; secondly, the form of the coordinate frame is sharp, the thermodynamic diagram is smooth, and gradient convolution is facilitated.
Step4: and referring to Margin Loss in fine-grained classification, introducing position and appearance characteristic information of an object to assist in executing final interactive behavior recognition, and realizing self-correction of a first-stage recognition result.
The process uses the output thermodynamic diagram in step3 as an input for determining the position of the interactive object to assist in the classification task of the HOI behavior, and simultaneously improves the accuracy of HOI classification by utilizing the original appearance information of the character and the interactive object.
The step uses the character appearance characteristics and the object appearance characteristics obtained in the step2, models the relative position space relation of the character appearance characteristics and the object appearance characteristics, and performs HOI identification together. The final fusion characteristics are expressed as follows:
f all =[f h ,f o ,f sp ]
wherein f h Representing the appearance characteristics of the character extracted in the step2, f o Representing the features of the interactive object extracted at this stage, f sp Representing the relative spatial relationship of the character to the position of the interactive object.
For the appearance characteristics of the interactive object, the interactive object positioning output in the step3 is used as a thermodynamic diagram, and the thermodynamic diagram is multiplied by the LER image characteristic diagram after convolution, and the calculation formula is as follows:
f o =MaxPooling(f LER ⊙loc o )
wherein f LER Convolution feature map, loc, representing LER region o Representing the interaction object positioning thermodynamic diagram output in the step 3. Only the pixel points representing the positions of the interaction objects in the thermodynamic diagram have values larger than 0, and other redundant features are removed while the appearance information of the objects is extracted through MaxPooling.
Considering that the difference between the original image information of the character and the LER image information is smaller, the similarity of the contained information is higher, and the parameter sharing of the convolution layer in the characteristic extraction process of the original image information and the LER image information is set, so that the parameter quantity in model training is reduced, and the model training speed is improved.
And (2) referring to Margin-Loss used in the fine-grained classification problem to ensure classification accuracy, realizing further correction of the recognition result in the step (2), and dynamically adjusting an interval threshold value according to the difficulty level of sample training by using Focal-Loss, wherein a final Loss calculation formula is as follows:
Figure BDA0003304584260000121
in the method, in the process of the invention,
Figure BDA0003304584260000122
score representing behavior recognition in step2, < >>
Figure BDA0003304584260000123
Representing the current person-object interaction behavior recognition score, m pos 、m neg Hyper-parameter interval margin representing positive and negative examples respectively,
Figure BDA0003304584260000124
and the difficulty level of sample learning is represented.
Model use part:
step1: for inputting a single picture, a target detection model (here, the fast R-CNN model) is used to detect the individuals in the picture, and the human body coordinate frame b of each person is returned o
Step2: according to the coordinate frame b of the person o Intercepting a character part image and a corresponding LER image;
step3: inputting original character image information and intercepting character image information, and calculating image appearance characteristics f s And character position information f l From this, a joint feature f is derived h =[f s ,f l ];
Step4: calculating the position information of specific information of the person in the LER image, extracting the position characteristics through Pooling, and combining the position characteristics with f h Fusing, calculating to generate 14 multiplied by 14 thermodynamic diagrams, and performing binarization processing;
step5: feature extraction is carried out on the LER region image, thermodynamic diagram is multiplied by LER region features, and final fusion features f are calculated all And calculating the final behavior category through the full connection layer and the activation function.
Examples
In the constructed real multi-person multi-thing classroom scene data set MPMOCS, the description of the MPMOCS data set is shown in tables 1 and 2. Compared with the classical HOI method, such as iCAN and QPIC, mAP is improved by 0.21, and the execution efficiency is greatly improved; compared with the classical Faster R-CNN model, the average accuracy of target detection of the object is improved by 0.301, and experimental results are shown in tables 3 and 4.
Table 1 comparison of MPMOCS with HOI public dataset
Figure BDA0003304584260000125
Figure BDA0003304584260000131
Table 2 MPMOCS dataset partitioning
Figure BDA0003304584260000132
TABLE 3 HOI identification accuracy contrast
Figure BDA0003304584260000133
TABLE 4 object positioning accuracy contrast
Figure BDA0003304584260000134
Figure BDA0003304584260000141
The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. The method for identifying single-view man-object interaction of the three-stage network frame is characterized in that the three-stage network frame comprises a first-stage network frame, a second-stage network frame and a third-stage network frame; the first-stage network framework comprises a ResNet model, a self-attention mechanism module, a joint module and a full-connection layer; the second-stage network framework comprises a pooling layer, a joint module and a full-connection layer; the third-stage network framework comprises a ResNet model, a joint module and a full connection layer; the identification method comprises the steps of training a network and using the trained network to carry out identification, wherein the training network comprises the following steps:
1) Identifying character position information, object type information and position information thereof in the picture through a Faster R-CNN model, and outputting a character image according to the character position information by the Faster R-CNN model;
the character image sequentially passes through a ResNet model of the first-stage network frame and a self-attention mechanism module and then outputs the appearance characteristic f of the character image s
Acquiring single-channel binary matrix character position information f by using multiple downsampling of character position information l The joint module of the first-stage network framework makes f s And said f l The obtained combined features are input into a full-connection layer of the first-stage network framework, and the full-connection layer carries out preliminary classification on behaviors of the characters to obtain a multi-label action classification prediction result;
2) Expanding the character position information to obtain position information of a local expansion area and a corresponding character local expansion area image, and inputting the position information into a pooling layer of a second-stage network frame to obtain corresponding characteristics;
the combination module of the second-stage network framework combines the characteristics with the combined characteristics, and then inputs the characteristics and the combined characteristics into a full-connection layer of the second-stage network framework, and the full-connection layer outputs the position thermodynamic diagram of the most focused object of each person to obtain the position thermodynamic diagram of the interaction object;
3) Inputting object position information and a person local expansion area image into a ResNet network of a third-stage network frame, outputting image characteristics of an expansion area by the ResNet network, combining the image characteristics of the expansion area with the interactive object position thermodynamic diagram by a joint module of the third-stage network frame, and outputting corrected person behavior categories through a full-connection layer of the third-stage network frame to obtain a multi-label action classification prediction correction result;
in the training process, calculating the Margin-Loss of the multi-label action classification prediction correction result and the multi-label action classification prediction result output by the first-stage network frame, and simultaneously dynamically adjusting the Margin interval threshold by using the Focal-Loss, and optimizing the fast R-CNN model and the three-stage network frame in a counter-propagation mode;
4) Repeating the steps 1) -3) until the Margin-Loss is stable, and finishing training to obtain a trained fast R-CNN model and a three-stage network frame;
identifying using the trained network includes the following operations:
inputting a single Zhang Jiaoshi character visual angle picture into a trained Faster R-CNN model, inputting character information and object information output by the Faster R-CNN model into a trained three-stage network frame, and outputting the behavior and interactive object result of each character by the three-stage network frame.
2. The method for identifying single-view man-object interactions of a three-phase network framework according to claim 1, wherein the specific process of identifying by using a trained network is as follows:
inputting a single Zhang Jiaoshi person visual angle picture into a fast R-CNN model target detection model, wherein the fast R-CNN model target detection model detects the persons in the picture and outputs a human body coordinate frame b of each person o According to the coordinate frame b of the person o Intercepting a character part image and a corresponding LER image;
inputting an original picture into a first stage network frameworkResNet and self-attention mechanism module, calculating to obtain image appearance characteristic f s Inputting character position information and character image information obtained by the Faster R-CNN model into a continuous pooling layer to obtain character position information feature f l Fusion to obtain joint feature f h =[f s ,f l ];
Inputting the character position information into a pooling layer of a second-stage network framework, extracting the position characteristics of the character specific information in the LER image, and combining the extracted position characteristics with f h Fusing, generating an interactive object position thermodynamic diagram through the full-connection layer, and performing binarization processing;
inputting the LER region image into a ResNet model of a third-stage network frame, extracting LER region image characteristics, multiplying the thermodynamic diagram with the LER region characteristics, and calculating a final fusion characteristic f all And outputting the final behavior category through the full connection layer and the activation function.
3. The method for recognizing single-view man-object interaction of three-stage network frame according to claim 2, wherein the convolutional layer parameter sharing in the feature extraction process of the character original image information and the LER image information is set, thereby reducing the parameter amount in model training and improving the model training speed.
4. The method for identifying single-view human-object interaction of three-phase network frame according to claim 1, wherein the specific operation of obtaining the joint characteristics of the people in the first-phase network frame in step 1) is as follows:
extracting appearance characteristic f of human body part image of person by using ResNet model as backbone network s Capturing a concerned region which is favorable for classification by using a self-attention mechanism module in the iCAN;
representing position information f of person in classroom using binary matrix l The value of the position of the character is 1, the values of the rest positions are 0, and the binary matrix is obtained through continuous pooling layer downsampling;
the f is set to s And said f l And fusing to obtain the joint characteristics of the characters.
5. The method for identifying single-view human-object interactions in a three-phase network framework according to claim 1, wherein the specific operations for generating the thermodynamic diagram of the position of the interaction object in the second-phase network framework in step 2) are as follows:
the method comprises the steps that an LER image developed in a local area of a character image is used for positioning an interactive object in the LER image;
based on the character position information in the LER image, extracting character position information characteristics through a pooling layer of the second-stage network framework and combining the character position information characteristics with f h And (3) fusing, adjusting the thermodynamic diagram to be 14 multiplied by 14 through a 196-dimensional full-connection layer in the second-stage network, and performing binarization processing on the thermodynamic diagram through an activation function to obtain the thermodynamic diagram of the position of the interaction object.
6. The method for identifying single-view human-object interaction of three-phase network framework according to claim 5, wherein the LER image is formed by expanding the length and width of the original human body partial image by alpha times, alpha is defined expansion factor super-parameters, and the calculated new coordinates of the LER image are as follows:
[X min′ ,Y min′ ,X max′ ,Y max′ ]
wherein:
Figure FDA0004077390990000041
Figure FDA0004077390990000042
Figure FDA0004077390990000043
Figure FDA0004077390990000044
wherein W and H are the width of the image and the height of the image, X min ,Y min ,X max ,Y max Representing the upper left and lower right corner coordinates of the position box of the partial image of the person in the original picture.
7. The method for recognizing single-view man-object interactions of three-phase network framework according to claim 5, wherein the calculation result is compared with the image edge portion when calculating the LER region position coordinates, avoiding exceeding the image boundary.
8. The method for identifying single-view human-object interactions of three-phase network framework according to claim 1, wherein in step 3), the specific operations for classifying human actions are as follows:
inputting the image of the local expansion area of the character and the position information of the object into a ResNet model in a third-stage network, modeling the relative position space relation of the image and the object, combining the output relative position space relation with the interaction object position thermodynamic diagram, jointly carrying out HOI identification, and outputting a final behavior result and object information.
9. The method for identifying single-view man-object interaction of three-phase network frame according to claim 1, wherein in step 3), the classification accuracy is ensured by using a Margin-Loss to realize the correction of the identification result in step 1), and meanwhile, the interval threshold is dynamically adjusted by using a Focal-Loss according to the difficulty of sample training, and the final Loss calculation formula is as follows:
Figure FDA0004077390990000051
in the method, in the process of the invention,
Figure FDA0004077390990000052
a score representing the behavior recognition in step 1); />
Figure FDA0004077390990000053
Representing a current person-object interaction behavior recognition score; m is m pos 、m neg Hyper-parameter interval margin representing positive and negative samples, respectively;
Figure FDA0004077390990000054
and the difficulty level of sample learning is represented.
10. The method for identifying single-view human-object interactions of a three-phase network framework according to claim 1, characterized in that the final fusion feature f all The expression is as follows:
f all =[f h ,f o ,f sp ]
wherein f h Representing the appearance characteristics of the character extracted in the step2, f o Representing the features of the interactive object extracted at this stage, f sp Representing the relative spatial relationship of the character to the position of the interactive object.
CN202111200063.8A 2021-10-14 2021-10-14 Single-view human-object interaction identification method of three-stage network framework Active CN113887468B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111200063.8A CN113887468B (en) 2021-10-14 2021-10-14 Single-view human-object interaction identification method of three-stage network framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111200063.8A CN113887468B (en) 2021-10-14 2021-10-14 Single-view human-object interaction identification method of three-stage network framework

Publications (2)

Publication Number Publication Date
CN113887468A CN113887468A (en) 2022-01-04
CN113887468B true CN113887468B (en) 2023-06-16

Family

ID=79002912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111200063.8A Active CN113887468B (en) 2021-10-14 2021-10-14 Single-view human-object interaction identification method of three-stage network framework

Country Status (1)

Country Link
CN (1) CN113887468B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114973684B (en) * 2022-07-25 2022-10-14 深圳联和智慧科技有限公司 Fixed-point monitoring method and system for construction site

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914622A (en) * 2020-06-16 2020-11-10 北京工业大学 Character interaction detection method based on deep learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10019655B2 (en) * 2016-08-31 2018-07-10 Adobe Systems Incorporated Deep-learning network architecture for object detection
CN110163059B (en) * 2018-10-30 2022-08-23 腾讯科技(深圳)有限公司 Multi-person posture recognition method and device and electronic equipment
CN111814661B (en) * 2020-07-07 2024-02-09 西安电子科技大学 Human body behavior recognition method based on residual error-circulating neural network
CN111931703B (en) * 2020-09-14 2021-01-05 中国科学院自动化研究所 Object detection method based on human-object interaction weak supervision label
CN113449801B (en) * 2021-07-08 2023-05-02 西安交通大学 Image character behavior description generation method based on multi-level image context coding and decoding

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914622A (en) * 2020-06-16 2020-11-10 北京工业大学 Character interaction detection method based on deep learning

Also Published As

Publication number Publication date
CN113887468A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN109344693B (en) Deep learning-based face multi-region fusion expression recognition method
CN108334848B (en) Tiny face recognition method based on generation countermeasure network
Yang et al. Driver yawning detection based on subtle facial action recognition
Pitaloka et al. Enhancing CNN with preprocessing stage in automatic emotion recognition
CN108062525B (en) Deep learning hand detection method based on hand region prediction
JP5675229B2 (en) Image processing apparatus and image processing method
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
WO2017079521A1 (en) Cascaded neural network with scale dependent pooling for object detection
CN107808376B (en) Hand raising detection method based on deep learning
KR20160096460A (en) Recognition system based on deep learning including a plurality of classfier and control method thereof
US11853892B2 (en) Learning to segment via cut-and-paste
CN111428664B (en) Computer vision real-time multi-person gesture estimation method based on deep learning technology
CN109063626B (en) Dynamic face recognition method and device
CN113297956B (en) Gesture recognition method and system based on vision
CN113435319B (en) Classification method combining multi-target tracking and pedestrian angle recognition
CN112084952B (en) Video point location tracking method based on self-supervision training
CN113591719A (en) Method and device for detecting text with any shape in natural scene and training method
Diaz et al. Detecting dynamic objects with multi-view background subtraction
CN113887468B (en) Single-view human-object interaction identification method of three-stage network framework
CN116434311A (en) Facial expression recognition method and system based on mixed domain consistency constraint
CN110414430B (en) Pedestrian re-identification method and device based on multi-proportion fusion
CN111582057B (en) Face verification method based on local receptive field
CN112446292A (en) 2D image salient target detection method and system
CN116682178A (en) Multi-person gesture detection method in dense scene
CN114663910A (en) Multi-mode learning state analysis system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant