CN115909405A

CN115909405A - Character interaction detection method based on YOLOv5

Info

Publication number: CN115909405A
Application number: CN202211512924.0A
Authority: CN
Inventors: 叶海波; 张诗凡
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-04-04

Abstract

The invention discloses a character interaction detection (HOI) method based on YOLOv 5. The human interaction detection task aims to detect human and object with interaction relation in the picture and interaction actions of the human and the object. Previous researches do not pay much attention to detection speed, and people hope to detect HOI relationship more quickly to adapt to scenes with high real-time requirements, so that the researches are inspired by YOLO (object detection algorithm), a quick target detection algorithm is used for HOI detection, and YOLOHOI is provided. The method for generating the interactive frame is designed, a double-target detection structure is provided, and an interactive area is regarded as a special object, so that the model has the capability of detecting the special object in the interactive area, the interactive relation is detected while target detection is executed, and the purpose of rapid detection is achieved.

Description

Character interaction detection method based on YOLOv5

Technical Field

The invention relates to the technical field of computer vision and human interaction detection, in particular to a human interaction detection method based on YOLOv 5.

Background

The human interaction detection task aims to detect people and objects with interaction relation in the picture and corresponding interaction actions of the people and the objects, and finally detects and outputs a triple of < human, object, action >. There are many scenarios that require real-time HOI detection, such as the field of automated driving, where identifying HOI relationships helps the model to identify and analyze relationships between people and objects in the scene, and where real-time is very important, so we are studying how to perform fast HOI detection.

In most of the past HOI detection researches, two-stage algorithms are adopted, which serially execute target detection and interactive classification, and people and objects obtained after target detection are combined one by one, and the people are input into another interactive classification network for action type detection. These methods are very time consuming due to this two-stage structural limitation. To overcome the structural deficiencies of these methods, some work has investigated parallel HOI detection methods to increase detection speed, called one-stage algorithms.

This is more efficient because the one-phase method performs object detection and interaction detection in parallel. They require matching rules that define interaction detection, e.g., PPDM and IPNet introduce the concept of interaction points to match interactions, and uniodets use a union of people and objects for detection. Our aim is to study the detection of HOI in scenes where the real-time requirements are high. Although many one-stage methods have achieved significant improvements in detection accuracy, they do not pay much attention to detection speed. Therefore, we have designed a so-called yooloi to perform the HOI detection as soon as possible.

Disclosure of Invention

The red boxes in fig. 3 are the interaction areas between the person and the object. With human judgment, the interaction behavior can be easily recognized only by a part of the red box. For example, with the red box in the right figure, you can easily determine the interaction of a person drinking water as you see the cup, face and hands. If the red frames can be obtained before the target detection result is obtained, then the dual target detection model simultaneously carries out target detection and interactive classification, and the purpose of real-time detection can be achieved.

The invention discloses a character interaction detection method based on YOLOv5, which comprises the following steps:

step 1: inputting an original picture and extracting picture characteristics;

step 2: the double target detection branches respectively detect an object example and an interaction frame;

and step 3: the target detection branch is responsible for detecting people and objects;

and 4, step 4: the interaction detection branch is responsible for detecting the interaction frame;

and 5: carrying out HOI relation pairing;

in the step 2, the dual target detection branch respectively detects the object instance and the interaction frame, and the method comprises the following steps:

the method is different from a general solution idea of the character interaction relation detection problem, and treats the core problem as a double target detection task, wherein the double detection comprises basic target detection of people and objects and special target detection of an interaction area. The interactive region between each person-object pair is concerned, the complex many-to-many relation detection is solved, the interactive region is regarded as a special target, the target detection result is obtained, meanwhile, the interactive detection result is also obtained, and the effect of rapidly detecting the person interactive relation is achieved. On the basis of the original YOLOv5 model structure, an interactive frame detection head is added for detecting an interactive frame. The basic target detection and the interactive area detection share a main feature extraction network. To summarize, YOLOv5 has two detection branches, one branch being responsible for detecting the base target and the other branch being responsible for detecting the interaction region.

In step 3, the target detection branch is responsible for detecting people and objects, and the method comprises the following steps:

the detection outputs a confidence level of whether an object exists, four parameters required for determining the frame of the object and a score of each target category.

During training, according to the positions of the human and the object, an interactive region, namely an interactive frame, between the human and the object is calculated and fitted through a formula, the interactive frame is regarded as a special target and is input into a target detection network for training, and the model has the capability of predicting the interactive region.

Through a series of tests, an interactive frame generation formula is obtained. The formula can enable the center point of the generated interaction frame to be located on a connecting line of the center points of the human body and the object, and the area which is useful for classifying the interaction is covered as much as possible.

The main considerations when designing this formula are: 1) And the center point of the interactive frame is on the connecting line of the center points of the person and the object. 2) When the object is far smaller than the human body, the interactive frame covers the object completely and covers the human body partially. 3) When the object is much larger than the person, the opposite is true.

In the step 4, the interaction detection branch is responsible for detecting the interaction frame, and the method comprises the following steps:

on the basis of the known borders of people and objects, a series of interactive frames are obtained according to a formula designed in the step three, and the interactive frames and the corresponding interactive categories are used as special target input YOLOv5 model structures to detect the interactive frames. And detecting and outputting a confidence coefficient of whether the interaction relationship exists, four parameters required for determining the interaction frame and a score of each interaction action.

In step 5, the pair of HOI relationships is performed, which comprises the following steps:

and pairwise people and objects in the basic target detection result are subjected to interaction frame generation formula to obtain interaction frames to be judged, intersection ratio between all the interaction frames to be judged and the predicted interaction frame is calculated IoU, the interaction frame to be judged with the largest IoU is selected and reserved for each predicted interaction frame, the people and the objects generating the interaction frames to be judged are considered to have an interaction relationship, and the motion types of the predicted interaction frames are endowed to the people and the objects to obtain an interaction relationship predicted value.

Drawings

Fig. 1 is a schematic diagram.

Fig. 2 is a diagram illustrating a conventional two-stage human interaction detection method.

FIG. 3 is a character frame and interaction frame presentation.

FIG. 4 is a diagram of a model framework of the present invention.

Detailed Description

The character interaction detection method based on YOLOv5 provided by the invention integrally comprises five steps:

step 1: inputting an original picture and extracting characteristics;

and 5: carrying out HOI relation pairing;

setting the channel dimension of the extracted feature vector to be (nc + 5) × 3, wherein nc represents the number of the types of the objects, namely 80;5=4+1:4 represents four parameters required for determining the frame of the object, namely a center point coordinate parameter and a width and height parameter, and 1 represents a confidence level for detecting whether the object exists or not. The detection process follows the method of YOLOv 5.

The main considerations when designing this formula are: 1) And the center point of the interactive frame is on the connecting line of the center points of the person and the object. 2) When the object is far smaller than the human body, the interactive frame covers the object completely and covers the human body partially. 3) When the object is much larger than the person, the opposite is true. In light of the above considerations, the formula for these factors is designed as follows:

x _a ＝Ratio _L *x _o +(1-Ratio _L )*x _h

y _a ＝Ratio _L *y _o +(1-Ratio _L )*y _h

w _a ＝Ratio _s *min(w _h ，w _o )+(1-Ratio _s )*max(w _h ，w _o )

h _a ＝Ratio _s *min(h _h ，h _o )+(1-Ratio _s )*max(h _h ，h _o )

wherein (w) _h ，h _h )、(w _o ，h _o ) The width and height of the human frame and the object frame respectively (x) _h ，y _h )、(x _o ，y _o ) Represents the center points of the human and the object, respectively, (w) _a ，h _a )、(x _a ，y _a ) Information representing the interactive box. Ratio (R) _L Representing a scale factor, ratio, of the coordinates of the center point _S Representing the area scale factor.

and (3) calculating the position of the interactive frame through the formula in the step (3) according to the positions of the people and the objects, giving the corresponding interactive category to the interactive frame, and inputting the interactive frame into an interactive detection branch as a special target.

Adding an interactive frame detection head structure in the yaml configuration file, setting the channel dimension of the extracted feature vector to be (nh + 5) × 3, and setting nh to represent the number of types of interactive actions, namely 117;5=4+1: and 4 represents four parameters required for determining the interactive frame, namely a central point coordinate parameter and a width and height parameter, and 1 represents a confidence level for detecting whether an interactive relationship exists or not. The detection procedure followed the method of YOLOv 5.

In the step 4, the HOI relationship pairing is performed, which comprises the following steps:

pairwise people and objects in the basic target detection result are processed according to the interaction frame generation formula to obtain the interaction frame judge _ bbox to be judged _ai (x _ai ，y _ai ，w _ai ，h _ai ) Wherein i is 1-M, and M is the number of the pairwise combination and pairing of the human and the object. Predicting the interaction Box as Presct _ bbox _aj (x _aj ，y _aj ，w _aj ，h _aj ) Where j belongs to 1-N, N being the number of predicted interaction boxes. Calculating all interaction frames to be judged and predicting interactionIntersection ratio between frames IoU, i.e.:

where area (predict _ bbox) represents a region of the prediction box, and area (judge _ bbox) represents a region of the prediction box. Interactive frame prediction _ bbox for each prediction _aj Selecting the largest interactive frame to be judged, namely judge _ bbox, which is reserved for IoU _ai And considering that the interaction relationship exists between the people and the object of the interaction frame to be judged generated by the formula in the step 2, and endowing the predicted interaction frame with the action type of the interaction frame to obtain an interaction relationship predicted value.

In summary, we want to detect the HOI relationships faster to accommodate some time-critical scenarios. In consideration of the rapid detection advantage of the target detection algorithm YOLO, we apply the method to HOI detection and propose our YOLOHOI model. This solution differs from the general HOI detection approach in that it treats the core problem of HOI detection as a dual target detection task and focuses on detecting the interaction area of each person pair. We treat the interaction region as a special object and therefore the model has the capability to detect this special object of the interaction region. In addition to proposing a model, we also devise a formula for generating an interaction box. To demonstrate that yollosoi is effective, we performed experiments on three scale sizes. Training in the HOI-A dataset uses a cosine annealing learning rate. The detection accuracy is improved with the increase of the convolution layer number, but when the network scale is larger, the improvement effect of the increase of the convolution layer number on the accuracy is less obvious. The highest value of mAP is 57.58%, the corresponding average reasoning time is only 21.67ms, the speed exceeds the existing model, the accuracy exceeds part of the traditional two-stage model, but a certain gap exists between the traditional two-stage model and the optimal model of the accuracy, and the YOLO algorithm is not very serious in the aspect of detection precision.

TABLE 1 training results for three different size networks

TABLE 2 comparison of Performance with other models

/>

Claims

1. A character interaction detection method based on YOLOv5 is characterized by comprising the following operation steps:

step 1: inputting an original picture and extracting picture characteristics;

and step 3: the target detection branch outputs object related information;

and 4, step 4: the interaction detection branch outputs the relevant information of the interaction frame;

and 5: HOI relationship pairing is carried out through two branch results;

in the step 1, the input picture has three channels, the picture input size is 512, and the dual-branch target detection shares features extracted by the backbone network.

2. The method as claimed in claim 1, wherein the dual target detection branch detects the object instance and the interaction frame in step 2, specifically as follows:

the method is different from the general solution idea of the character interaction relation detection problem, the core problem is regarded as a double target detection task, the double detection comprises basic target detection including human and object and special target detection in an interaction area, the interaction area between each human and object pair is concerned, the complex many-to-many relation detection is solved, the interaction area is regarded as a special target, the target detection result is obtained, meanwhile, the interaction detection result is also obtained, and the effect of rapidly detecting the character interaction relation is achieved.

3. The method for detecting interaction of people based on YOLOv5 as claimed in claim 1, wherein the target detection branch is responsible for detecting people and objects in step 3, specifically as follows:

using the Yolov5 model structure to detect and output a confidence level of whether an object exists, four parameters required for determining the frame of the object and a score of each target category,

during training, according to the positions of people and objects, the interactive region between the people and the objects, namely the interactive frame, is calculated and fitted through the formula designed by the invention, the interactive frame is taken as a special target and is input into the target detection network for training, so that the model has the capability of predicting the interactive region,

through a series of tests, an interactive frame generation formula is obtained, the formula enables the generated interactive frame center point to be located on a connecting line of a person and an object center point, an area which is useful for interactive action classification is covered as much as possible, and main consideration factors in designing the formula are as follows: 1) make mutual frame central point on the line of people and object central point, 2) when the object is far than the people little, mutual frame covers the object entirely, covers the part human body, 3) when the object is far than the people big, contrary to last, the formula design is as follows:

x _a ＝Ratio _L *x _o +(1-Ratio _L )*x _h

y _a ＝Ratio _L *y _o +(1-Ratio _L )*y _h

w _a ＝Ratio _s *min(w _h ，w _o )+(1-Ratio _s )*max(w _h ，w _o )

h _a ＝Ratio _s *min(h _h ，h _o )+(1-Ratio _s )*max(h _h ，h _o )

wherein the content of the first and second substances,(w _h ，h _h )、(w _o ，h _o ) The width and height of the human frame and the object frame respectively (x) _h ，y _h )、(x _o ，y _o ) Represents the center points of the human and the object, respectively, (w) _a ，h _a )、(x _a ，y _a ) Information representing the interactive box; ratio _L Representing a scale factor, ratio, of the coordinates of the center point _s Representing the area scale factor.

4. The person interaction detection method based on YOLOv5 as claimed in claim 1, wherein the interaction detection branch detection interaction frame in step 4 is specifically as follows:

on the basis of the known frames of people and objects, a series of interactive frames are obtained according to a formula designed in the step three, the interactive frames and the corresponding interactive categories are used as special targets to be input into a model structure of YOLOv5, the interactive frames are detected, the confidence degree of whether the interactive relationship exists or not is detected and output, four parameters required by the interactive frames are determined, and the score of each interactive action is determined.

5. The method for detecting human interaction based on YOLOv5 as claimed in claim 1, wherein in step 5, the pair of HOI relationships is performed as follows: