CN113887468B

CN113887468B - Single-view human-object interaction identification method of three-stage network framework

Info

Publication number: CN113887468B
Application number: CN202111200063.8A
Authority: CN
Inventors: 田锋; 王耀智; 张吉仲; 南方; 洪振鑫; 吴砚泽; 郑庆华
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2023-06-16
Anticipated expiration: 2041-10-14
Also published as: CN113887468A

Abstract

The invention discloses a single-view human-object interaction identification method of a three-stage network frame, and belongs to the field of Human Objection Interaction. The three-stage framework solves the problems of human-object combination explosion, low pixels, shielding and the like in the MPMO scene of the existing HOI recognition algorithm, and combines the capabilities of recognition efficiency and accuracy. Compared with the classical HOI method such as iCAN and QPIC, the invention improves mAP by 0.21 in the constructed real multi-person multi-object classroom scene data set, and greatly improves the execution efficiency; compared with the classical Faster R-CNN model, the average accuracy of target detection of an object is improved by 0.301. The invention accurately captures the context information and the interactive object information around the task on the premise of not introducing complex background information, and effectively solves the problem of human-object combined explosion.

Description

Single-view human-object interaction identification method of three-stage network framework

Technical Field

The invention belongs to the field of Human Objection Interaction, and particularly relates to a single-view human-object interaction identification method of a three-stage network frame.

Background

The behavior of the person in the picture and the interaction relation between the person and the object are extracted by utilizing an artificial intelligence technology, and the hot spot problem in computer vision is solved. Compared with the traditional behavior recognition method, the HOI recognition task is to detect and recognize the interaction mode of each person and surrounding objects, mainly aims at scenes with more complex actions (such as more action types, more complex background information and the like), firstly recognizes all persons/objects in the scenes through a target detection method, obtains interaction types of the two by utilizing the appearance information and the relative position information of the related persons/objects in the scenes, outputs a triple structure of < person, verb and object >, and outputs a forward reasoning called HOI from the last triple detected by the target.

The HOI recognition method at the present stage is mostly aimed at the research under the condition of simple scenes, namely, the image only contains 1-2 persons/objects, and the influence of the multi-person multi-object scene on HOI recognition is ignored. In a multi-person and multi-object scene, people are usually dense, the types and the number of objects are large, so that a large amount of shielding exists among people, people and objects in a picture, the image resolution of the people and objects is low, and the problem of explosion of the combination relation of the people and the objects is solved. The existing method is applied to a real multi-person multi-object scene, and has the following problems: in the object detection stage, if the pre-trained object detection model cannot accurately detect all the people/object objects in the image due to the occlusion, even the concealment, or the blurring of the image, the person-object combination with the real HOI relationship may be missed, so that the performance of the HOI recognition is degraded. Second,: in the interactive action recognition stage, since the model takes the form of a person-object combination as input, the number of person and objects in one image increases to cause the number of person-object combinations to exponentially increase, and under the condition that the model batch number is set to be the same, the time consumption of forward reasoning of the model becomes longer as the number of person/objects in one image increases. Third,: in the traditional HOI algorithm (taking the iCAN as an example), a method of convolution before segmentation is utilized for extracting appearance characteristics of people and objects, but under a multi-person multi-object scene data set, the occupied pixel points of a single person and objects are fewer, the proportion of the single person and objects in the whole image is small, surrounding redundant background information can be captured by a convolution layer too much, noise is introduced, and the algorithm identification accuracy is reduced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a single-view human-object interaction identification method of a three-stage network framework.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

a recognition method of single-view man-object interaction of a three-stage network frame, wherein the three-stage network frame comprises a first-stage network frame, a second-stage network frame and a third-stage network frame; the first-stage network framework comprises a ResNet model, a self-attention mechanism module, a joint module and a full-connection layer; the second-stage network framework comprises a pooling layer, a joint module and a full-connection layer; the third-stage network framework comprises a ResNet model, a joint module and a full connection layer; the identification method comprises the steps of training a network and using the trained network to carry out identification, wherein the training network comprises the following steps:

1) Identifying character position information, object type information and position information thereof in the picture through a Faster R-CNN model, and outputting a character image according to the character position information by the Faster R-CNN model;

the character image sequentially passes through a ResNet model of the first-stage network frame and a self-attention mechanism module and then outputs the appearance characteristic f of the character image _s ；

Acquiring single-channel binary matrix character position information f by using multiple downsampling of character position information _l The joint module of the first-stage network framework makes f _s And said f _l The obtained combined features are input into a full-connection layer of the first-stage network framework, and the full-connection layer carries out preliminary classification on behaviors of the characters to obtain a multi-label action classification prediction result;

2) Expanding the character position information to obtain position information of a local expansion area and a corresponding character local expansion area image, and inputting the position information into a pooling layer of a second-stage network frame to obtain corresponding characteristics;

the combination module of the second-stage network framework combines the characteristics with the combined characteristics, and then inputs the characteristics and the combined characteristics into a full-connection layer of the second-stage network framework, and the full-connection layer outputs the position thermodynamic diagram of the most focused object of each person to obtain the position thermodynamic diagram of the interaction object;

3) Inputting object position information and a person local expansion area image into a ResNet network of a third-stage network frame, outputting image characteristics of an expansion area by the ResNet network, combining the image characteristics of the expansion area with the interactive object position thermodynamic diagram by a joint module of the third-stage network frame, and outputting corrected person behavior categories through a full-connection layer of the third-stage network frame to obtain a multi-label action classification prediction correction result;

in the training process, calculating the Margin-Loss of the multi-label action classification prediction correction result and the multi-label action classification prediction result output by the first-stage network frame, and simultaneously dynamically adjusting the Margin interval threshold by using the Focal-Loss, and optimizing the fast R-CNN model and the three-stage network frame in a counter-propagation mode;

4) Repeating the steps 1) -3) until the Margin-Loss is stable, and finishing training to obtain a trained fast R-CNN model and a three-stage network frame;

identifying using the trained network includes the following operations:

inputting a single Zhang Jiaoshi character visual angle picture into a trained Faster R-CNN model, inputting character information and object information output by the Faster R-CNN model into a trained three-stage network frame, and outputting the behavior and interactive object result of each character by the three-stage network frame.

Further, the specific process of identifying by using the trained network is as follows:

will take the form Zhang Jiaoshi figuresThe visual angle picture is input into a fast R-CNN model target detection model, and the fast R-CNN model target detection model detects people in the picture and outputs a human body coordinate frame b of each person _o According to the coordinate frame b of the person _o Intercepting a character part image and a corresponding LER image;

inputting the original picture into ResNet and self-attention mechanism module in first stage network frame, calculating to obtain image appearance characteristic f _s Inputting character position information and character image information obtained by the Faster R-CNN model into a continuous pooling layer to obtain character position information feature f _l Fusion to obtain joint feature f _h ＝[f _s ，f _l ]；

Inputting the character position information into a pooling layer of a second-stage network framework, extracting the position characteristics of the character specific information in the LER image, and combining the extracted position characteristics with f _h Fusing, generating an interactive object position thermodynamic diagram through the full-connection layer, and performing binarization processing;

inputting the LER region image into a ResNet model of a third-stage network frame, extracting LER region image characteristics, multiplying the thermodynamic diagram with the LER region characteristics, and calculating a final fusion characteristic f _all And outputting the final behavior category through the full connection layer and the activation function.

Furthermore, the convolutional layer parameter sharing in the characteristic extraction process of the character original image information and the LER image information is set, so that the parameter quantity in model training is reduced, and the model training speed is improved.

Further, the specific operation of obtaining the joint characteristics of the characters in the first-stage network frame in the step 1) is as follows:

extracting appearance characteristic f of human body part image of person by using ResNet model as backbone network _s Capturing a concerned region which is favorable for classification by using a self-attention mechanism module in the iCAN;

representing position information f of person in classroom using binary matrix _l The value of the position of the character is 1, the values of the rest positions are 0, and the binary matrix is obtained through continuous pooling layer downsampling;

the f is set to _s And said f _l And fusing to obtain the joint characteristics of the characters.

Further, the specific operation of generating the thermodynamic diagram of the position of the interactive object in the second stage network frame in the step 2) is as follows:

the method comprises the steps that an LER image developed in a local area of a character image is used for positioning an interactive object in the LER image;

based on the character position information in the LER image, extracting character position information characteristics through a pooling layer of the second-stage network framework and combining the character position information characteristics with f _h And (3) fusing, adjusting the thermodynamic diagram to be 14 multiplied by 14 through a 196-dimensional full-connection layer in the second-stage network, and performing binarization processing on the thermodynamic diagram through an activation function to obtain the thermodynamic diagram of the position of the interaction object.

Further, the LER image is formed by expanding the length and width of an original human body local image outwards by alpha times respectively, alpha is defined expansion factor super-parameters, and new coordinates of the LER image are calculated as follows:

[X ^min′ ，Y ^min′ ，X ^max′ ，Y ^max′ ]

wherein:

wherein W and H are the width of the image and the height of the image, X ^min ，Y ^min ，X ^max ，Y ^max Representing the upper left and lower right corner coordinates of the position box of the partial image of the person in the original picture.

Further, the calculation result is compared with the image edge portion when calculating the LER region position coordinates, and the exceeding of the image boundary is avoided.

Further, in step 3), the specific operations for classifying the person actions are:

inputting the image of the local expansion area of the character and the position information of the object into a ResNet model in a third-stage network, modeling the relative position space relation of the image and the object, combining the output relative position space relation with the interaction object position thermodynamic diagram, jointly carrying out HOI identification, and outputting a final behavior result and object information.

Further, in step 3), the used Margin-Loss is used to ensure the classification accuracy, so as to correct the recognition result in step 1), and meanwhile, the Focal-Loss is used to dynamically adjust the interval threshold according to the difficulty level of sample training, and the final Loss calculation formula is as follows:

in the method, in the process of the invention,

a score representing the behavior recognition in step 1); />

Representing a current person-object interaction behavior recognition score; m is m ^pos 、m ^neg Hyper-parameter interval margin representing positive and negative samples, respectively;

and the difficulty level of sample learning is represented.

Further, final fusion feature f _all The expression is as follows:

f _all ＝[f _h ，f _o ，f _sp ]

wherein f _LER Convolution feature map, loc, representing LER region _o Representing an interactive object positioning thermodynamic diagram.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a single-view human-object interaction identification method of a three-stage network framework, which solves a series of problems of human-object combination explosion, low pixels, shielding and the like in an MPMO scene of the existing HOI identification algorithm and combines the capabilities of identification efficiency and accuracy. Compared with the classical HOI method such as iCAN and QPIC, the invention improves mAP by 0.21 in the constructed real multi-person multi-object classroom scene data set, and greatly improves the execution efficiency; compared with the classical Faster R-CNN model, the average accuracy of target detection of an object is improved by 0.301. The LER concept provided by the invention accurately captures the context information and the interactive object information around the task on the premise of not introducing complex background information, and effectively solves the problem of human-object combined explosion; compared with classical HOI algorithm iCAN and QPIC, the model has the advantages of greatly improving the overall accuracy and the execution efficiency.

Drawings

Fig. 1 is an overall flow chart of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the attached drawing figures:

referring to fig. 1, fig. 1 is a flowchart showing a three-Stage network frame according to the present invention, wherein the upper left input is the result of recognition by the fast R-CNN model, including character position information, object category information, object position information, and character image in the picture, stage 1 represents a first-Stage network model, and the Stage inputs the character image to the res net model of the Stage and the self-attention mechanism module acquires the appearance feature f of the character image _s Simultaneously, the character position information is downsampled for a plurality of times to obtain single-channel binary matrix character position information f _l Combining the two features to form a combined feature, inputting the combined feature into a full-connection layer network at the current stage, and outputting a primary classification result of the character behaviors; stage 2 represents a second Stage network model, wherein the Stage expands character position information to obtain position information of a local expansion area and a corresponding character local expansion area image, the position information is input into a pooling layer of the Stage to obtain corresponding characteristics, the characteristics are combined with the combined characteristics of the first Stage network, the characteristics are input into a fully-connected layer of the layer, and finally an interactive object position thermodynamic diagram is output; stage 3 represents a third Stage network model, the Stage inputs the partial expansion area image of the person and the object position information of the second Stage into the ResNet network of the Stage, extracts image features, combines the image features with the interaction object position thermodynamic diagram of the second Stage, and finally inputs the image features and the interaction object position thermodynamic diagram of the second Stage into the full network frame of the third StageAnd the connection layer outputs the corrected person action classification result. The invention mainly comprises the following steps:

model training part:

step1: images including people and object objects are collected and annotated.

The annotated format refers to the annotated format of the HICO-DET dataset and is modified based thereon. For each sample in the data set, the sample contains the storage address of the original picture, the label of the character action, the character body position coordinate (not normalized), the character body face position coordinate (not normalized), the interaction object position coordinate (not normalized) and the position of the character in the classroom:

for each dataset sample, the label form of the character action is as follows:

[Action1，Action2，Action3，...]

i.e., a single character may perform multiple actions simultaneously.

The information of the character body position, the face body position and the interactive object position is determined by marking the coordinates of two points of the upper left corner and the lower right corner of the rectangular area, and the form is as follows:

[X ^min ，Y ^min ，X ^max ，Y ^max ]

wherein the coordinates of the upper left and lower right points respectively correspond to [ X ] ^min ，Y ^min ]And [ X ] ^max ，Y ^max ]。

The positions of the characters in the classroom are used for representing seat information in the classroom environment, and the basic form is as follows:

[Position1，Position2]

where Position1 indicates the row of the seats in the classroom, and Position2 indicates the column of the seats in the classroom.

In training the detection HOI model, all sample information in the dataset is used and according to 6:1 divides the training set and the test set.

Step2: and performing a behavior recognition task based on the traditional convolutional neural network.

This stage modelConsisting of a traditional behavior recognition framework, aimed at extracting the appearance features f of images of persons _s And single channel binary matrix character position information f _l The combined characteristics are formed, meanwhile, the behaviors of the people are initially classified, and the specific operation of acquiring information is as follows:

the appearance information contained in the figure image is quite rich, the input of the figure behavior recognition module only depends on human body frame detection, and ResNet50 is used as a backbone network to extract the appearance characteristic f of the figure human body part image _s The Self-Attention module in the iCAN is then used to capture regions of interest that are favorable for classification.

Different positions of the person in the classroom can cause the shooting angles of the cameras to be different, so that deformation of the image of the person can be caused, and the behavior recognition result can be influenced. The present invention proposes to use a binary matrix to represent the position information f of a person in a classroom _l The value of the position where the person is located is 1, the values of the rest positions are 0, and the binary matrix is obtained through continuous MaxPooling downsampling.

Behavior recognition feature f _h ＝[f _s ，f _l ]Combining the two types of features, calculating confidence scores for each type of action a by using a full connection structure and a Sigmoid activation function

Step3: an interactive object position thermodynamic diagram is generated.

The possible positions of the most interesting objects of each person are output and are regularized as a network through supervision training. Using the character feature f obtained in step2 _h The method for estimating the thermodynamic diagram of the position of the interactive object with the position information of the character in the local expansion area comprises the following steps:

(1) And (3) performing interactive object positioning on the LER image developed in the partial area of the character image in the LER image developed in the partial area of the character image. The LER image is formed by expanding the length and width of an original human body local image outwards by alpha times respectively, alpha is defined expansion factor super-parameters, and new coordinates of the LER image are calculated as follows:

[X ^min′ ，Y ^min′ ，X ^max′ ，Y ^max′ ]

wherein:

where W and H are the width of the image and the height of the image, respectively.

In consideration of the situation that the task main body part is at the edge of the classroom, the calculation result is compared with the image edge part when calculating the LER region position coordinates, so that the situation that the image boundary is exceeded is avoided.

(2) Generating object position thermodynamic diagrams

According to the character position information in the LER image, extracting the character position information and f in the step2 _h The fusion is carried out, the thermodynamic diagram is adjusted to be 14 multiplied by 14 through a 196-dimensional full-connection layer, and binarization processing is carried out on the thermodynamic diagram through a Sigmoid function, so that an object position thermodynamic diagram is generated.

The thermodynamic diagram is selected as object position information rather than frame coordinates, and is mainly studied in two points: firstly, the output in the form of thermodynamic diagram is applied to the step4, and can be directly used as attention diagram to carry out multiplication operation with the image characteristic diagram; secondly, the form of the coordinate frame is sharp, the thermodynamic diagram is smooth, and gradient convolution is facilitated.

Step4: and referring to Margin Loss in fine-grained classification, introducing position and appearance characteristic information of an object to assist in executing final interactive behavior recognition, and realizing self-correction of a first-stage recognition result.

The process uses the output thermodynamic diagram in step3 as an input for determining the position of the interactive object to assist in the classification task of the HOI behavior, and simultaneously improves the accuracy of HOI classification by utilizing the original appearance information of the character and the interactive object.

The step uses the character appearance characteristics and the object appearance characteristics obtained in the step2, models the relative position space relation of the character appearance characteristics and the object appearance characteristics, and performs HOI identification together. The final fusion characteristics are expressed as follows:

f _all ＝[f _h ，f _o ，f _sp ]

wherein f _h Representing the appearance characteristics of the character extracted in the step2, f _o Representing the features of the interactive object extracted at this stage, f _sp Representing the relative spatial relationship of the character to the position of the interactive object.

For the appearance characteristics of the interactive object, the interactive object positioning output in the step3 is used as a thermodynamic diagram, and the thermodynamic diagram is multiplied by the LER image characteristic diagram after convolution, and the calculation formula is as follows:

f _o ＝MaxPooling(f _LER ⊙loc _o )

wherein f _LER Convolution feature map, loc, representing LER region _o Representing the interaction object positioning thermodynamic diagram output in the step 3. Only the pixel points representing the positions of the interaction objects in the thermodynamic diagram have values larger than 0, and other redundant features are removed while the appearance information of the objects is extracted through MaxPooling.

Considering that the difference between the original image information of the character and the LER image information is smaller, the similarity of the contained information is higher, and the parameter sharing of the convolution layer in the characteristic extraction process of the original image information and the LER image information is set, so that the parameter quantity in model training is reduced, and the model training speed is improved.

And (2) referring to Margin-Loss used in the fine-grained classification problem to ensure classification accuracy, realizing further correction of the recognition result in the step (2), and dynamically adjusting an interval threshold value according to the difficulty level of sample training by using Focal-Loss, wherein a final Loss calculation formula is as follows:

in the method, in the process of the invention,

score representing behavior recognition in step2, < >>

Representing the current person-object interaction behavior recognition score, m ^pos 、m ^neg Hyper-parameter interval margin representing positive and negative examples respectively,

and the difficulty level of sample learning is represented.

Model use part:

step1: for inputting a single picture, a target detection model (here, the fast R-CNN model) is used to detect the individuals in the picture, and the human body coordinate frame b of each person is returned _o ；

Step2: according to the coordinate frame b of the person _o Intercepting a character part image and a corresponding LER image;

step3: inputting original character image information and intercepting character image information, and calculating image appearance characteristics f _s And character position information f _l From this, a joint feature f is derived _h ＝[f _s ，f _l ]；

Step4: calculating the position information of specific information of the person in the LER image, extracting the position characteristics through Pooling, and combining the position characteristics with f _h Fusing, calculating to generate 14 multiplied by 14 thermodynamic diagrams, and performing binarization processing;

step5: feature extraction is carried out on the LER region image, thermodynamic diagram is multiplied by LER region features, and final fusion features f are calculated _all And calculating the final behavior category through the full connection layer and the activation function.

Examples

In the constructed real multi-person multi-thing classroom scene data set MPMOCS, the description of the MPMOCS data set is shown in tables 1 and 2. Compared with the classical HOI method, such as iCAN and QPIC, mAP is improved by 0.21, and the execution efficiency is greatly improved; compared with the classical Faster R-CNN model, the average accuracy of target detection of the object is improved by 0.301, and experimental results are shown in tables 3 and 4.

Table 1 comparison of MPMOCS with HOI public dataset

Table 2 MPMOCS dataset partitioning

TABLE 3 HOI identification accuracy contrast

TABLE 4 object positioning accuracy contrast

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The method for identifying single-view man-object interaction of the three-stage network frame is characterized in that the three-stage network frame comprises a first-stage network frame, a second-stage network frame and a third-stage network frame; the first-stage network framework comprises a ResNet model, a self-attention mechanism module, a joint module and a full-connection layer; the second-stage network framework comprises a pooling layer, a joint module and a full-connection layer; the third-stage network framework comprises a ResNet model, a joint module and a full connection layer; the identification method comprises the steps of training a network and using the trained network to carry out identification, wherein the training network comprises the following steps:

identifying using the trained network includes the following operations:

2. The method for identifying single-view man-object interactions of a three-phase network framework according to claim 1, wherein the specific process of identifying by using a trained network is as follows:

inputting a single Zhang Jiaoshi person visual angle picture into a fast R-CNN model target detection model, wherein the fast R-CNN model target detection model detects the persons in the picture and outputs a human body coordinate frame b of each person _o According to the coordinate frame b of the person _o Intercepting a character part image and a corresponding LER image;

inputting an original picture into a first stage network frameworkResNet and self-attention mechanism module, calculating to obtain image appearance characteristic f _s Inputting character position information and character image information obtained by the Faster R-CNN model into a continuous pooling layer to obtain character position information feature f _l Fusion to obtain joint feature f _h ＝[f _s ,f _l ]；

3. The method for recognizing single-view man-object interaction of three-stage network frame according to claim 2, wherein the convolutional layer parameter sharing in the feature extraction process of the character original image information and the LER image information is set, thereby reducing the parameter amount in model training and improving the model training speed.

4. The method for identifying single-view human-object interaction of three-phase network frame according to claim 1, wherein the specific operation of obtaining the joint characteristics of the people in the first-phase network frame in step 1) is as follows:

5. The method for identifying single-view human-object interactions in a three-phase network framework according to claim 1, wherein the specific operations for generating the thermodynamic diagram of the position of the interaction object in the second-phase network framework in step 2) are as follows:

6. The method for identifying single-view human-object interaction of three-phase network framework according to claim 5, wherein the LER image is formed by expanding the length and width of the original human body partial image by alpha times, alpha is defined expansion factor super-parameters, and the calculated new coordinates of the LER image are as follows:

[X ^min′ ，Y ^min′ ，X ^max′ ，Y ^max′ ]

wherein:

wherein W and H are the width of the image and the height of the image, X ^min ,Y ^min ,X ^max ,Y ^max Representing the upper left and lower right corner coordinates of the position box of the partial image of the person in the original picture.

7. The method for recognizing single-view man-object interactions of three-phase network framework according to claim 5, wherein the calculation result is compared with the image edge portion when calculating the LER region position coordinates, avoiding exceeding the image boundary.

8. The method for identifying single-view human-object interactions of three-phase network framework according to claim 1, wherein in step 3), the specific operations for classifying human actions are as follows:

9. The method for identifying single-view man-object interaction of three-phase network frame according to claim 1, wherein in step 3), the classification accuracy is ensured by using a Margin-Loss to realize the correction of the identification result in step 1), and meanwhile, the interval threshold is dynamically adjusted by using a Focal-Loss according to the difficulty of sample training, and the final Loss calculation formula is as follows:

in the method, in the process of the invention,

a score representing the behavior recognition in step 1); />

and the difficulty level of sample learning is represented.

10. The method for identifying single-view human-object interactions of a three-phase network framework according to claim 1, characterized in that the final fusion feature f _all The expression is as follows:

f _all ＝[f _h ，f _o ，f _sp ]