CN114973333A

CN114973333A - Human interaction detection method, human interaction detection device, human interaction detection equipment and storage medium

Info

Publication number: CN114973333A
Application number: CN202210828498.5A
Authority: CN
Inventors: 周德森; 王健; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-08-30
Anticipated expiration: 2042-07-13
Also published as: CN114973333B

Abstract

The present disclosure provides a person interaction detection method, apparatus, device, storage medium, and program product, which relate to the technical field of artificial intelligence, specifically to the technical fields of image processing, computer vision, deep learning, and the like, and can be applied to scenes such as smart cities. One embodiment of the method comprises: extracting global features of an image to be detected; inputting the global features into a pre-trained basic decoder to obtain a plurality of candidate triples; respectively inputting the candidate triples into a pre-trained object detection decoder and a pre-trained interactive decoder to obtain a plurality of pairs of human body interaction information and a plurality of interaction action information; and correspondingly combining the multiple pairs of human body object interaction information and the multiple interaction action information to obtain multiple human body object action triples. This embodiment solves the problem of matching errors and joint distribution by introducing a base decoder.

Description

Human interaction detection method, human interaction detection device, human interaction detection equipment and storage medium

Technical Field

The utility model relates to an artificial intelligence technical field, concretely relates to image processing, computer vision and technical field such as deep learning can be applied to scenes such as wisdom city.

Background

The human interaction detection is to locate all the acting people and objects in the image and their action relationship. People interaction detection is widely applied to the field of video monitoring, and can classify and supervise human behaviors.

The current human interaction detection method mainly comprises a two-stage method and a one-stage method. Wherein, the two-stage method mainly adopts a strategy of firstly detecting and then classifying. The first stage method is to directly predict the < human body, object, action > triple at the same time.

Disclosure of Invention

The embodiment of the disclosure provides a person interaction detection method, a person interaction detection device, a person interaction detection apparatus, a storage medium and a program product.

In a first aspect, an embodiment of the present disclosure provides a human interaction detection method, including: extracting global features of an image to be detected; inputting the global features into a pre-trained basic decoder to obtain a plurality of candidate triples; respectively inputting the candidate triples into a pre-trained object detection decoder and a pre-trained interactive decoder to obtain a plurality of pairs of human body interaction information and a plurality of interaction action information; and correspondingly combining the multiple pairs of human body object interaction information and the multiple interaction action information to obtain multiple human body object action triples.

In a second aspect, an embodiment of the present disclosure provides a human interaction detection apparatus, including: the extraction module is configured to extract global features of the image to be detected; a first decoding module configured to input global features to a pre-trained base decoder, resulting in a plurality of candidate triples; the second decoding module is configured to input the candidate triples into a pre-trained object detection decoder and a pre-trained interactive decoder respectively to obtain a plurality of pairs of human body interaction information and a plurality of interaction action information; and the combination module is configured to correspondingly combine the plurality of pairs of human body object interaction information and the plurality of interaction action information to obtain a plurality of human body object action triples.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, the disclosed embodiments propose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect.

In a fifth aspect, the present disclosure provides a computer program product including a computer program, which when executed by a processor implements the method as described in any implementation manner of the first aspect.

According to the character interaction detection method provided by the embodiment of the disclosure, the object detection decoding and the interaction decoder are connected by introducing the basic decoder, so that the matching process is omitted. Meanwhile, the triple expression obtained by the basic decoder can model the joint distribution of the detection task and the interaction task.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of one embodiment of a human interaction detection method according to the present disclosure;

FIG. 2 is a flow diagram of yet another embodiment of a human interaction detection method in accordance with the present disclosure;

FIG. 3 is a schematic diagram of a human interaction detection model;

FIG. 4 is a schematic structural diagram of one embodiment of a human interaction detection apparatus according to the present disclosure;

fig. 5 is a block diagram of an electronic device for implementing a human interaction detection method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows a flow 100 of one embodiment of a human interaction detection method according to the present disclosure. The human interaction detection method comprises the following steps:

step 101, extracting global features of an image to be detected.

In the present embodiment, the execution subject of the human interaction detection method may extract the global features of the image to be detected. The image to be detected can be any image which needs to detect human bodies, objects and action relations thereof. The global features may be overall attributes of the image to be detected, including but not limited to color features, texture features, shape features, and the like.

Step 102, inputting the global features into a pre-trained basic decoder to obtain a plurality of candidate triples.

In this embodiment, the execution agent may input the global feature to a pre-trained base decoder to obtain a plurality of candidate triples.

Wherein the base decoder can decode the input global features into the triplet expression. Specifically, the base decoder may perform feature extraction by using a plurality of triple queries (queries) to obtain a plurality of candidate triples, where one triple query corresponds to one candidate triple. The candidate triplet may be a < human, object, action > triplet.

In some embodiments, the base decoder may include a first preset number of decoder layers, each of which may include an attention layer, a self-attention layer, a forward layer, and the like, for performing one decoding operation.

It should be noted that the number of decoder layers included in the base decoder can be set as needed. In general, the more complex the application scenario, the greater the number of decoder layers the base decoder comprises. In the present embodiment, the number of decoder layers may be set to 2, for example. The number of triplet queries of the base decoder may also be set as required, and in the present embodiment, the number of triplet queries may be set to 100, for example.

And 103, respectively inputting the candidate triples into a pre-trained object detection decoder and a pre-trained interactive decoder to obtain a plurality of pairs of human body interaction information and a plurality of interaction action information.

In this embodiment, the execution body may input the plurality of candidate triples to a pre-trained object detection decoder and a pre-trained interaction decoder, respectively, to obtain a plurality of pairs of human object interaction information and a plurality of interaction information.

The object detection decoder can decode the input candidate triples into the human body object interaction information. The interactive decoder may decode the input candidate triples into interactive action information. Specifically, the object detection decoder may perform the search using a set of queries (queries), each of which may detect a pair of human interaction pairs, rather than an independent object or human body. Similarly, the interactive decoder may perform a lookup using another set of queries, each of which may detect an interactive action. The human body object interaction information may include a position where a human body is located, a position where an object is located, an object category, and the like. The interaction information may include a category of interaction.

In some embodiments, the object detection decoder may include a second preset number of decoder layers, the interactive decoder may include a third preset number of decoder layers, each of which may include an interactive attention layer, a self-attention layer, a forward layer, and the like, for performing one decoding operation.

It should be noted that the number of decoder layers included in the object detection decoder and the interactive decoder may be set as needed. In general, the more complex the application scenario, the greater the number of decoder layers the object detection decoder and the interactive decoder comprise. In the present embodiment, the number of decoder layers of the object detection decoder may be set to 4, for example, and the number of decoder layers of the interactive decoder may also be set to 4, for example.

And 104, correspondingly combining the multiple pairs of human body object interaction information and the multiple interaction action information to obtain multiple human body object action triples.

In this embodiment, the execution main body may correspondingly combine multiple pairs of human body object interaction information and multiple interaction information to obtain multiple human body object action triples.

In general, human-object interaction information and interaction information from the same candidate triplet may be correspondingly combined to generate a < human, object, action > triplet.

The human body and the object detected by the object detection decoder can be matched with the candidate triplet of the basic decoder, and the interactive action detected by the interactive decoder can be matched with the candidate triplet of the basic decoder, so that the interactive information and the interactive action information of the human body and the object are matched without an additional matching process, and errors caused by matching are eliminated.

The character interaction detection method provided by the embodiment of the disclosure solves the problems of matching errors and joint distribution. The core of the method is that a basic decoder is introduced to link object detection decoding with an interactive decoder, so that a matching process is omitted. Meanwhile, the triple expression obtained by the basic decoder can model the joint distribution of the detection task and the interaction task.

With continued reference to fig. 2, a flow 200 of yet another embodiment of a human interaction detection method in accordance with the present disclosure is shown. The human interaction detection method comprises the following steps:

step 201, inputting an image to be detected into a residual error network to obtain pixel characteristics of the image to be detected.

In this embodiment, the main executing body of the human interaction detection method may input the image to be detected to the residual error network, so as to obtain the pixel characteristics of the image to be detected.

Here, the extraction of the image pixel features is realized by using a Residual Network, and the Residual Network may be selected from a ResNet (Residual Network) 50, a ResNet101, or the like. The pixel characteristics can be the attributes of the pixel points of the image to be detected, and are usually expressed in a matrix form.

Step 202, inputting the pixel characteristics to an image encoder to obtain global characteristics.

In this embodiment, the execution subject may input the pixel feature to an image encoder, so as to obtain a global feature.

In general, an encoder using a Transformer can further encode the image pixel features to obtain global features. Specifically, the encoder of the Transformer models a token (tokens) composed of image pixel features by a self-attention mechanism, and obtains a global representation of the image. The token is usually represented in the form of a vector, and is obtained by expanding the image pixel characteristics.

And step 203, performing feature extraction on the global features by using the multiple triple queries of the basic decoder to obtain multiple candidate triples.

In this embodiment, the execution main body may perform feature extraction on the global feature by using a plurality of triple queries of the base decoder, so as to obtain a plurality of candidate triples.

Wherein the base decoder can decode the input global features into a coarse triplet representation. Specifically, the base decoder may perform feature extraction by using a plurality of triple queries (queries) to obtain a plurality of candidate triples, where one triple query corresponds to one candidate triple. The candidate triplet may be a coarse < human, object, action > triplet.

Since each triplet query of the base decoder is able to decode one candidate triplet. In order to enhance the eigenexpression, a supplemental loss function is used for supervision, acting on the output of each decoder layer of the base decoder. Specifically, the base decoder may be obtained by training using a human detection frame loss function, an object detection frame loss function, and a motion classification loss function. In some embodiments, a weighted sum of the human detection frame loss function, the object detection frame loss function, and the motion classification loss function is calculated to obtain an overall loss function of the base decoder, and the base decoder is trained based on the overall loss function of the base decoder. The human body detection frame loss function can be used for representing the difference between the predicted human body boundary frame and the real human body boundary frame. For example, the human detection box loss function may be obtained by calculating a weighted sum of absolute distance and intersection ratio of the predicted human bounding box and the real human bounding box. The object detection box loss function may be used to characterize the difference of the predicted object bounding box and the real object bounding box. For example, the object detection frame loss function may be obtained by calculating a weighted sum of the absolute distance and the intersection ratio of the predicted object bounding frame and the real object bounding frame, and combining the cross entropy loss of the object class. The motion classification loss function may be used to characterize the difference of the predicted motion class from the true motion class. For example, the action classification loss function may be derived by calculating a focus loss for the predicted action class and the real action class.

And 204, taking the multiple candidate triples as the initialization characteristics of the object detection decoder, and predicting the positions of the human bodies and the objects of the multiple candidate triples and the object types.

In this embodiment, the executing entity may predict positions and object types of the human body and the object of the multiple candidate triples by using the multiple candidate triples as the initialization feature of the object detection decoder.

Wherein the object detection decoder may decode the input coarse triplet of candidates into refined positions of the human body and the object and object classes. Specifically, the object detection decoder may perform the search using a set of queries (queries), each of which may detect a pair of human interaction pairs, rather than an independent object or human body. By means of initializing the characteristics, the process from coarse to fine is realized.

In the object detection decoder, the auxiliary loss function is also adopted for supervision, and the auxiliary loss function is applied to each decoder layer of the object detection decoder. Specifically, the object detection decoder may be obtained by training using a human detection frame loss function and an object detection frame loss function. In some embodiments, a weighted sum of the human detection frame loss function and the object detection frame loss function is calculated to obtain a total loss function of the object detection decoder, and the object detection decoder is trained based on the total loss function of the object detection decoder.

Step 205, using the multiple candidate triples as the initialization feature of the interactive decoder, and predicting the interactive action categories of the multiple candidate triples.

In this embodiment, the execution subject may use a plurality of candidate triples as an initialization feature of the interactive decoder to predict the interactive action categories of the plurality of candidate triples.

Wherein the interaction decoder may decode the input coarse candidate triples into a refined interaction action category. In particular, the interactive decoder may perform a lookup using another set of queries, each of which may detect an interactive action. By means of initializing the characteristics, the process from coarse to fine is realized.

In the interactive decoder, a supplemental loss function is also used for supervision, acting on each decoder layer of the interactive decoder. Specifically, the interactive decoder may be trained using a motion classification loss function.

It should be noted that, during model optimization, since the object detection decoder and the interactive decoder correspond to the same candidate triplet in the base decoder, when output and annotation are subjected to hungarian matching, the outputs of the object detection decoder and the interactive decoder are combined to form a new triplet, and then matching is performed.

And step 206, correspondingly combining the multiple pairs of human body object interaction information and the multiple interaction action information to obtain multiple human body object action triples.

In this embodiment, the execution main body may correspondingly combine a plurality of pairs of human body interaction information and a plurality of interaction information to obtain a plurality of human body object action triples.

Generally, the human-object interaction information and the interaction information from the same candidate triplet may be correspondingly combined to generate a refined < human, object, action > triplet, which implements a coarse-to-fine process.

As can be seen from fig. 2, compared with the embodiment corresponding to fig. 1, the process 200 of the human interaction detection method in this embodiment highlights the decoding step. Therefore, the scheme described in the embodiment provides a decoupled human interaction detection method optimized from coarse to fine, and the course from coarse to fine is realized by initializing the characteristics.

For ease of understanding, fig. 3 shows a schematic structural diagram of the human interaction detection model. As shown in fig. 3, the human interaction detection model includes a residual network 301, an image encoder 302, a base decoder 303, an object detection decoder 304, and an interaction decoder 305. The image is input to a residual network 301 to obtain pixel characteristics. The pixel features are input to the image encoder 302, resulting in global features. The global features are input to the base decoder 303 resulting in coarse candidate triples. And inputting the candidate triple into the object detection decoder 304 and the interaction decoder 305 for initialization features, respectively, wherein the object detection decoder 304 outputs the refined human body position, the refined object position and the refined object type, and the interaction decoder 305 outputs the refined interaction action type.

With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a human interaction detection apparatus, which corresponds to the method embodiment shown in fig. 1, and which is particularly applicable to various electronic devices.

As shown in fig. 4, the human interaction detecting apparatus 400 of the present embodiment may include: an extraction module 401, a first decoding module 402, a second decoding module 403 and a combination module 404. The extraction module 401 is configured to extract global features of an image to be detected; a first decoding module 402 configured to input global features to a pre-trained base decoder, resulting in a plurality of candidate triples; a second decoding module 403, configured to input the multiple candidate triples to a pre-trained object detection decoder and a pre-trained interactive decoder, respectively, to obtain multiple pairs of human object interaction information and multiple interaction action information; and the combining module 404 is configured to correspondingly combine the plurality of pairs of human body object interaction information and the plurality of interaction action information to obtain a plurality of human body object action triples.

In the present embodiment, in the human interaction detection apparatus 400: the specific processing and the technical effects of the extracting module 401, the first decoding module 402, the second decoding module 403 and the combining module 404 can refer to the related descriptions of step 101 and step 104 in the corresponding embodiment of fig. 1, which are not described herein again.

In some optional implementations of this embodiment, the base decoder includes a first preset number of decoder layers, the object detection decoder includes a second preset number of decoder layers, and the interactive decoder includes a third preset number of decoder layers, each decoder layer including an interactive attention layer, a self-attention layer, and a forward layer.

In some optional implementation manners of this embodiment, the basic decoder is obtained by training a human detection frame loss function, an object detection frame loss function, and an action classification loss function, the object detection decoder is obtained by training a human detection frame loss function and an object detection frame loss function, and the interactive decoder is obtained by training an action classification loss function, where the human detection frame loss function is used to represent a difference between a predicted human body bounding box and a real human body bounding box, the object detection frame loss function is used to represent a difference between the predicted object bounding box and the real object bounding box, and the action classification loss function is used to represent a difference between a predicted action category and a real action category.

In some optional implementation manners of this embodiment, the human detection frame loss function is obtained by calculating a weighted sum of absolute distances and intersection ratios of the predicted human body bounding box and the real human body bounding box, the object detection frame loss function is obtained by calculating a weighted sum of absolute distances and intersection ratios of the predicted object bounding box and the real object bounding box, and then combining cross entropy losses of the object classes, and the action classification loss function is obtained by calculating a focus loss of the predicted action class and the real action class.

In some optional implementations of this embodiment, the first decoding module 402 is further configured to: and performing feature extraction on the global features by utilizing a plurality of triple queries of the basic decoder to obtain a plurality of candidate triples.

In some optional implementations of this embodiment, the second decoding module 403 is further configured to: using the multiple candidate triples as the initialization characteristics of the object detection decoder, and predicting the positions and object types of the human bodies and the objects of the multiple candidate triples; and predicting the interactive action categories of the candidate triples by taking the candidate triples as the initialization characteristics of the interactive decoder.

In some optional implementations of this embodiment, the extraction module 401 is further configured to: inputting an image to be detected into a residual error network to obtain pixel characteristics of the image to be detected; and inputting the pixel characteristics to an image encoder to obtain global characteristics.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 executes the respective methods and processes described above, such as the human interaction detection method. For example, in some embodiments, the human interaction detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the human interaction detection method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the human interaction detection method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be performed in parallel or sequentially or in a different order, as long as the desired results of the technical solutions provided by this disclosure can be achieved, and are not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A human interaction detection method comprises the following steps:

extracting global features of an image to be detected;

inputting the global features into a pre-trained basic decoder to obtain a plurality of candidate triples;

respectively inputting the candidate triples into a pre-trained object detection decoder and a pre-trained interactive decoder to obtain a plurality of pairs of human body interaction information and a plurality of interaction action information;

and correspondingly combining the multiple pairs of human body object interaction information and the multiple interaction action information to obtain multiple human body object action triples.

2. The method of claim 1, wherein the base decoder comprises a first preset number of decoder layers, the object detection decoder comprises a second preset number of decoder layers, the interactive decoder comprises a third preset number of decoder layers, each decoder layer comprising an interactive attention layer, a self-attention layer, and a forward layer.

3. The method according to claim 1 or 2, wherein the base decoder is trained to obtain a human detection frame loss function, an object detection frame loss function and an action classification loss function, the object detection decoder is trained to obtain a human detection frame loss function and an object detection frame loss function, and the interactive decoder is trained to obtain an action classification loss function, wherein the human detection frame loss function is used for representing the difference between a predicted human body bounding box and a real human body bounding box, the object detection frame loss function is used for representing the difference between the predicted object bounding box and the real object bounding box, and the action classification loss function is used for representing the difference between a predicted action category and the real action category.

4. The method of claim 3, wherein the human detection frame loss function is obtained by calculating a weighted sum of absolute distance and cross-over ratio of the predicted human bounding box and the real human bounding box, the object detection frame loss function is obtained by calculating a weighted sum of absolute distance and cross-over ratio of the predicted object bounding box and the real object bounding box, and combining cross-entropy loss of object classes, and the motion classification loss function is obtained by calculating focus loss of the predicted motion classes and the real motion classes.

5. The method of any of claims 1-4, wherein the inputting the global features to a base decoder, resulting in a plurality of candidate triples, comprises:

and performing feature extraction on the global features by utilizing the multiple triple queries of the basic decoder to obtain the multiple candidate triples.

6. The method according to any one of claims 1-5, wherein the inputting the plurality of candidate triples to an object detection decoder and an interaction decoder, respectively, results in a plurality of pairs of human object interaction information and a plurality of interaction information, comprises:

using the multiple candidate triples as the initialization features of the object detection decoder, and predicting positions and object types of human bodies and objects of the multiple candidate triples;

and predicting the interaction action categories of the candidate triples by taking the candidate triples as the initialization characteristics of the interaction decoder.

7. The method according to any one of claims 1-6, wherein said extracting global features of the image to be detected comprises:

inputting the image to be detected into a residual error network to obtain the pixel characteristics of the image to be detected;

and inputting the pixel characteristics to an image encoder to obtain the global characteristics.

8. A human interaction detection apparatus, comprising:

the extraction module is configured to extract the global features of the image to be detected;

a first decoding module configured to input the global features to a pre-trained base decoder, resulting in a plurality of candidate triples;

the second decoding module is configured to input the candidate triples into a pre-trained object detection decoder and a pre-trained interactive decoder respectively to obtain a plurality of pairs of human body interaction information and a plurality of interaction action information;

and the combination module is configured to correspondingly combine the plurality of pairs of human body object interaction information and the plurality of interaction action information to obtain a plurality of human body object action triples.

9. The apparatus of claim 8, wherein the base decoder comprises a first preset number of decoder layers, the object detection decoder comprises a second preset number of decoder layers, the interactive decoder comprises a third preset number of decoder layers, each decoder layer comprising an interactive attention layer, a self-attention layer, and a forward layer.

10. The apparatus according to claim 8 or 9, wherein the base decoder is trained to obtain a human detection frame loss function, an object detection frame loss function, and an action classification loss function, the object detection decoder is trained to obtain a human detection frame loss function and an object detection frame loss function, and the interactive decoder is trained to obtain an action classification loss function, wherein the human detection frame loss function is used for representing a difference between a predicted human body bounding box and a real human body bounding box, the object detection frame loss function is used for representing a difference between the predicted object bounding box and the real object bounding box, and the action classification loss function is used for representing a difference between a predicted action category and a real action category.

11. The apparatus of claim 10, wherein the human detection frame loss function is obtained by calculating a weighted sum of absolute distance and cross-over ratio of the predicted human bounding box and the real human bounding box, the object detection frame loss function is obtained by calculating a weighted sum of absolute distance and cross-over ratio of the predicted object bounding box and the real object bounding box, and combining cross-entropy loss of object classes, and the motion classification loss function is obtained by calculating focus loss of the predicted motion classes and the real motion classes.

12. The apparatus of any of claims 8-11, wherein the first decoding module is further configured to:

13. The apparatus of any of claims 8-12, wherein the second decoding module is further configured to:

14. The apparatus of any one of claims 8-13, wherein the extraction module is further configured to:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.