WO2023273334A1

WO2023273334A1 - Behavior recognition method and apparatus, and electronic device, computer-readable storage medium, computer program and computer program product

Info

Publication number: WO2023273334A1
Application number: PCT/CN2022/074120
Authority: WO
Inventors: 王浩然; 纪德益
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-07-02
Filing date: 2022-01-26
Publication date: 2023-01-05
Also published as: CN113469056A

Abstract

Disclosed in the embodiments of the present disclosure are a recognition method and apparatus, and an electronic device, a computer-readable storage medium, a computer program and a computer program product. The method comprises: performing detection on each image to be subjected to detection, so as to obtain features of a plurality of objects, and encoding the features to obtain multi-dimensional features corresponding to the plurality of objects on a one-to-one basis; on the basis of some of the features of each object in each group of objects, determining spatial results of at least two categories of objects in each group of objects, and an action result of each object; on the basis of the multi-dimensional features, determining relationship interaction features of each group of objects, and where it is determined, according to the relationship interaction features, that the objects in each group of objects are associated with each other, determining target results of each group of objects on the basis of the spatial results and the action results, so as to obtain at least one target result; and on the basis of the at least one target result, determining object behavior in each image to be subjected to detection.

Description

Behavior recognition method, device, electronic device, computer readable storage medium, computer program and computer program product

Cross References to Related Applications

This disclosure is based on the Chinese patent application with the application number 202110750749.8, the application date is July 02, 2021, and the application name is "behavior recognition method, device, electronic equipment and computer-readable storage medium", and requires the Chinese patent application Priority, the entire content of the Chinese patent application is hereby incorporated by reference into this disclosure.

technical field

The present disclosure relates to the technical field of computer vision, and in particular to a behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program and computer program product.

Background technique

Human interaction behavior detection is an important task for understanding how people and objects interact. Human-object interaction (HOI) behavior detection aims to localize and classify triplets of human, object, and human-object relationship from an input image. Detecting human-object interactions can enable well-designed algorithms to generate better descriptions of scenes.

However, when the related technology is used to detect the interaction behavior of characters, the detection efficiency and accuracy are low, resulting in poor detection effect and low detection efficiency of the interaction behavior of the characters.

Contents of the invention

Embodiments of the present disclosure provide a behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program and computer program product, which can improve the recognition accuracy and recognition efficiency of character interaction behaviors.

The technical scheme of the embodiment of the present disclosure is realized in this way:

An embodiment of the present disclosure provides a behavior recognition method, including: detecting each image to be detected to obtain features of multiple objects, encoding the features, and obtaining multi-dimensional features corresponding to the multiple objects respectively; Based on some of the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the action results of each of the group member objects, wherein the Each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects; based on the multidimensional features, determine the relationship interaction characteristics of each group of objects, and interact according to the relationship The feature is that when it is determined that the group member objects in each group of objects are related to each other, based on the space result and the action result, determine the target result of each group of objects, and obtain at least one of the group members The target result; based on at least one target result, determine the object behavior in each image to be detected.

An embodiment of the present disclosure provides a behavior recognition device, including: an encoding part, configured to detect each image to be detected to obtain the features of multiple objects, encode the features, and obtain the features corresponding to the multiple objects respectively. A corresponding multi-dimensional feature; the result determination part is configured to determine the spatial results of at least two types of objects of each group of objects based on some features of the features of each group member object of each group of objects, and each The action results of the group member objects, wherein each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects; The relationship interaction features of the objects, and in the case of determining that the member objects in each group of objects are related to each other according to the relationship interaction features, based on the spatial result and the action result, determine the The target result of each group of objects is obtained to obtain at least one target result; the behavior determination part is configured to determine the object behavior in each image to be detected based on the at least one target result.

In the above device, the result determining part is further configured to generate a fully connected graph corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively; The multi-dimensional features corresponding to the objects one-to-one, and the fully connected graph, perform graph convolution processing to obtain the updated multi-dimensional features corresponding to each of the objects; according to each of the objects in each group The updated multi-dimensional features of each group member object are obtained to obtain the relationship interaction feature of each group of objects.

In the above device, the result determination part is further configured to classify each group of objects according to the relationship interaction feature, and obtain the interaction result of each group of objects; when the interaction result is greater than or equal to In the case of the first preset score threshold, it is determined that the group member objects in each group of objects are related to each other.

In the above device, the result determining part is further configured to update the multi-dimensional feature of each group member object based on the relationship interaction feature of each group of objects and preset parameters, to obtain The refined features of each group member object, and based on the refined features, determine the graph interaction features of each group of objects; based on the graph interaction features, classify each group of objects to obtain Graph relationship results; based on the spatial results, the action results, the interaction results, the graph relationship results, and the confidence results obtained when performing the detection on each of the team member objects, determine each of the The target outcome for a set of objects.

In the above device, the target result is a target value; the behavior determination part is further configured to, according to at least one of the target values, select from a plurality of associated object groups corresponding to at least one of the target values Finding an associated object group corresponding to the highest target value, and identifying behaviors among the group member objects in the associated object group.

In the above device, the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects; the result determining part is further configured to The multi-dimensional features corresponding to each of the objects one-to-one, iterate the multi-dimensional features of each of the objects through a graph neural network, and obtain the updated multi-dimensional features corresponding to each of the objects .

In the above device, the two objects include: a first object and a second object; the result determination part is further configured to determine the multidimensional features of the first object and the multidimensional features of the second object The similarity between features; based on the positional features of the first object in each of the images to be detected and the positional features of the second object in each of the images to be detected, determine the first A distance between the object and the second object; based on the similarity and the distance, determine the degree of association between the first object and the second object.

In the above device, the result determining part is further configured to, based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multi-dimensional features corresponding to each of the objects, for each Iteratively updating the multi-dimensional features of each of the objects, and when the number of iterations reaches the first preset number of times, the features generated after the first preset number of times are used as the update of each of the objects. The subsequent multidimensional features.

In the above device, the preset parameters include: a second weight parameter and the number of iterations; the result determination part is further configured to, based on the second weight parameter and the relationship interaction feature of each group of objects, Iteratively updating the multi-dimensional features of each of the group member objects, and when the number of iterations reaches a second preset number of times, using the features generated after the second preset number of times as each of the group members The refinement characteristics of the employee object.

In the above device, the detection includes: image detection and word vector detection; the encoding part is further configured to encode the positional features corresponding to the plurality of objects respectively to obtain the first feature of each object ; The visual features corresponding to the plurality of objects are encoded to obtain the second feature of each of the objects; the position feature and the visual feature are image detection for each of the images to be detected Obtained; The word vector feature corresponding to each of the multiple objects is encoded to obtain the third feature of each of the objects; the word vector feature is the category information of each of the objects, and the word obtained by vector detection; the category information is obtained by performing image detection on each of the images to be detected; according to the first feature, the second feature and the third feature, a One corresponding to the multi-dimensional feature; wherein, the dimensions of the first feature, the second feature and the third feature are the same.

In the above device, the encoding part is further configured to perform dimension transformation processing on the visual features corresponding to the plurality of objects respectively, so as to obtain the dimensionally transformed visual features of each of the objects; The dimensionally transformed visual features are encoded to obtain the second features of each of the objects.

In the above device, the partial features include: the positional features and visual features of each of the team member objects; the positional features and the visual features are obtained by performing image detection on each of the images to be detected; the The result determination part is further configured to determine the image area of each of the group member objects in each of the images to be detected based on the position characteristics of each of the group member objects of each group of objects; According to the image area of each of the group member objects, the image area corresponding to each group of objects is obtained, and the image area corresponding to each group of objects is encoded to obtain two-dimensional feature data; The two-dimensional feature data and the visual features of each member object are subjected to feature processing, corresponding to the processed two-dimensional feature data and the processed visual features; according to the processed two-dimensional feature data , classify each group of objects, obtain the spatial result of each group of objects, and classify each of the group member objects according to the processed visual features, and obtain each of the The result of the described action on the member object.

In the above device, the device further includes a detection part configured to perform image detection on each of the images to be detected, and obtain the positional features, visual features, confidence results of each detected target, and Corresponding category information: taking the target whose confidence result is greater than or equal to a second preset score threshold as the detected object, and obtaining the position feature and the visual feature corresponding to the plurality of objects respectively, and the category information; performing word vector detection on the category information of each object to obtain the word vector features of each object.

An embodiment of the present disclosure provides an electronic device, including: a memory configured to store an executable computer program; a processor configured to implement the above behavior recognition method when executing the executable computer program stored in the memory.

An embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored for causing a processor to execute the above-mentioned behavior recognition method.

An embodiment of the present disclosure provides a computer program, including computer readable codes. When the computer readable codes run in an electronic device, a processor in the electronic device executes the method for implementing the above behavior recognition method. step.

An embodiment of the present disclosure provides a computer program product, including computer program instructions, which enable a computer to execute the steps of the above-mentioned behavior recognition method.

The behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program, and computer program product provided by the embodiments of the present disclosure obtain the features of multiple objects by detecting each image to be detected, and encode the obtained features , to obtain the multi-dimensional features corresponding to each object; based on some features in the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the spatial results of each group member object The action result, wherein each group of objects at least includes: objects classified as objects and objects classified as people among the multiple objects; then, based on the multi-dimensional features corresponding to the multiple objects respectively, the relationship of each group of objects is determined Interaction features, and in the case of determining that the member objects in each group of objects are related to each other based on the relationship interaction characteristics, based on the spatial result and the action result, determine the target result of each group of objects, so as to obtain at least one target result ; Finally, based on the obtained at least one target result, determine the object behavior in the image to be detected. Since the embodiment of the present disclosure first determines whether the group member objects in each group of objects are related to each other, and then uses the group of group member objects related to each other to determine the object behavior in the image to be detected, so the relationship between group member objects is filtered out. The groups are not related to each other, so that when determining the behavior of the object in the image to be detected, the factors that interfere with the determination result are reduced, and at the same time, the amount of data required for calculation is reduced, thereby improving the recognition accuracy when recognizing the interaction behavior of people degree and recognition efficiency.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Description of drawings

The accompanying drawings here are incorporated into the description and constitute a part of the present description. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.

FIG. 1A is a schematic diagram of an exemplary image to be detected provided by an embodiment of the present disclosure;

FIG. 1B is an exemplary schematic diagram of another image to be detected provided by an embodiment of the present disclosure;

FIG. 2 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure;

FIG. 4 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;

FIG. 5 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure;

FIG. 6 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;

FIG. 7 is an optional schematic flow chart of a behavior recognition method provided by an embodiment of the present disclosure;

FIG. 8 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure;

FIG. 9 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;

FIG. 10 is a schematic partial flow diagram of an exemplary method of identifying an object's behavior in an image to be detected using a behavior recognition method provided by an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of an identification device provided by an embodiment of the present disclosure;

Fig. 12 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

detailed description

In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described in detail below in conjunction with the accompanying drawings. All other embodiments obtained under the premise of creative labor belong to the protection scope of the present disclosure.

Human interaction behavior detection is an important task for understanding how people and objects interact. Human-object interaction detection aims to localize and classify triplets of people, objects, and human-object relationships based on input images. Detecting human-object interactions can enable well-designed algorithms to generate better descriptions of scenes. For example, FIG. 1A is a schematic diagram of an exemplary image to be detected provided by an embodiment of the present disclosure. As shown in FIG. 1A, two objects, a human and an elephant, are detected from the image, and each object is Using the annotation box annotation, by detecting the interaction between people and objects, a better description generated for the behavior in this image should be "man riding an elephant" rather than "man and elephant". At present, in related technologies, this task is regarded as a one-stage classification problem. For example, for a picture, first detect all the people and objects in the picture, and then perform Classification, so as to predict the interaction behavior and score of each pair of people and objects, and finally judge the interaction behavior contained in a picture through the score threshold. However, this method of directly predicting all combinations cannot remove negative sample pairs, which is prone to misjudgment. For example, FIG. 1B is an exemplary schematic diagram of another image to be detected provided by an embodiment of the present disclosure. As shown in FIG. 1B , people, tables, and teacups are all detected, and each object is marked with a label frame. As shown in Figure 1B, the person and the teacup are a pair of negative samples. That is to say, although the person and the teacup are not in contact, there is still a high probability that they will be predicted as drinking when the person and the teacup are paired. Tea behavior, thereby affecting the accuracy of the final prediction results.

Based on this, an embodiment of the present disclosure provides a behavior recognition method, which can reduce negative sample pairs, thereby improving recognition accuracy and recognition efficiency of human interaction behaviors. The behavior recognition method provided by the embodiment of the present disclosure is applied to an electronic device. The following describes exemplary applications of the electronic equipment provided by the embodiments of the present disclosure. The electronic equipment provided by the embodiments of the present disclosure can be implemented as AR (Augmented Reality) glasses, notebook computers, tablet computers, desktop computers, set-top boxes, mobile devices (for example, mobile Various types of user terminals (hereinafter referred to as terminals) such as telephones, portable music players, personal digital assistants, dedicated messaging devices, portable game devices, etc., can also be implemented as servers.

Next, an exemplary application when the electronic device is implemented as a terminal will be described. FIG. 2 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure, which will be described in conjunction with the steps shown in FIG. 2 .

S101. Detect each image to be detected to obtain features of multiple objects, and encode the features to obtain multi-dimensional features corresponding to the multiple objects respectively.

In the disclosed embodiment, the terminal can first detect each image to be detected to obtain the features of each object, and then encode the features of each object to obtain the number of objects in the image to be detected. , the multidimensional features of each object. It should be noted that the multiple objects may be all objects in the image to be detected, or may be some objects in the image to be detected.

In some embodiments of the present disclosure, the terminal may obtain the feature of each of the multiple objects by performing image detection and word vector detection on the image to be detected by itself. The feature of each object can be the position feature, visual feature and word vector feature of the object, the feature composed of these three features; where the position feature can be the coordinates of the label box of the object in the image to be detected, and the visual feature It may be a region of interest (Region of Interest, RoI) pooled feature map corresponding to the coordinates of the label box, and the word vector feature may be a word vector corresponding to the category information of the object.

Exemplarily, for an image to be detected, the terminal can first use the Faster R-CNN model to perform image detection on the image to be detected, obtain the positional features and visual features of each object, and obtain the category information of each object (for example, people, trees, etc.), and also obtain the confidence (confidence result) corresponding to the category information, and then use the word vector and text classification model (for example, fastText model) to perform word vector detection on the category information, and obtain each The word vector features corresponding to the category information of an object.

In the embodiment of the present disclosure, the image to be detected may be an image for any scene, for example, the image to be detected may be a collected image of a customer shopping in a store, or an image of a certain scenic spot collected, etc., the implementation of the present disclosure Examples are not limited to this.

S102. Based on some of the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the action results of each group member object; wherein, each group of objects Include at least: an object whose category is an object and an object whose category is a person among the plurality of objects.

In the embodiment of the present disclosure, after obtaining multiple objects in the image to be detected, the terminal can group the multiple objects to obtain multiple groups of objects, where each group of objects includes at least objects classified as objects and objects whose category is person; and, at least one group member object is different between any two groups of objects. After obtaining multiple groups of objects, for each group of objects, the terminal can determine the space result between the group member objects in the group of objects according to some features of the characteristics of each group member object in the group of objects, and Determine the action result of each team member object.

In some embodiments, each group of objects may include two types of objects: people and objects, or each group of objects may also include three types of objects: people, objects and animals.

Exemplarily, when the multiple objects are 3 objects, and each group of objects includes two types of objects, people and objects, and the 3 objects include: people, object 1, and object 2, the terminal may combine these 3 objects Objects are divided into two groups: person-object 1, person-object 2; obviously, there is a group member object between these two groups of objects (object 1 is different from object 2); after obtaining these two groups of objects, for person- For object 1, the terminal determines the result of the space between the person and object 1 according to some of the characteristics of the person and object 1 in the group of objects, and determines the result of the action of the person and the result of the action of object 1; for the person- For the object 2, the terminal determines the result of the space between the person and the object 2 according to some of the characteristics of the person and the object 2 in the group of objects, and determines the result of the action of the person and the result of the action of the object 2 respectively.

It should be noted that the spatial result and the action result can be classification score values, and the terminal can obtain the spatial result and the action result through the fully connected layer.

S103. Based on the multi-dimensional features, determine the relationship interaction features of each group of objects, and in the case of determining that the group member objects in each group of objects are related to each other according to the relationship interaction features, based on the spatial result and the action result, Determining target outcomes for each group of objects to obtain at least one target outcome.

In the embodiment of the present disclosure, after obtaining the multi-dimensional features of each of the multiple objects, the terminal can determine the relationship interaction features corresponding to each group of objects according to the multi-dimensional features corresponding to the multiple objects respectively, for For each group of objects, the terminal can determine whether the member objects in the group of objects are related according to the relationship interaction characteristics of the group of objects, and when it is determined that the group member objects of the group of objects are related, based on the group The space results between the group member objects in the object, and the action results of each group member object, and then determine the target result corresponding to the group object, so that there are one or more group members in multiple groups of objects In the case that objects are associated with each other (hereinafter, a group in which member objects in a group are associated with each other is referred to as an associated object group), at least one target result can be correspondingly obtained. For example, in the case that there are 3 groups of objects, and there are 2 associated object groups in the 3 groups of objects, two target results corresponding to the 2 associated object groups can be obtained.

It can be understood that, when it is determined that the member objects of a group of objects are not associated with each other, the group of objects is not a group of related objects, and there is no target result; that is, the embodiments of the present disclosure filter by determining the target result The unrelated object groups between the group member objects are removed; in this way, the interference factors can be reduced when the object behavior in the image to be detected is subsequently determined, and at the same time, the amount of data required for calculation is reduced; The object groups associated with each other, the recognition accuracy and recognition efficiency when recognizing the interaction behavior of characters.

S104. Based on at least one target result, determine the object behavior in each image to be detected.

In the embodiment of the present disclosure, when the terminal obtains at least one target result, it can determine the object in the image to be detected according to the at least one target result and at least one associated object group corresponding to the at least one target result Behavior. Exemplarily, the object behavior in the image to be detected may be the behavior between a person and an object, for example, for the image to be detected in Figure 1A, the obtained object behavior may be "a man riding an elephant", and for example , for the image to be detected in FIG. 1B , the obtained object behavior can be "many people sitting at the dining table".

In some embodiments, the target result is a target value; the terminal may select an associated object corresponding to the highest target value from a plurality of associated object groups corresponding to at least one target value according to at least one target value group, and recognize the behavior among group member objects in the selected group of associated objects.

Here, when the terminal obtains at least one target value, it can sort the at least one target value, select the highest target value according to the sorting result, and use an associated object group corresponding to the highest target value as an identification Target, so as to identify the behavior actions among the group member objects in this associated object group. It should be noted that, the embodiment of the present disclosure may adopt the recognition model in the related art to recognize the behaviors among the member objects in this associated object group, and the embodiment of the present disclosure does not limit the recognition model here.

In some embodiments of the present disclosure, determining the relational interaction features of each group of objects based on the multi-dimensional features in S103 above may be implemented through S1031-S1033, which will be described in conjunction with the steps shown in FIG. 3 .

S1031. Generate fully connected graphs corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively.

In the embodiment of the present disclosure, for multiple objects in the image to be detected, the terminal may generate full Connection Diagram. The fully connected graph can be represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects, and the adjacency matrix can represent the association between any two objects among multiple objects Spend.

Exemplarily, the adjacency matrix can be represented by the following formula (1):

A ^f ∈ R ^N×N ＝{(f _i )|i＝1,...,N}…………(1)

Among them, A ^f represents the adjacency matrix, i represents the i-th object (or, can also be called a node), f _i represents the multi-dimensional feature of the i-th object, and N represents the total number of multiple objects.

S1032. Perform graph convolution processing on the multi-dimensional features corresponding to each object and the fully connected graph to obtain updated multi-dimensional features corresponding to each object.

In the embodiment of the present disclosure, when the terminal obtains the fully connected graph corresponding to the multiple objects in the image to be detected, it can map the multidimensional features and the fully connected graph of each object in the multiple objects. Product operation, through which the updated multi-dimensional features of each object are obtained.

Exemplarily, the terminal can input the multi-dimensional features of each object and the adjacency matrix used to represent the fully connected graph into a graph neural network (Graph Convolutional Network, GCN), perform graph convolution operations through the GCN network, and Output the updated multidimensional features for each object.

In some embodiments, the above S1032 can be implemented in the following manner: based on the adjacency matrix and the one-to-one correspondence between each object's multi-dimensional features, the graph neural network is used to iterate the multi-dimensional features of each object to obtain the one-to-one correspondence with each object. The corresponding updated multi-dimensional features; wherein, the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between the corresponding two objects.

In some embodiments, the above-mentioned two objects include: a first object and a second object; the degree of association between the two objects can be determined through S201-S203, which will be described in conjunction with the steps shown in FIG. 4 .

S201. Determine the similarity between the multidimensional features of the first object and the multidimensional features of the second object.

In this embodiment of the present disclosure, the terminal may determine the similarity between the first object and the second object according to the multi-dimensional features of the first object and the multi-dimensional features of the second object, for example, dot product similarity or cosine similarity, etc. .

Exemplarily, in the case where the similarity is a dot product similarity, the similarity between the first object and the second object can be represented by the following formula (2):

F _se (f _i ,f _j )=(f _i ) ^T f _i (2)

Among them, F _se (f _i , f _j ) represents the dot product similarity between the i-th object (the first object) and the j-th object (the second object), where i and j are any of 1 to N Integer, and i is not equal to j, f _i represents the multidimensional feature of the i-th object, and f _j represents the multi-dimensional feature of the j-th object.

S202. Determine the distance between the first object and the second object based on the position features of the first object in each image to be detected and the position features of the second object in each image to be detected.

In the embodiment of the present disclosure, in the case of detecting the image to be detected where the first object and the second object are located, the positional characteristics of the first object in the image to be detected, and the position characteristics of the second object in the image to be detected can be obtained. The terminal may determine the distance between the first object and the second object according to the location characteristics of the first object and the location characteristics of the second object.

Exemplarily, the location feature is the coordinates of the callout frame (for example, the coordinates of the center point of the callout frame, or the coordinates of the upper left corner point and the lower right corner point of the callout frame, etc.), and the terminal may use the callout frame coordinates of the first object and the second object The coordinates of the label frame of , calculate the distance between the first object and the second object. For example, the distance between the first object and the second object can be represented by the following formula (3):

Among them, D(b _i , b _j ) represents the coordinate distance between the i-th object and the j-th object calculated by the coordinates of the label box, and F _dist (f _i , f _j ) represents the distance between the i-th object and the j-th object The distance between j objects.

S203. Based on the similarity and the distance, determine the degree of association between the first object and the second object.

In the embodiment of the present disclosure, after determining the similarity and distance between the first object and the second object, the terminal may calculate the degree of association between the first object and the second object according to the similarity and distance.

In some embodiments, the degree of association between the first object and the second object can be calculated by the following formula (4):

in,

Indicates the degree of association between the i-th object and the j-th object,

is a value between 0 and 1; N represents the total number of multiple objects, f _j represents the multidimensional feature of the jth object, f _i represents the multidimensional feature of the ith object, exp(.) represents the base e exponential function.

For the above S1032, the terminal can input the adjacency matrix and the multi-dimensional features of all objects into the multi-layer graph neural network, and iteratively update the multi-dimensional features of each object through the multi-layer graph neural network, so as to obtain each The updated multidimensional feature of the object.

In some embodiments, the terminal may iteratively update the multi-dimensional features corresponding to each object based on the update parameters, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multi-dimensional features of all objects, and when the number of iterations reaches In the case of the first preset number of times, the features generated after the first preset number of times are used as the updated multi-dimensional features corresponding to each object.

Here, the update parameter may be an activation function, the first weight parameter corresponding to the number of iterations may be a learnable weight matrix corresponding to each layer of the graph neural network, and the number of iterations may be determined according to the number of layers of the graph neural network. For example, in the case where the graph neural network is a 2-layer graph neural network, each layer corresponds to a learnable weight, and the number of iterations can be determined to be 2; that is, for the first layer of the graph neural network and In other words, the input is the adjacency matrix and the multidimensional features of each object, and the output is the multidimensional features of each object after the first iteration; for the second layer of the graph neural network, the input is the adjacency matrix and each object The multidimensional feature after the first iteration, the output is the multidimensional feature after the second iteration of each object, and the multidimensional feature after the second iteration of each object is the update of each object obtained after the iteration The subsequent multidimensional features.

According to the above, the process of iterating the multi-dimensional features of each object using each layer of the graph neural network can be expressed by the following formula (5):

g ^(l+1) = σ(A×g ^l ×W ^l ) (5)

Among them, A represents the adjacency matrix. g ^l ∈ R ^N×d represents the iterated multidimensional feature of each object output by the l-th layer, g ^(l+1) represents the iterated multi-dimensional feature of each object output by the l+1 layer, g ⁰ ∈ f represents the feature of each object in layer 0, that is, represents the multidimensional feature of each object. W ^l ∈ R ^d×d represents the learnable weight matrix of layer l, d is the size of the input and output features; σ(.) represents the activation function, for example, it can be a linear rectification function (Rectified Linear Unit, ReLU). According to the above formula (5), it can be seen that the input of the l+1th layer is the output of the lth layer.

In some embodiments, l is 1, that is to say, a two-layer graph neural network can be used to iteratively update the multidimensional features of each object; in this way, the update efficiency of the multidimensional features of each object can be improved, thereby having It is beneficial to improve the recognition efficiency of character interaction behavior.

S1033. According to the updated multi-dimensional features of each member object in each group of objects, obtain the relationship interaction features of each group of objects.

In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the updated multi-dimensional features corresponding to each group member in the group of objects, it can use the updated Multi-dimensional features determine the relationship interaction features of the group of objects.

In some embodiments, for each group of objects, the terminal may superimpose the updated multi-dimensional features of the group member objects on the channel dimension, and use the superimposed features as the relationship interaction features of the group of objects.

In some embodiments of the present disclosure, according to the relationship interaction feature in S103 above, it is determined that the group member objects in each group of objects are related to each other, which can be realized through S1034-S1035, which will be performed by the steps shown in FIG. 5 illustrate.

S1034. Classify each group of objects according to the relationship interaction feature, and obtain an interaction result of each group of objects.

In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the relational interaction features of the group of objects, it can input the relational interaction features of the group of objects into the fully connected layer, and the group of objects through the fully connected layer The interaction classification is performed, and the obtained interaction classification score of the group of objects is used as the interaction result of the group of objects.

Exemplarily, the interaction result of each group of objects can be represented by the following formula (6):

in,

Represents the interaction result of each group of objects, _Win represents the learning weight of the fully connected layer, σ(.) represents the activation function,

Represents the relational interaction features of each set of objects.

S1035. If the interaction result is greater than or equal to the first preset score threshold, determine that the group member objects in each group of objects are related to each other.

In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the interaction result of the group of objects, it can compare the interaction result with the first preset score threshold, and when the interaction result is greater than or equal to the first In the case of preset score thresholds, it is determined that the group member objects in the group objects are related to each other.

It should be noted that the first preset score threshold may be set according to actual needs, and the embodiment of the present disclosure does not limit the value of the first preset score threshold.

In some embodiments of the present disclosure, the determination of the target result of each group of objects based on the spatial result and the action result in S103 above may go through S1036-S1038, which will be described with the steps shown in FIG. 6 .

S1036. Based on the relationship interaction features of each group of objects and preset parameters, update the multi-dimensional features of each group member object, obtain the refined features of each group member object, and determine each group based on the refined features. Graph interaction features for objects.

In the embodiments of the present disclosure, for each group of objects, the terminal can update the multi-dimensional features of each group member object in the group of objects according to the relationship interaction features and preset parameters of the group of objects, so as to obtain According to the refined characteristics of each group member object in the group object, the graph interaction characteristics of the group object are determined. In some embodiments, the terminal may superimpose the refined features of all member objects in the group of objects in the channel dimension, so as to obtain the graph interaction features of the group of objects.

In some embodiments, the preset parameters include: the second weight parameter and the number of iterations; the above-mentioned S1036 based on the relationship interaction features of each group of objects, and the preset parameters to update the multidimensional features of each group member object, Obtaining the refined features of each group member object can be achieved in the following way: Based on the second weight parameter and the relationship interaction feature of each group object, iteratively update the multi-dimensional features of each group member object, and in the number of iterations When the second preset number of times is reached, the features generated after the second preset number of times are used as the refined features of each group member object.

Here, for each group of objects, the terminal can iteratively update the multi-dimensional features of each group member object in the group of objects according to the second weight parameter and the relationship interaction feature of the group of objects, for example, in the process of the first iteration , the multidimensional feature of each member object in the group object is used as input, and the multidimensional feature of each member object after the first iteration is obtained after iteration. In the process of the second iteration, for each member For the object, the multidimensional feature after the first iteration is used as the input of the second iteration, and iterates in this way until the number of iterations reaches the second preset number of times, and the corresponding number of iterations after the second preset number of times The multidimensional features of are used as the refined features of each group member object.

Exemplarily, the process of generating the refined features of each group member object can be expressed by the following formula (7):

in,

Represents the relational interaction features of each group of objects,

represents the index function,

Represents the interaction result of each group of objects, μ _s represents the first preset score threshold; α represents the second weight parameter (weighting parameter), N represents the total number of multiple objects, f _i ^(t) represents the i-th object Refinement features, f _i ^(t-1) represents the features of the i-th object input when obtaining the refinement features of the i-th object, f _j ^(t-1) represents the input when obtaining the refinement features of the i-th object t represents the number of iterations; in the case of t=1, f _i ^(t-1) represents the multidimensional feature of the i-th object, and f _j ^(t-1) represents the j-th object’s multidimensional features.

It should be noted that the second preset number of times may be set according to actual needs, which is not limited in this embodiment of the present disclosure.

Exemplarily, the second preset number of times may be 2. In this way, the efficiency of obtaining the refined features of each team member object can be improved, which is beneficial to improve the recognition efficiency of human interaction behavior.

S1037. Based on the graph interaction feature, classify each group of objects to obtain a graph relationship result.

In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the graph interaction features of the group of objects, it may classify the graph relationship of the group of objects according to the graph interaction features to obtain a graph relationship result.

In some embodiments, the terminal can input the graph interaction features of the group of objects into the fully connected layer, and classify the graph relationship of the group of objects through the fully connected layer, so as to obtain the graph relationship classification score, and the obtained graph A relation classification score, as a result of graph relations for this set of objects.

Exemplarily, according to the graph interaction characteristics of each group of objects, the terminal obtains the process of the graph relationship result of the group of objects, which can be expressed by the following formula (8):

in,

Represents the graph relationship result of each group of objects,

Represents the graph interaction features of each group of objects, W _a represents the learning weight of the fully connected layer, and σ(.) represents the activation function.

S1038 Determine target results for each group of objects based on the spatial results, action results, interaction results, graph relationship results, and confidence results obtained when detecting each group member object.

In the embodiment of the present disclosure, for each group of objects, the terminal can obtain the spatial results of the group of objects, the action results of each group member object, the interaction results and graph relationship results of the group of objects, and in the above steps The confidence result obtained when detecting each group member object is used to obtain the target result of the group object.

In some embodiments, for each group of objects, the terminal may determine the first product value between the confidence results of all group member objects; determine the second product value between the action results of all group member objects; determine the The third product between the first product value, the second product value, the spatial result and the graph relationship result; determine the index value between the interaction result and the first preset score threshold; and, combine the third product value with the index value The product between is used as the target result for this set of objects.

Exemplarily, according to the spatial result, action result, interaction result, graph relationship result, and the confidence result of each group member object, the process of determining the target result of each group object can be expressed by the following formula (9):

in,

Indicates the target result, s _h or s _o represents the confidence result of the group member object, wherein, s _h represents the confidence result of the object classified as a person, and s _o represents the confidence result of the object classified as an object;

or

Indicates the action result of the group member object, where,

represents the result of an action for an object of class Person,

Indicates the action result of an object of class Object;

represents the product between the graph relational result and the spatial result,

represents the interaction result, μ _s represents the first preset score threshold,

Represents an indicator function.

In some embodiments, the encoding of features in S101 above to obtain multi-dimensional features corresponding to multiple objects respectively may be implemented through S1011-S1014, which will be described below in conjunction with the steps in FIG. 7 .

S1011. Encode the location features corresponding to the multiple objects respectively to obtain the first feature of each object; the detection includes: image detection and word vector detection.

S1012. Encode the visual features corresponding to the plurality of objects one by one to obtain the second feature of each object; the position feature and visual feature are obtained by performing image detection on each image to be detected.

S1013. Encode the word vector features corresponding to a plurality of objects respectively to obtain the third feature of each object; the word vector feature is obtained by performing word vector detection on the category information of each object; the category information is It is obtained by performing image detection on each image to be detected.

S1014. According to the first feature, the second feature and the third feature, obtain multi-dimensional features corresponding to the multiple objects respectively; wherein, the dimensions of the first feature, the second feature and the third feature are the same.

In the embodiment of the present disclosure, after the terminal obtains the position feature, visual feature and word vector feature of each object obtained by performing image detection and word vector detection on an image to be detected, these three features can be encoded separately to the same feature space, so as to obtain the first feature, the second feature and the third feature with the same dimension.

Exemplarily, the location feature of an object may be the coordinates of the label box of the object in the image to be detected, the visual feature may be the RoI pooled feature map corresponding to the coordinates of the label box, and the word vector feature may be is the word vector corresponding to the category information of the object.

In some embodiments of the present disclosure, the detection of each image to be detected in the above S101 to obtain the features of multiple objects may be implemented through S401-S403, which will be described below in conjunction with the steps in FIG. 8 .

S401. Perform image detection on each image to be detected, and obtain the position feature, visual feature, confidence result, and category information corresponding to the confidence result of each detected target.

S402. Taking the target whose confidence result is greater than or equal to the second preset score threshold as the detected object, and obtaining positional features, visual features, and category information corresponding to the plurality of objects respectively.

S403. Perform word vector detection on category information of each object to obtain word vector features of each object.

In the embodiment of the present disclosure, for each image to be detected, the terminal can obtain the position feature, visual feature, confidence result, and category information corresponding to the confidence result of each target in the image to be detected through image detection , after that, the terminal can compare the confidence result of each target with the second preset score threshold, and according to the comparison result, remove the target whose confidence result is less than the second preset score threshold, and keep the confidence result greater than or equal to the second preset The goal of the score threshold, using all the retained goals as the above-mentioned multiple objects, so as to obtain the location features, visual features, confidence results corresponding to each object, and category information corresponding to the confidence results; and, after obtaining each object In the case of the category information of each object, the terminal may also perform word vector detection on the category information of each object to obtain the word vector feature corresponding to each object.

Here, each target whose confidence result is greater than or equal to the second preset score threshold is used as each object for subsequent behavior recognition of the image to be detected. In this way, the gap between the person and the object in the image to be detected can be reduced. Interference factors in the identification of interactive behaviors are beneficial to improve the recognition accuracy when identifying the interactive behaviors between people and objects in the image to be detected.

It should be noted that the second preset score threshold may be set according to actual needs, which is not limited in this embodiment of the present disclosure.

In some embodiments, in the case of encoding, the terminal can use a multilayer perceptron (Multilayer Perceptron, MLP) to encode the location features, visual features and word vector features of each object, so as to obtain the corresponding The first, second, and third features of the same dimensionality.

Exemplarily, the first feature, the second feature and the third feature may all be 256-dimensional features. After obtaining the 256-dimensional first feature, second feature, and third feature of each object, the terminal can superimpose the first feature, second feature, and third feature on the channel dimension to obtain the corresponding 768 Dimensional multidimensional features.

In some embodiments, the above S1012 can be implemented through S301-S302:

S301. Perform dimension transformation processing on the visual features corresponding to the multiple objects respectively, to obtain the dimensionally transformed visual features of each object.

S302. Encode the dimensionally transformed visual features to obtain a second feature of each object.

In the embodiment of the present disclosure, since the visual features of each object are two-dimensional features, before encoding, for the visual features of each object, the terminal can perform dimension transformation (Reshap) on the visual features to obtain the dimension The transformed one-dimensional visual feature is encoded, and the second feature of each object is obtained by encoding the dimensionally transformed one-dimensional visual feature.

In some embodiments of the present disclosure, based on some features of the features of each group member object in each group of objects in the above S102, determine the spatial results of at least two types of objects in each group of objects, and each group member The action result of the object can be realized through S1021-S1024, which will be described in conjunction with the steps in FIG. 9 .

S1021. Based on the position feature of each team member object of each group of objects, determine the image area of each team member object in each image to be detected; some features include: the position feature and visual feature of each team member object; The location features and visual features are obtained by image detection for each image to be detected.

In the embodiment of the present disclosure, for each group member object in each group of objects, the terminal may determine the corresponding image area of the group member object from the corresponding image to be detected according to the position characteristics of the group member object.

Exemplarily, in the case where the position feature is the coordinates of a label frame, and a team member object is a motorcycle, the terminal may, according to the coordinates of the label frame of the motorcycle in the image to be detected, The image area of the car is intercepted to obtain the image area of the motorcycle.

S1022. Obtain an image area corresponding to each group of objects according to the image area of each group member object, and encode the image area corresponding to each group of objects to obtain two-dimensional feature data.

In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the image area of each group member object, it can stitch the image areas of all group member objects to obtain the image area of the group object, and The image area of this group of objects is encoded; in the process of encoding, in the person channel, the value of the image area of the person is 1, and the value of other areas is 0; in the object channel, the value of the image area of the object is 1, The values of other regions are 0, thus obtaining the two-dimensional characteristic data of this group of objects.

For example, in the case that a group of objects includes a person and a motorcycle, the terminal may concatenate the image area of the person and the image area of the motorcycle obtained in S1021, so as to obtain the image area of the group of objects of the person-motorcycle, and So that in the human channel, the value of the image area of the person is 1, and the value of other areas is 0; in the motorcycle channel, the value of the image area of the motorcycle is 1, and the value of other areas is 0, thus obtaining the human- The two-dimensional feature data of the motorcycle group of objects.

S1023. Perform feature processing on the two-dimensional feature data and the visual features of each team member object, correspondingly obtain the processed two-dimensional feature data and the processed visual features.

In the embodiment of the present disclosure, for each group of objects, the terminal may separately perform feature processing on the two-dimensional feature data and visual features after obtaining the two-dimensional feature data of the group of objects and the visual features of each group member object. processing, so as to obtain the processed two-dimensional feature data and the processed visual features respectively.

Exemplarily, the terminal can first perform feature extraction on two-dimensional feature data through a convolutional neural network (Convolutional Neural Networks, CNN, referred to as CNN Block) to obtain the first sub-feature; through a residual network (Residual Block, referred to as Res Block) extracts the visual features of each group member object to obtain the second sub-feature; then performs global average pooling (Global average Pooling, GAP) on the first sub-feature and the second sub-feature respectively, and the corresponding The two-dimensional feature data and the processed visual features.

Exemplarily, the processed two-dimensional feature data and the processed visual features can be represented by the following formulas (10), (11) and (12):

f _h =GAP(Res(RoI(F,b _h ))) (11)

f _o ＝GAP(Res(RoI(F,b _o ))) (12)

Among them, F represents the feature map after the ROI pooling of the image to be detected, f _h or f _o is the processed visual feature of each group member object, where f _h represents the processed visual feature of the object of the category, f _o represents the processed visual features of objects classified as objects;

Represents the processed two-dimensional feature data, for example, f _{h, o} represents the processed two-dimensional feature data of the group of objects of people-objects when a group of objects includes two objects of the category of people and objects; F _{h, o} represent the image area corresponding to each group of objects, b _h or b _o represent the location characteristics of each group member object, where b _h represents the location characteristics of objects classified as people, and b _o represents objects classified as objects The location features of ; RoI(F,b _h ) or RoI(F,b _o ) represent the visual features of each team member object, where RoI(F,b _h ) represents the visual features of objects classified as people, RoI(F , b _o ) represent the visual features of objects classified as objects.

S1024. Classify each group of objects according to the processed two-dimensional feature data to obtain the spatial result of each group of objects, and classify each group member object according to the processed visual features to obtain each group member The object's action result.

In the embodiment of the present disclosure, when the terminal obtains the processed two-dimensional feature data and the processed visual features of each group member object, it can perform spatial processing on the group object according to the processed two-dimensional feature data. Classify to obtain the spatial result corresponding to the group object; and according to the processed visual features of each group member object, perform action classification on the group member object to obtain the action result of the group member object.

In some embodiments, the terminal can input the processed two-dimensional feature data into a fully connected layer, classify the group of objects through the fully connected layer, obtain a spatial classification score, and use the spatial classification score as the group object and the terminal can input the processed visual features of each group member object into another fully connected layer, classify the group member object through the fully connected layer, obtain an action classification score, and classify the action Score as action result.

Exemplarily, the terminal classifies each group of objects according to the processed two-dimensional feature data, obtains the spatial result of each group of objects, and classifies each group member according to the processed visual features of each group member object Objects are classified to obtain the action results of each team member object, which can be expressed by the following formulas (13), (14) and (15):

in,

represents the spatial result for each group of objects,

or

Indicates the action result of each group member object, where,

Indicates the action result of the group member object whose category is person,

Indicates the action result of the member object whose category is an object; W _h indicates the learning weight of the fully connected layer corresponding to the member object whose category is human, and W _o indicates the learning weight of the fully connected layer corresponding to the member object whose category is an object , W _h,o represents the learning weight of the fully connected layer corresponding to each group of objects.

The technical solution of the present disclosure will be described below in conjunction with a specific application scenario; FIG. 10 is a partial flow diagram of an example of using a behavior recognition method to identify an object behavior in an image to be detected provided by an embodiment of the disclosure.

As shown in Figure 10, the terminal performs target detection and word vector detection on an image to be detected I, and obtains the position feature, confidence result, and word vector feature of each object in the image to be detected, for example, as shown in Figure 10 It shows that in the case of detecting motorcycles and helmets, the detector can be used to detect the word vector of the motorcycle to obtain the word vector features corresponding to the motorcycle, and the detector can be used to detect the word vector of the helmet to obtain the helmet corresponding and, according to the location feature of each object, the terminal may perform feature extraction on the ROI pooled image obtained in the image detection process of the image to be detected to obtain the visual feature of each object.

On the one hand, after the terminal obtains the location feature, word vector feature and visual feature of each object, the terminal can use the semantic encoding module to encode the location feature and word vector feature respectively, and obtain the first feature and the third feature , at the same time, perform dimension transformation processing (Reshap) on the visual features of each object, and use MLP to encode the visual features after dimension transformation processing to obtain the second feature with the same dimension as the first feature and the third feature, And the first feature, the second feature and the third feature are superimposed in the channel dimension to obtain the multi-dimensional feature corresponding to each object in the image I to be detected. According to the multi-dimensional features corresponding to all objects (ie multiple objects) of the image I to be detected respectively, a fully connected graph corresponding to all objects is generated, and the fully connected graph is characterized by an adjacency matrix (not shown in FIG. 10 ), the adjacency matrix and the multi-dimensional features corresponding to all objects are taken as the input of the GCN network, and the updated multi-dimensional features of each object are obtained through the graph convolution processing of the GCN network; according to the updated multi-dimensional features of each object The multi-dimensional features of each group of objects are obtained, and the relationship interaction features of each group of objects are input into the fully connected layer (FCs), and the group of objects is classified to obtain the interaction results of each group of objects.

Wherein, grouping all objects in the image I to be detected can obtain multiple groups of objects. According to the interaction result of each group of objects, the terminal retains each group of objects whose corresponding interaction result is greater than or equal to the first preset score threshold, and obtains a plurality of related object groups that are related to each other among group member objects; and, according to The relationship interaction features corresponding to each associated object group, as well as the preset parameters, update the multidimensional features of each member object in the associated object group, and obtain the refined features of each member object (the update process can be shown in Fig. 10), for each associated object group, the terminal superimposes the refined features of all member objects in the associated object group on the channel dimension to obtain the graph interaction feature of the associated object group (Fig. 10), and input the graph interaction feature into the fully connected layer for classification, and obtain the graph relationship result of the associated object group

On the other hand, the terminal obtains the image area of each object according to the location characteristics of each object, and stitches the image areas of the member objects of each group of objects to obtain the image area of each group of objects, and The image region of the group object is encoded to obtain two-dimensional feature data. Afterwards, for each group of objects, the visual features of the group member objects (for example, object 1 and object 2 in Figure 1) in the group of objects are respectively input into the residual network for feature extraction, and different second subclasses are obtained. feature (not shown in Figure 10), and, for the two-dimensional feature data of each group of objects, the two-dimensional feature data is input into the convolutional neural network for feature extraction to obtain the first sub-feature (not shown in Figure 10 output), and perform global average pooling on the first sub-feature and each second sub-feature respectively, to obtain the processed visual features of each group member object, for example, the processed visual features of object 1 in Figure 10 , and the processed visual features of object 2, and obtain the processed two-dimensional feature data of the group of objects. After that, the processed visual features of object 1, the processed visual features of object 2, and the processed two-dimensional feature data of this group of objects are input into different fully connected layers for classification, respectively, and object 1 result of the action

Object 2's action result

and the spatial results for a set of objects consisting of Object 1 and Object 2

Finally, the terminal can calculate the target result of each associated object group by substituting all the results obtained in the above process into the above formula (9), and then according to the obtained target results of all associated object groups, the highest target result A corresponding associated object group can identify the interaction behavior of the person in the image I to be detected.

The present disclosure also provides a behavior recognition device, and FIG. 11 is a schematic structural diagram of the behavior recognition device provided by an embodiment of the disclosure; as shown in FIG. The image is detected to obtain the features of multiple objects, and the features are encoded to obtain multi-dimensional features corresponding to the multiple objects respectively; the result determining part 20 is configured to be based on each group member of each group of objects Part of the features in the object features, determine the spatial results of at least two types of objects in each group of objects, and the action results of each of the group member objects, wherein each group of objects includes at least: the multiple Among the objects, the object category is an object, and the object category is a person; based on the multidimensional feature, determine the relationship interaction feature of each group of objects, and determine the relationship interaction feature of each group of objects according to the relationship interaction feature In the case that the group member objects are related to each other, based on the spatial result and the action result, determine the target result of each group of objects, and obtain at least one of the target results; the behavior determination part 30 is It is configured to determine the object behavior in each of the images to be detected based on at least one of the target results.

In some embodiments of the present disclosure, the result determining part 20 is further configured to generate a fully connected graph corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively ; By performing graph convolution processing on the multi-dimensional features corresponding to each of the objects and the fully connected graph, the updated multi-dimensional features corresponding to each of the objects are obtained; according to the The updated multi-dimensional feature of each member object in each group of objects obtains the relationship interaction feature of each group of objects.

In some embodiments of the present disclosure, the result determining part 20 is further configured to classify each group of objects according to the relationship interaction feature, and obtain the interaction result of each group of objects; If the interaction result is greater than or equal to a first preset score threshold, it is determined that the group member objects in each group of objects are related to each other.

In some embodiments of the present disclosure, the result determination part 20 is further configured to, based on the relationship interaction features of each group of objects and preset parameters, for each of the group member objects update the multi-dimensional features to obtain the refined features of each of the group member objects, and based on the refined features, determine the graph interaction features of each group of objects; based on the graph interaction features, for each Classifying group objects to obtain graph relationship results; based on the spatial results, the action results, the interaction results, the graph relationship results, and the confidence obtained when performing the detection on each of the group member objects As a result, said target outcome for said each set of subjects is determined.

In some embodiments of the present disclosure, the target result is a target value; the behavior determination part 30 is further configured to, according to at least one of the target values, select from a plurality of one-to-one corresponding to the at least one target value In the group of associated objects, a group of associated objects corresponding to the highest target value is selected, and the behavior among the member objects in the group of associated objects is identified.

In some embodiments of the present disclosure, the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects; the result determining part 20 is further configured For the one-to-one correspondence between the adjacency matrix and each of the objects, the multidimensional features of each of the objects are iterated through a graph neural network to obtain a one-to-one correspondence with each of the objects. The updated multidimensional features of .

In some embodiments of the present disclosure, the two objects include: a first object and a second object; the result determining part 20 is further configured to determine the multi-dimensional feature and the second object of the first object The similarity between the multi-dimensional features of the two objects; based on the positional features of the first object in each of the images to be detected, and the positional features of the second object in each of the images to be detected , determining a distance between the first object and the second object; and determining the degree of association between the first object and the second object based on the similarity and the distance.

In some embodiments of the present disclosure, the result determination part 20 is further configured to be based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the one-to-one correspondence between each of the objects The multi-dimensional features are iteratively updated for each of the multi-dimensional features of the object, and when the number of iterations reaches the first preset number of times, the features generated after the first preset number of times are used as each of the multi-dimensional features The updated multidimensional feature of the object.

In some embodiments of the present disclosure, the preset parameters include: a second weight parameter and the number of iterations; the result determining part 20 is further configured to The relational interaction features are iteratively updated for the multi-dimensional features of each of the team member objects, and when the number of iterations reaches a second preset number of times, the features generated after the second preset number of times, as the refinement feature for each of the group member objects.

In some embodiments of the present disclosure, the detection includes: image detection and word vector detection; the encoding part 10 is further configured to encode the position features corresponding to the plurality of objects respectively, to obtain each The first feature of an object; the visual features corresponding to the multiple objects are encoded to obtain the second feature of each of the objects; the position feature and the visual feature are for each of the described objects. The image to be detected is obtained by image detection; the word vector features corresponding to the plurality of objects are encoded to obtain the third feature of each of the objects; the word vector features are for each of the objects The category information is obtained by performing word vector detection; the category information is obtained by performing image detection on each of the images to be detected; according to the first feature, the second feature and the third feature, get the same as the The plurality of objects correspond to the multi-dimensional features respectively; wherein, the dimensions of the first feature, the second feature and the third feature are the same.

In some embodiments of the present disclosure, the encoding part 10 is further configured to perform dimension transformation processing on the visual features corresponding to the plurality of objects respectively, to obtain the dimension transformation of each of the objects. Visual features: encoding the visual features after dimension transformation to obtain the second features of each of the objects.

In some embodiments of the present disclosure, the partial features include: positional features and visual features of each of the team member objects; the positional features and the visual features are image detection performed on each image to be detected Obtained; the result determining part 20 is further configured to determine each of the group member objects on each sheet to be detected based on the position characteristics of each of the group member objects of each group of objects The image area in the image; according to the image area of each of the group member objects, the image area corresponding to each group of objects is obtained, and the image area corresponding to each group of objects is encoded to obtain a two-dimensional feature data; performing feature processing on the two-dimensional feature data and the visual features of each of the team member objects, corresponding to the processed two-dimensional feature data and the processed visual feature; according to the processing classify each group of objects, obtain the spatial result of each group of objects, and classify each group member object according to the processed visual features , to obtain the action result of each member object.

In some embodiments of the present disclosure, the device further includes a detection part configured to perform image detection on each of the images to be detected, and obtain the positional features, visual features, and confidence results of each detected target, and Category information corresponding to the confidence result; taking the target whose confidence result is greater than or equal to a second preset score threshold as the detected object, and obtaining the positional features, location features, and The visual features, and the category information; word vector detection is performed on the category information of each object to obtain the word vector features of each object.

In the embodiments of the present disclosure and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.

An embodiment of the present disclosure also provides an electronic device. FIG. 12 is a schematic structural diagram of a virtual label display device provided by an embodiment of the present disclosure. As shown in FIG. 12 , it includes: a memory 22 and a processor 23, wherein the memory 22 and the processor 23 is connected through the bus 21; the memory 22 is configured to store an executable computer program; the processor 23 is configured to execute the executable computer program stored in the memory 22 to implement the method provided by the embodiment of the present disclosure, for example, the present disclosure The behavior recognition method provided by the embodiment.

The embodiment of the present disclosure provides a computer-readable storage medium storing a computer program for causing the processor 23 to implement the method provided in the embodiment of the present disclosure, for example, the behavior recognition method provided in the embodiment of the present disclosure.

In some embodiments of the present disclosure, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; Various devices in any combination.

A computer readable storage medium may also be a tangible device that holds and stores instructions for use by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: USB flash drives, magnetic disks, optical disks, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable Read Only Memory (EPROM or Flash), Static Random Access Reader (ROM), Portable Compact Disk Read Only Memory (CD-ROM), Digital Versatile Disk (DVD), Memory Stick, Floppy Disk, Memory Encoding Device, Examples include punched cards with instructions stored thereon, or recessed-in-groove structures, and any suitable combination of the foregoing. As used herein, a computer-readable storage medium is not to be construed as a transient signal per se, such as a radio wave or other freely propagating battery wave, a battery wave propagating through a waveguide or other media medium (e.g., a pulse of light through a fiber optic cable), or Electrical signals transmitted through wires.

In some embodiments of the present disclosure, computer program instructions may take the form of programs, software, software modules, scripts, or codes written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages , and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other part suitable for use in a computing environment.

As an example, computer program instructions may, but do not necessarily correspond to files in a file system, may be stored as part of files that hold other programs or data, for example, in Hyper Text Markup Language (HTML) documents in one or more scripts, in a single file dedicated to the program in question, or in multiple cooperating files (for example, files that store one or more modules, subroutines, or sections of code).

As an example, computer program instructions can be deployed to be executed on one computing device, or on multiple computing devices at one site, or distributed across multiple sites and interconnected by a communication network. to execute.

The above descriptions are merely examples of the present disclosure, and are not intended to limit the protection scope of the present disclosure. Any modifications, equivalent replacements and improvements made within the spirit and scope of the present disclosure are included in the protection scope of the present disclosure.

Industrial Applicability

The embodiment of the present disclosure discloses an identification method, device, electronic equipment, computer-readable storage medium, computer program and computer program product. The method includes: detecting each image to be detected to obtain the features of multiple objects, encoding the features, and obtaining multi-dimensional features corresponding to the multiple objects respectively, based on the part of the features of each object in each group of objects Features, determine the spatial results of at least two types of objects in each group of objects, and the action results of each object; based on multidimensional features, determine the relationship interaction characteristics of each group of objects, and determine the objects in each group of objects based on the relationship interaction characteristics In the case of mutual correlation, the target result of each group of objects is determined based on the spatial result and the action result, and at least one target result is obtained; based on at least one target result, the object behavior in each image to be detected is determined. Through the present disclosure, it is possible to improve the recognition accuracy and recognition efficiency when recognizing human interaction behaviors.

Claims

A behavior recognition method, comprising:

Detecting each image to be detected to obtain features of multiple objects, encoding the features to obtain multi-dimensional features corresponding to the multiple objects respectively;

Based on some of the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the action results of each of the group member objects, wherein the Each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects;

Based on the multi-dimensional features, determine the relationship interaction features of each group of objects, and in the case of determining that the group member objects in each group of objects are related to each other according to the relationship interaction features, based on The spatial result and the action result determine the target result of each group of objects to obtain at least one of the target results;

Based on at least one of the target results, an object behavior in each of the images to be detected is determined.
The method according to claim 1, wherein said determining the relational interaction features of each group of objects based on said multi-dimensional features comprises:

generating fully connected graphs corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively;

By performing graph convolution processing on the multi-dimensional features corresponding to each of the objects and the fully connected graph, an updated multi-dimensional feature corresponding to each of the objects is obtained;

According to the updated multi-dimensional feature of each member object in each group of objects, the relationship interaction feature of each group of objects is obtained.
The method according to claim 1 or 2, wherein, according to the relationship interaction feature, determining that the group member objects in each group of objects are related to each other includes:

classify each group of objects according to the relationship interaction features, and obtain the interaction results of each group of objects;

If the interaction result is greater than or equal to a first preset score threshold, it is determined that the group member objects in each group of objects are related to each other.
The method according to claim 3, wherein said determining a target result for each group of objects based on said spatial result and said action result comprises:

Based on the relationship interaction features of each group of objects and preset parameters, the multidimensional features of each of the group member objects are updated to obtain the refined features of each of the group member objects, and based on The refinement feature determines the graph interaction feature of each group of objects;

Classifying each group of objects based on the graph interaction features to obtain a graph relationship result;

Based on the spatial result, the action result, the interaction result, the graph relationship result, and the confidence result obtained when the detection is performed on each of the group member objects, determine the value of each group of objects The target result.
The method according to claim 1, wherein the target result is a target value; said determining the object behavior in each image to be detected based on at least one target result comprises:

According to at least one of the target values, from a plurality of associated object groups corresponding to at least one of the target values one by one, select an associated object group corresponding to the highest target value, and identify the associated object group Behavior between the group member objects in the .
The method according to claim 2, wherein the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects;

The multi-dimensional features corresponding to each of the objects one-to-one and the fully connected graph are subjected to graph convolution processing to obtain updated multi-dimensional features corresponding to each of the objects one-to-one, including:

Based on the adjacency matrix and the multi-dimensional features corresponding to each of the objects, the multi-dimensional features of each of the objects are iterated through a graph neural network to obtain a one-to-one correspondence with each of the objects. Updated multidimensional features.
The method according to claim 6, wherein the two objects include: a first object and a second object; the method for determining the degree of association between the two objects includes:

determining a degree of similarity between the multidimensional feature of the first object and the multidimensional feature of the second object;

Determining the first object and the second object based on the positional features of the first object in each of the images to be detected and the positional features of the second object in each of the images to be detected the distance between;

Based on the similarity and the distance, the degree of association between the first object and the second object is determined.
The method according to claim 6 or 7, wherein, based on the adjacency matrix and the one-to-one correspondence between the multidimensional features of each of the objects, the multidimensional features of each of the objects are processed through a graph neural network. Features are iterated to obtain updated multidimensional features corresponding to each of the objects, including:

Based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multidimensional features corresponding to each of the objects, iteratively update the multidimensional features of each of the objects, and When the number of iterations reaches a first preset number of times, the features generated after the first preset number of times are used as the updated multi-dimensional features of each of the objects.
The method according to claim 4, wherein the preset parameters include: a second weight parameter and the number of iterations; the interaction characteristics based on the relationship of each group of objects, and preset parameters, for each The multi-dimensional features of the group member objects are updated to obtain the refined features of each group member object, including:

Based on the second weight parameter and the relationship interaction feature of each group of objects, iteratively update the multi-dimensional features of each of the group member objects, and when the number of iterations reaches a second preset number of times Next, the features generated after the second preset number of times are used as the refined features of each group member object.
The method according to claim 1, wherein said detection comprises: image detection and word vector detection; said encoding said features to obtain multi-dimensional features respectively one-to-one corresponding to said plurality of objects comprises:

Encoding the positional features corresponding to the plurality of objects respectively to obtain the first feature of each object;

Encoding the visual features corresponding to each of the plurality of objects to obtain the second feature of each of the objects; the position feature and the visual feature are obtained by image detection of each of the images to be detected of;

Encoding the word vector features corresponding to the plurality of objects respectively to obtain the third feature of each of the objects; the word vector features are obtained by performing word vector detection on the category information of each of the objects The category information is obtained by performing image detection on each of the images to be detected;

According to the first feature, the second feature and the third feature, obtain the multi-dimensional features corresponding to the plurality of objects respectively; wherein, the first feature, the second feature and the The dimensions of the third feature are the same.
The method according to claim 10, wherein said encoding the visual features corresponding to said plurality of objects respectively to obtain the second feature of each said object comprises:

performing dimension transformation processing on the visual features corresponding to the plurality of objects respectively, so as to obtain the visual features after dimension transformation of each of the objects;

Encoding the dimension-transformed visual features to obtain the second features of each of the objects.
The method according to claim 1, 10 or 11, wherein the partial features include: positional features and visual features of each of the group member objects; the positional features and the visual features are for each of the The image to be detected is obtained by image detection;

Determining the spatial results of at least two types of objects in each group of objects based on some of the features of each group member object in each group of objects, and the action results of each of the group member objects, including:

Based on the positional features of each of the group member objects of each group of objects, determine the image area of each of the group member objects in each of the images to be detected;

Obtaining an image area corresponding to each group of objects according to the image area of each group member object, and encoding the image area corresponding to each group of objects to obtain two-dimensional feature data;

performing feature processing on the two-dimensional feature data and the visual features of each member object, correspondingly obtaining the processed two-dimensional feature data and the processed visual features;

Classify each group of objects according to the processed two-dimensional feature data, obtain the spatial result of each group of objects, and classify each group according to the processed visual features Classify the member objects to obtain the action result of each member object.
The method according to claim 1, wherein the detection of each image to be detected to obtain the characteristics of a plurality of objects includes:

Performing image detection on each of the images to be detected to obtain the detected positional features, visual features, confidence results, and category information corresponding to the confidence results;

Taking the target whose confidence result is greater than or equal to the second preset score threshold as the detected object, and obtaining the position feature, the visual feature, and the category information corresponding to the plurality of objects respectively ;

The word vector detection is performed on the category information of each object to obtain the word vector features of each object.
A behavior recognition device, comprising:

The encoding part is configured to detect each image to be detected to obtain the features of multiple objects, and encode the features to obtain multi-dimensional features corresponding to the multiple objects respectively;

The result determination part is configured to determine the spatial results of at least two types of objects of each group of objects based on some features of the features of each group member object of each group of objects, and each of the group member objects The action result, wherein each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects; based on the multidimensional features, the relationship interaction features of each group of objects are determined, And in the case of determining that the member objects in each group of objects are related to each other according to the relationship interaction feature, based on the spatial result and the action result, determine the relationship between each group of objects target outcomes, obtaining at least one of said target outcomes;

The behavior determination part is configured to determine the behavior of the object in each of the images to be detected based on at least one of the target results.
An electronic device comprising:

a memory configured to store an executable computer program;

A processor configured to implement the method according to any one of claims 1 to 13 when executing the executable computer program stored in the memory.
A computer-readable storage medium on which is stored a computer program for causing a processor to implement the method according to any one of claims 1 to 13 when executed.
A computer program, comprising computer-readable codes, when the computer-readable codes run in an electronic device, a processor in the electronic device executes the program to implement any one of claims 1 to 13 steps of the method.
A computer program product comprising computer program instructions for causing a computer to perform the steps of the method as claimed in any one of claims 1 to 13.