WO2023273334A1 - Behavior recognition method and apparatus, and electronic device, computer-readable storage medium, computer program and computer program product - Google Patents

Behavior recognition method and apparatus, and electronic device, computer-readable storage medium, computer program and computer program product Download PDF

Info

Publication number
WO2023273334A1
WO2023273334A1 PCT/CN2022/074120 CN2022074120W WO2023273334A1 WO 2023273334 A1 WO2023273334 A1 WO 2023273334A1 CN 2022074120 W CN2022074120 W CN 2022074120W WO 2023273334 A1 WO2023273334 A1 WO 2023273334A1
Authority
WO
WIPO (PCT)
Prior art keywords
objects
features
group
feature
result
Prior art date
Application number
PCT/CN2022/074120
Other languages
French (fr)
Chinese (zh)
Inventor
王浩然
纪德益
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023273334A1 publication Critical patent/WO2023273334A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the technical field of computer vision, and in particular to a behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program and computer program product.
  • Human interaction behavior detection is an important task for understanding how people and objects interact.
  • Human-object interaction (HOI) behavior detection aims to localize and classify triplets of human, object, and human-object relationship from an input image. Detecting human-object interactions can enable well-designed algorithms to generate better descriptions of scenes.
  • Embodiments of the present disclosure provide a behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program and computer program product, which can improve the recognition accuracy and recognition efficiency of character interaction behaviors.
  • An embodiment of the present disclosure provides a behavior recognition method, including: detecting each image to be detected to obtain features of multiple objects, encoding the features, and obtaining multi-dimensional features corresponding to the multiple objects respectively; Based on some of the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the action results of each of the group member objects, wherein the Each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects; based on the multidimensional features, determine the relationship interaction characteristics of each group of objects, and interact according to the relationship
  • the feature is that when it is determined that the group member objects in each group of objects are related to each other, based on the space result and the action result, determine the target result of each group of objects, and obtain at least one of the group members The target result; based on at least one target result, determine the object behavior in each image to be detected.
  • An embodiment of the present disclosure provides a behavior recognition device, including: an encoding part, configured to detect each image to be detected to obtain the features of multiple objects, encode the features, and obtain the features corresponding to the multiple objects respectively.
  • a corresponding multi-dimensional feature the result determination part is configured to determine the spatial results of at least two types of objects of each group of objects based on some features of the features of each group member object of each group of objects, and each The action results of the group member objects, wherein each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects;
  • the relationship interaction features of the objects and in the case of determining that the member objects in each group of objects are related to each other according to the relationship interaction features, based on the spatial result and the action result, determine the The target result of each group of objects is obtained to obtain at least one target result; the behavior determination part is configured to determine the object behavior in each image to be detected based on the at least one target result.
  • the result determining part is further configured to generate a fully connected graph corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively;
  • the multi-dimensional features corresponding to the objects one-to-one, and the fully connected graph perform graph convolution processing to obtain the updated multi-dimensional features corresponding to each of the objects; according to each of the objects in each group
  • the updated multi-dimensional features of each group member object are obtained to obtain the relationship interaction feature of each group of objects.
  • the result determination part is further configured to classify each group of objects according to the relationship interaction feature, and obtain the interaction result of each group of objects; when the interaction result is greater than or equal to In the case of the first preset score threshold, it is determined that the group member objects in each group of objects are related to each other.
  • the result determining part is further configured to update the multi-dimensional feature of each group member object based on the relationship interaction feature of each group of objects and preset parameters, to obtain The refined features of each group member object, and based on the refined features, determine the graph interaction features of each group of objects; based on the graph interaction features, classify each group of objects to obtain Graph relationship results; based on the spatial results, the action results, the interaction results, the graph relationship results, and the confidence results obtained when performing the detection on each of the team member objects, determine each of the The target outcome for a set of objects.
  • the target result is a target value
  • the behavior determination part is further configured to, according to at least one of the target values, select from a plurality of associated object groups corresponding to at least one of the target values Finding an associated object group corresponding to the highest target value, and identifying behaviors among the group member objects in the associated object group.
  • the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects; the result determining part is further configured to The multi-dimensional features corresponding to each of the objects one-to-one, iterate the multi-dimensional features of each of the objects through a graph neural network, and obtain the updated multi-dimensional features corresponding to each of the objects .
  • the two objects include: a first object and a second object; the result determination part is further configured to determine the multidimensional features of the first object and the multidimensional features of the second object The similarity between features; based on the positional features of the first object in each of the images to be detected and the positional features of the second object in each of the images to be detected, determine the first A distance between the object and the second object; based on the similarity and the distance, determine the degree of association between the first object and the second object.
  • the result determining part is further configured to, based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multi-dimensional features corresponding to each of the objects, for each Iteratively updating the multi-dimensional features of each of the objects, and when the number of iterations reaches the first preset number of times, the features generated after the first preset number of times are used as the update of each of the objects.
  • the subsequent multidimensional features is further configured to, based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multi-dimensional features corresponding to each of the objects, for each Iteratively updating the multi-dimensional features of each of the objects, and when the number of iterations reaches the first preset number of times, the features generated after the first preset number of times are used as the update of each of the objects.
  • the subsequent multidimensional features is further configured to, based on the update parameter, the adjacency matrix, the first weight parameter corresponding
  • the preset parameters include: a second weight parameter and the number of iterations; the result determination part is further configured to, based on the second weight parameter and the relationship interaction feature of each group of objects, Iteratively updating the multi-dimensional features of each of the group member objects, and when the number of iterations reaches a second preset number of times, using the features generated after the second preset number of times as each of the group members The refinement characteristics of the employee object.
  • the detection includes: image detection and word vector detection; the encoding part is further configured to encode the positional features corresponding to the plurality of objects respectively to obtain the first feature of each object ;
  • the visual features corresponding to the plurality of objects are encoded to obtain the second feature of each of the objects;
  • the position feature and the visual feature are image detection for each of the images to be detected Obtained;
  • the word vector feature corresponding to each of the multiple objects is encoded to obtain the third feature of each of the objects;
  • the word vector feature is the category information of each of the objects, and the word obtained by vector detection;
  • the category information is obtained by performing image detection on each of the images to be detected; according to the first feature, the second feature and the third feature, a One corresponding to the multi-dimensional feature; wherein, the dimensions of the first feature, the second feature and the third feature are the same.
  • the encoding part is further configured to perform dimension transformation processing on the visual features corresponding to the plurality of objects respectively, so as to obtain the dimensionally transformed visual features of each of the objects;
  • the dimensionally transformed visual features are encoded to obtain the second features of each of the objects.
  • the partial features include: the positional features and visual features of each of the team member objects; the positional features and the visual features are obtained by performing image detection on each of the images to be detected; the The result determination part is further configured to determine the image area of each of the group member objects in each of the images to be detected based on the position characteristics of each of the group member objects of each group of objects; According to the image area of each of the group member objects, the image area corresponding to each group of objects is obtained, and the image area corresponding to each group of objects is encoded to obtain two-dimensional feature data; The two-dimensional feature data and the visual features of each member object are subjected to feature processing, corresponding to the processed two-dimensional feature data and the processed visual features; according to the processed two-dimensional feature data , classify each group of objects, obtain the spatial result of each group of objects, and classify each of the group member objects according to the processed visual features, and obtain each of the The result of the described action on the member object.
  • the device further includes a detection part configured to perform image detection on each of the images to be detected, and obtain the positional features, visual features, confidence results of each detected target, and Corresponding category information: taking the target whose confidence result is greater than or equal to a second preset score threshold as the detected object, and obtaining the position feature and the visual feature corresponding to the plurality of objects respectively, and the category information; performing word vector detection on the category information of each object to obtain the word vector features of each object.
  • a detection part configured to perform image detection on each of the images to be detected, and obtain the positional features, visual features, confidence results of each detected target, and Corresponding category information: taking the target whose confidence result is greater than or equal to a second preset score threshold as the detected object, and obtaining the position feature and the visual feature corresponding to the plurality of objects respectively, and the category information; performing word vector detection on the category information of each object to obtain the word vector features of each object.
  • An embodiment of the present disclosure provides an electronic device, including: a memory configured to store an executable computer program; a processor configured to implement the above behavior recognition method when executing the executable computer program stored in the memory.
  • An embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored for causing a processor to execute the above-mentioned behavior recognition method.
  • An embodiment of the present disclosure provides a computer program, including computer readable codes.
  • a processor in the electronic device executes the method for implementing the above behavior recognition method. step.
  • An embodiment of the present disclosure provides a computer program product, including computer program instructions, which enable a computer to execute the steps of the above-mentioned behavior recognition method.
  • the behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program, and computer program product obtained by the embodiments of the present disclosure obtain the features of multiple objects by detecting each image to be detected, and encode the obtained features , to obtain the multi-dimensional features corresponding to each object; based on some features in the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the spatial results of each group member object
  • the action result wherein each group of objects at least includes: objects classified as objects and objects classified as people among the multiple objects; then, based on the multi-dimensional features corresponding to the multiple objects respectively, the relationship of each group of objects is determined Interaction features, and in the case of determining that the member objects in each group of objects are related to each other based on the relationship interaction characteristics, based on the spatial result and the action result, determine the target result of each group of objects, so as to obtain at least one target result ; Finally, based on the obtained at least one target result, determine the object behavior in the image to
  • the embodiment of the present disclosure first determines whether the group member objects in each group of objects are related to each other, and then uses the group of group member objects related to each other to determine the object behavior in the image to be detected, so the relationship between group member objects is filtered out.
  • the groups are not related to each other, so that when determining the behavior of the object in the image to be detected, the factors that interfere with the determination result are reduced, and at the same time, the amount of data required for calculation is reduced, thereby improving the recognition accuracy when recognizing the interaction behavior of people degree and recognition efficiency.
  • FIG. 1A is a schematic diagram of an exemplary image to be detected provided by an embodiment of the present disclosure
  • FIG. 1B is an exemplary schematic diagram of another image to be detected provided by an embodiment of the present disclosure.
  • FIG. 2 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure
  • FIG. 4 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure
  • FIG. 5 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure.
  • FIG. 6 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure.
  • FIG. 7 is an optional schematic flow chart of a behavior recognition method provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure.
  • FIG. 9 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure.
  • FIG. 10 is a schematic partial flow diagram of an exemplary method of identifying an object's behavior in an image to be detected using a behavior recognition method provided by an embodiment of the present disclosure
  • FIG. 11 is a schematic structural diagram of an identification device provided by an embodiment of the present disclosure.
  • Fig. 12 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • FIG. 1A is a schematic diagram of an exemplary image to be detected provided by an embodiment of the present disclosure.
  • two objects, a human and an elephant are detected from the image, and each object is Using the annotation box annotation, by detecting the interaction between people and objects, a better description generated for the behavior in this image should be "man riding an elephant” rather than "man and elephant".
  • this task is regarded as a one-stage classification problem.
  • FIG. 1B is an exemplary schematic diagram of another image to be detected provided by an embodiment of the present disclosure.
  • people, tables, and teacups are all detected, and each object is marked with a label frame.
  • the person and the teacup are a pair of negative samples. That is to say, although the person and the teacup are not in contact, there is still a high probability that they will be predicted as drinking when the person and the teacup are paired. Tea behavior, thereby affecting the accuracy of the final prediction results.
  • an embodiment of the present disclosure provides a behavior recognition method, which can reduce negative sample pairs, thereby improving recognition accuracy and recognition efficiency of human interaction behaviors.
  • the behavior recognition method provided by the embodiment of the present disclosure is applied to an electronic device.
  • the following describes exemplary applications of the electronic equipment provided by the embodiments of the present disclosure.
  • the electronic equipment provided by the embodiments of the present disclosure can be implemented as AR (Augmented Reality) glasses, notebook computers, tablet computers, desktop computers, set-top boxes, mobile devices (for example, mobile Various types of user terminals (hereinafter referred to as terminals) such as telephones, portable music players, personal digital assistants, dedicated messaging devices, portable game devices, etc., can also be implemented as servers.
  • FIG. 2 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure, which will be described in conjunction with the steps shown in FIG. 2 .
  • the terminal can first detect each image to be detected to obtain the features of each object, and then encode the features of each object to obtain the number of objects in the image to be detected. , the multidimensional features of each object. It should be noted that the multiple objects may be all objects in the image to be detected, or may be some objects in the image to be detected.
  • the terminal may obtain the feature of each of the multiple objects by performing image detection and word vector detection on the image to be detected by itself.
  • the feature of each object can be the position feature, visual feature and word vector feature of the object, the feature composed of these three features; where the position feature can be the coordinates of the label box of the object in the image to be detected, and the visual feature It may be a region of interest (Region of Interest, RoI) pooled feature map corresponding to the coordinates of the label box, and the word vector feature may be a word vector corresponding to the category information of the object.
  • RoI region of interest
  • the terminal can first use the Faster R-CNN model to perform image detection on the image to be detected, obtain the positional features and visual features of each object, and obtain the category information of each object (for example, people, trees, etc.), and also obtain the confidence (confidence result) corresponding to the category information, and then use the word vector and text classification model (for example, fastText model) to perform word vector detection on the category information, and obtain each The word vector features corresponding to the category information of an object.
  • the word vector and text classification model for example, fastText model
  • the image to be detected may be an image for any scene, for example, the image to be detected may be a collected image of a customer shopping in a store, or an image of a certain scenic spot collected, etc., the implementation of the present disclosure Examples are not limited to this.
  • each group member object of each group of objects determines the spatial results of at least two types of objects in each group of objects, and the action results of each group member object; wherein, each group of objects Include at least: an object whose category is an object and an object whose category is a person among the plurality of objects.
  • the terminal can group the multiple objects to obtain multiple groups of objects, where each group of objects includes at least objects classified as objects and objects whose category is person; and, at least one group member object is different between any two groups of objects.
  • the terminal can determine the space result between the group member objects in the group of objects according to some features of the characteristics of each group member object in the group of objects, and Determine the action result of each team member object.
  • each group of objects may include two types of objects: people and objects, or each group of objects may also include three types of objects: people, objects and animals.
  • the terminal may combine these 3 objects Objects are divided into two groups: person-object 1, person-object 2; obviously, there is a group member object between these two groups of objects (object 1 is different from object 2); after obtaining these two groups of objects, for person- For object 1, the terminal determines the result of the space between the person and object 1 according to some of the characteristics of the person and object 1 in the group of objects, and determines the result of the action of the person and the result of the action of object 1; for the person- For the object 2, the terminal determines the result of the space between the person and the object 2 according to some of the characteristics of the person and the object 2 in the group of objects, and determines the result of the action of the person and the result of the action of the object 2 respectively.
  • the spatial result and the action result can be classification score values, and the terminal can obtain the spatial result and the action result through the fully connected layer.
  • the terminal can determine the relationship interaction features corresponding to each group of objects according to the multi-dimensional features corresponding to the multiple objects respectively, for For each group of objects, the terminal can determine whether the member objects in the group of objects are related according to the relationship interaction characteristics of the group of objects, and when it is determined that the group member objects of the group of objects are related, based on the group The space results between the group member objects in the object, and the action results of each group member object, and then determine the target result corresponding to the group object, so that there are one or more group members in multiple groups of objects
  • an associated object group a group in which member objects in a group are associated with each other is referred to as an associated object group
  • at least one target result can be correspondingly obtained. For example, in the case that there are 3 groups of objects, and there are 2 associated object groups in the 3 groups of objects, two target results corresponding to the 2 associated object groups can be obtained.
  • the group of objects when it is determined that the member objects of a group of objects are not associated with each other, the group of objects is not a group of related objects, and there is no target result; that is, the embodiments of the present disclosure filter by determining the target result
  • the unrelated object groups between the group member objects are removed; in this way, the interference factors can be reduced when the object behavior in the image to be detected is subsequently determined, and at the same time, the amount of data required for calculation is reduced;
  • the object groups associated with each other the recognition accuracy and recognition efficiency when recognizing the interaction behavior of characters.
  • the terminal when it obtains at least one target result, it can determine the object in the image to be detected according to the at least one target result and at least one associated object group corresponding to the at least one target result Behavior.
  • the object behavior in the image to be detected may be the behavior between a person and an object, for example, for the image to be detected in Figure 1A, the obtained object behavior may be "a man riding an elephant", and for example , for the image to be detected in FIG. 1B , the obtained object behavior can be "many people sitting at the dining table".
  • the target result is a target value
  • the terminal may select an associated object corresponding to the highest target value from a plurality of associated object groups corresponding to at least one target value according to at least one target value group, and recognize the behavior among group member objects in the selected group of associated objects.
  • the terminal when it obtains at least one target value, it can sort the at least one target value, select the highest target value according to the sorting result, and use an associated object group corresponding to the highest target value as an identification Target, so as to identify the behavior actions among the group member objects in this associated object group.
  • the embodiment of the present disclosure may adopt the recognition model in the related art to recognize the behaviors among the member objects in this associated object group, and the embodiment of the present disclosure does not limit the recognition model here.
  • determining the relational interaction features of each group of objects based on the multi-dimensional features in S103 above may be implemented through S1031-S1033, which will be described in conjunction with the steps shown in FIG. 3 .
  • the terminal may generate full Connection Diagram.
  • the fully connected graph can be represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects, and the adjacency matrix can represent the association between any two objects among multiple objects Spend.
  • the adjacency matrix can be represented by the following formula (1):
  • a f represents the adjacency matrix
  • i represents the i-th object (or, can also be called a node)
  • f i represents the multi-dimensional feature of the i-th object
  • N represents the total number of multiple objects.
  • the terminal when the terminal obtains the fully connected graph corresponding to the multiple objects in the image to be detected, it can map the multidimensional features and the fully connected graph of each object in the multiple objects. Product operation, through which the updated multi-dimensional features of each object are obtained.
  • the terminal can input the multi-dimensional features of each object and the adjacency matrix used to represent the fully connected graph into a graph neural network (Graph Convolutional Network, GCN), perform graph convolution operations through the GCN network, and Output the updated multidimensional features for each object.
  • GCN Graph Convolutional Network
  • the above S1032 can be implemented in the following manner: based on the adjacency matrix and the one-to-one correspondence between each object's multi-dimensional features, the graph neural network is used to iterate the multi-dimensional features of each object to obtain the one-to-one correspondence with each object.
  • the corresponding updated multi-dimensional features wherein, the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between the corresponding two objects.
  • the above-mentioned two objects include: a first object and a second object; the degree of association between the two objects can be determined through S201-S203, which will be described in conjunction with the steps shown in FIG. 4 .
  • the terminal may determine the similarity between the first object and the second object according to the multi-dimensional features of the first object and the multi-dimensional features of the second object, for example, dot product similarity or cosine similarity, etc. .
  • the similarity between the first object and the second object can be represented by the following formula (2):
  • F se (f i , f j ) represents the dot product similarity between the i-th object (the first object) and the j-th object (the second object), where i and j are any of 1 to N Integer, and i is not equal to j, f i represents the multidimensional feature of the i-th object, and f j represents the multi-dimensional feature of the j-th object.
  • the positional characteristics of the first object in the image to be detected, and the position characteristics of the second object in the image to be detected can be obtained.
  • the terminal may determine the distance between the first object and the second object according to the location characteristics of the first object and the location characteristics of the second object.
  • the location feature is the coordinates of the callout frame (for example, the coordinates of the center point of the callout frame, or the coordinates of the upper left corner point and the lower right corner point of the callout frame, etc.), and the terminal may use the callout frame coordinates of the first object and the second object
  • the coordinates of the label frame of calculate the distance between the first object and the second object.
  • the distance between the first object and the second object can be represented by the following formula (3):
  • D(b i , b j ) represents the coordinate distance between the i-th object and the j-th object calculated by the coordinates of the label box
  • F dist (f i , f j ) represents the distance between the i-th object and the j-th object The distance between j objects.
  • the terminal may calculate the degree of association between the first object and the second object according to the similarity and distance.
  • the degree of association between the first object and the second object can be calculated by the following formula (4):
  • N represents the total number of multiple objects
  • f j represents the multidimensional feature of the jth object
  • f i represents the multidimensional feature of the ith object
  • exp(.) represents the base e exponential function.
  • the terminal can input the adjacency matrix and the multi-dimensional features of all objects into the multi-layer graph neural network, and iteratively update the multi-dimensional features of each object through the multi-layer graph neural network, so as to obtain each The updated multidimensional feature of the object.
  • the terminal may iteratively update the multi-dimensional features corresponding to each object based on the update parameters, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multi-dimensional features of all objects, and when the number of iterations reaches In the case of the first preset number of times, the features generated after the first preset number of times are used as the updated multi-dimensional features corresponding to each object.
  • the update parameter may be an activation function
  • the first weight parameter corresponding to the number of iterations may be a learnable weight matrix corresponding to each layer of the graph neural network, and the number of iterations may be determined according to the number of layers of the graph neural network.
  • each layer corresponds to a learnable weight, and the number of iterations can be determined to be 2; that is, for the first layer of the graph neural network and
  • the input is the adjacency matrix and the multidimensional features of each object, and the output is the multidimensional features of each object after the first iteration;
  • the output is the multidimensional features of each object after the second iteration;
  • the output is the multidimensional feature after the second iteration of each object, and the multidimensional feature after the second iteration of each object is the update of each object obtained after the iteration The subsequent multidimensional features.
  • g (l+1) ⁇ (A ⁇ g l ⁇ W l ) (5)
  • A represents the adjacency matrix.
  • g l ⁇ R N ⁇ d represents the iterated multidimensional feature of each object output by the l-th layer
  • g (l+1) represents the iterated multi-dimensional feature of each object output by the l+1 layer
  • g 0 ⁇ f represents the feature of each object in layer 0, that is, represents the multidimensional feature of each object.
  • W l ⁇ R d ⁇ d represents the learnable weight matrix of layer l
  • d is the size of the input and output features
  • ⁇ (.) represents the activation function, for example, it can be a linear rectification function (Rectified Linear Unit, ReLU). According to the above formula (5), it can be seen that the input of the l+1th layer is the output of the lth layer.
  • l is 1, that is to say, a two-layer graph neural network can be used to iteratively update the multidimensional features of each object; in this way, the update efficiency of the multidimensional features of each object can be improved, thereby having It is beneficial to improve the recognition efficiency of character interaction behavior.
  • the terminal when the terminal obtains the updated multi-dimensional features corresponding to each group member in the group of objects, it can use the updated Multi-dimensional features determine the relationship interaction features of the group of objects.
  • the terminal may superimpose the updated multi-dimensional features of the group member objects on the channel dimension, and use the superimposed features as the relationship interaction features of the group of objects.
  • the terminal when the terminal obtains the relational interaction features of the group of objects, it can input the relational interaction features of the group of objects into the fully connected layer, and the group of objects through the fully connected layer The interaction classification is performed, and the obtained interaction classification score of the group of objects is used as the interaction result of the group of objects.
  • the terminal when the terminal obtains the interaction result of the group of objects, it can compare the interaction result with the first preset score threshold, and when the interaction result is greater than or equal to the first In the case of preset score thresholds, it is determined that the group member objects in the group objects are related to each other.
  • the first preset score threshold may be set according to actual needs, and the embodiment of the present disclosure does not limit the value of the first preset score threshold.
  • the determination of the target result of each group of objects based on the spatial result and the action result in S103 above may go through S1036-S1038, which will be described with the steps shown in FIG. 6 .
  • the terminal can update the multi-dimensional features of each group member object in the group of objects according to the relationship interaction features and preset parameters of the group of objects, so as to obtain According to the refined characteristics of each group member object in the group object, the graph interaction characteristics of the group object are determined.
  • the terminal may superimpose the refined features of all member objects in the group of objects in the channel dimension, so as to obtain the graph interaction features of the group of objects.
  • the preset parameters include: the second weight parameter and the number of iterations; the above-mentioned S1036 based on the relationship interaction features of each group of objects, and the preset parameters to update the multidimensional features of each group member object, Obtaining the refined features of each group member object can be achieved in the following way: Based on the second weight parameter and the relationship interaction feature of each group object, iteratively update the multi-dimensional features of each group member object, and in the number of iterations When the second preset number of times is reached, the features generated after the second preset number of times are used as the refined features of each group member object.
  • the terminal can iteratively update the multi-dimensional features of each group member object in the group of objects according to the second weight parameter and the relationship interaction feature of the group of objects, for example, in the process of the first iteration , the multidimensional feature of each member object in the group object is used as input, and the multidimensional feature of each member object after the first iteration is obtained after iteration.
  • the multidimensional feature after the first iteration is used as the input of the second iteration, and iterates in this way until the number of iterations reaches the second preset number of times, and the corresponding number of iterations after the second preset number of times
  • the multidimensional features of are used as the refined features of each group member object.
  • Represents the relational interaction features of each group of objects, represents the index function, Represents the interaction result of each group of objects, ⁇ s represents the first preset score threshold; ⁇ represents the second weight parameter (weighting parameter), N represents the total number of multiple objects, f i (t) represents the i-th object Refinement features, f i (t-1) represents the features of the i-th object input when obtaining the refinement features of the i-th object, f j (t-1) represents the input when obtaining the refinement features of the i-th object t represents the number of iterations; in the case of t 1, f i (t-1) represents the multidimensional feature of the i-th object, and f j (t-1) represents the j-th object’s multidimensional features.
  • the second preset number of times may be set according to actual needs, which is not limited in this embodiment of the present disclosure.
  • the second preset number of times may be 2. In this way, the efficiency of obtaining the refined features of each team member object can be improved, which is beneficial to improve the recognition efficiency of human interaction behavior.
  • the terminal when the terminal obtains the graph interaction features of the group of objects, it may classify the graph relationship of the group of objects according to the graph interaction features to obtain a graph relationship result.
  • the terminal can input the graph interaction features of the group of objects into the fully connected layer, and classify the graph relationship of the group of objects through the fully connected layer, so as to obtain the graph relationship classification score, and the obtained graph A relation classification score, as a result of graph relations for this set of objects.
  • the terminal obtains the process of the graph relationship result of the group of objects, which can be expressed by the following formula (8):
  • S1038 Determine target results for each group of objects based on the spatial results, action results, interaction results, graph relationship results, and confidence results obtained when detecting each group member object.
  • the terminal can obtain the spatial results of the group of objects, the action results of each group member object, the interaction results and graph relationship results of the group of objects, and in the above steps
  • the confidence result obtained when detecting each group member object is used to obtain the target result of the group object.
  • the terminal may determine the first product value between the confidence results of all group member objects; determine the second product value between the action results of all group member objects; determine the The third product between the first product value, the second product value, the spatial result and the graph relationship result; determine the index value between the interaction result and the first preset score threshold; and, combine the third product value with the index value The product between is used as the target result for this set of objects.
  • s h or s o represents the confidence result of the group member object, wherein, s h represents the confidence result of the object classified as a person, and s o represents the confidence result of the object classified as an object; or Indicates the action result of the group member object, where, represents the result of an action for an object of class Person, Indicates the action result of an object of class Object; represents the product between the graph relational result and the spatial result, represents the interaction result, ⁇ s represents the first preset score threshold, Represents an indicator function.
  • the encoding of features in S101 above to obtain multi-dimensional features corresponding to multiple objects respectively may be implemented through S1011-S1014, which will be described below in conjunction with the steps in FIG. 7 .
  • S1011 Encode the location features corresponding to the multiple objects respectively to obtain the first feature of each object; the detection includes: image detection and word vector detection.
  • the second feature and the third feature obtain multi-dimensional features corresponding to the multiple objects respectively; wherein, the dimensions of the first feature, the second feature and the third feature are the same.
  • these three features can be encoded separately to the same feature space, so as to obtain the first feature, the second feature and the third feature with the same dimension.
  • the location feature of an object may be the coordinates of the label box of the object in the image to be detected
  • the visual feature may be the RoI pooled feature map corresponding to the coordinates of the label box
  • the word vector feature may be is the word vector corresponding to the category information of the object.
  • the detection of each image to be detected in the above S101 to obtain the features of multiple objects may be implemented through S401-S403, which will be described below in conjunction with the steps in FIG. 8 .
  • S403. Perform word vector detection on category information of each object to obtain word vector features of each object.
  • the terminal can obtain the position feature, visual feature, confidence result, and category information corresponding to the confidence result of each target in the image to be detected through image detection , after that, the terminal can compare the confidence result of each target with the second preset score threshold, and according to the comparison result, remove the target whose confidence result is less than the second preset score threshold, and keep the confidence result greater than or equal to the second preset
  • the goal of the score threshold using all the retained goals as the above-mentioned multiple objects, so as to obtain the location features, visual features, confidence results corresponding to each object, and category information corresponding to the confidence results; and, after obtaining each object
  • the terminal may also perform word vector detection on the category information of each object to obtain the word vector feature corresponding to each object.
  • each target whose confidence result is greater than or equal to the second preset score threshold is used as each object for subsequent behavior recognition of the image to be detected.
  • the gap between the person and the object in the image to be detected can be reduced.
  • Interference factors in the identification of interactive behaviors are beneficial to improve the recognition accuracy when identifying the interactive behaviors between people and objects in the image to be detected.
  • the second preset score threshold may be set according to actual needs, which is not limited in this embodiment of the present disclosure.
  • the terminal in the case of encoding, can use a multilayer perceptron (Multilayer Perceptron, MLP) to encode the location features, visual features and word vector features of each object, so as to obtain the corresponding The first, second, and third features of the same dimensionality.
  • MLP Multilayer Perceptron
  • the first feature, the second feature and the third feature may all be 256-dimensional features.
  • the terminal can superimpose the first feature, second feature, and third feature on the channel dimension to obtain the corresponding 768 Dimensional multidimensional features.
  • the above S1012 can be implemented through S301-S302:
  • the terminal can perform dimension transformation (Reshap) on the visual features to obtain the dimension
  • the transformed one-dimensional visual feature is encoded, and the second feature of each object is obtained by encoding the dimensionally transformed one-dimensional visual feature.
  • each group member object in each group of objects in the above S102 determines the spatial results of at least two types of objects in each group of objects, and each group member
  • the action result of the object can be realized through S1021-S1024, which will be described in conjunction with the steps in FIG. 9 .
  • S1021. Based on the position feature of each team member object of each group of objects, determine the image area of each team member object in each image to be detected; some features include: the position feature and visual feature of each team member object; The location features and visual features are obtained by image detection for each image to be detected.
  • the terminal may determine the corresponding image area of the group member object from the corresponding image to be detected according to the position characteristics of the group member object.
  • the terminal may, according to the coordinates of the label frame of the motorcycle in the image to be detected, The image area of the car is intercepted to obtain the image area of the motorcycle.
  • the terminal when the terminal obtains the image area of each group member object, it can stitch the image areas of all group member objects to obtain the image area of the group object, and The image area of this group of objects is encoded; in the process of encoding, in the person channel, the value of the image area of the person is 1, and the value of other areas is 0; in the object channel, the value of the image area of the object is 1, The values of other regions are 0, thus obtaining the two-dimensional characteristic data of this group of objects.
  • the terminal may concatenate the image area of the person and the image area of the motorcycle obtained in S1021, so as to obtain the image area of the group of objects of the person-motorcycle, and So that in the human channel, the value of the image area of the person is 1, and the value of other areas is 0; in the motorcycle channel, the value of the image area of the motorcycle is 1, and the value of other areas is 0, thus obtaining the human- The two-dimensional feature data of the motorcycle group of objects.
  • the terminal may separately perform feature processing on the two-dimensional feature data and visual features after obtaining the two-dimensional feature data of the group of objects and the visual features of each group member object. processing, so as to obtain the processed two-dimensional feature data and the processed visual features respectively.
  • the terminal can first perform feature extraction on two-dimensional feature data through a convolutional neural network (Convolutional Neural Networks, CNN, referred to as CNN Block) to obtain the first sub-feature; through a residual network (Residual Block, referred to as Res Block) extracts the visual features of each group member object to obtain the second sub-feature; then performs global average pooling (Global average Pooling, GAP) on the first sub-feature and the second sub-feature respectively, and the corresponding The two-dimensional feature data and the processed visual features.
  • CNN Convolutional Neural Networks
  • Res Block residual network
  • GAP Global average Pooling
  • the processed two-dimensional feature data and the processed visual features can be represented by the following formulas (10), (11) and (12):
  • F represents the feature map after the ROI pooling of the image to be detected
  • f h or f o is the processed visual feature of each group member object
  • f h represents the processed visual feature of the object of the category
  • f o represents the processed visual features of objects classified as objects
  • Represents the processed two-dimensional feature data for example, f h, o represents the processed two-dimensional feature data of the group of objects of people-objects when a group of objects includes two objects of the category of people and objects
  • F h, o represent the image area corresponding to each group of objects, b h or b o represent the location characteristics of each group member object, where b h represents the location characteristics of objects classified as people, and b o represents objects classified as objects
  • the location features of ; RoI(F,b h ) or RoI(F,b o ) represent the visual features of each team member object, where RoI(F,b h ) represents the visual features of objects classified as people, RoI(F
  • the terminal when the terminal obtains the processed two-dimensional feature data and the processed visual features of each group member object, it can perform spatial processing on the group object according to the processed two-dimensional feature data. Classify to obtain the spatial result corresponding to the group object; and according to the processed visual features of each group member object, perform action classification on the group member object to obtain the action result of the group member object.
  • the terminal can input the processed two-dimensional feature data into a fully connected layer, classify the group of objects through the fully connected layer, obtain a spatial classification score, and use the spatial classification score as the group object and the terminal can input the processed visual features of each group member object into another fully connected layer, classify the group member object through the fully connected layer, obtain an action classification score, and classify the action Score as action result.
  • the terminal classifies each group of objects according to the processed two-dimensional feature data, obtains the spatial result of each group of objects, and classifies each group member according to the processed visual features of each group member object Objects are classified to obtain the action results of each team member object, which can be expressed by the following formulas (13), (14) and (15):
  • W h indicates the learning weight of the fully connected layer corresponding to the member object whose category is human
  • W o indicates the learning weight of the fully connected layer corresponding to the member object whose category is an object
  • W h,o represents the learning weight of the fully connected layer corresponding to each group of objects.
  • FIG. 10 is a partial flow diagram of an example of using a behavior recognition method to identify an object behavior in an image to be detected provided by an embodiment of the disclosure.
  • the terminal performs target detection and word vector detection on an image to be detected I, and obtains the position feature, confidence result, and word vector feature of each object in the image to be detected, for example, as shown in Figure 10
  • the detector can be used to detect the word vector of the motorcycle to obtain the word vector features corresponding to the motorcycle
  • the detector can be used to detect the word vector of the helmet to obtain the helmet corresponding
  • the terminal may perform feature extraction on the ROI pooled image obtained in the image detection process of the image to be detected to obtain the visual feature of each object.
  • the terminal can use the semantic encoding module to encode the location feature and word vector feature respectively, and obtain the first feature and the third feature , at the same time, perform dimension transformation processing (Reshap) on the visual features of each object, and use MLP to encode the visual features after dimension transformation processing to obtain the second feature with the same dimension as the first feature and the third feature, And the first feature, the second feature and the third feature are superimposed in the channel dimension to obtain the multi-dimensional feature corresponding to each object in the image I to be detected.
  • dimension transformation processing Reflectidirectional
  • a fully connected graph corresponding to all objects is generated, and the fully connected graph is characterized by an adjacency matrix (not shown in FIG. 10 ), the adjacency matrix and the multi-dimensional features corresponding to all objects are taken as the input of the GCN network, and the updated multi-dimensional features of each object are obtained through the graph convolution processing of the GCN network; according to the updated multi-dimensional features of each object
  • the multi-dimensional features of each group of objects are obtained, and the relationship interaction features of each group of objects are input into the fully connected layer (FCs), and the group of objects is classified to obtain the interaction results of each group of objects.
  • grouping all objects in the image I to be detected can obtain multiple groups of objects.
  • the terminal retains each group of objects whose corresponding interaction result is greater than or equal to the first preset score threshold, and obtains a plurality of related object groups that are related to each other among group member objects; and, according to The relationship interaction features corresponding to each associated object group, as well as the preset parameters, update the multidimensional features of each member object in the associated object group, and obtain the refined features of each member object (the update process can be shown in Fig. 10), for each associated object group, the terminal superimposes the refined features of all member objects in the associated object group on the channel dimension to obtain the graph interaction feature of the associated object group (Fig. 10), and input the graph interaction feature into the fully connected layer for classification, and obtain the graph relationship result of the associated object group
  • the terminal obtains the image area of each object according to the location characteristics of each object, and stitches the image areas of the member objects of each group of objects to obtain the image area of each group of objects, and The image region of the group object is encoded to obtain two-dimensional feature data.
  • the visual features of the group member objects for example, object 1 and object 2 in Figure 1
  • the group of objects are respectively input into the residual network for feature extraction, and different second subclasses are obtained.
  • the two-dimensional feature data is input into the convolutional neural network for feature extraction to obtain the first sub-feature (not shown in Figure 10 output), and perform global average pooling on the first sub-feature and each second sub-feature respectively, to obtain the processed visual features of each group member object, for example, the processed visual features of object 1 in Figure 10 , and the processed visual features of object 2, and obtain the processed two-dimensional feature data of the group of objects.
  • the processed visual features of object 1, the processed visual features of object 2, and the processed two-dimensional feature data of this group of objects are input into different fully connected layers for classification, respectively, and object 1 result of the action Object 2's action result and the spatial results for a set of objects consisting of Object 1 and Object 2
  • the terminal can calculate the target result of each associated object group by substituting all the results obtained in the above process into the above formula (9), and then according to the obtained target results of all associated object groups, the highest target result A corresponding associated object group can identify the interaction behavior of the person in the image I to be detected.
  • FIG. 11 is a schematic structural diagram of the behavior recognition device provided by an embodiment of the disclosure; as shown in FIG.
  • the image is detected to obtain the features of multiple objects, and the features are encoded to obtain multi-dimensional features corresponding to the multiple objects respectively;
  • the result determining part 20 is further configured to generate a fully connected graph corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively ; By performing graph convolution processing on the multi-dimensional features corresponding to each of the objects and the fully connected graph, the updated multi-dimensional features corresponding to each of the objects are obtained; according to the The updated multi-dimensional feature of each member object in each group of objects obtains the relationship interaction feature of each group of objects.
  • the result determining part 20 is further configured to classify each group of objects according to the relationship interaction feature, and obtain the interaction result of each group of objects; If the interaction result is greater than or equal to a first preset score threshold, it is determined that the group member objects in each group of objects are related to each other.
  • the result determination part 20 is further configured to, based on the relationship interaction features of each group of objects and preset parameters, for each of the group member objects update the multi-dimensional features to obtain the refined features of each of the group member objects, and based on the refined features, determine the graph interaction features of each group of objects; based on the graph interaction features, for each Classifying group objects to obtain graph relationship results; based on the spatial results, the action results, the interaction results, the graph relationship results, and the confidence obtained when performing the detection on each of the group member objects As a result, said target outcome for said each set of subjects is determined.
  • the target result is a target value
  • the behavior determination part 30 is further configured to, according to at least one of the target values, select from a plurality of one-to-one corresponding to the at least one target value In the group of associated objects, a group of associated objects corresponding to the highest target value is selected, and the behavior among the member objects in the group of associated objects is identified.
  • the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects; the result determining part 20 is further configured For the one-to-one correspondence between the adjacency matrix and each of the objects, the multidimensional features of each of the objects are iterated through a graph neural network to obtain a one-to-one correspondence with each of the objects. The updated multidimensional features of .
  • the two objects include: a first object and a second object; the result determining part 20 is further configured to determine the multi-dimensional feature and the second object of the first object The similarity between the multi-dimensional features of the two objects; based on the positional features of the first object in each of the images to be detected, and the positional features of the second object in each of the images to be detected , determining a distance between the first object and the second object; and determining the degree of association between the first object and the second object based on the similarity and the distance.
  • the result determination part 20 is further configured to be based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the one-to-one correspondence between each of the objects
  • the multi-dimensional features are iteratively updated for each of the multi-dimensional features of the object, and when the number of iterations reaches the first preset number of times, the features generated after the first preset number of times are used as each of the multi-dimensional features The updated multidimensional feature of the object.
  • the preset parameters include: a second weight parameter and the number of iterations; the result determining part 20 is further configured to
  • the relational interaction features are iteratively updated for the multi-dimensional features of each of the team member objects, and when the number of iterations reaches a second preset number of times, the features generated after the second preset number of times, as the refinement feature for each of the group member objects.
  • the detection includes: image detection and word vector detection; the encoding part 10 is further configured to encode the position features corresponding to the plurality of objects respectively, to obtain each The first feature of an object; the visual features corresponding to the multiple objects are encoded to obtain the second feature of each of the objects; the position feature and the visual feature are for each of the described objects.
  • the image to be detected is obtained by image detection; the word vector features corresponding to the plurality of objects are encoded to obtain the third feature of each of the objects; the word vector features are for each of the objects
  • the category information is obtained by performing word vector detection; the category information is obtained by performing image detection on each of the images to be detected; according to the first feature, the second feature and the third feature, get the same as the
  • the plurality of objects correspond to the multi-dimensional features respectively; wherein, the dimensions of the first feature, the second feature and the third feature are the same.
  • the encoding part 10 is further configured to perform dimension transformation processing on the visual features corresponding to the plurality of objects respectively, to obtain the dimension transformation of each of the objects.
  • Visual features encoding the visual features after dimension transformation to obtain the second features of each of the objects.
  • the partial features include: positional features and visual features of each of the team member objects; the positional features and the visual features are image detection performed on each image to be detected Obtained; the result determining part 20 is further configured to determine each of the group member objects on each sheet to be detected based on the position characteristics of each of the group member objects of each group of objects The image area in the image; according to the image area of each of the group member objects, the image area corresponding to each group of objects is obtained, and the image area corresponding to each group of objects is encoded to obtain a two-dimensional feature data; performing feature processing on the two-dimensional feature data and the visual features of each of the team member objects, corresponding to the processed two-dimensional feature data and the processed visual feature; according to the processing classify each group of objects, obtain the spatial result of each group of objects, and classify each group member object according to the processed visual features , to obtain the action result of each member object.
  • the device further includes a detection part configured to perform image detection on each of the images to be detected, and obtain the positional features, visual features, and confidence results of each detected target, and Category information corresponding to the confidence result; taking the target whose confidence result is greater than or equal to a second preset score threshold as the detected object, and obtaining the positional features, location features, and The visual features, and the category information; word vector detection is performed on the category information of each object to obtain the word vector features of each object.
  • a detection part configured to perform image detection on each of the images to be detected, and obtain the positional features, visual features, and confidence results of each detected target, and Category information corresponding to the confidence result; taking the target whose confidence result is greater than or equal to a second preset score threshold as the detected object, and obtaining the positional features, location features, and The visual features, and the category information; word vector detection is performed on the category information of each object to obtain the word vector features of each object.
  • a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.
  • FIG. 12 is a schematic structural diagram of a virtual label display device provided by an embodiment of the present disclosure. As shown in FIG. 12 , it includes: a memory 22 and a processor 23, wherein the memory 22 and the processor 23 is connected through the bus 21; the memory 22 is configured to store an executable computer program; the processor 23 is configured to execute the executable computer program stored in the memory 22 to implement the method provided by the embodiment of the present disclosure, for example, the present disclosure The behavior recognition method provided by the embodiment.
  • the embodiment of the present disclosure provides a computer-readable storage medium storing a computer program for causing the processor 23 to implement the method provided in the embodiment of the present disclosure, for example, the behavior recognition method provided in the embodiment of the present disclosure.
  • An embodiment of the present disclosure provides a computer program, including computer readable codes.
  • a processor in the electronic device executes the method for implementing the above behavior recognition method. step.
  • An embodiment of the present disclosure provides a computer program product, including computer program instructions, which enable a computer to execute the steps of the above-mentioned behavior recognition method.
  • the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; Various devices in any combination.
  • a computer readable storage medium may also be a tangible device that holds and stores instructions for use by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • USB flash drives magnetic disks, optical disks, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable Read Only Memory (EPROM or Flash), Static Random Access Reader (ROM), Portable Compact Disk Read Only Memory (CD-ROM), Digital Versatile Disk (DVD), Memory Stick, Floppy Disk, Memory Encoding Device, Examples include punched cards with instructions stored thereon, or recessed-in-groove structures, and any suitable combination of the foregoing.
  • a computer-readable storage medium is not to be construed as a transient signal per se, such as a radio wave or other freely propagating battery wave, a battery wave propagating through a waveguide or other media medium (e.g., a pulse of light through a fiber optic cable), or Electrical signals transmitted through wires.
  • computer program instructions may take the form of programs, software, software modules, scripts, or codes written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages , and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other part suitable for use in a computing environment.
  • computer program instructions may, but do not necessarily correspond to files in a file system, may be stored as part of files that hold other programs or data, for example, in Hyper Text Markup Language (HTML) documents in one or more scripts, in a single file dedicated to the program in question, or in multiple cooperating files (for example, files that store one or more modules, subroutines, or sections of code).
  • HTML Hyper Text Markup Language
  • computer program instructions can be deployed to be executed on one computing device, or on multiple computing devices at one site, or distributed across multiple sites and interconnected by a communication network. to execute.
  • the embodiment of the present disclosure discloses an identification method, device, electronic equipment, computer-readable storage medium, computer program and computer program product.
  • the method includes: detecting each image to be detected to obtain the features of multiple objects, encoding the features, and obtaining multi-dimensional features corresponding to the multiple objects respectively, based on the part of the features of each object in each group of objects Features, determine the spatial results of at least two types of objects in each group of objects, and the action results of each object; based on multidimensional features, determine the relationship interaction characteristics of each group of objects, and determine the objects in each group of objects based on the relationship interaction characteristics
  • the target result of each group of objects is determined based on the spatial result and the action result, and at least one target result is obtained; based on at least one target result, the object behavior in each image to be detected is determined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed in the embodiments of the present disclosure are a recognition method and apparatus, and an electronic device, a computer-readable storage medium, a computer program and a computer program product. The method comprises: performing detection on each image to be subjected to detection, so as to obtain features of a plurality of objects, and encoding the features to obtain multi-dimensional features corresponding to the plurality of objects on a one-to-one basis; on the basis of some of the features of each object in each group of objects, determining spatial results of at least two categories of objects in each group of objects, and an action result of each object; on the basis of the multi-dimensional features, determining relationship interaction features of each group of objects, and where it is determined, according to the relationship interaction features, that the objects in each group of objects are associated with each other, determining target results of each group of objects on the basis of the spatial results and the action results, so as to obtain at least one target result; and on the basis of the at least one target result, determining object behavior in each image to be subjected to detection.

Description

行为识别方法、装置、电子设备、计算机可读存储介质、计算机程序及计算机程序产品Behavior recognition method, device, electronic device, computer readable storage medium, computer program and computer program product
相关申请的交叉引用Cross References to Related Applications
本公开基于申请号为202110750749.8、申请日为2021年07月02日、申请名称为“行为识别方法、装置、电子设备及计算机可读存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。This disclosure is based on the Chinese patent application with the application number 202110750749.8, the application date is July 02, 2021, and the application name is "behavior recognition method, device, electronic equipment and computer-readable storage medium", and requires the Chinese patent application Priority, the entire content of the Chinese patent application is hereby incorporated by reference into this disclosure.
技术领域technical field
本公开涉及计算机视觉技术领域,尤其涉及一种行为识别方法、装置、电子设备、计算机可读存储介质、计算机程序及计算机程序产品。The present disclosure relates to the technical field of computer vision, and in particular to a behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program and computer program product.
背景技术Background technique
人物交互行为检测是理解人与对象如何交互的一项重要任务。人与物体交互(Human-object interaction,HOI)行为检测旨在根据输入图像,对人、物体和人与物体的关系的三元组进行定位和分类。检测到人与物体的交互可以使设计良好的算法能够为场景生成更好的描述。Human interaction behavior detection is an important task for understanding how people and objects interact. Human-object interaction (HOI) behavior detection aims to localize and classify triplets of human, object, and human-object relationship from an input image. Detecting human-object interactions can enable well-designed algorithms to generate better descriptions of scenes.
然而,采用相关技术进行人物交互行为检测时,检测效率与准确度较低,从而造成对人物交互行为的检测效果差,以及检测效率低下。However, when the related technology is used to detect the interaction behavior of characters, the detection efficiency and accuracy are low, resulting in poor detection effect and low detection efficiency of the interaction behavior of the characters.
发明内容Contents of the invention
本公开实施例提供一种行为识别方法、装置、电子设备、计算机可读存储介质、计算机程序及计算机程序产品,能够提高对人物交互行为的识别准确度和识别效率。Embodiments of the present disclosure provide a behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program and computer program product, which can improve the recognition accuracy and recognition efficiency of character interaction behaviors.
本公开实施例的技术方案是这样实现的:The technical scheme of the embodiment of the present disclosure is realized in this way:
本公开实施例提供一种行为识别方法,包括:对每张待检测图像进行检测得到多个对象的特征,对所述特征进行编码,得到与所述多个对象分别一一对应的多维特征;基于每一组对象的每个组员对象的特征中的部分特征,确定所述每一组对象的至少两类对象的空间结果,以及每个所述组员对象的动作结果,其中,所述每一组对象至少包含:所述多个对象中类别为物体的对象,以及类别为人的对象;基于所述多维特征,确定所述每一组对象的关系交互特征,并在依据所述关系交互特征,确定所述每一组对象中的所述组员对象之间相互关联的情况下,基于所述空间结果和所述动作结果,确定所述每一组对象的目标结果,得到至少一个所述目标结果;基于至少一个所述目标结果,确定所述每张待检测图像中的对象行为。An embodiment of the present disclosure provides a behavior recognition method, including: detecting each image to be detected to obtain features of multiple objects, encoding the features, and obtaining multi-dimensional features corresponding to the multiple objects respectively; Based on some of the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the action results of each of the group member objects, wherein the Each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects; based on the multidimensional features, determine the relationship interaction characteristics of each group of objects, and interact according to the relationship The feature is that when it is determined that the group member objects in each group of objects are related to each other, based on the space result and the action result, determine the target result of each group of objects, and obtain at least one of the group members The target result; based on at least one target result, determine the object behavior in each image to be detected.
本公开实施例提供一种行为识别装置,包括:编码部分,被配置为对每张待检测图像进行检测得到多个对象的特征,对所述特征进行编码,得到与所述多个对象分别一一对应的多维特征;结果确定部分,被配置为基于每一组对象的每个组员对象的特征中的部分特征,确定所述每一组对象的至少两类对象的空间结果,以及每个所述组员对象的动作结果,其中,所述每一组对象至少包含:所述多个对象中类别为物体的对象,以及类别为人的对象;基于所述多维特征,确定所述每一组对象的关系交互特征,并在依据所述关系交互特征,确定所述每一组对象中的所述组员对象之间相互关联的情况下,基于所述空间结果和所述动作结果,确定所述每一组对象的目标结果,得到至少一个所述目标结果;行为确定部分,被配置为基于至少一个所述目标结果,确定所述每张待检测图像中的对象行为。An embodiment of the present disclosure provides a behavior recognition device, including: an encoding part, configured to detect each image to be detected to obtain the features of multiple objects, encode the features, and obtain the features corresponding to the multiple objects respectively. A corresponding multi-dimensional feature; the result determination part is configured to determine the spatial results of at least two types of objects of each group of objects based on some features of the features of each group member object of each group of objects, and each The action results of the group member objects, wherein each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects; The relationship interaction features of the objects, and in the case of determining that the member objects in each group of objects are related to each other according to the relationship interaction features, based on the spatial result and the action result, determine the The target result of each group of objects is obtained to obtain at least one target result; the behavior determination part is configured to determine the object behavior in each image to be detected based on the at least one target result.
上述装置中,所述结果确定部分,还被配置为基于与所述多个对象分别一一对应的所述多维特征,生成与所述多个对象所对应的全连接图;通过对每个所述对象一一对应的所述多维特征,以及所述全连接图,进行图卷积处理,得到与每个所述对象一一对应的更新后的多维特征;根据所述每一组对象中每个组员对象的所述更新后的多维特征,得到所述每一组对 象的所述关系交互特征。In the above device, the result determining part is further configured to generate a fully connected graph corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively; The multi-dimensional features corresponding to the objects one-to-one, and the fully connected graph, perform graph convolution processing to obtain the updated multi-dimensional features corresponding to each of the objects; according to each of the objects in each group The updated multi-dimensional features of each group member object are obtained to obtain the relationship interaction feature of each group of objects.
上述装置中,所述结果确定部分,还被配置为根据所述关系交互特征,对所述每一组对象进行分类,得到所述每一组对象的交互结果;在所述交互结果大于或等于第一预设分数阈值的情况下,确定所述每一组对象中的所述组员对象之间相互关联。In the above device, the result determination part is further configured to classify each group of objects according to the relationship interaction feature, and obtain the interaction result of each group of objects; when the interaction result is greater than or equal to In the case of the first preset score threshold, it is determined that the group member objects in each group of objects are related to each other.
上述装置中,所述结果确定部分,还被配置为基于所述每一组对象的所述关系交互特征,以及预设参数,对每个所述组员对象的所述多维特征进行更新,得到每个所述组员对象的细化特征,并基于所述细化特征,确定所述每一组对象的图交互特征;基于所述图交互特征,对所述每一组对象进行分类,得到图关系结果;基于所述空间结果、所述动作结果、所述交互结果、所述图关系结果,以及对每个所述组员对象进行所述检测时所得到的置信结果,确定所述每一组对象的所述目标结果。In the above device, the result determining part is further configured to update the multi-dimensional feature of each group member object based on the relationship interaction feature of each group of objects and preset parameters, to obtain The refined features of each group member object, and based on the refined features, determine the graph interaction features of each group of objects; based on the graph interaction features, classify each group of objects to obtain Graph relationship results; based on the spatial results, the action results, the interaction results, the graph relationship results, and the confidence results obtained when performing the detection on each of the team member objects, determine each of the The target outcome for a set of objects.
上述装置中,所述目标结果为目标数值;所述行为确定部分,还被配置为根据至少一个所述目标数值,从与至少一个所述目标数值一一对应的多个关联对象组中,选出与最高的目标数值所对应的一个关联对象组,并识别所述一个关联对象组中的所述组员对象之间的行为。In the above device, the target result is a target value; the behavior determination part is further configured to, according to at least one of the target values, select from a plurality of associated object groups corresponding to at least one of the target values Finding an associated object group corresponding to the highest target value, and identifying behaviors among the group member objects in the associated object group.
上述装置中,所述全连接图通过邻接矩阵表征,所述邻接矩阵中的每个数据表征对应的两个对象之间的关联度;所述结果确定部分,还被配置为基于所述邻接矩阵和每个所述对象一一对应的所述多维特征,通过图神经网络,对每个所述对象的所述多维特征进行迭代,得到与每个所述对象一一对应的更新后的多维特征。In the above device, the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects; the result determining part is further configured to The multi-dimensional features corresponding to each of the objects one-to-one, iterate the multi-dimensional features of each of the objects through a graph neural network, and obtain the updated multi-dimensional features corresponding to each of the objects .
上述装置中,所述两个对象包括:第一对象和第二对象;所述结果确定部分,还被配置为确定所述第一对象的所述多维特征和所述第二对象的所述多维特征之间的相似度;基于所述第一对象在所述每张待检测图像中的位置特征,以及所述第二对象在所述每张待检测图像中的位置特征,确定所述第一对象与所述第二对象之间的距离;基于所述相似度和所述距离,确定所述第一对象和所述第二对象之间的所述关联度。In the above device, the two objects include: a first object and a second object; the result determination part is further configured to determine the multidimensional features of the first object and the multidimensional features of the second object The similarity between features; based on the positional features of the first object in each of the images to be detected and the positional features of the second object in each of the images to be detected, determine the first A distance between the object and the second object; based on the similarity and the distance, determine the degree of association between the first object and the second object.
上述装置中,所述结果确定部分,还被配置为基于更新参数、所述邻接矩阵、与迭代次数对应的第一权重参数,以及每个所述对象一一对应的所述多维特征,对每个所述对象的所述多维特征进行迭代更新,并在迭代次数达到第一预设次数的情况下,将所述第一预设次数之后生成的特征,作为每个所述对象的所述更新后的多维特征。In the above device, the result determining part is further configured to, based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multi-dimensional features corresponding to each of the objects, for each Iteratively updating the multi-dimensional features of each of the objects, and when the number of iterations reaches the first preset number of times, the features generated after the first preset number of times are used as the update of each of the objects. The subsequent multidimensional features.
上述装置中,所述预设参数包括:第二权重参数和迭代次数;所述结果确定部分,还被配置为基于所述第二权重参数和所述每一组对象的所述关系交互特征,对每个所述组员对象的所述多维特征进行迭代更新,并在迭代次数达到第二预设次数的情况下,将所述第二预设次数之后生成的特征,作为每个所述组员对象的所述细化特征。In the above device, the preset parameters include: a second weight parameter and the number of iterations; the result determination part is further configured to, based on the second weight parameter and the relationship interaction feature of each group of objects, Iteratively updating the multi-dimensional features of each of the group member objects, and when the number of iterations reaches a second preset number of times, using the features generated after the second preset number of times as each of the group members The refinement characteristics of the employee object.
上述装置中,所述检测包括:图像检测和词向量检测;所述编码部分,还被配置为将与所述多个对象分别一一对应的位置特征进行编码,得到每个对象的第一特征;将与所述多个对象分别一一对应的视觉特征进行编码,得到每个所述对象的第二特征;所述位置特征和所述视觉特征是对所述每张待检测图像进行图像检测得到的;将与所述多个对象分别一一对应的词向量特征进行编码,得到每个所述对象的第三特征;所述词向量特征是对每个所述对象的类别信息,进行词向量检测得到的;所述类别信息是对所述每张待检测图像进行图像检测得到的;根据所述第一特征、第二特征和所述第三特征,得到与所述多个对象分别一一对应的所述多维特征;其中,所述第一特征、所述第二特征和所述第三特征的维度相同。In the above device, the detection includes: image detection and word vector detection; the encoding part is further configured to encode the positional features corresponding to the plurality of objects respectively to obtain the first feature of each object ; The visual features corresponding to the plurality of objects are encoded to obtain the second feature of each of the objects; the position feature and the visual feature are image detection for each of the images to be detected Obtained; The word vector feature corresponding to each of the multiple objects is encoded to obtain the third feature of each of the objects; the word vector feature is the category information of each of the objects, and the word obtained by vector detection; the category information is obtained by performing image detection on each of the images to be detected; according to the first feature, the second feature and the third feature, a One corresponding to the multi-dimensional feature; wherein, the dimensions of the first feature, the second feature and the third feature are the same.
上述装置中,所述编码部分,还被配置为将与所述多个对象分别一一对应的视觉特征,进行维度变换处理,得到每个所述对象的维度变换后的视觉特征;对所述维度变换后的视觉特征进行编码,得到每个所述对象的所述第二特征。In the above device, the encoding part is further configured to perform dimension transformation processing on the visual features corresponding to the plurality of objects respectively, so as to obtain the dimensionally transformed visual features of each of the objects; The dimensionally transformed visual features are encoded to obtain the second features of each of the objects.
上述装置中,所述部分特征包括:每个所述组员对象的位置特征和视觉特征;所述位置特征和所述视觉特征是对所述每张待检测图像进行图像检测得到的;所述结果确定部分,还被配置为基于所述每一组对象的每个所述组员对象的所述位置特征,确定每个所述组员对象在所述每张待检测图像中的图像区域;根据每个所述组员对象的所述图像区域,得到所述每一组对象对应的图像区域,并对所述每一组对象对应的图像区域进行编码,得到二维特征数据;对所述二维特征数据,以及每个所述组员对象的所述视觉特征,分别进行特征处理,对 应得到处理后的二维特征数据和处理后的视觉特征;根据所述处理后的二维特征数据,对所述每一组对象进行分类,得到所述每一组对象的所述空间结果,以及根据所述处理后的视觉特征,对每个所述组员对象进行分类,得到每个所述组员对象的所述动作结果。In the above device, the partial features include: the positional features and visual features of each of the team member objects; the positional features and the visual features are obtained by performing image detection on each of the images to be detected; the The result determination part is further configured to determine the image area of each of the group member objects in each of the images to be detected based on the position characteristics of each of the group member objects of each group of objects; According to the image area of each of the group member objects, the image area corresponding to each group of objects is obtained, and the image area corresponding to each group of objects is encoded to obtain two-dimensional feature data; The two-dimensional feature data and the visual features of each member object are subjected to feature processing, corresponding to the processed two-dimensional feature data and the processed visual features; according to the processed two-dimensional feature data , classify each group of objects, obtain the spatial result of each group of objects, and classify each of the group member objects according to the processed visual features, and obtain each of the The result of the described action on the member object.
上述装置中,所述装置还包括检测部分,被配置为对所述每张待检测图像进行图像检测,得到检测出的每个目标的位置特征、视觉特征、置信结果,以及与所述置信结果对应的类别信息;将所述置信结果大于或等于第二预设分数阈值的目标,作为检测出的对象,得到与所述多个对象分别一一对应的所述位置特征、所述视觉特征,以及所述类别信息;对每个对象的所述类别信息进行词向量检测,得到每个所述对象的词向量特征。In the above device, the device further includes a detection part configured to perform image detection on each of the images to be detected, and obtain the positional features, visual features, confidence results of each detected target, and Corresponding category information: taking the target whose confidence result is greater than or equal to a second preset score threshold as the detected object, and obtaining the position feature and the visual feature corresponding to the plurality of objects respectively, and the category information; performing word vector detection on the category information of each object to obtain the word vector features of each object.
本公开实施例提供一种电子设备,包括:存储器,被配置为存储可执行计算机程序;处理器,被配置为执行所述存储器中存储的可执行计算机程序时,实现上述的行为识别方法。An embodiment of the present disclosure provides an electronic device, including: a memory configured to store an executable computer program; a processor configured to implement the above behavior recognition method when executing the executable computer program stored in the memory.
本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,用于引起处理器执行时,实现上述的行为识别方法。An embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored for causing a processor to execute the above-mentioned behavior recognition method.
本公开实施例提供一种计算机程序,包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备中的处理器执行用于实现上述的行为识别方法的步骤。An embodiment of the present disclosure provides a computer program, including computer readable codes. When the computer readable codes run in an electronic device, a processor in the electronic device executes the method for implementing the above behavior recognition method. step.
本公开实施例提供一种计算机程序产品,包括计算机程序指令,该计算机程序指令使得计算机执行上述的行为识别方法的步骤。An embodiment of the present disclosure provides a computer program product, including computer program instructions, which enable a computer to execute the steps of the above-mentioned behavior recognition method.
本公开实施例提供的行为识别方法、装置、电子设备、计算机可读存储介质、计算机程序及计算机程序产品,通过对每张待检测图像进行检测得到多个对象的特征,对得到的特征进行编码,得到每个对象所对应的多维特征;基于每一组对象的每个组员对象的特征中的部分特征,确定每一组对象的至少两类对象的空间结果,以及每个组员对象的动作结果,其中,每一组对象至少包含:多个对象中类别为物体的对象,以及类别为人的对象;之后,基于与多个对象分别一一对应的多维特征,确定每一组对象的关系交互特征,并在依据关系交互特征,确定每一组对象中的组员对象之间相互关联的情况下,基于空间结果和动作结果,确定每一组对象的目标结果,从而得到至少一个目标结果;最后,基于得到的至少一个目标结果,确定该张待检测图像中的对象行为。由于本公开实施例先确定每一组对象中的组员对象是否相互关联,之后将组员对象相互关联的组用来确定待检测图像中的对象行为,所以,过滤掉了组员对象之间相互不关联的组,从而在确定待检测图像中的对象行为时,减少了干扰确定结果的因素,同时,减少了所需计算的数据量,从而提高了对人物交互行为进行识别时的识别准确度和识别效率。The behavior recognition method, device, electronic equipment, computer-readable storage medium, computer program, and computer program product provided by the embodiments of the present disclosure obtain the features of multiple objects by detecting each image to be detected, and encode the obtained features , to obtain the multi-dimensional features corresponding to each object; based on some features in the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the spatial results of each group member object The action result, wherein each group of objects at least includes: objects classified as objects and objects classified as people among the multiple objects; then, based on the multi-dimensional features corresponding to the multiple objects respectively, the relationship of each group of objects is determined Interaction features, and in the case of determining that the member objects in each group of objects are related to each other based on the relationship interaction characteristics, based on the spatial result and the action result, determine the target result of each group of objects, so as to obtain at least one target result ; Finally, based on the obtained at least one target result, determine the object behavior in the image to be detected. Since the embodiment of the present disclosure first determines whether the group member objects in each group of objects are related to each other, and then uses the group of group member objects related to each other to determine the object behavior in the image to be detected, so the relationship between group member objects is filtered out. The groups are not related to each other, so that when determining the behavior of the object in the image to be detected, the factors that interfere with the determination result are reduced, and at the same time, the amount of data required for calculation is reduced, thereby improving the recognition accuracy when recognizing the interaction behavior of people degree and recognition efficiency.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。The accompanying drawings here are incorporated into the description and constitute a part of the present description. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.
图1A为本公开实施例提供的示例性地一张待检测图像的示意图;FIG. 1A is a schematic diagram of an exemplary image to be detected provided by an embodiment of the present disclosure;
图1B为本公开实施例提供的示例性地另一张待检测图像的示意图;FIG. 1B is an exemplary schematic diagram of another image to be detected provided by an embodiment of the present disclosure;
图2为本公开实施例提供的行为识别方法的一个可选的流程示意图;FIG. 2 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;
图3为本公开实施例提供的行为识别方法的一个可选的流程示意图;FIG. 3 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure;
图4为本公开实施例提供的行为识别方法的一个可选的流程示意图;FIG. 4 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;
图5为本公开实施例提供的行为识别方法的一个可选的流程示意图;FIG. 5 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure;
图6为本公开实施例提供的行为识别方法的一个可选的流程示意图;FIG. 6 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;
图7为本公开实施例提供的行为识别方法的一个可选的流程示意图;FIG. 7 is an optional schematic flow chart of a behavior recognition method provided by an embodiment of the present disclosure;
图8为本公开实施例提供的行为识别方法的一个可选的流程示意图;FIG. 8 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure;
图9为本公开实施例提供的行为识别方法的一个可选的流程示意图;FIG. 9 is an optional schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;
图10为本公开实施例提供的示例性地采用行为识别方法识别一张待检测图像中的对象行为的部分流程示意图;FIG. 10 is a schematic partial flow diagram of an exemplary method of identifying an object's behavior in an image to be detected using a behavior recognition method provided by an embodiment of the present disclosure;
图11为本公开实施例提供的识别装置的结构示意图;FIG. 11 is a schematic structural diagram of an identification device provided by an embodiment of the present disclosure;
图12为本公开实施例提供的电子设备的结构示意图。Fig. 12 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式detailed description
为了使本公开的目的、技术方案和优点更加清楚,下面将结合附图对本公开作进一步地详细描述,所描述的实施例不应视为对本公开的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described in detail below in conjunction with the accompanying drawings. All other embodiments obtained under the premise of creative labor belong to the protection scope of the present disclosure.
人物交互行为检测是理解人与对象如何交互的一项重要任务。人与物体交互行为检测旨在根据输入图像,对人、物体和人与物体的关系的三元组进行定位和分类。检测到人与物体的交互可以使设计良好的算法能够为场景生成更好的描述。例如,图1A为本公开实施例提供的示例性地一张待检测图像的示意图,如图1A所示,从该图像中检测出了人和大象这两个对象,并且,每个对象均采用了标注框标注,通过检测人与物体的交互,针对该图像中的行为,所生成的更好的描述应该是“男人骑着大象”而非“男人和大象”。目前,在相关技术中,都把这个任务看作是一个一阶段的分类问题,例如,针对一张图片,首先检测出图片中所有的人和物体,然后对每一对人和物的组合进行分类,从而预测每一对人和物的交互行为和得分,最终通过得分阈值判断出一张图片中包含的交互行为。但是,这种直接预测所有组合的方式无法去掉负样本对,容易引起误判。例如,图1B为本公开实施例提供的示例性地另一张待检测图像的示意图,如图1B所示,人、桌子和茶杯均被检测了出来,每个对象均采用了标注框标注,如图1B所示,人和茶杯为一对负样本对,也就是说,虽然人和茶杯没有接触,但是在人和茶杯组合成对的情况下,依然有很大的概率将其预测成喝茶行为,从而影响最终的预测结果的准确性。Human interaction behavior detection is an important task for understanding how people and objects interact. Human-object interaction detection aims to localize and classify triplets of people, objects, and human-object relationships based on input images. Detecting human-object interactions can enable well-designed algorithms to generate better descriptions of scenes. For example, FIG. 1A is a schematic diagram of an exemplary image to be detected provided by an embodiment of the present disclosure. As shown in FIG. 1A, two objects, a human and an elephant, are detected from the image, and each object is Using the annotation box annotation, by detecting the interaction between people and objects, a better description generated for the behavior in this image should be "man riding an elephant" rather than "man and elephant". At present, in related technologies, this task is regarded as a one-stage classification problem. For example, for a picture, first detect all the people and objects in the picture, and then perform Classification, so as to predict the interaction behavior and score of each pair of people and objects, and finally judge the interaction behavior contained in a picture through the score threshold. However, this method of directly predicting all combinations cannot remove negative sample pairs, which is prone to misjudgment. For example, FIG. 1B is an exemplary schematic diagram of another image to be detected provided by an embodiment of the present disclosure. As shown in FIG. 1B , people, tables, and teacups are all detected, and each object is marked with a label frame. As shown in Figure 1B, the person and the teacup are a pair of negative samples. That is to say, although the person and the teacup are not in contact, there is still a high probability that they will be predicted as drinking when the person and the teacup are paired. Tea behavior, thereby affecting the accuracy of the final prediction results.
基于此,本公开实施例提供一种行为识别方法,能够减少负样本对,从而提高对人物交互行为的识别准确度和识别效率。本公开实施例提供的行为识别方法应用于电子设备。下面说明本公开实施例提供的电子设备的示例性应用,本公开实施例提供的电子设备可以实施为AR(Augmented Reality)眼镜、笔记本电脑,平板电脑,台式计算机,机顶盒,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)等各种类型的用户终端(以下简称终端),也可以实施为服务器。Based on this, an embodiment of the present disclosure provides a behavior recognition method, which can reduce negative sample pairs, thereby improving recognition accuracy and recognition efficiency of human interaction behaviors. The behavior recognition method provided by the embodiment of the present disclosure is applied to an electronic device. The following describes exemplary applications of the electronic equipment provided by the embodiments of the present disclosure. The electronic equipment provided by the embodiments of the present disclosure can be implemented as AR (Augmented Reality) glasses, notebook computers, tablet computers, desktop computers, set-top boxes, mobile devices (for example, mobile Various types of user terminals (hereinafter referred to as terminals) such as telephones, portable music players, personal digital assistants, dedicated messaging devices, portable game devices, etc., can also be implemented as servers.
下面,将说明电子设备实施为终端时的示例性应用。图2是本公开实施例提供的行为识别方法的一个可选的流程示意图,将结合图2示出的步骤进行说明。Next, an exemplary application when the electronic device is implemented as a terminal will be described. FIG. 2 is a schematic flowchart of an optional behavior recognition method provided by an embodiment of the present disclosure, which will be described in conjunction with the steps shown in FIG. 2 .
S101、对每张待检测图像进行检测得到多个对象的特征,对所述特征进行编码,得到与多个对象分别一一对应的多维特征。S101. Detect each image to be detected to obtain features of multiple objects, and encode the features to obtain multi-dimensional features corresponding to the multiple objects respectively.
在公开实施例中,终端可以先对每张待检测图像进行检测,得到每个对象的特征,然后,对每个对象的特征进行编码,从而得到该张待检测图像中存在的多个对象中,每个对象的多维特征。需要说明的是,多个对象可以为该张待检测图像中的所有对象,也可以为该张待检测图像中的部分对象。In the disclosed embodiment, the terminal can first detect each image to be detected to obtain the features of each object, and then encode the features of each object to obtain the number of objects in the image to be detected. , the multidimensional features of each object. It should be noted that the multiple objects may be all objects in the image to be detected, or may be some objects in the image to be detected.
在本公开的一些实施例中,终端可以通过自身对该张待检测图像进行图像检测和词向量检测,来获得多个对象中每个对象的特征。每个对象的特征可以是该对象的位置特征、视觉特征和词向量特征,这三种特征所组成的特征;其中,位置特征可以是该对象在待检测图像中的标注框的坐标,视觉特征可以是与该标注框的坐标所对应的感兴趣区域(Region of Interes,RoI)池化后的特征图,词向量特征可以是与该对象的类别信息对应的词向量。In some embodiments of the present disclosure, the terminal may obtain the feature of each of the multiple objects by performing image detection and word vector detection on the image to be detected by itself. The feature of each object can be the position feature, visual feature and word vector feature of the object, the feature composed of these three features; where the position feature can be the coordinates of the label box of the object in the image to be detected, and the visual feature It may be a region of interest (Region of Interest, RoI) pooled feature map corresponding to the coordinates of the label box, and the word vector feature may be a word vector corresponding to the category information of the object.
示例性地,对于一张待检测图像,终端可以先采用Faster R-CNN模型,对该张待检测图像进行图像检测,得到每个对象的位置特征和视觉特征,以及得到每个对象的类别信息(例如,人、树等),以及还得到与该类别信息对应的置信度(置信结果),然后采用词向量与文 本分类模型(例如,fastText模型)对该类别信息进行词向量检测,得到每个对象的类别信息所对应的词向量特征。Exemplarily, for an image to be detected, the terminal can first use the Faster R-CNN model to perform image detection on the image to be detected, obtain the positional features and visual features of each object, and obtain the category information of each object (for example, people, trees, etc.), and also obtain the confidence (confidence result) corresponding to the category information, and then use the word vector and text classification model (for example, fastText model) to perform word vector detection on the category information, and obtain each The word vector features corresponding to the category information of an object.
在本公开实施例中,待检测图像可以是针对任何场景的图像,例如,待检测图像可以是采集的某个店铺中的顾客购物图像,或者,采集的某个景点的图像等,本公开实施例对此不作限定。In the embodiment of the present disclosure, the image to be detected may be an image for any scene, for example, the image to be detected may be a collected image of a customer shopping in a store, or an image of a certain scenic spot collected, etc., the implementation of the present disclosure Examples are not limited to this.
S102、基于每一组对象的每个组员对象的特征中的部分特征,确定每一组对象的至少两类对象的空间结果,以及每个组员对象的动作结果;其中,每一组对象至少包含:多个对象中类别为物体的对象,以及类别为人的对象。S102. Based on some of the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the action results of each group member object; wherein, each group of objects Include at least: an object whose category is an object and an object whose category is a person among the plurality of objects.
在本公开实施例中,终端在获得该张待检测图像中的多个对象后,可以将多个对象进行分组,从而得到多组对象,其中,每一组对象至少包含了类别为物体的对象和类别为人的对象;并且,任意两组对象之间至少有一个组员对象不同。在得到多组对象后,对于每一组对象,终端可以根据该组对象中每个组员对象的特征中的部分特征,来确定出该组对象中的组员对象之间的空间结果,以及确定出每个组员对象的动作结果。In the embodiment of the present disclosure, after obtaining multiple objects in the image to be detected, the terminal can group the multiple objects to obtain multiple groups of objects, where each group of objects includes at least objects classified as objects and objects whose category is person; and, at least one group member object is different between any two groups of objects. After obtaining multiple groups of objects, for each group of objects, the terminal can determine the space result between the group member objects in the group of objects according to some features of the characteristics of each group member object in the group of objects, and Determine the action result of each team member object.
在一些实施例中,每一组对象中可以包含人和物体这两类对象,或者,每一组对象中还可以包含人、物体和动物这三类对象。In some embodiments, each group of objects may include two types of objects: people and objects, or each group of objects may also include three types of objects: people, objects and animals.
示例性地,在该多个对象为3个对象,每一组对象中包含人和物体这两类对象,且3个对象包括:人、物体1和物体2的情况下,终端可以将这3个对象分成2组:人-物体1、人-物体2;显然,这两组对象之间有一个组员对象不同(物体1与物体2不同);在得到这两组对象后,对于人-物体1,终端根据该组对象中人和物体1的特征中的部分特征,确定出人与物体1之间的空间结果,并分别确定出人的动作结果和物体1的动作结果;对于人-物体2,终端根据该组对象中人和物体2的特征中的部分特征,确定出人与物体2之间的空间结果,并分别确定出人的动作结果和物体2的动作结果。Exemplarily, when the multiple objects are 3 objects, and each group of objects includes two types of objects, people and objects, and the 3 objects include: people, object 1, and object 2, the terminal may combine these 3 objects Objects are divided into two groups: person-object 1, person-object 2; obviously, there is a group member object between these two groups of objects (object 1 is different from object 2); after obtaining these two groups of objects, for person- For object 1, the terminal determines the result of the space between the person and object 1 according to some of the characteristics of the person and object 1 in the group of objects, and determines the result of the action of the person and the result of the action of object 1; for the person- For the object 2, the terminal determines the result of the space between the person and the object 2 according to some of the characteristics of the person and the object 2 in the group of objects, and determines the result of the action of the person and the result of the action of the object 2 respectively.
需要说明的是,空间结果和动作结果可以是分类分数值,终端可以通过全连接层得到空间结果和动作结果。It should be noted that the spatial result and the action result can be classification score values, and the terminal can obtain the spatial result and the action result through the fully connected layer.
S103、基于多维特征,确定每一组对象的关系交互特征,并在依据关系交互特征,确定每一组对象中的所述组员对象之间相互关联的情况下,基于空间结果和动作结果,确定每一组对象的目标结果,得到至少一个目标结果。S103. Based on the multi-dimensional features, determine the relationship interaction features of each group of objects, and in the case of determining that the group member objects in each group of objects are related to each other according to the relationship interaction features, based on the spatial result and the action result, Determining target outcomes for each group of objects to obtain at least one target outcome.
在本公开实施例中,终端在获得多个对象中每个对象的多维特征之后,可以根据与多个对象分别一一对应的多维特征,确定出每一组对象所对应的关系交互特征,针对每一组对象,终端可以根据该组对象的关系交互特征,确定该组对象中组员对象之间是否关联,并在确定出该组对象的组员对象之间关联的情况下,基于该组对象中组员对象之间的空间结果,以及每个组员对象的动作结果,再确定出该组对象所对应的目标结果,如此,在多组对象中有一个组或多个组的组员对象之间相互关联(以下将一个组中的组员对象之间相互关联的组,称为关联对象组)的情况下,可以对应获得至少一个目标结果。例如,在多组对象为3组,且这3组对象中存在2个关联对象组的情况下,可以得到与这2个关联对象组一一对应的两个目标结果。In the embodiment of the present disclosure, after obtaining the multi-dimensional features of each of the multiple objects, the terminal can determine the relationship interaction features corresponding to each group of objects according to the multi-dimensional features corresponding to the multiple objects respectively, for For each group of objects, the terminal can determine whether the member objects in the group of objects are related according to the relationship interaction characteristics of the group of objects, and when it is determined that the group member objects of the group of objects are related, based on the group The space results between the group member objects in the object, and the action results of each group member object, and then determine the target result corresponding to the group object, so that there are one or more group members in multiple groups of objects In the case that objects are associated with each other (hereinafter, a group in which member objects in a group are associated with each other is referred to as an associated object group), at least one target result can be correspondingly obtained. For example, in the case that there are 3 groups of objects, and there are 2 associated object groups in the 3 groups of objects, two target results corresponding to the 2 associated object groups can be obtained.
可以理解的是,在确定一组对象的组员对象之间不关联的情况下,则该组对象不是关联对象组,且没有目标结果;也就是说,本公开实施例通过确定目标结果,过滤掉了组员对象之间不关联的对象组;如此,可以在后续确定待检测图像中的对象行为时,减少干扰因素,同时,减少所需计算的数据量;从而可以提高后续根据组员对象之间关联的对象组,对人物交互行为进行识别时的识别准确度和识别效率。It can be understood that, when it is determined that the member objects of a group of objects are not associated with each other, the group of objects is not a group of related objects, and there is no target result; that is, the embodiments of the present disclosure filter by determining the target result The unrelated object groups between the group member objects are removed; in this way, the interference factors can be reduced when the object behavior in the image to be detected is subsequently determined, and at the same time, the amount of data required for calculation is reduced; The object groups associated with each other, the recognition accuracy and recognition efficiency when recognizing the interaction behavior of characters.
S104、基于至少一个目标结果,确定每张待检测图像中的对象行为。S104. Based on at least one target result, determine the object behavior in each image to be detected.
本公开实施例中,终端在得到至少一个目标结果的情况下,可以根据这至少一个目标结果,以及与这至少一个目标结果对应的至少一个关联对象组,确定出该张待检测图像中的对象行为。示例性地,该张待检测图像中的对象行为可以是人与物体之间的行为,例如,针对图1A中的待检测图像,得到的对象行为可以是“男人骑着大象”,又例如,针对图1B中的待检测图像,得到的对象行为可以是“多人坐在餐桌前”。In the embodiment of the present disclosure, when the terminal obtains at least one target result, it can determine the object in the image to be detected according to the at least one target result and at least one associated object group corresponding to the at least one target result Behavior. Exemplarily, the object behavior in the image to be detected may be the behavior between a person and an object, for example, for the image to be detected in Figure 1A, the obtained object behavior may be "a man riding an elephant", and for example , for the image to be detected in FIG. 1B , the obtained object behavior can be "many people sitting at the dining table".
在一些实施例中,目标结果为目标数值;终端可以根据至少一个目标数值,从与至少一个目标数值一一对应的多个关联对象组中,选出与最高的目标数值所对应的一个关联对象组,并识别所选的这一个关联对象组中的组员对象之间的行为。In some embodiments, the target result is a target value; the terminal may select an associated object corresponding to the highest target value from a plurality of associated object groups corresponding to at least one target value according to at least one target value group, and recognize the behavior among group member objects in the selected group of associated objects.
这里,终端在获得至少一个目标数值的情况下,可以将这至少一个目标数值进行排序,并根据排序结果从中选取最高的目标数值,并将该最高的目标数值所对应的一个关联对象组作为识别目标,从而识别这一个关联对象组中的组员对象之间的行为动作。需要说明的是,本公开实施例可以采用相关技术中的识别模型,对这一个关联对象组中的组员对象之间的行为动作进行识别,本公开实施例在此对该识别模型不作限定。Here, when the terminal obtains at least one target value, it can sort the at least one target value, select the highest target value according to the sorting result, and use an associated object group corresponding to the highest target value as an identification Target, so as to identify the behavior actions among the group member objects in this associated object group. It should be noted that, the embodiment of the present disclosure may adopt the recognition model in the related art to recognize the behaviors among the member objects in this associated object group, and the embodiment of the present disclosure does not limit the recognition model here.
在本公开的一些实施例中,上述S103中的基于多维特征,确定每一组对象的关系交互特征,可以通过S1031-S1033实现,将结合图3示出的步骤进行说明。In some embodiments of the present disclosure, determining the relational interaction features of each group of objects based on the multi-dimensional features in S103 above may be implemented through S1031-S1033, which will be described in conjunction with the steps shown in FIG. 3 .
S1031、基于与多个对象分别一一对应的多维特征,生成与多个对象所对应的全连接图。S1031. Generate fully connected graphs corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively.
在本公开实施例中,对于该张待检测图像中的多个对象,终端可以根据这多个对象中每个对象所对应的多维特征,生成与该待检测图像中的多个对象对应的全连接图。该全连接图可以采用邻接矩阵表征,且该邻接矩阵中的每个数据表征对应的两个对象之间的关联度,通过该邻接矩阵可以表征多个对象中的任意两个对象之间的关联度。In the embodiment of the present disclosure, for multiple objects in the image to be detected, the terminal may generate full Connection Diagram. The fully connected graph can be represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects, and the adjacency matrix can represent the association between any two objects among multiple objects Spend.
示例性地,该邻接矩阵可以采用下述公式(1)表示:Exemplarily, the adjacency matrix can be represented by the following formula (1):
A f∈R N×N={(f i)|i=1,...,N}……………(1) A f ∈ R N×N ={(f i )|i=1,...,N}…………(1)
其中,A f表示邻接矩阵,i表示第i个对象(或者,也可以称为节点),f i表示第i个对象的多维特征,N表示多个对象的总数量。 Among them, A f represents the adjacency matrix, i represents the i-th object (or, can also be called a node), f i represents the multi-dimensional feature of the i-th object, and N represents the total number of multiple objects.
S1032、通过对每个对象一一对应的多维特征,以及全连接图,进行图卷积处理,得到与每个对象一一对应的更新后的多维特征。S1032. Perform graph convolution processing on the multi-dimensional features corresponding to each object and the fully connected graph to obtain updated multi-dimensional features corresponding to each object.
在本公开实施例中,终端在得到与该张待检测图像中的多个对象所对应的全连接图的情况下,可以对多个对象中每个对象的多维特征和全连接图进行图卷积操作,通过该图卷积操作得到每个对象的更新后的多维特征。In the embodiment of the present disclosure, when the terminal obtains the fully connected graph corresponding to the multiple objects in the image to be detected, it can map the multidimensional features and the fully connected graph of each object in the multiple objects. Product operation, through which the updated multi-dimensional features of each object are obtained.
示例性地,终端可以将每个对象的多维特征,以及用于表示该全连接图的邻接矩阵,均输入图神经网络(Graph Convolutional Network,GCN)中,通过GCN网络进行图卷积操作,并输出每个对象的更新后的多维特征。Exemplarily, the terminal can input the multi-dimensional features of each object and the adjacency matrix used to represent the fully connected graph into a graph neural network (Graph Convolutional Network, GCN), perform graph convolution operations through the GCN network, and Output the updated multidimensional features for each object.
在一些实施例中,上述S1032可以通过以下方式实现:基于邻接矩阵和每个对象一一对应的多维特征,通过图神经网络,对每个对象的多维特征进行迭代,得到与每个对象一一对应的更新后的多维特征;其中,全连接图通过邻接矩阵表征,邻接矩阵中的每个数据表征对应的两个对象之间的关联度。In some embodiments, the above S1032 can be implemented in the following manner: based on the adjacency matrix and the one-to-one correspondence between each object's multi-dimensional features, the graph neural network is used to iterate the multi-dimensional features of each object to obtain the one-to-one correspondence with each object. The corresponding updated multi-dimensional features; wherein, the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between the corresponding two objects.
在一些实施例中,上述的两个对象包括:第一对象和第二对象;可以通过S201-S203来确定两个对象之间的关联度,将结合图4示出的步骤进行说明。In some embodiments, the above-mentioned two objects include: a first object and a second object; the degree of association between the two objects can be determined through S201-S203, which will be described in conjunction with the steps shown in FIG. 4 .
S201、确定第一对象的多维特征和第二对象的多维特征之间的相似度。S201. Determine the similarity between the multidimensional features of the first object and the multidimensional features of the second object.
在本公开实施例中,终端可以根据第一对象的多维特征和第二对象的多维特征,确定出第一对象与第二对象之间的相似度,例如,点积相似度或余弦相似度等。In this embodiment of the present disclosure, the terminal may determine the similarity between the first object and the second object according to the multi-dimensional features of the first object and the multi-dimensional features of the second object, for example, dot product similarity or cosine similarity, etc. .
示例性地,在相似度为点积相似度的情况下,第一对象和第二对象之间的相似度可以采用下述公式(2)表示:Exemplarily, in the case where the similarity is a dot product similarity, the similarity between the first object and the second object can be represented by the following formula (2):
F se(f i,f j)=(f i) Tf i       (2) F se (f i ,f j )=(f i ) T f i (2)
其中,F se(f i,f j)表示第i个对象(第一对象)和第j个对象(第二对象)之间的点积相似度,i和j均为1至N中的任意整数,且i与j不相等,f i表示第i个对象的多维特征,f j表示第j个对象的多维特征。 Among them, F se (f i , f j ) represents the dot product similarity between the i-th object (the first object) and the j-th object (the second object), where i and j are any of 1 to N Integer, and i is not equal to j, f i represents the multidimensional feature of the i-th object, and f j represents the multi-dimensional feature of the j-th object.
S202、基于第一对象在每张待检测图像中的位置特征,以及第二对象在每张待检测图像中的位置特征,确定第一对象与第二对象之间的距离。S202. Determine the distance between the first object and the second object based on the position features of the first object in each image to be detected and the position features of the second object in each image to be detected.
在本公开实施例中,在对第一对象和第二对象所在的待检测图像进行检测的情况下,可 以获得第一对象在待检测图像中的位置特征,以及第二对象在待检测图像中的位置特征,终端可以根据第一对象的位置特征和第二对象的位置特征,确定出第一对象与第二对象之间的距离。In the embodiment of the present disclosure, in the case of detecting the image to be detected where the first object and the second object are located, the positional characteristics of the first object in the image to be detected, and the position characteristics of the second object in the image to be detected can be obtained. The terminal may determine the distance between the first object and the second object according to the location characteristics of the first object and the location characteristics of the second object.
示例性地,位置特征为标注框坐标(例如,标注框的中心点坐标,或者标注框的左上角点与右下角点的坐标等),终端可以根据第一对象的标注框坐标和第二对象的标注框坐标,计算出第一对象与第二对象之间的距离。例如,第一对象与第二对象之间的距离可以采用下述公式(3)表示:Exemplarily, the location feature is the coordinates of the callout frame (for example, the coordinates of the center point of the callout frame, or the coordinates of the upper left corner point and the lower right corner point of the callout frame, etc.), and the terminal may use the callout frame coordinates of the first object and the second object The coordinates of the label frame of , calculate the distance between the first object and the second object. For example, the distance between the first object and the second object can be represented by the following formula (3):
Figure PCTCN2022074120-appb-000001
Figure PCTCN2022074120-appb-000001
其中,D(b i,b j)表示通过标注框的坐标计算出的第i个对象与第j个对象之间的坐标距离,F dist(f i,f j)表示第i个对象与第j个对象之间的距离。 Among them, D(b i , b j ) represents the coordinate distance between the i-th object and the j-th object calculated by the coordinates of the label box, and F dist (f i , f j ) represents the distance between the i-th object and the j-th object The distance between j objects.
S203、基于相似度和距离,确定第一对象和第二对象之间的关联度。S203. Based on the similarity and the distance, determine the degree of association between the first object and the second object.
本公开实施例中,终端在确定出第一对象与第二对象之间的相似度和距离的情况下,可以根据相似度和距离再计算出第一对象与第二对象之间的关联度。In the embodiment of the present disclosure, after determining the similarity and distance between the first object and the second object, the terminal may calculate the degree of association between the first object and the second object according to the similarity and distance.
在一些实施例中,可以通过下述公式(4)计算第一对象与第二对象之间的关联度:In some embodiments, the degree of association between the first object and the second object can be calculated by the following formula (4):
Figure PCTCN2022074120-appb-000002
Figure PCTCN2022074120-appb-000002
其中,
Figure PCTCN2022074120-appb-000003
表示第i个对象与第j个对象之间的关联度,
Figure PCTCN2022074120-appb-000004
为0-1之间的数值;N表示多个对象的总数量,f j表示第j个对象的多维特征,f i表示第i个对象的多维特征,exp(.)表示以e为底的指数函数。
in,
Figure PCTCN2022074120-appb-000003
Indicates the degree of association between the i-th object and the j-th object,
Figure PCTCN2022074120-appb-000004
is a value between 0 and 1; N represents the total number of multiple objects, f j represents the multidimensional feature of the jth object, f i represents the multidimensional feature of the ith object, exp(.) represents the base e exponential function.
针对上述S1032,终端可以将邻接矩阵和所有对象的多维特征,均输入多层的图神经网络中,通过该多层的图神经网络,对每个对象的多维特征进行迭代更新,从而得到每个对象的更新后的多维特征。For the above S1032, the terminal can input the adjacency matrix and the multi-dimensional features of all objects into the multi-layer graph neural network, and iteratively update the multi-dimensional features of each object through the multi-layer graph neural network, so as to obtain each The updated multidimensional feature of the object.
在一些实施例中,终端可以基于更新参数、邻接矩阵、与迭代次数对应的第一权重参数,以及所有对象的多维特征,对每个对象所对应的多维特征进行迭代更新,并在迭代次数达到第一预设次数的情况下,将第一预设次数之后生成的特征,作为每个对象所对应的更新后的多维特征。In some embodiments, the terminal may iteratively update the multi-dimensional features corresponding to each object based on the update parameters, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multi-dimensional features of all objects, and when the number of iterations reaches In the case of the first preset number of times, the features generated after the first preset number of times are used as the updated multi-dimensional features corresponding to each object.
这里,更新参数可以是激活函数,与迭代次数对应的第一权重参数可以是与图神经网络的每一层对应的可学习权重矩阵,且迭代次数可以根据图神经网络的层数确定。例如,在图神经网络为2层的图神经网络的情况下,每一层对应有一个可学习权重,并且,可以确定出迭代次数为2;也就是说,对于图神经网络的第一层而言,输入的是邻接矩阵和每个对象的多维特征,输出的是每个对象第一次迭代后的多维特征;对于图神经网络的第二层而言,输入的是邻接矩阵和每个对象第一次迭代后的多维特征,输出的是每个对象的第二次迭代后的多维特征,并且每个对象的第二次迭代后的多维特征,是迭代结束后得到的每个对象的更新后的多维特征。Here, the update parameter may be an activation function, the first weight parameter corresponding to the number of iterations may be a learnable weight matrix corresponding to each layer of the graph neural network, and the number of iterations may be determined according to the number of layers of the graph neural network. For example, in the case where the graph neural network is a 2-layer graph neural network, each layer corresponds to a learnable weight, and the number of iterations can be determined to be 2; that is, for the first layer of the graph neural network and In other words, the input is the adjacency matrix and the multidimensional features of each object, and the output is the multidimensional features of each object after the first iteration; for the second layer of the graph neural network, the input is the adjacency matrix and each object The multidimensional feature after the first iteration, the output is the multidimensional feature after the second iteration of each object, and the multidimensional feature after the second iteration of each object is the update of each object obtained after the iteration The subsequent multidimensional features.
根据上述可知,采用图神经网络的每一层对每个对象的多维特征进行迭代的过程,可以采用下述公式(5)表示:According to the above, the process of iterating the multi-dimensional features of each object using each layer of the graph neural network can be expressed by the following formula (5):
g (l+1)=σ(A×g l×W l)      (5) g (l+1) = σ(A×g l ×W l ) (5)
其中,A代表邻接矩阵。g l∈R N×d表示第l层输出的每个对象的迭代后的多维特征,g (l+1)表示第l+1层输出的每个对象的迭代后的多维特征,g 0∈f表示第0层中每个对象的特征,即表示每个对象的多维特征。W l∈R d×d表示第l层的可学习权重矩阵,d是输入和输出特征的大小;σ(.)表示激活函数,例如,可以是线性整流函数(Rectified Linear Unit,ReLU)。根据上述公式(5)可知,第l+1的输入是第l层的输出。 Among them, A represents the adjacency matrix. g l ∈ R N×d represents the iterated multidimensional feature of each object output by the l-th layer, g (l+1) represents the iterated multi-dimensional feature of each object output by the l+1 layer, g 0 ∈ f represents the feature of each object in layer 0, that is, represents the multidimensional feature of each object. W l ∈ R d×d represents the learnable weight matrix of layer l, d is the size of the input and output features; σ(.) represents the activation function, for example, it can be a linear rectification function (Rectified Linear Unit, ReLU). According to the above formula (5), it can be seen that the input of the l+1th layer is the output of the lth layer.
在一些实施例中l为1,也就是说,可以采用两层的图神经网络对中每个对象的多维特 征进行迭代更新;如此,可以提高对每个对象的多维特征的更新效率,从而有利于提高对人物交互行为的识别效率。In some embodiments, l is 1, that is to say, a two-layer graph neural network can be used to iteratively update the multidimensional features of each object; in this way, the update efficiency of the multidimensional features of each object can be improved, thereby having It is beneficial to improve the recognition efficiency of character interaction behavior.
S1033、根据每一组对象中每个组员对象的更新后的多维特征,得到每一组对象的关系交互特征。S1033. According to the updated multi-dimensional features of each member object in each group of objects, obtain the relationship interaction features of each group of objects.
在本公开实施例中,对于每一组对象,终端在获得该组对象中,每个组员对应的更新后的多维特征的情况下,可以根据该组对象中所有组员对象的更新后的多维特征,确定出该组对象的关系交互特征。In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the updated multi-dimensional features corresponding to each group member in the group of objects, it can use the updated Multi-dimensional features determine the relationship interaction features of the group of objects.
在一些实施例中,对于每一组对象,终端可以将组员对象的更新后的多维特征,在通道维度上进行叠加,并将叠加后的特征作为该组对象的关系交互特征。In some embodiments, for each group of objects, the terminal may superimpose the updated multi-dimensional features of the group member objects on the channel dimension, and use the superimposed features as the relationship interaction features of the group of objects.
在本公开的一些实施例中,上述S103中的依据关系交互特征,确定每一组对象中的组员对象之间相互关联,可以通过S1034-S1035实现,将以图5中示出的步骤进行说明。In some embodiments of the present disclosure, according to the relationship interaction feature in S103 above, it is determined that the group member objects in each group of objects are related to each other, which can be realized through S1034-S1035, which will be performed by the steps shown in FIG. 5 illustrate.
S1034、根据关系交互特征,对每一组对象进行分类,得到每一组对象的交互结果。S1034. Classify each group of objects according to the relationship interaction feature, and obtain an interaction result of each group of objects.
在本公开实施例中,针对每一组对象,终端在获得该组对象的关系交互特征的情况下,可以将该组对象的关系交互特征输入全连接层中,通过全连接层对该组对象进行交互性分类,并将得到的该组对象的交互分类分数作为该组对象的交互结果。In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the relational interaction features of the group of objects, it can input the relational interaction features of the group of objects into the fully connected layer, and the group of objects through the fully connected layer The interaction classification is performed, and the obtained interaction classification score of the group of objects is used as the interaction result of the group of objects.
示例性地,每一组对象的交互结果可以采用下述公式(6)表示:Exemplarily, the interaction result of each group of objects can be represented by the following formula (6):
Figure PCTCN2022074120-appb-000005
Figure PCTCN2022074120-appb-000005
其中,
Figure PCTCN2022074120-appb-000006
表示每一组对象的交互结果,W in表示全连接层的学习权重,σ(.)表示激活函数,
Figure PCTCN2022074120-appb-000007
表示每一组对象的关系交互特征。
in,
Figure PCTCN2022074120-appb-000006
Represents the interaction result of each group of objects, Win represents the learning weight of the fully connected layer, σ(.) represents the activation function,
Figure PCTCN2022074120-appb-000007
Represents the relational interaction features of each set of objects.
S1035、在交互结果大于或等于第一预设分数阈值的情况下,确定每一组对象中的组员对象之间相互关联。S1035. If the interaction result is greater than or equal to the first preset score threshold, determine that the group member objects in each group of objects are related to each other.
在本公开实施例中,针对每一组对象,终端在得到该组对象的交互结果的情况下,可以将该交互结果与第一预设分数阈值进行比较,并在交互结果大于或等于第一预设分数阈值的情况下,确定该组对象中的组员对象之间相互关联。In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the interaction result of the group of objects, it can compare the interaction result with the first preset score threshold, and when the interaction result is greater than or equal to the first In the case of preset score thresholds, it is determined that the group member objects in the group objects are related to each other.
需要说明的是,第一预设分数阈值可以根据实际需要设置,本公开实施例对第一预设分数阈值的取值不作限定。It should be noted that the first preset score threshold may be set according to actual needs, and the embodiment of the present disclosure does not limit the value of the first preset score threshold.
在本公开的一些实施例中,上述S103中的基于空间结果和动作结果,确定每一组对象的目标结果,可以通过S1036-S1038,将以图6中示出的步骤进行说明。In some embodiments of the present disclosure, the determination of the target result of each group of objects based on the spatial result and the action result in S103 above may go through S1036-S1038, which will be described with the steps shown in FIG. 6 .
S1036、基于每一组对象的关系交互特征,以及预设参数,对每个组员对象的多维特征进行更新,得到每个组员对象的细化特征,并基于细化特征,确定每一组对象的图交互特征。S1036. Based on the relationship interaction features of each group of objects and preset parameters, update the multi-dimensional features of each group member object, obtain the refined features of each group member object, and determine each group based on the refined features. Graph interaction features for objects.
在本公开的实施例中,针对每一组对象,终端可以根据该组对象的关系交互特征和预设参数,对该组对象中的每个组员对象的多维特征再进行更新,从而得到每个组员对象的细化特征,并根据该组对象中每个组员对象的细化特征,确定出该组对象的图交互特征。在一些实施例中,终端可以将该组对象中所有组员对象的细化特征,在通道维度进行叠加,从而得到该组对象的图交互特征。In the embodiments of the present disclosure, for each group of objects, the terminal can update the multi-dimensional features of each group member object in the group of objects according to the relationship interaction features and preset parameters of the group of objects, so as to obtain According to the refined characteristics of each group member object in the group object, the graph interaction characteristics of the group object are determined. In some embodiments, the terminal may superimpose the refined features of all member objects in the group of objects in the channel dimension, so as to obtain the graph interaction features of the group of objects.
在一些实施例中,预设参数包括:第二权重参数和迭代次数;上述S1036中的基于每一组对象的关系交互特征,以及预设参数,对每个组员对象的多维特征进行更新,得到每个组员对象的细化特征,可以通过下述方式实现:基于第二权重参数和每一组对象的关系交互特征,对每个组员对象的多维特征进行迭代更新,并在迭代次数达到第二预设次数的情况下,将第二预设次数之后生成的特征,作为每个组员对象的细化特征。In some embodiments, the preset parameters include: the second weight parameter and the number of iterations; the above-mentioned S1036 based on the relationship interaction features of each group of objects, and the preset parameters to update the multidimensional features of each group member object, Obtaining the refined features of each group member object can be achieved in the following way: Based on the second weight parameter and the relationship interaction feature of each group object, iteratively update the multi-dimensional features of each group member object, and in the number of iterations When the second preset number of times is reached, the features generated after the second preset number of times are used as the refined features of each group member object.
这里,对于每一组对象,终端可以根据第二权重参数和该组对象的关系交互特征,对该组对象中每个组员对象的多维特征进行迭代更新,例如,在第一次迭代的过程中,将该组对象中每个组员对象的多维特征作为输入,迭代后得到每个组员对象的第一次迭代后的多维特征,在第二次迭代的过程中,对于每个组员对象来说,将第一次迭代后的多维特征作为第二次迭代时的输入,如此循环迭代,直至在迭代次数达到第二预设次数的情况下,将第二预设次数对应的迭代后的多维特征作为,每个组员对象的细化特征。Here, for each group of objects, the terminal can iteratively update the multi-dimensional features of each group member object in the group of objects according to the second weight parameter and the relationship interaction feature of the group of objects, for example, in the process of the first iteration , the multidimensional feature of each member object in the group object is used as input, and the multidimensional feature of each member object after the first iteration is obtained after iteration. In the process of the second iteration, for each member For the object, the multidimensional feature after the first iteration is used as the input of the second iteration, and iterates in this way until the number of iterations reaches the second preset number of times, and the corresponding number of iterations after the second preset number of times The multidimensional features of are used as the refined features of each group member object.
示例性地,生成每个组员对象的细化特征的过程,可以采用下述公式(7)表示:Exemplarily, the process of generating the refined features of each group member object can be expressed by the following formula (7):
Figure PCTCN2022074120-appb-000008
Figure PCTCN2022074120-appb-000008
其中,
Figure PCTCN2022074120-appb-000009
表示每一组对象的关系交互特征,
Figure PCTCN2022074120-appb-000010
表示指标函数,
Figure PCTCN2022074120-appb-000011
表示每一组对象的交互结果,μ s表示第一预设分数阈值;α表示第二权重参数(加权参数),N表示多个对象的总数量,f i (t)表示第i个对象的细化特征,f i (t-1)表示得到第i个对象的细化特征时输入的第i个对象的特征,f j (t-1)表示得到第i个对象的细化特征时输入的第j个对象的特征;t表示迭代次数;在t=1的情况下,f i (t-1)表示第i个对象的多维特征,f j (t-1)表示第j个对象的多维特征。
in,
Figure PCTCN2022074120-appb-000009
Represents the relational interaction features of each group of objects,
Figure PCTCN2022074120-appb-000010
represents the index function,
Figure PCTCN2022074120-appb-000011
Represents the interaction result of each group of objects, μ s represents the first preset score threshold; α represents the second weight parameter (weighting parameter), N represents the total number of multiple objects, f i (t) represents the i-th object Refinement features, f i (t-1) represents the features of the i-th object input when obtaining the refinement features of the i-th object, f j (t-1) represents the input when obtaining the refinement features of the i-th object t represents the number of iterations; in the case of t=1, f i (t-1) represents the multidimensional feature of the i-th object, and f j (t-1) represents the j-th object’s multidimensional features.
需要说明的是,第二预设次数可以根据实际需要进行设定,本公开实施例对此不作限定。It should be noted that the second preset number of times may be set according to actual needs, which is not limited in this embodiment of the present disclosure.
示例性地,第二预设次数可以为2,如此,可以提高得到每个组员对象的细化特征的效率,从而有利于提高对人物交互行为的识别效率。Exemplarily, the second preset number of times may be 2. In this way, the efficiency of obtaining the refined features of each team member object can be improved, which is beneficial to improve the recognition efficiency of human interaction behavior.
S1037、基于图交互特征,对每一组对象进行分类,得到图关系结果。S1037. Based on the graph interaction feature, classify each group of objects to obtain a graph relationship result.
在本公开实施例中,对于每一组对象,终端在得到该组对象的图交互特征的情况下,可以根据图交互特征,对该组对象的图关系进行分类,得到图关系结果。In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the graph interaction features of the group of objects, it may classify the graph relationship of the group of objects according to the graph interaction features to obtain a graph relationship result.
在一些实施例中,终端可以将该组对象的图交互特征输入到全连接层中,通过该全连接层对该组对象的图关系进行分类,从而得到图关系分类分数,并将得到的图关系分类分数,作为该组对象的图关系结果。In some embodiments, the terminal can input the graph interaction features of the group of objects into the fully connected layer, and classify the graph relationship of the group of objects through the fully connected layer, so as to obtain the graph relationship classification score, and the obtained graph A relation classification score, as a result of graph relations for this set of objects.
示例性地,终端根据每一组对象的图交互特征,得到该组对象的图关系结果的过程,可以采用下述公式(8)表示:Exemplarily, according to the graph interaction characteristics of each group of objects, the terminal obtains the process of the graph relationship result of the group of objects, which can be expressed by the following formula (8):
Figure PCTCN2022074120-appb-000012
Figure PCTCN2022074120-appb-000012
其中,
Figure PCTCN2022074120-appb-000013
表示每一组对象的图关系结果,
Figure PCTCN2022074120-appb-000014
表示每一组对象的图交互特征,W a表示全连接层的学习权重,σ(.)表示激活函数。
in,
Figure PCTCN2022074120-appb-000013
Represents the graph relationship result of each group of objects,
Figure PCTCN2022074120-appb-000014
Represents the graph interaction features of each group of objects, W a represents the learning weight of the fully connected layer, and σ(.) represents the activation function.
S1038基于空间结果、动作结果、交互结果、图关系结果,以及对每个组员对象进行检测时所得到的置信结果,确定每一组对象的目标结果。S1038 Determine target results for each group of objects based on the spatial results, action results, interaction results, graph relationship results, and confidence results obtained when detecting each group member object.
在本公开实施例中,对于每一组对象,终端可以根据获得的该组对象的空间结果,每个组员对象的动作结果,该组对象的交互结果和图关系结果,以及在上述步骤中对每个组员对象进行检测时所得到的置信结果,来得到该组对象的目标结果。In the embodiment of the present disclosure, for each group of objects, the terminal can obtain the spatial results of the group of objects, the action results of each group member object, the interaction results and graph relationship results of the group of objects, and in the above steps The confidence result obtained when detecting each group member object is used to obtain the target result of the group object.
在一些实施例中,对于每一组对象,终端可以确定出所有组员对象的置信结果之间的第一乘积值;确定出所有组员对象的动作结果之间的第二乘积值;确定出第一乘积值、第二乘积值、空间结果和图关系结果之间的第三乘积;确定出交互结果与第一预设分数阈值之间的指标值;以及,将第三乘积值与指标值之间的乘积,作为该组对象的目标结果。In some embodiments, for each group of objects, the terminal may determine the first product value between the confidence results of all group member objects; determine the second product value between the action results of all group member objects; determine the The third product between the first product value, the second product value, the spatial result and the graph relationship result; determine the index value between the interaction result and the first preset score threshold; and, combine the third product value with the index value The product between is used as the target result for this set of objects.
示例性地,根据空间结果、动作结果、交互结果、图关系结果,以及每个组员对象的置信结果,确定出每一组对象的目标结果的过程,可以采用下述公式(9)表示:Exemplarily, according to the spatial result, action result, interaction result, graph relationship result, and the confidence result of each group member object, the process of determining the target result of each group object can be expressed by the following formula (9):
Figure PCTCN2022074120-appb-000015
Figure PCTCN2022074120-appb-000015
其中,
Figure PCTCN2022074120-appb-000016
表示目标结果,s h或s o表示组员对象的置信结果,其中,s h表示类别为人的对象的置信结果,s o表示类别为物体的对象的置信结果;
Figure PCTCN2022074120-appb-000017
Figure PCTCN2022074120-appb-000018
表示组员对象的动作结果,其中,
Figure PCTCN2022074120-appb-000019
表示类别为人的对象的动作结果,
Figure PCTCN2022074120-appb-000020
表示类别为物体的对象的动作结果;
Figure PCTCN2022074120-appb-000021
表示图关系结果和空间结果之间的乘积,
Figure PCTCN2022074120-appb-000022
表示交互结果,μ s表示第一预设分数阈值,
Figure PCTCN2022074120-appb-000023
表示指标函数。
in,
Figure PCTCN2022074120-appb-000016
Indicates the target result, s h or s o represents the confidence result of the group member object, wherein, s h represents the confidence result of the object classified as a person, and s o represents the confidence result of the object classified as an object;
Figure PCTCN2022074120-appb-000017
or
Figure PCTCN2022074120-appb-000018
Indicates the action result of the group member object, where,
Figure PCTCN2022074120-appb-000019
represents the result of an action for an object of class Person,
Figure PCTCN2022074120-appb-000020
Indicates the action result of an object of class Object;
Figure PCTCN2022074120-appb-000021
represents the product between the graph relational result and the spatial result,
Figure PCTCN2022074120-appb-000022
represents the interaction result, μ s represents the first preset score threshold,
Figure PCTCN2022074120-appb-000023
Represents an indicator function.
在一些实施例中,上述S101中的对特征进行编码,得到与多个对象分别一一对应的多维特征,可以通过S1011-S1014实现,以下将结合图7中的步骤进行说明。In some embodiments, the encoding of features in S101 above to obtain multi-dimensional features corresponding to multiple objects respectively may be implemented through S1011-S1014, which will be described below in conjunction with the steps in FIG. 7 .
S1011、将与多个对象分别一一对应的位置特征进行编码,得到每个对象的第一特征;检测包括:图像检测和词向量检测。S1011. Encode the location features corresponding to the multiple objects respectively to obtain the first feature of each object; the detection includes: image detection and word vector detection.
S1012、将与多个对象分别一一对应的视觉特征进行编码,得到每个对象的第二特征;位置特征和视觉特征是对每张待检测图像进行图像检测得到的。S1012. Encode the visual features corresponding to the plurality of objects one by one to obtain the second feature of each object; the position feature and visual feature are obtained by performing image detection on each image to be detected.
S1013、将与多个对象分别一一对应的词向量特征进行编码,得到每个对象的第三特征;词向量特征是对每个对象的类别信息,进行词向量检测得到的;类别信息是对每张待检测图像进行图像检测得到的。S1013. Encode the word vector features corresponding to a plurality of objects respectively to obtain the third feature of each object; the word vector feature is obtained by performing word vector detection on the category information of each object; the category information is It is obtained by performing image detection on each image to be detected.
S1014、根据第一特征、第二特征和第三特征,得到与多个对象分别一一对应的多维特征;其中,第一特征、第二特征和第三特征的维度相同。S1014. According to the first feature, the second feature and the third feature, obtain multi-dimensional features corresponding to the multiple objects respectively; wherein, the dimensions of the first feature, the second feature and the third feature are the same.
在本公开实施例中,终端在得到对一张待检测图像进行图像检测和词向量检测后所得到的每个对象的位置特征、视觉特征和词向量特征之后,可以将这三个特征分别编码至同一特征空间中,从而对应得到维度相同的第一特征、第二特征和第三特征。In the embodiment of the present disclosure, after the terminal obtains the position feature, visual feature and word vector feature of each object obtained by performing image detection and word vector detection on an image to be detected, these three features can be encoded separately to the same feature space, so as to obtain the first feature, the second feature and the third feature with the same dimension.
示例性地,一个对象的位置特征可以是该对象在待检测图像中的标注框的坐标,视觉特征可以是与该标注框的坐标所对应的RoI池化后的特征图,该词向量特征可以是与该对象的类别信息对应的词向量。Exemplarily, the location feature of an object may be the coordinates of the label box of the object in the image to be detected, the visual feature may be the RoI pooled feature map corresponding to the coordinates of the label box, and the word vector feature may be is the word vector corresponding to the category information of the object.
在本公开的一些实施例中,在上述S101中的对每张待检测图像进行检测得到多个对象的特征,可以通过S401-S403实现,以下将结合图8中的步骤进行说明。In some embodiments of the present disclosure, the detection of each image to be detected in the above S101 to obtain the features of multiple objects may be implemented through S401-S403, which will be described below in conjunction with the steps in FIG. 8 .
S401、对每张待检测图像进行图像检测,得到检测出的每个目标的位置特征、视觉特征、置信结果,以及与置信结果对应的类别信息。S401. Perform image detection on each image to be detected, and obtain the position feature, visual feature, confidence result, and category information corresponding to the confidence result of each detected target.
S402、将置信结果大于或等于第二预设分数阈值的目标,作为检测出的对象,得到与多个对象分别一一对应的位置特征、视觉特征,以及类别信息。S402. Taking the target whose confidence result is greater than or equal to the second preset score threshold as the detected object, and obtaining positional features, visual features, and category information corresponding to the plurality of objects respectively.
S403、对每个对象的类别信息进行词向量检测,得到每个对象的词向量特征。S403. Perform word vector detection on category information of each object to obtain word vector features of each object.
在本公开的实施例中,对于每一张待检测图像,终端通过图像检测,可以得到该张待检测图像中每个目标的位置特征、视觉特征、置信结果,以及与置信结果对应的类别信息,之后,终端可以将每个目标的置信结果与第二预设分数阈值进行比较,根据比较结果,去掉置信结果小于第二预设分数阈值的目标,并保留置信结果大于或等于第二预设分数阈值的目标,将保留的所有目标作为上述的多个对象,从而得到每个对象所对应的位置特征、视觉特征、置信结果,以及与置信结果对应的类别信息;并且,在得到每个对象的类别信息的情况下,终端还可以对该每个对象的类别信息进行词向量检测,得到每个对象所对应的词向量特征。In the embodiment of the present disclosure, for each image to be detected, the terminal can obtain the position feature, visual feature, confidence result, and category information corresponding to the confidence result of each target in the image to be detected through image detection , after that, the terminal can compare the confidence result of each target with the second preset score threshold, and according to the comparison result, remove the target whose confidence result is less than the second preset score threshold, and keep the confidence result greater than or equal to the second preset The goal of the score threshold, using all the retained goals as the above-mentioned multiple objects, so as to obtain the location features, visual features, confidence results corresponding to each object, and category information corresponding to the confidence results; and, after obtaining each object In the case of the category information of each object, the terminal may also perform word vector detection on the category information of each object to obtain the word vector feature corresponding to each object.
这里,将置信结果大于或等于第二预设分数阈值的每个目标,作为每个对象,以用于后续进行待检测图像的行为识别,如此,可以减少对待检测图像中的人与物体之间的交互行为进行识别时的干扰因素,有利于提高对待检测图像中的人与物体之间的交互行为,进行识别时的识别准确度。Here, each target whose confidence result is greater than or equal to the second preset score threshold is used as each object for subsequent behavior recognition of the image to be detected. In this way, the gap between the person and the object in the image to be detected can be reduced. Interference factors in the identification of interactive behaviors are beneficial to improve the recognition accuracy when identifying the interactive behaviors between people and objects in the image to be detected.
需要说明的是,第二预设分数阈值可以根据实际需要设置,本公开实施例对此不作限定。It should be noted that the second preset score threshold may be set according to actual needs, which is not limited in this embodiment of the present disclosure.
在一些实施例中,在进行编码的情况下,终端可以采用多层感知器(Multilayer Perceptron,MLP)分别对每个对象的位置特征、视觉特征和词向量特征进行编码,从而对应得到每个对象的维度相同的第一特征、第二特征和第三特征。In some embodiments, in the case of encoding, the terminal can use a multilayer perceptron (Multilayer Perceptron, MLP) to encode the location features, visual features and word vector features of each object, so as to obtain the corresponding The first, second, and third features of the same dimensionality.
示例性地,第一特征、第二特征和第三特征可以均是256维度的特征。终端在获得每个对象的256维度的第一特征、第二特征和第三特征之后,可以将第一特征、第二特征和第三特征在通道维度上进行叠加,从而对应得到该对象的768维度的多维特征。Exemplarily, the first feature, the second feature and the third feature may all be 256-dimensional features. After obtaining the 256-dimensional first feature, second feature, and third feature of each object, the terminal can superimpose the first feature, second feature, and third feature on the channel dimension to obtain the corresponding 768 Dimensional multidimensional features.
在一些实施例中,上述S1012可以通过S301-S302实现:In some embodiments, the above S1012 can be implemented through S301-S302:
S301、将与多个对象分别一一对应的视觉特征,进行维度变换处理,得到每个对象的维度变换后的视觉特征。S301. Perform dimension transformation processing on the visual features corresponding to the multiple objects respectively, to obtain the dimensionally transformed visual features of each object.
S302、对维度变换后的视觉特征进行编码,得到每个对象的第二特征。S302. Encode the dimensionally transformed visual features to obtain a second feature of each object.
在本公开实施例中,由于每个对象的视觉特征是二维特征,因而,在进行编码之前,对 于每个对象的视觉特征而言,终端可以对视觉特征进行维度变换(Reshap),得到维度变换后的一维的视觉特征,并对维度变换后的一维的视觉特征进行编码,从而得到每个对象的第二特征。In the embodiment of the present disclosure, since the visual features of each object are two-dimensional features, before encoding, for the visual features of each object, the terminal can perform dimension transformation (Reshap) on the visual features to obtain the dimension The transformed one-dimensional visual feature is encoded, and the second feature of each object is obtained by encoding the dimensionally transformed one-dimensional visual feature.
在本公开的一些实施例中,上述S102中的基于每一组对象的每个组员对象的特征中的部分特征,确定每一组对象的至少两类对象的空间结果,以及每个组员对象的动作结果,可以通过S1021-S1024实现,将结合图9中的步骤进行说明。In some embodiments of the present disclosure, based on some features of the features of each group member object in each group of objects in the above S102, determine the spatial results of at least two types of objects in each group of objects, and each group member The action result of the object can be realized through S1021-S1024, which will be described in conjunction with the steps in FIG. 9 .
S1021、基于每一组对象的每个组员对象的位置特征,确定每个组员对象在每张待检测图像中的图像区域;部分特征包括:每个组员对象的位置特征和视觉特征;位置特征和视觉特征是对每张待检测图像进行图像检测得到的。S1021. Based on the position feature of each team member object of each group of objects, determine the image area of each team member object in each image to be detected; some features include: the position feature and visual feature of each team member object; The location features and visual features are obtained by image detection for each image to be detected.
在本公开实施例中,对于每一组对象中的每个组员对象,终端可以根据该组员对象的位置特征,从对应的待检测图像中确定该组员对象对应的图像区域。In the embodiment of the present disclosure, for each group member object in each group of objects, the terminal may determine the corresponding image area of the group member object from the corresponding image to be detected according to the position characteristics of the group member object.
示例性地,在位置特征为标注框的坐标,且一个组员对象为摩托车的情况下,终端可以根据摩托车在待检测图像中的标注框的坐标,将该标注框所标出的摩托车的图像区域截取下来,从而得到该摩托车的图像区域。Exemplarily, in the case where the position feature is the coordinates of a label frame, and a team member object is a motorcycle, the terminal may, according to the coordinates of the label frame of the motorcycle in the image to be detected, The image area of the car is intercepted to obtain the image area of the motorcycle.
S1022、根据每个组员对象的图像区域,得到每一组对象对应的图像区域,并对每一组对象对应的图像区域进行编码,得到二维特征数据。S1022. Obtain an image area corresponding to each group of objects according to the image area of each group member object, and encode the image area corresponding to each group of objects to obtain two-dimensional feature data.
在本公开实施例中,对于每一组对象,终端在得到每个组员对象的图像区域的情况下,可以将所有组员对象的图像区域进行拼接,得到该组对象的图像区域,并对该组对象的图像区域进行编码;在编码的过程中,在人通道中,人的图像区域的值为1,其他区域的值为0,在物体通道中,物体的图像区域的值为1,其他区域的值为0,由此得到该组对象的二维特征数据。In the embodiment of the present disclosure, for each group of objects, when the terminal obtains the image area of each group member object, it can stitch the image areas of all group member objects to obtain the image area of the group object, and The image area of this group of objects is encoded; in the process of encoding, in the person channel, the value of the image area of the person is 1, and the value of other areas is 0; in the object channel, the value of the image area of the object is 1, The values of other regions are 0, thus obtaining the two-dimensional characteristic data of this group of objects.
例如,在一组对象包含人和摩托车的情况下,终端可以将在S1021中得到的人的图像区域和摩托车的图像区域进行拼接,从而得到人-摩托车这组对象的图像区域,并使得在人通道中,人的图像区域的值为1,其他区域的值为0,在摩托车通道中,摩托车的图像区域的值为1,其他区域的值为0,由此得到人-摩托车这组对象的二维特征数据。For example, in the case that a group of objects includes a person and a motorcycle, the terminal may concatenate the image area of the person and the image area of the motorcycle obtained in S1021, so as to obtain the image area of the group of objects of the person-motorcycle, and So that in the human channel, the value of the image area of the person is 1, and the value of other areas is 0; in the motorcycle channel, the value of the image area of the motorcycle is 1, and the value of other areas is 0, thus obtaining the human- The two-dimensional feature data of the motorcycle group of objects.
S1023、对二维特征数据,以及每个组员对象的视觉特征,分别进行特征处理,对应得到处理后的二维特征数据和处理后的视觉特征。S1023. Perform feature processing on the two-dimensional feature data and the visual features of each team member object, correspondingly obtain the processed two-dimensional feature data and the processed visual features.
在本公开实施例中,对于每一组对象,终端在得到该组对象的二维特征数据,以及每个组员对象的视觉特征的情况下,可以分别对二维特征数据和视觉特征进行特征处理,从而分别得到处理后的二维特征数据和处理后的视觉特征。In the embodiment of the present disclosure, for each group of objects, the terminal may separately perform feature processing on the two-dimensional feature data and visual features after obtaining the two-dimensional feature data of the group of objects and the visual features of each group member object. processing, so as to obtain the processed two-dimensional feature data and the processed visual features respectively.
示例性地,终端可以先通过卷积神经网络(Convolutional Neural Networks,CNN,简称为CNN Block)对二维特征数据进行特征提取,得到第一子特征;通过残差网络(Residual Block,简称为Res Block)对每个组员对象的视觉特征进行特征提取,得到第二子特征;之后对第一子特征和第二子特征分别进行全局平均池化(Global average Pooling,GAP),对应得到处理后的二维特征数据和处理后的视觉特征。Exemplarily, the terminal can first perform feature extraction on two-dimensional feature data through a convolutional neural network (Convolutional Neural Networks, CNN, referred to as CNN Block) to obtain the first sub-feature; through a residual network (Residual Block, referred to as Res Block) extracts the visual features of each group member object to obtain the second sub-feature; then performs global average pooling (Global average Pooling, GAP) on the first sub-feature and the second sub-feature respectively, and the corresponding The two-dimensional feature data and the processed visual features.
示例性地,处理后的二维特征数据,以及处理后的视觉特征,可以分别通过下述公式(10)、(11)和(12)所示:Exemplarily, the processed two-dimensional feature data and the processed visual features can be represented by the following formulas (10), (11) and (12):
Figure PCTCN2022074120-appb-000024
Figure PCTCN2022074120-appb-000024
f h=GAP(Res(RoI(F,b h)))       (11) f h =GAP(Res(RoI(F,b h ))) (11)
f o=GAP(Res(RoI(F,b o)))       (12) f o =GAP(Res(RoI(F,b o ))) (12)
其中,F表示待检测图像的ROI池化后的特征图,f h或f o是每个组员对象的处理后的视觉特征,其中,f h表示类别为人的对象的处理后的视觉特征,f o表示类别为物体的对象的处理后的视觉特征;
Figure PCTCN2022074120-appb-000025
表示处理后的二维特征数据,例如,f h,o表示在一组对象包括类别为人和类别为物体的两个对象的情况下,人-物体这组对象的处理后的二维特征数据;F h,o表 示每一组对象对应的图像区域,b h或b o表示每个组员对象的位置特征,其中,b h表示类别为人的对象的位置特征,b o表示类别为物体的对象的位置特征;RoI(F,b h)或RoI(F,b o)表示每个组员对象的视觉特征,其中,RoI(F,b h)表示类别为人的对象的视觉特征,RoI(F,b o)表示类别为物体的对象的视觉特征。
Among them, F represents the feature map after the ROI pooling of the image to be detected, f h or f o is the processed visual feature of each group member object, where f h represents the processed visual feature of the object of the category, f o represents the processed visual features of objects classified as objects;
Figure PCTCN2022074120-appb-000025
Represents the processed two-dimensional feature data, for example, f h, o represents the processed two-dimensional feature data of the group of objects of people-objects when a group of objects includes two objects of the category of people and objects; F h, o represent the image area corresponding to each group of objects, b h or b o represent the location characteristics of each group member object, where b h represents the location characteristics of objects classified as people, and b o represents objects classified as objects The location features of ; RoI(F,b h ) or RoI(F,b o ) represent the visual features of each team member object, where RoI(F,b h ) represents the visual features of objects classified as people, RoI(F , b o ) represent the visual features of objects classified as objects.
S1024、根据处理后的二维特征数据,对每一组对象进行分类,得到每一组对象的空间结果,以及根据处理后的视觉特征,对每个组员对象进行分类,得到每个组员对象的动作结果。S1024. Classify each group of objects according to the processed two-dimensional feature data to obtain the spatial result of each group of objects, and classify each group member object according to the processed visual features to obtain each group member The object's action result.
在本公开实施例中,终端在得到处理后的二维特征数据,以及每个组员对象的处理后的视觉特征的情况下,可以根据处理后的二维特征数据,对该组对象进行空间分类,得到该组对象对应的空间结果;以及根据每个组员对象的处理后的视觉特征,对该组员对象进行动作分类,得到该组员对象的动作结果。In the embodiment of the present disclosure, when the terminal obtains the processed two-dimensional feature data and the processed visual features of each group member object, it can perform spatial processing on the group object according to the processed two-dimensional feature data. Classify to obtain the spatial result corresponding to the group object; and according to the processed visual features of each group member object, perform action classification on the group member object to obtain the action result of the group member object.
在一些实施例中,终端可以将处理后的二维特征数据输入一个全连接层中,通过该全连接层对该组对象进行分类,得到空间分类分数,并将该空间分类分数作为该组对象的空间结果;以及,终端可以将每个组员对象的处理后的视觉特征输入另一个全连接层中,通过该全连接层对该组员对象进行分类,得到动作分类分数,并将动作分类分数作为动作结果。In some embodiments, the terminal can input the processed two-dimensional feature data into a fully connected layer, classify the group of objects through the fully connected layer, obtain a spatial classification score, and use the spatial classification score as the group object and the terminal can input the processed visual features of each group member object into another fully connected layer, classify the group member object through the fully connected layer, obtain an action classification score, and classify the action Score as action result.
示例性地,终端根据处理后的二维特征数据,对每一组对象进行分类,得到每一组对象的空间结果,以及根据每个组员对象的处理后的视觉特征,对每个组员对象进行分类,得到每个组员对象的动作结果,可以通过下述公式(13)、(14)和(15)分别表示:Exemplarily, the terminal classifies each group of objects according to the processed two-dimensional feature data, obtains the spatial result of each group of objects, and classifies each group member according to the processed visual features of each group member object Objects are classified to obtain the action results of each team member object, which can be expressed by the following formulas (13), (14) and (15):
Figure PCTCN2022074120-appb-000026
Figure PCTCN2022074120-appb-000026
Figure PCTCN2022074120-appb-000027
Figure PCTCN2022074120-appb-000027
Figure PCTCN2022074120-appb-000028
Figure PCTCN2022074120-appb-000028
其中,
Figure PCTCN2022074120-appb-000029
表示每一组对象的空间结果,
Figure PCTCN2022074120-appb-000030
Figure PCTCN2022074120-appb-000031
表示每个组员对象的动作结果,其中,
Figure PCTCN2022074120-appb-000032
表示类别为人的组员对象的动作结果,
Figure PCTCN2022074120-appb-000033
表示类别为物体的组员对象的动作结果;W h表示类别为人的组员对象所对应的全连接层的学习权重,W o表示类别为物体的组员对象所对应的全连接层的学习权重,W h,o表示每一组对象所对应的全连接层的学习权重。
in,
Figure PCTCN2022074120-appb-000029
represents the spatial result for each group of objects,
Figure PCTCN2022074120-appb-000030
or
Figure PCTCN2022074120-appb-000031
Indicates the action result of each group member object, where,
Figure PCTCN2022074120-appb-000032
Indicates the action result of the group member object whose category is person,
Figure PCTCN2022074120-appb-000033
Indicates the action result of the member object whose category is an object; W h indicates the learning weight of the fully connected layer corresponding to the member object whose category is human, and W o indicates the learning weight of the fully connected layer corresponding to the member object whose category is an object , W h,o represents the learning weight of the fully connected layer corresponding to each group of objects.
以下将结合一个具体的应用场景对本公开的技术方案进行描述;图10是本公开实施例提供的示例性地采用行为识别方法识别一张待检测图像中的对象行为的部分流程示意图。The technical solution of the present disclosure will be described below in conjunction with a specific application scenario; FIG. 10 is a partial flow diagram of an example of using a behavior recognition method to identify an object behavior in an image to be detected provided by an embodiment of the disclosure.
如图10所示,终端对一张待检测图像I进行目标检测与词向量检测,得到该张待检测图像中每个对象的位置特征、置信结果、词向量特征,例如,如图10中所示,在检测出摩托车和头盔的情况下,可以采用检测器对摩托车进行词向量检测,得到摩托车对应的词向量特征,以及,可以采用检测器对头盔进行词向量检测,得到头盔对应的词向量特征;以及,终端还可以根据每个对象的位置特征,对该待检测图像在图像检测过程中得到的ROI池化后的图像,进行特征截取,得到每个对象的视觉特征。As shown in Figure 10, the terminal performs target detection and word vector detection on an image to be detected I, and obtains the position feature, confidence result, and word vector feature of each object in the image to be detected, for example, as shown in Figure 10 It shows that in the case of detecting motorcycles and helmets, the detector can be used to detect the word vector of the motorcycle to obtain the word vector features corresponding to the motorcycle, and the detector can be used to detect the word vector of the helmet to obtain the helmet corresponding and, according to the location feature of each object, the terminal may perform feature extraction on the ROI pooled image obtained in the image detection process of the image to be detected to obtain the visual feature of each object.
一方面,终端在得到每个对象的位置特征、词向量特征和视觉特征后,终端可以通过语义编码模块,采用MLP分别对位置特征和词向量特征进行编码,对应得到第一特征和第三特征,同时,对每个对象的视觉特征进行维度变换处理(Reshap),并采用MLP对维度变换处理后的视觉特征同样进行编码,得到与第一特征和第三特征的维度相同的第二特征,并将第一特征、第二特征和第三特征在通道维度进行叠加,得到该待检测图像I中的每个对象对应的多维特征。根据该待检测图像I的所有对象(即多个对象)分别一一对应的多维特征,生成与所有对象所对应的全连接图,并通过邻接矩阵表征该全连接图(图10中未示出),将邻接矩阵和与所有对象分别一一对应的多维特征均作为GCN网络的输入,通过GCN网络的图卷积处理,得到每个对象的更新后的多维特征;根据每个对象的更新后的多维特征,得到每一组对象的关系交互特征,并将每一组对象的关系交互特征输入全连接层(FCs)中,对 该组对象进行分类,得到每一组对象的交互结果
Figure PCTCN2022074120-appb-000034
其中,对该待检测图像I中的所有对象分组可以得到多组对象。终端根据每一组对象的交互结果,保留所对应的交互结果大于或等于第一预设分数阈值的每一组对象,得到组员对象之间相互关联的多个关联对象组;并且,终端根据每个关联对象组对应的关系交互特征,以及预设参数,对该关联对象组中每个组员对象的多维特征进行更新,得到每个组员对象的细化特征(该更新过程可以通过图10中的信息传递过程表示),针对每个关联对象组,终端将该关联对象组中所有组员对象的细化特征,在通道维度上进行叠加,得到该关联对象组的图交互特征(图10中未示出),并将该图交互特征输入全连接层中进行分类,得到该关联对象组的图关系结果
Figure PCTCN2022074120-appb-000035
On the one hand, after the terminal obtains the location feature, word vector feature and visual feature of each object, the terminal can use the semantic encoding module to encode the location feature and word vector feature respectively, and obtain the first feature and the third feature , at the same time, perform dimension transformation processing (Reshap) on the visual features of each object, and use MLP to encode the visual features after dimension transformation processing to obtain the second feature with the same dimension as the first feature and the third feature, And the first feature, the second feature and the third feature are superimposed in the channel dimension to obtain the multi-dimensional feature corresponding to each object in the image I to be detected. According to the multi-dimensional features corresponding to all objects (ie multiple objects) of the image I to be detected respectively, a fully connected graph corresponding to all objects is generated, and the fully connected graph is characterized by an adjacency matrix (not shown in FIG. 10 ), the adjacency matrix and the multi-dimensional features corresponding to all objects are taken as the input of the GCN network, and the updated multi-dimensional features of each object are obtained through the graph convolution processing of the GCN network; according to the updated multi-dimensional features of each object The multi-dimensional features of each group of objects are obtained, and the relationship interaction features of each group of objects are input into the fully connected layer (FCs), and the group of objects is classified to obtain the interaction results of each group of objects.
Figure PCTCN2022074120-appb-000034
Wherein, grouping all objects in the image I to be detected can obtain multiple groups of objects. According to the interaction result of each group of objects, the terminal retains each group of objects whose corresponding interaction result is greater than or equal to the first preset score threshold, and obtains a plurality of related object groups that are related to each other among group member objects; and, according to The relationship interaction features corresponding to each associated object group, as well as the preset parameters, update the multidimensional features of each member object in the associated object group, and obtain the refined features of each member object (the update process can be shown in Fig. 10), for each associated object group, the terminal superimposes the refined features of all member objects in the associated object group on the channel dimension to obtain the graph interaction feature of the associated object group (Fig. 10), and input the graph interaction feature into the fully connected layer for classification, and obtain the graph relationship result of the associated object group
Figure PCTCN2022074120-appb-000035
另一方面,终端根据每个对象的位置特征,得到每个对象的图像区域,并将每一组对象的组员对象的图像区域进行拼接,得到每一组对象的图像区域,并对每一组对象的图像区域进行编码,得到二维特征数据。之后,对于每一组对象,将该组对象中的组员对象(例如,图1中的对象1和对象2)的视觉特征,分别输入残差网络中进行特征提取,得到不同的第二子特征(图10中未示出),以及,对于每一组对象的二维特征数据,将该二维特征数据输入卷积神经网络中进行特征提取,得到第一子特征(图10中未示出),并分别对第一子特征和每个第二子特征进行全局平均池化,得到每个组员对象的处理后的视觉特征,例如,图10中的对象1的处理后的视觉特征,以及对象2的处理后的视觉特征,并得到该组对象的处理后的二维特征数据。之后,再将对象1的处理后的视觉特征、对象2的处理后的视觉特征,以及该组对象的处理后的二维特征数据,分别输入不同的全连接层中进行分类,分别得到对象1的动作结果
Figure PCTCN2022074120-appb-000036
对象2的动作结果
Figure PCTCN2022074120-appb-000037
以及对象1和对象2所组成的一组对象的空间结果
Figure PCTCN2022074120-appb-000038
最终,终端将上述过程中所得到的所有结果代入上述公式(9)中,便可计算出每个关联对象组的目标结果,从而根据得到的所有关联对象组的目标结果中,最高的目标结果所对应的一个关联对象组,便可识别出待检测图像I中的人物交互行为。
On the other hand, the terminal obtains the image area of each object according to the location characteristics of each object, and stitches the image areas of the member objects of each group of objects to obtain the image area of each group of objects, and The image region of the group object is encoded to obtain two-dimensional feature data. Afterwards, for each group of objects, the visual features of the group member objects (for example, object 1 and object 2 in Figure 1) in the group of objects are respectively input into the residual network for feature extraction, and different second subclasses are obtained. feature (not shown in Figure 10), and, for the two-dimensional feature data of each group of objects, the two-dimensional feature data is input into the convolutional neural network for feature extraction to obtain the first sub-feature (not shown in Figure 10 output), and perform global average pooling on the first sub-feature and each second sub-feature respectively, to obtain the processed visual features of each group member object, for example, the processed visual features of object 1 in Figure 10 , and the processed visual features of object 2, and obtain the processed two-dimensional feature data of the group of objects. After that, the processed visual features of object 1, the processed visual features of object 2, and the processed two-dimensional feature data of this group of objects are input into different fully connected layers for classification, respectively, and object 1 result of the action
Figure PCTCN2022074120-appb-000036
Object 2's action result
Figure PCTCN2022074120-appb-000037
and the spatial results for a set of objects consisting of Object 1 and Object 2
Figure PCTCN2022074120-appb-000038
Finally, the terminal can calculate the target result of each associated object group by substituting all the results obtained in the above process into the above formula (9), and then according to the obtained target results of all associated object groups, the highest target result A corresponding associated object group can identify the interaction behavior of the person in the image I to be detected.
本公开还提供一种行为识别装置,图11为本公开实施例提供的行为识别装置的结构示意图;如图11所示,行为识别装置1包括:编码部分10,被配置为对每张待检测图像进行检测得到多个对象的特征,对所述特征进行编码,得到与所述多个对象分别一一对应的多维特征;结果确定部分20,被配置为基于每一组对象的每个组员对象的特征中的部分特征,确定所述每一组对象的至少两类对象的空间结果,以及每个所述组员对象的动作结果,其中,所述每一组对象至少包含:所述多个对象中类别为物体的对象,以及类别为人的对象;基于所述多维特征,确定所述每一组对象的关系交互特征,并在依据所述关系交互特征,确定所述每一组对象中的所述组员对象之间相互关联的情况下,基于所述空间结果和所述动作结果,确定所述每一组对象的目标结果,得到至少一个所述目标结果;行为确定部分30,被配置为基于至少一个所述目标结果,确定所述每张待检测图像中的对象行为。The present disclosure also provides a behavior recognition device, and FIG. 11 is a schematic structural diagram of the behavior recognition device provided by an embodiment of the disclosure; as shown in FIG. The image is detected to obtain the features of multiple objects, and the features are encoded to obtain multi-dimensional features corresponding to the multiple objects respectively; the result determining part 20 is configured to be based on each group member of each group of objects Part of the features in the object features, determine the spatial results of at least two types of objects in each group of objects, and the action results of each of the group member objects, wherein each group of objects includes at least: the multiple Among the objects, the object category is an object, and the object category is a person; based on the multidimensional feature, determine the relationship interaction feature of each group of objects, and determine the relationship interaction feature of each group of objects according to the relationship interaction feature In the case that the group member objects are related to each other, based on the spatial result and the action result, determine the target result of each group of objects, and obtain at least one of the target results; the behavior determination part 30 is It is configured to determine the object behavior in each of the images to be detected based on at least one of the target results.
在本公开的一些实施例中,所述结果确定部分20,还被配置为基于与所述多个对象分别一一对应的所述多维特征,生成与所述多个对象所对应的全连接图;通过对每个所述对象一一对应的所述多维特征,以及所述全连接图,进行图卷积处理,得到与每个所述对象一一对应的更新后的多维特征;根据所述每一组对象中每个组员对象的所述更新后的多维特征,得到所述每一组对象的所述关系交互特征。In some embodiments of the present disclosure, the result determining part 20 is further configured to generate a fully connected graph corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively ; By performing graph convolution processing on the multi-dimensional features corresponding to each of the objects and the fully connected graph, the updated multi-dimensional features corresponding to each of the objects are obtained; according to the The updated multi-dimensional feature of each member object in each group of objects obtains the relationship interaction feature of each group of objects.
在本公开的一些实施例中,所述结果确定部分20,还被配置为根据所述关系交互特征,对所述每一组对象进行分类,得到所述每一组对象的交互结果;在所述交互结果大于或等于第一预设分数阈值的情况下,确定所述每一组对象中的所述组员对象之间相互关联。In some embodiments of the present disclosure, the result determining part 20 is further configured to classify each group of objects according to the relationship interaction feature, and obtain the interaction result of each group of objects; If the interaction result is greater than or equal to a first preset score threshold, it is determined that the group member objects in each group of objects are related to each other.
在本公开的一些实施例中,所述结果确定部分20,还被配置为基于所述每一组对象的所述关系交互特征,以及预设参数,对每个所述组员对象的所述多维特征进行更新,得到每个所述组员对象的细化特征,并基于所述细化特征,确定所述每一组对象的图交互特征;基于所述图交互特征,对所述每一组对象进行分类,得到图关系结果;基于所述空间结果、所述动作结果、所述交互结果、所述图关系结果,以及对每个所述组员对象进行所述检测时所得 到的置信结果,确定所述每一组对象的所述目标结果。In some embodiments of the present disclosure, the result determination part 20 is further configured to, based on the relationship interaction features of each group of objects and preset parameters, for each of the group member objects update the multi-dimensional features to obtain the refined features of each of the group member objects, and based on the refined features, determine the graph interaction features of each group of objects; based on the graph interaction features, for each Classifying group objects to obtain graph relationship results; based on the spatial results, the action results, the interaction results, the graph relationship results, and the confidence obtained when performing the detection on each of the group member objects As a result, said target outcome for said each set of subjects is determined.
在本公开的一些实施例中,所述目标结果为目标数值;所述行为确定部分30,还被配置为根据至少一个所述目标数值,从与至少一个所述目标数值一一对应的多个关联对象组中,选出与最高的目标数值所对应的一个关联对象组,并识别所述一个关联对象组中的所述组员对象之间的行为。In some embodiments of the present disclosure, the target result is a target value; the behavior determination part 30 is further configured to, according to at least one of the target values, select from a plurality of one-to-one corresponding to the at least one target value In the group of associated objects, a group of associated objects corresponding to the highest target value is selected, and the behavior among the member objects in the group of associated objects is identified.
在本公开的一些实施例中,所述全连接图通过邻接矩阵表征,所述邻接矩阵中的每个数据表征对应的两个对象之间的关联度;所述结果确定部分20,还被配置为基于所述邻接矩阵和每个所述对象一一对应的所述多维特征,通过图神经网络,对每个所述对象的所述多维特征进行迭代,得到与每个所述对象一一对应的更新后的多维特征。In some embodiments of the present disclosure, the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects; the result determining part 20 is further configured For the one-to-one correspondence between the adjacency matrix and each of the objects, the multidimensional features of each of the objects are iterated through a graph neural network to obtain a one-to-one correspondence with each of the objects. The updated multidimensional features of .
在本公开的一些实施例中,所述两个对象包括:第一对象和第二对象;所述结果确定部分20,还被配置为确定所述第一对象的所述多维特征和所述第二对象的所述多维特征之间的相似度;基于所述第一对象在所述每张待检测图像中的位置特征,以及所述第二对象在所述每张待检测图像中的位置特征,确定所述第一对象与所述第二对象之间的距离;基于所述相似度和所述距离,确定所述第一对象和所述第二对象之间的所述关联度。In some embodiments of the present disclosure, the two objects include: a first object and a second object; the result determining part 20 is further configured to determine the multi-dimensional feature and the second object of the first object The similarity between the multi-dimensional features of the two objects; based on the positional features of the first object in each of the images to be detected, and the positional features of the second object in each of the images to be detected , determining a distance between the first object and the second object; and determining the degree of association between the first object and the second object based on the similarity and the distance.
在本公开的一些实施例中,所述结果确定部分20,还被配置为基于更新参数、所述邻接矩阵、与迭代次数对应的第一权重参数,以及每个所述对象一一对应的所述多维特征,对每个所述对象的所述多维特征进行迭代更新,并在迭代次数达到第一预设次数的情况下,将所述第一预设次数之后生成的特征,作为每个所述对象的所述更新后的多维特征。In some embodiments of the present disclosure, the result determination part 20 is further configured to be based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the one-to-one correspondence between each of the objects The multi-dimensional features are iteratively updated for each of the multi-dimensional features of the object, and when the number of iterations reaches the first preset number of times, the features generated after the first preset number of times are used as each of the multi-dimensional features The updated multidimensional feature of the object.
在本公开的一些实施例中,所述预设参数包括:第二权重参数和迭代次数;所述结果确定部分20,还被配置为基于所述第二权重参数和所述每一组对象的所述关系交互特征,对每个所述组员对象的所述多维特征进行迭代更新,并在迭代次数达到第二预设次数的情况下,将所述第二预设次数之后生成的特征,作为每个所述组员对象的所述细化特征。In some embodiments of the present disclosure, the preset parameters include: a second weight parameter and the number of iterations; the result determining part 20 is further configured to The relational interaction features are iteratively updated for the multi-dimensional features of each of the team member objects, and when the number of iterations reaches a second preset number of times, the features generated after the second preset number of times, as the refinement feature for each of the group member objects.
在本公开的一些实施例中,所述检测包括:图像检测和词向量检测;所述编码部分10,还被配置为将与所述多个对象分别一一对应的位置特征进行编码,得到每个对象的第一特征;将与所述多个对象分别一一对应的视觉特征进行编码,得到每个所述对象的第二特征;所述位置特征和所述视觉特征是对所述每张待检测图像进行图像检测得到的;将与所述多个对象分别一一对应的词向量特征进行编码,得到每个所述对象的第三特征;所述词向量特征是对每个所述对象的类别信息,进行词向量检测得到的;所述类别信息是对所述每张待检测图像进行图像检测得到的;根据所述第一特征、第二特征和所述第三特征,得到与所述多个对象分别一一对应的所述多维特征;其中,所述第一特征、所述第二特征和所述第三特征的维度相同。In some embodiments of the present disclosure, the detection includes: image detection and word vector detection; the encoding part 10 is further configured to encode the position features corresponding to the plurality of objects respectively, to obtain each The first feature of an object; the visual features corresponding to the multiple objects are encoded to obtain the second feature of each of the objects; the position feature and the visual feature are for each of the described objects. The image to be detected is obtained by image detection; the word vector features corresponding to the plurality of objects are encoded to obtain the third feature of each of the objects; the word vector features are for each of the objects The category information is obtained by performing word vector detection; the category information is obtained by performing image detection on each of the images to be detected; according to the first feature, the second feature and the third feature, get the same as the The plurality of objects correspond to the multi-dimensional features respectively; wherein, the dimensions of the first feature, the second feature and the third feature are the same.
在本公开的一些实施例中,所述编码部分10,还被配置为将与所述多个对象分别一一对应的视觉特征,进行维度变换处理,得到每个所述对象的维度变换后的视觉特征;对所述维度变换后的视觉特征进行编码,得到每个所述对象的所述第二特征。In some embodiments of the present disclosure, the encoding part 10 is further configured to perform dimension transformation processing on the visual features corresponding to the plurality of objects respectively, to obtain the dimension transformation of each of the objects. Visual features: encoding the visual features after dimension transformation to obtain the second features of each of the objects.
在本公开的一些实施例中,所述部分特征包括:每个所述组员对象的位置特征和视觉特征;所述位置特征和所述视觉特征是对所述每张待检测图像进行图像检测得到的;所述结果确定部分20,还被配置为基于所述每一组对象的每个所述组员对象的所述位置特征,确定每个所述组员对象在所述每张待检测图像中的图像区域;根据每个所述组员对象的所述图像区域,得到所述每一组对象对应的图像区域,并对所述每一组对象对应的图像区域进行编码,得到二维特征数据;对所述二维特征数据,以及每个所述组员对象的所述视觉特征,分别进行特征处理,对应得到处理后的二维特征数据和处理后的视觉特征;根据所述处理后的二维特征数据,对所述每一组对象进行分类,得到所述每一组对象的所述空间结果,以及根据所述处理后的视觉特征,对每个所述组员对象进行分类,得到每个所述组员对象的所述动作结果。In some embodiments of the present disclosure, the partial features include: positional features and visual features of each of the team member objects; the positional features and the visual features are image detection performed on each image to be detected Obtained; the result determining part 20 is further configured to determine each of the group member objects on each sheet to be detected based on the position characteristics of each of the group member objects of each group of objects The image area in the image; according to the image area of each of the group member objects, the image area corresponding to each group of objects is obtained, and the image area corresponding to each group of objects is encoded to obtain a two-dimensional feature data; performing feature processing on the two-dimensional feature data and the visual features of each of the team member objects, corresponding to the processed two-dimensional feature data and the processed visual feature; according to the processing classify each group of objects, obtain the spatial result of each group of objects, and classify each group member object according to the processed visual features , to obtain the action result of each member object.
在本公开的一些实施例中,所述装置还包括检测部分,被配置为对所述每张待检测图像进行图像检测,得到检测出的每个目标的位置特征、视觉特征、置信结果,以及与所述置信结果对应的类别信息;将所述置信结果大于或等于第二预设分数阈值的目标,作为检测出的 对象,得到与所述多个对象分别一一对应的所述位置特征、所述视觉特征,以及所述类别信息;对每个对象的所述类别信息进行词向量检测,得到每个所述对象的词向量特征。In some embodiments of the present disclosure, the device further includes a detection part configured to perform image detection on each of the images to be detected, and obtain the positional features, visual features, and confidence results of each detected target, and Category information corresponding to the confidence result; taking the target whose confidence result is greater than or equal to a second preset score threshold as the detected object, and obtaining the positional features, location features, and The visual features, and the category information; word vector detection is performed on the category information of each object to obtain the word vector features of each object.
在本公开实施例以及其他的实施例中,“部分”可以是部分电路、部分处理器、部分程序或软件等等,当然也可以是单元,还可以是模块也可以是非模块化的。In the embodiments of the present disclosure and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.
本公开实施例还提供一种电子设备,图12为本公开实施例提供的虚拟标签展示设备的结构示意图,如图12所示,包括:存储器22和处理器23,其中,存储器22和处理器23通过总线21连接;存储器22,被配置为存储可执行计算机程序;处理器23,被配置为执行存储器22中存储的可执行计算机程序时,实现本公开实施例提供的方法,例如,本公开实施例提供的行为识别方法。An embodiment of the present disclosure also provides an electronic device. FIG. 12 is a schematic structural diagram of a virtual label display device provided by an embodiment of the present disclosure. As shown in FIG. 12 , it includes: a memory 22 and a processor 23, wherein the memory 22 and the processor 23 is connected through the bus 21; the memory 22 is configured to store an executable computer program; the processor 23 is configured to execute the executable computer program stored in the memory 22 to implement the method provided by the embodiment of the present disclosure, for example, the present disclosure The behavior recognition method provided by the embodiment.
本公开实施例提供一种计算机可读存储介质,存储有计算机程序,用于引起处理器23执行时,实现本公开实施例提供的方法,例如,本公开实施例提供的行为识别方法。The embodiment of the present disclosure provides a computer-readable storage medium storing a computer program for causing the processor 23 to implement the method provided in the embodiment of the present disclosure, for example, the behavior recognition method provided in the embodiment of the present disclosure.
本公开实施例提供一种计算机程序,包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备中的处理器执行用于实现上述的行为识别方法的步骤。An embodiment of the present disclosure provides a computer program, including computer readable codes. When the computer readable codes run in an electronic device, a processor in the electronic device executes the method for implementing the above behavior recognition method. step.
本公开实施例提供一种计算机程序产品,包括计算机程序指令,该计算机程序指令使得计算机执行上述的行为识别方法的步骤。An embodiment of the present disclosure provides a computer program product, including computer program instructions, which enable a computer to execute the steps of the above-mentioned behavior recognition method.
在本公开的一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。In some embodiments of the present disclosure, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; Various devices in any combination.
计算机可读存储介质还可以是保持和存储由指令执行设备使用的指令的有形设备,可为易失性存储介质或非易失性存储介质。计算机可读存储介质例如可以是——但不限于——电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:U盘、磁碟、光盘、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦拭可编程只读存储器(EPROM或闪存)、静态随机存储读取器(ROM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、记性编码设备、例如其上存储有指令的打孔卡或凹槽内凹起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电池波、通过波导或其他传媒介质传播的电池波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer readable storage medium may also be a tangible device that holds and stores instructions for use by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: USB flash drives, magnetic disks, optical disks, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable Read Only Memory (EPROM or Flash), Static Random Access Reader (ROM), Portable Compact Disk Read Only Memory (CD-ROM), Digital Versatile Disk (DVD), Memory Stick, Floppy Disk, Memory Encoding Device, Examples include punched cards with instructions stored thereon, or recessed-in-groove structures, and any suitable combination of the foregoing. As used herein, a computer-readable storage medium is not to be construed as a transient signal per se, such as a radio wave or other freely propagating battery wave, a battery wave propagating through a waveguide or other media medium (e.g., a pulse of light through a fiber optic cable), or Electrical signals transmitted through wires.
在本公开的一些实施例中,计算机程序指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它部分。In some embodiments of the present disclosure, computer program instructions may take the form of programs, software, software modules, scripts, or codes written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages , and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other part suitable for use in a computing environment.
作为示例,计算机程序指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。As an example, computer program instructions may, but do not necessarily correspond to files in a file system, may be stored as part of files that hold other programs or data, for example, in Hyper Text Markup Language (HTML) documents in one or more scripts, in a single file dedicated to the program in question, or in multiple cooperating files (for example, files that store one or more modules, subroutines, or sections of code).
作为示例,计算机程序指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。As an example, computer program instructions can be deployed to be executed on one computing device, or on multiple computing devices at one site, or distributed across multiple sites and interconnected by a communication network. to execute.
以上所述,仅为本公开的实施例而已,并非用于限定本公开的保护范围。凡在本公开的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本公开的保护范围之内。The above descriptions are merely examples of the present disclosure, and are not intended to limit the protection scope of the present disclosure. Any modifications, equivalent replacements and improvements made within the spirit and scope of the present disclosure are included in the protection scope of the present disclosure.
工业实用性Industrial Applicability
本公开实施例公开了一种识别方法、装置、电子设备、计算机可读存储介质、计算机程序及计算机程序产品。该方法包括:对每张待检测图像进行检测得到多个对象的特征,对特征进行编码,得到与多个对象分别一一对应的多维特征,基于每组对象的每个对象的特征中的部分特征,确定每组对象的至少两类对象的空间结果,以及每个对象的动作结果;基于多维特征,确定每组对象的关系交互特征,并在依据关系交互特征,确定每组对象中的对象之间相互关联的情况下,基于空间结果和动作结果,确定每组对象的目标结果,得到至少一个目标结果;基于至少一个目标结果,确定每张待检测图像中的对象行为。通过本公开,可以提高对人物交互行为进行识别时的识别准确度和识别效率。The embodiment of the present disclosure discloses an identification method, device, electronic equipment, computer-readable storage medium, computer program and computer program product. The method includes: detecting each image to be detected to obtain the features of multiple objects, encoding the features, and obtaining multi-dimensional features corresponding to the multiple objects respectively, based on the part of the features of each object in each group of objects Features, determine the spatial results of at least two types of objects in each group of objects, and the action results of each object; based on multidimensional features, determine the relationship interaction characteristics of each group of objects, and determine the objects in each group of objects based on the relationship interaction characteristics In the case of mutual correlation, the target result of each group of objects is determined based on the spatial result and the action result, and at least one target result is obtained; based on at least one target result, the object behavior in each image to be detected is determined. Through the present disclosure, it is possible to improve the recognition accuracy and recognition efficiency when recognizing human interaction behaviors.

Claims (18)

  1. 一种行为识别方法,包括:A behavior recognition method, comprising:
    对每张待检测图像进行检测得到多个对象的特征,对所述特征进行编码,得到与所述多个对象分别一一对应的多维特征;Detecting each image to be detected to obtain features of multiple objects, encoding the features to obtain multi-dimensional features corresponding to the multiple objects respectively;
    基于每一组对象的每个组员对象的特征中的部分特征,确定所述每一组对象的至少两类对象的空间结果,以及每个所述组员对象的动作结果,其中,所述每一组对象至少包含:所述多个对象中类别为物体的对象,以及类别为人的对象;Based on some of the features of each group member object of each group of objects, determine the spatial results of at least two types of objects in each group of objects, and the action results of each of the group member objects, wherein the Each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects;
    基于所述多维特征,确定所述每一组对象的关系交互特征,并在依据所述关系交互特征,确定所述每一组对象中的所述组员对象之间相互关联的情况下,基于所述空间结果和所述动作结果,确定所述每一组对象的目标结果,得到至少一个所述目标结果;Based on the multi-dimensional features, determine the relationship interaction features of each group of objects, and in the case of determining that the group member objects in each group of objects are related to each other according to the relationship interaction features, based on The spatial result and the action result determine the target result of each group of objects to obtain at least one of the target results;
    基于至少一个所述目标结果,确定所述每张待检测图像中的对象行为。Based on at least one of the target results, an object behavior in each of the images to be detected is determined.
  2. 根据权利要求1所述的方法,其中,所述基于所述多维特征,确定所述每一组对象的关系交互特征,包括:The method according to claim 1, wherein said determining the relational interaction features of each group of objects based on said multi-dimensional features comprises:
    基于与所述多个对象分别一一对应的所述多维特征,生成与所述多个对象所对应的全连接图;generating fully connected graphs corresponding to the multiple objects based on the multi-dimensional features corresponding to the multiple objects respectively;
    通过对每个所述对象一一对应的所述多维特征,以及所述全连接图,进行图卷积处理,得到与每个所述对象一一对应的更新后的多维特征;By performing graph convolution processing on the multi-dimensional features corresponding to each of the objects and the fully connected graph, an updated multi-dimensional feature corresponding to each of the objects is obtained;
    根据所述每一组对象中每个所述组员对象的所述更新后的多维特征,得到所述每一组对象的所述关系交互特征。According to the updated multi-dimensional feature of each member object in each group of objects, the relationship interaction feature of each group of objects is obtained.
  3. 根据权利要求1或2所述的方法,其中,所述依据所述关系交互特征,确定所述每一组对象中的所述组员对象之间相互关联,包括:The method according to claim 1 or 2, wherein, according to the relationship interaction feature, determining that the group member objects in each group of objects are related to each other includes:
    根据所述关系交互特征,对所述每一组对象进行分类,得到所述每一组对象的交互结果;classify each group of objects according to the relationship interaction features, and obtain the interaction results of each group of objects;
    在所述交互结果大于或等于第一预设分数阈值的情况下,确定所述每一组对象中的所述组员对象之间相互关联。If the interaction result is greater than or equal to a first preset score threshold, it is determined that the group member objects in each group of objects are related to each other.
  4. 根据权利要求3所述的方法,其中,所述基于所述空间结果和所述动作结果,确定所述每一组对象的目标结果,包括:The method according to claim 3, wherein said determining a target result for each group of objects based on said spatial result and said action result comprises:
    基于所述每一组对象的所述关系交互特征,以及预设参数,对每个所述组员对象的所述多维特征进行更新,得到每个所述组员对象的细化特征,并基于所述细化特征,确定所述每一组对象的图交互特征;Based on the relationship interaction features of each group of objects and preset parameters, the multidimensional features of each of the group member objects are updated to obtain the refined features of each of the group member objects, and based on The refinement feature determines the graph interaction feature of each group of objects;
    基于所述图交互特征,对所述每一组对象进行分类,得到图关系结果;Classifying each group of objects based on the graph interaction features to obtain a graph relationship result;
    基于所述空间结果、所述动作结果、所述交互结果、所述图关系结果,以及对每个所述组员对象进行所述检测时所得到的置信结果,确定所述每一组对象的所述目标结果。Based on the spatial result, the action result, the interaction result, the graph relationship result, and the confidence result obtained when the detection is performed on each of the group member objects, determine the value of each group of objects The target result.
  5. 根据权利要求1所述的方法,其中,所述目标结果为目标数值;所述基于至少一个所述目标结果,确定所述每张待检测图像中的对象行为,包括:The method according to claim 1, wherein the target result is a target value; said determining the object behavior in each image to be detected based on at least one target result comprises:
    根据至少一个所述目标数值,从与至少一个所述目标数值一一对应的多个关联对象组中,选出与最高的目标数值所对应的一个关联对象组,并识别所述一个关联对象组中的所述组员对象之间的行为。According to at least one of the target values, from a plurality of associated object groups corresponding to at least one of the target values one by one, select an associated object group corresponding to the highest target value, and identify the associated object group Behavior between the group member objects in the .
  6. 根据权利要求2所述的方法,其中,所述全连接图通过邻接矩阵表征,所述邻接矩阵中的每个数据表征对应的两个对象之间的关联度;The method according to claim 2, wherein the fully connected graph is represented by an adjacency matrix, and each data in the adjacency matrix represents the degree of association between two corresponding objects;
    所述通过对每个所述对象一一对应的所述多维特征,以及所述全连接图,进行图卷积处理,得到与每个所述对象一一对应的更新后的多维特征,包括:The multi-dimensional features corresponding to each of the objects one-to-one and the fully connected graph are subjected to graph convolution processing to obtain updated multi-dimensional features corresponding to each of the objects one-to-one, including:
    基于所述邻接矩阵和每个所述对象一一对应的所述多维特征,通过图神经网络,对每个所述对象的所述多维特征进行迭代,得到与每个所述对象一一对应的更新后的多维特征。Based on the adjacency matrix and the multi-dimensional features corresponding to each of the objects, the multi-dimensional features of each of the objects are iterated through a graph neural network to obtain a one-to-one correspondence with each of the objects. Updated multidimensional features.
  7. 根据权利要求6所述的方法,其中,所述两个对象包括:第一对象和第二对象;确定所述两个对象之间的关联度的方法包括:The method according to claim 6, wherein the two objects include: a first object and a second object; the method for determining the degree of association between the two objects includes:
    确定所述第一对象的所述多维特征和所述第二对象的所述多维特征之间的相似度;determining a degree of similarity between the multidimensional feature of the first object and the multidimensional feature of the second object;
    基于所述第一对象在所述每张待检测图像中的位置特征,以及所述第二对象在所述每张待检测图像中的位置特征,确定所述第一对象与所述第二对象之间的距离;Determining the first object and the second object based on the positional features of the first object in each of the images to be detected and the positional features of the second object in each of the images to be detected the distance between;
    基于所述相似度和所述距离,确定所述第一对象和所述第二对象之间的所述关联度。Based on the similarity and the distance, the degree of association between the first object and the second object is determined.
  8. 根据权利要求6或7所述的方法,其中,所述基于所述邻接矩阵和每个所述对象一一对应的所述多维特征,通过图神经网络,对每个所述对象的所述多维特征进行迭代,得到与每个所述对象一一对应的更新后的多维特征,包括:The method according to claim 6 or 7, wherein, based on the adjacency matrix and the one-to-one correspondence between the multidimensional features of each of the objects, the multidimensional features of each of the objects are processed through a graph neural network. Features are iterated to obtain updated multidimensional features corresponding to each of the objects, including:
    基于更新参数、所述邻接矩阵、与迭代次数对应的第一权重参数,以及每个所述对象一一对应的所述多维特征,对每个所述对象的所述多维特征进行迭代更新,并在迭代次数达到第一预设次数的情况下,将所述第一预设次数之后生成的特征,作为每个所述对象的所述更新后的多维特征。Based on the update parameter, the adjacency matrix, the first weight parameter corresponding to the number of iterations, and the multidimensional features corresponding to each of the objects, iteratively update the multidimensional features of each of the objects, and When the number of iterations reaches a first preset number of times, the features generated after the first preset number of times are used as the updated multi-dimensional features of each of the objects.
  9. 根据权利要求4所述的方法,其中,所述预设参数包括:第二权重参数和迭代次数;所述基于所述每一组对象的所述关系交互特征,以及预设参数,对每个所述组员对象的所述多维特征进行更新,得到每个所述组员对象的细化特征,包括:The method according to claim 4, wherein the preset parameters include: a second weight parameter and the number of iterations; the interaction characteristics based on the relationship of each group of objects, and preset parameters, for each The multi-dimensional features of the group member objects are updated to obtain the refined features of each group member object, including:
    基于所述第二权重参数和所述每一组对象的所述关系交互特征,对每个所述组员对象的所述多维特征进行迭代更新,并在迭代次数达到第二预设次数的情况下,将所述第二预设次数之后生成的特征,作为每个所述组员对象的所述细化特征。Based on the second weight parameter and the relationship interaction feature of each group of objects, iteratively update the multi-dimensional features of each of the group member objects, and when the number of iterations reaches a second preset number of times Next, the features generated after the second preset number of times are used as the refined features of each group member object.
  10. 根据权利要求1所述的方法,其中,所述检测包括:图像检测和词向量检测;所述对所述特征进行编码,得到与所述多个对象分别一一对应的多维特征,包括:The method according to claim 1, wherein said detection comprises: image detection and word vector detection; said encoding said features to obtain multi-dimensional features respectively one-to-one corresponding to said plurality of objects comprises:
    将与所述多个对象分别一一对应的位置特征进行编码,得到每个对象的第一特征;Encoding the positional features corresponding to the plurality of objects respectively to obtain the first feature of each object;
    将与所述多个对象分别一一对应的视觉特征进行编码,得到每个所述对象的第二特征;所述位置特征和所述视觉特征是对所述每张待检测图像进行图像检测得到的;Encoding the visual features corresponding to each of the plurality of objects to obtain the second feature of each of the objects; the position feature and the visual feature are obtained by image detection of each of the images to be detected of;
    将与所述多个对象分别一一对应的词向量特征进行编码,得到每个所述对象的第三特征;所述词向量特征是对每个所述对象的类别信息,进行词向量检测得到的;所述类别信息是对所述每张待检测图像进行图像检测得到的;Encoding the word vector features corresponding to the plurality of objects respectively to obtain the third feature of each of the objects; the word vector features are obtained by performing word vector detection on the category information of each of the objects The category information is obtained by performing image detection on each of the images to be detected;
    根据所述第一特征、第二特征和所述第三特征,得到与所述多个对象分别一一对应的所述多维特征;其中,所述第一特征、所述第二特征和所述第三特征的维度相同。According to the first feature, the second feature and the third feature, obtain the multi-dimensional features corresponding to the plurality of objects respectively; wherein, the first feature, the second feature and the The dimensions of the third feature are the same.
  11. 根据权利要求10所述的方法,其中,所述将与所述多个对象分别一一对应的视觉特征进行编码,得到每个所述对象的第二特征,包括:The method according to claim 10, wherein said encoding the visual features corresponding to said plurality of objects respectively to obtain the second feature of each said object comprises:
    将与所述多个对象分别一一对应的视觉特征,进行维度变换处理,得到每个所述对象的维度变换后的视觉特征;performing dimension transformation processing on the visual features corresponding to the plurality of objects respectively, so as to obtain the visual features after dimension transformation of each of the objects;
    对所述维度变换后的视觉特征进行编码,得到每个所述对象的所述第二特征。Encoding the dimension-transformed visual features to obtain the second features of each of the objects.
  12. 根据权利要求1、10或11所述的方法,其中,所述部分特征包括:每个所述组员对象的位置特征和视觉特征;所述位置特征和所述视觉特征是对所述每张待检测图像进行图像检测得到的;The method according to claim 1, 10 or 11, wherein the partial features include: positional features and visual features of each of the group member objects; the positional features and the visual features are for each of the The image to be detected is obtained by image detection;
    所述基于每一组对象的每个组员对象的特征中的部分特征,确定所述每一组对象的至少两类对象的空间结果,以及每个所述组员对象的动作结果,包括:Determining the spatial results of at least two types of objects in each group of objects based on some of the features of each group member object in each group of objects, and the action results of each of the group member objects, including:
    基于所述每一组对象的每个所述组员对象的所述位置特征,确定每个所述组员对象在所述每张待检测图像中的图像区域;Based on the positional features of each of the group member objects of each group of objects, determine the image area of each of the group member objects in each of the images to be detected;
    根据每个所述组员对象的所述图像区域,得到所述每一组对象对应的图像区域,并对所述每一组对象对应的图像区域进行编码,得到二维特征数据;Obtaining an image area corresponding to each group of objects according to the image area of each group member object, and encoding the image area corresponding to each group of objects to obtain two-dimensional feature data;
    对所述二维特征数据,以及每个所述组员对象的所述视觉特征,分别进行特征处理,对应得到处理后的二维特征数据和处理后的视觉特征;performing feature processing on the two-dimensional feature data and the visual features of each member object, correspondingly obtaining the processed two-dimensional feature data and the processed visual features;
    根据所述处理后的二维特征数据,对所述每一组对象进行分类,得到所述每一组对象的所述空间结果,以及根据所述处理后的视觉特征,对每个所述组员对象进行分类,得到每个所述组员对象的所述动作结果。Classify each group of objects according to the processed two-dimensional feature data, obtain the spatial result of each group of objects, and classify each group according to the processed visual features Classify the member objects to obtain the action result of each member object.
  13. 根据权利要求1所述的方法,其中,所述对每张待检测图像进行检测得到多个对象的特征,包括:The method according to claim 1, wherein the detection of each image to be detected to obtain the characteristics of a plurality of objects includes:
    对所述每张待检测图像进行图像检测,得到检测出的每个目标的位置特征、视觉特征、置信结果,以及与所述置信结果对应的类别信息;Performing image detection on each of the images to be detected to obtain the detected positional features, visual features, confidence results, and category information corresponding to the confidence results;
    将所述置信结果大于或等于第二预设分数阈值的目标,作为检测出的对象,得到与所述多个对象分别一一对应的所述位置特征、所述视觉特征,以及所述类别信息;Taking the target whose confidence result is greater than or equal to the second preset score threshold as the detected object, and obtaining the position feature, the visual feature, and the category information corresponding to the plurality of objects respectively ;
    对每个对象的所述类别信息进行词向量检测,得到每个所述对象的词向量特征。The word vector detection is performed on the category information of each object to obtain the word vector features of each object.
  14. 一种行为识别装置,包括:A behavior recognition device, comprising:
    编码部分,被配置为对每张待检测图像进行检测得到多个对象的特征,对所述特征进行编码,得到与所述多个对象分别一一对应的多维特征;The encoding part is configured to detect each image to be detected to obtain the features of multiple objects, and encode the features to obtain multi-dimensional features corresponding to the multiple objects respectively;
    结果确定部分,被配置为基于每一组对象的每个组员对象的特征中的部分特征,确定所述每一组对象的至少两类对象的空间结果,以及每个所述组员对象的动作结果,其中,所述每一组对象至少包含:所述多个对象中类别为物体的对象,以及类别为人的对象;基于所述多维特征,确定所述每一组对象的关系交互特征,并在依据所述关系交互特征,确定所述每一组对象中的所述组员对象之间相互关联的情况下,基于所述空间结果和所述动作结果,确定所述每一组对象的目标结果,得到至少一个所述目标结果;The result determination part is configured to determine the spatial results of at least two types of objects of each group of objects based on some features of the features of each group member object of each group of objects, and each of the group member objects The action result, wherein each group of objects at least includes: objects classified as objects and objects classified as people among the plurality of objects; based on the multidimensional features, the relationship interaction features of each group of objects are determined, And in the case of determining that the member objects in each group of objects are related to each other according to the relationship interaction feature, based on the spatial result and the action result, determine the relationship between each group of objects target outcomes, obtaining at least one of said target outcomes;
    行为确定部分,被配置为基于至少一个所述目标结果,确定所述每张待检测图像中的对象行为。The behavior determination part is configured to determine the behavior of the object in each of the images to be detected based on at least one of the target results.
  15. 一种电子设备,包括:An electronic device comprising:
    存储器,被配置为存储可执行计算机程序;a memory configured to store an executable computer program;
    处理器,被配置为执行所述存储器中存储的可执行计算机程序时,实现权利要求1至13中任一项所述的方法。A processor configured to implement the method according to any one of claims 1 to 13 when executing the executable computer program stored in the memory.
  16. 一种计算机可读存储介质,其上存储有计算机程序,用于引起处理器执行时,实现权利要求1至13中任一项所述的方法。A computer-readable storage medium on which is stored a computer program for causing a processor to implement the method according to any one of claims 1 to 13 when executed.
  17. 一种计算机程序,包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备中的处理器执行用于实现如权利要求1至13任一项所述的方法的步骤。A computer program, comprising computer-readable codes, when the computer-readable codes run in an electronic device, a processor in the electronic device executes the program to implement any one of claims 1 to 13 steps of the method.
  18. 一种计算机程序产品,包括计算机程序指令,该计算机程序指令使得计算机执行如权利要求1至13任一项所述的方法的步骤。A computer program product comprising computer program instructions for causing a computer to perform the steps of the method as claimed in any one of claims 1 to 13.
PCT/CN2022/074120 2021-07-02 2022-01-26 Behavior recognition method and apparatus, and electronic device, computer-readable storage medium, computer program and computer program product WO2023273334A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110750749.8A CN113469056A (en) 2021-07-02 2021-07-02 Behavior recognition method and device, electronic equipment and computer readable storage medium
CN202110750749.8 2021-07-02

Publications (1)

Publication Number Publication Date
WO2023273334A1 true WO2023273334A1 (en) 2023-01-05

Family

ID=77877487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074120 WO2023273334A1 (en) 2021-07-02 2022-01-26 Behavior recognition method and apparatus, and electronic device, computer-readable storage medium, computer program and computer program product

Country Status (2)

Country Link
CN (1) CN113469056A (en)
WO (1) WO2023273334A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469056A (en) * 2021-07-02 2021-10-01 上海商汤智能科技有限公司 Behavior recognition method and device, electronic equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286892A1 (en) * 2018-03-13 2019-09-19 Adobe Inc. Interaction Detection Model for Identifying Human-Object Interactions in Image Content
CN111797705A (en) * 2020-06-11 2020-10-20 同济大学 Action recognition method based on character relation modeling
CN112232357A (en) * 2019-07-15 2021-01-15 北京京东尚科信息技术有限公司 Image processing method, image processing device, computer-readable storage medium and electronic equipment
CN112861848A (en) * 2020-12-18 2021-05-28 上海交通大学 Visual relation detection method and system based on known action conditions
CN113469056A (en) * 2021-07-02 2021-10-01 上海商汤智能科技有限公司 Behavior recognition method and device, electronic equipment and computer readable storage medium

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10366166B2 (en) * 2017-09-07 2019-07-30 Baidu Usa Llc Deep compositional frameworks for human-like language acquisition in virtual environments
CN108289177B (en) * 2018-02-13 2020-10-16 北京旷视科技有限公司 Information interaction method, device and system
CN110413819B (en) * 2019-07-12 2022-03-29 深兰科技(上海)有限公司 Method and device for acquiring picture description information
CN112219224B (en) * 2019-12-30 2024-04-26 商汤国际私人有限公司 Image processing method and device, electronic equipment and storage medium
CN111914622B (en) * 2020-06-16 2024-03-26 北京工业大学 Character interaction detection method based on deep learning
CN111931002B (en) * 2020-06-30 2024-08-13 华为技术有限公司 Matching method and related equipment
CN111881854A (en) * 2020-07-31 2020-11-03 上海商汤临港智能科技有限公司 Action recognition method and device, computer equipment and storage medium
CN111949131B (en) * 2020-08-17 2023-04-25 陈涛 Eye movement interaction method, system and equipment based on eye movement tracking technology
CN111967399A (en) * 2020-08-19 2020-11-20 辽宁科技大学 Improved fast RCNN behavior identification method
CN112580442B (en) * 2020-12-02 2022-08-09 河海大学 Behavior identification method based on multi-dimensional pyramid hierarchical model
CN112906484B (en) * 2021-01-25 2023-05-12 北京市商汤科技开发有限公司 Video frame processing method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286892A1 (en) * 2018-03-13 2019-09-19 Adobe Inc. Interaction Detection Model for Identifying Human-Object Interactions in Image Content
CN112232357A (en) * 2019-07-15 2021-01-15 北京京东尚科信息技术有限公司 Image processing method, image processing device, computer-readable storage medium and electronic equipment
CN111797705A (en) * 2020-06-11 2020-10-20 同济大学 Action recognition method based on character relation modeling
CN112861848A (en) * 2020-12-18 2021-05-28 上海交通大学 Visual relation detection method and system based on known action conditions
CN113469056A (en) * 2021-07-02 2021-10-01 上海商汤智能科技有限公司 Behavior recognition method and device, electronic equipment and computer readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Deep Learning - HOI Character Interaction Algorithm: ICAN_sakura Sakura's Blog-CSDN Blog_Character Animal Interaction Episode ll0", 1 March 2019 (2019-03-01), XP093019687, Retrieved from the Internet <URL:https://blog.csdn.net/Sakura55/article/details/87800747> [retrieved on 20230201] *
GAO CHEN, ZOU YULIANG, HUANG JIA-BIN, VIRGINIA VIRGINIA TECH: "INSTANCE-CENTRIC ATTENTION NETWORK iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection", 30 August 2018 (2018-08-30), XP055957091, Retrieved from the Internet <URL:https://arxiv.org/pdf/1808.10437.pdf> [retrieved on 20220901] *
WANG HAORAN; JIAO LICHENG; LIU FANG; LI LINGLING; LIU XU; JI DEYI; GAN WEIHAO: "IPGN: Interactiveness Proposal Graph Network for Human-Object Interaction Detection", IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE, USA, vol. 30, 16 July 2021 (2021-07-16), USA, pages 6583 - 6593, XP011867591, ISSN: 1057-7149, DOI: 10.1109/TIP.2021.3096333 *

Also Published As

Publication number Publication date
CN113469056A (en) 2021-10-01

Similar Documents

Publication Publication Date Title
CN108629043B (en) Webpage target information extraction method, device and storage medium
CN112949415B (en) Image processing method, apparatus, device and medium
CN105354307B (en) Image content identification method and device
CN111414888A (en) Low-resolution face recognition method, system, device and storage medium
Singh et al. A study of moment based features on handwritten digit recognition
US20220139063A1 (en) Filtering detected objects from an object recognition index according to extracted features
CN112507912B (en) Method and device for identifying illegal pictures
CN113434716A (en) Cross-modal information retrieval method and device
Myagila et al. A comparative study on performance of SVM and CNN in Tanzania sign language translation using image recognition
CN112733645A (en) Handwritten signature verification method and device, computer equipment and storage medium
CN114418124A (en) Method, device, equipment and storage medium for generating graph neural network model
WO2023273334A1 (en) Behavior recognition method and apparatus, and electronic device, computer-readable storage medium, computer program and computer program product
Bankar et al. Real time sign language recognition using deep learning
CN114282258A (en) Screen capture data desensitization method and device, computer equipment and storage medium
CN116152576B (en) Image processing method, device, equipment and storage medium
CN112949672A (en) Commodity identification method, commodity identification device, commodity identification equipment and computer readable storage medium
CN113239915B (en) Classroom behavior identification method, device, equipment and storage medium
CN115880702A (en) Data processing method, device, equipment, program product and storage medium
CN115116080A (en) Table analysis method and device, electronic equipment and storage medium
CN113379895B (en) Three-dimensional house model generation method and device and computer readable storage medium
Das et al. Occlusion robust sign language recognition system for indian sign language using CNN and pose features
CN118038019B (en) User graphical interface element identification method and system
CN118093885B (en) Data processing method, device and equipment, medium and product
CN112434162B (en) Method and device for sorting notes, computer equipment and storage medium
Sapkota Custom Object Detection with One-stage Detector

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE