WO2021073311A1

WO2021073311A1 - Image recognition method and apparatus, computer-readable storage medium and chip

Info

Publication number: WO2021073311A1
Application number: PCT/CN2020/113788
Authority: WO
Inventors: 严锐; 谢凌曦; 田奇
Original assignee: 华为技术有限公司
Priority date: 2019-10-15
Filing date: 2020-09-07
Publication date: 2021-04-22
Also published as: CN112668366A; CN112668366B

Abstract

The present application relates to the field of artificial intelligence, in particular to the field of computer vision, and provided therein are an image recognition method and apparatus, a computer-readable storage medium and a chip. The method comprises: extracting image features of an image to be processed; determining a time-sequence feature and spatial feature of each person among a plurality of people in the image to be processed in each image frame among a plurality of image frames in the image to be processed; determining an action feature thereof according to the time-sequence feature and spatial feature; and recognizing the group action of the plurality of people in the image to be processed according to the action features. In the described method, the temporal association between extracted actions of each person among a plurality of people in an image to be processed as well as the association between same and the actions of other people are determined, thereby better recognizing the group action of the plurality of people in the image to be processed.

Description

Image recognition method, device, computer readable storage medium and chip

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910980310.7, and the application name is "Image recognition method, device, computer readable storage medium and chip" on October 15, 2019. The entire content of the application is approved The reference is incorporated in this application.

Technical field

This application relates to the field of artificial intelligence, and in particular to an image recognition method, device, computer readable storage medium and chip.

Background technique

Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. To put it vividly, it is to install eyes (camera/camcorder) and brain (algorithm) on the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data. In general, computer vision is to use various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.

The recognition and understanding of human behavior in images is one of the most valuable information. Action recognition is an important research topic in the field of computer vision. The computer can understand the content of the video through motion recognition. Motion recognition technology can be widely used in public place monitoring, human-computer interaction and other fields. Feature extraction is a key link in the process of action recognition. Only based on accurate features, can action recognition be effectively performed. When performing group action recognition, the temporal relationship of the actions of each of the multiple characters in the video and the relationship between the actions of multiple characters affect the accuracy of group action recognition.

Existing solutions generally extract the temporal characteristics of characters through a long short-term memory (LSTM) network, where the temporal features are used to represent the temporal relevance of the actions of the characters. Then, according to the time sequence characteristics of each character, the interactive action characteristics of each character can be calculated, so that the action characteristics of each character can be determined according to the interactive action characteristics of each character, and the action characteristics of multiple characters can be inferred according to the action characteristics of each character. Group action. Interactive action features are used to express the correlation between characters' actions.

However, in the above solution, the interactive action characteristics of each character are only determined based on the temporal relevance of each character's actions, and the accuracy needs to be improved when used for group action recognition.

Summary of the invention

The present application provides an image recognition method, device, computer readable storage medium, and chip to better recognize group actions of multiple people in an image to be processed.

In a first aspect, an image recognition method is provided. The method includes: extracting image features of an image to be processed, the image to be processed includes multiple frames of images; determining that each of a plurality of persons is in each frame of the multiple frames of image Time sequence characteristics in the image; determine the spatial characteristics of each of the multiple characters in each frame of the multi-frame image; determine each of the multiple characters in each frame of the multi-frame image Based on the action characteristics of each of the multiple characters in each frame of the multi-frame image, the group actions of multiple characters in the image to be processed are recognized.

Optionally, the group actions of multiple characters in the image to be processed may be a certain sport or activity. For example, the group actions of multiple characters in the image to be processed may be basketball, volleyball, football or dancing, etc. .

Wherein, the image to be processed includes multiple people, and the image features of the image to be processed include image features of the multiple people in each of the multiple frames of the image to be processed.

In this application, when determining the group actions of multiple characters, not only the temporal characteristics of multiple characters are considered, but also the spatial characteristics of multiple characters are considered. By integrating the temporal characteristics and spatial characteristics of multiple characters, it can be better and more accurate. Determine the group actions of multiple characters.

When the image recognition method is executed by an image recognition device, the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image received by the image recognition device from another device, Alternatively, the above-mentioned image to be processed may also be captured by the camera of the image recognition device.

The above-mentioned image to be processed may be a continuous multi-frame image in a video, or a multi-frame image selected in accordance with a preset rule in a video according to a preset.

It should be understood that, among the multiple characters in the above-mentioned image to be processed, the multiple characters may include only humans, or only animals, or both humans and animals.

When extracting the image features of the image to be processed above, the person in the image can be identified to determine the person's bounding box. The image in each bounding box corresponds to a person in the image. Next, you can Feature extraction is performed on the image of the bounding box to obtain the image feature of each person.

Optionally, the bone node of the person in the bounding box corresponding to each person can be identified first, and then the image feature vector of the person can be extracted according to the bone node of each person, so that the extracted image features can be more accurately reflected The actions of the characters improve the accuracy of the extracted image features.

Further, the bone nodes in the bounding box can be connected according to the structure of the person to obtain a connected image, and then the image feature vector is extracted on the connected image.

Alternatively, the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.

Further, the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.

The above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.

When determining the temporal characteristics of a certain character among multiple characters, the time between actions of the character at different moments can be determined by the similarity between the image feature vectors of different actions of the character in different frames of images Association relationship, and then get the character's time series characteristics.

Assuming that the multi-frame images in the image to be processed are specifically T frames, and i is a positive integer less than or equal to T, then the i-th frame of image represents the images in the corresponding order in the T frame image; assuming that there are multiple people in the image to be processed Specifically for K, then the j-th character represents the characters in the corresponding order among the K characters, and both i and j are positive integers.

The timing characteristics of the j-th person in the i-th frame of the image to be processed above are determined based on the similarity between the image characteristics of the j-th person in the i-th frame and the image characteristics of other frames in the multi-frame image. of.

It should be understood that the timing characteristics of the j-th person in the i-th frame of image are used to indicate the relationship between the action of the j-th person in the i-th frame of image and the action of the above-mentioned multi-frame image. The similarity between the corresponding image features of a certain person in the two frames of images can reflect the degree of dependence of the person's actions on time.

If the similarity of the corresponding image features of a person in the two frames of images is higher, the relationship between the actions of the person at two points in time is closer; on the contrary, if a person corresponds to the two frames of images The lower the similarity of image features, the weaker the association between the person's actions at two points in time.

When determining the spatial characteristics of multiple characters, the spatial correlation between the actions of different characters in the frame of image is determined by the similarity between the image characteristics of different characters in the same frame of image.

The spatial feature of the j-th person among the multiple people in the i-th frame of the above-mentioned multi-frame image to be processed is based on the image feature of the j-th person in the i-th frame image and the removal of the j-th person from the i-th frame image The similarity of the image features of other characters is determined. That is to say, the j-th person in the i-th frame image can be determined based on the similarity between the image feature of the j-th person in the i-th frame image and the image features of other people except the j-th person in the i-th frame image. The spatial characteristics of the characters.

It should be understood that the spatial characteristics of the j-th person in the i-th frame image are used to represent the actions of the j-th person in the i-th frame image and the behavior of other people other than the j-th person in the i-th frame image in the i-th frame image. The relationship of the action.

Specifically, the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people except the j-th person can reflect the difference between the j-th person pair in the i-th frame image and the j-th person. The degree of dependence on the actions of characters other than personal objects. That is to say, when the similarity of the image feature vectors corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image feature vectors corresponding to the two characters is lower , The weaker the association between the actions of these two characters.

Optionally, the similarity between the above-mentioned temporal features and the spatial features can be calculated by Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance, etc. .

The spatial correlation between actions of different characters and the temporal correlation between actions of the same character can provide important clues to the categories of multi-person scenes in the image. Therefore, in the image recognition process of the present application, by comprehensively considering the spatial association relationship between different person actions and the time association relationship between the same person actions, the accuracy of recognition can be effectively improved.

Optionally, when determining the action characteristics of a person in a frame of image, the time series, spatial, and image characteristics corresponding to the person in a frame of image can be fused to obtain the person’s behavior in the frame of image. Movement characteristics.

When fusing the above-mentioned temporal features, spatial features, and image features, a combined fusion method can be used for fusion.

For example, the feature corresponding to a person in a frame of image is merged to obtain the action feature of the person in the frame of image.

Further, when fusing the above-mentioned multiple features, the features to be fused may be added directly or weighted.

Optionally, when fusing the above-mentioned multiple features, cascade and channel fusion can be used for fusion. Specifically, the dimensions of the features to be fused may be directly spliced, or spliced after being multiplied by a certain coefficient, that is, a weight value.

Optionally, a pooling layer may be used to process the above-mentioned multiple features, so as to realize the fusion of the above-mentioned multiple features.

In combination with the first aspect, in some implementations of the first aspect, the identification of multiple characters in the image to be processed is based on the action characteristics of each of the multiple characters in each frame of the image to be processed. For group actions, the action characteristics of each of the multiple characters in the image to be processed can be classified in each frame of the image to obtain the actions of each person, and determine the group actions of multiple characters accordingly.

Optionally, the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.

Optionally, a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.

In combination with the first aspect, in some implementations of the first aspect, the identification of multiple characters in the image to be processed is based on the action characteristics of each of the multiple characters in each frame of the image to be processed. In group action, the action features of multiple people in each frame of image can also be merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and based on this Determine the group actions of multiple characters in the image to be processed.

Optionally, the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.

Optionally, the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.

Optionally, a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.

In combination with the first aspect, in some implementations of the first aspect, after identifying group actions of multiple characters in the image to be processed, tag information of the image to be processed is generated according to the group action, and the tag information is used to indicate Group actions of multiple characters in the image to be processed.

The foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.

With reference to the first aspect, in some implementations of the first aspect, after identifying group actions of multiple characters in the image to be processed, the key person in the image to be processed is determined according to the group actions.

Optionally, first determine the contribution of each of the multiple characters in the image to be processed to the above group action, and then determine the person with the highest contribution as the key person.

It should be understood that the contribution of the key person to the group actions of the multiple characters is greater than the contribution of other characters among the multiple characters except the key person.

The above method can be used to detect key persons in a video image, for example. Generally, the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.

For example, if a video is a ball game, the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and also contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.

In a second aspect, an image recognition method is provided. The method includes: extracting image features of an image to be processed; determining the spatial characteristics of multiple people in each frame of the image to be processed; determining that multiple people are in each frame of the image to be processed Based on the action characteristics of the multiple characters in each frame of the image to be processed, the group actions of multiple characters in the image to be processed are recognized based on the action features of the multiple characters in each frame of the image to be processed.

Wherein, the action features of the multiple people in the image to be processed are obtained by fusing the spatial features of the multiple people in the image to be processed and the image features in the image to be processed.

The above-mentioned image to be processed may be one frame of image, or may be multiple frames of continuous or non-continuous images.

In this application, when determining the group actions of multiple characters, only the spatial characteristics of the multiple characters are considered, without calculating the temporal characteristics of each character, which is especially suitable for the situation where the determination of the spatial characteristics of the characters does not depend on the temporal characteristics of the characters. It is easier to determine the group actions of multiple characters. For another example, when only one frame of image is recognized, there is no time sequence characteristic of the same person at different times, and this method is more suitable.

The above-mentioned image to be processed may be one frame of image or continuous multiple frames of image in a piece of video, or one or multiple frames of image selected according to preset rules in a piece of video according to a preset.

When extracting the image features of the image to be processed, the person in the image can be identified to determine the bounding box of the person. The image in each bounding box corresponds to a person in the image. Feature extraction is performed on the image of the frame to obtain the image feature of each person.

Optionally, the bone node of the person in the bounding box corresponding to each person can be identified first, and then the image features of the person can be extracted according to the bone node of each person, so that the extracted image features more accurately reflect the person The action to improve the accuracy of the extracted image features.

Further, it is also possible to connect the bone nodes in the bounding box according to the character structure to obtain a connected image, and then extract the image feature vector of the connected image.

Further, a locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.

The above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located can be masked to obtain the partially visible image.

When determining the spatial characteristics of multiple characters, the spatial correlation between the actions of different characters in the same frame of image is determined by the similarity between the image characteristics of different characters in the same frame of image.

The spatial characteristics of the j-th person among the multiple people in the i-th frame of the above-mentioned multi-frame image to be processed are determined based on the similarity between the image characteristics of the j-th person in the i-th frame and the image characteristics of other people. . That is to say, the spatial characteristics of the j-th person in the i-th frame image can be determined according to the similarity between the image characteristics of the j-th person in the i-th frame image and the image characteristics of other people.

It should be understood that the spatial characteristics of the j-th person in the i-th frame image are used to represent the relationship between the actions of the j-th person in the i-th frame and the actions of other people in the i-th frame of the image except for the j-th person. .

Specifically, the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people in the i-th frame image except for the j-th person can reflect the similarity of the j-th person in the i-th frame image The degree of dependence on the actions of other characters. That is to say, the higher the similarity of the image feature vectors corresponding to the two characters, the closer the association between the two actions; conversely, the lower the similarity, the weaker the association between the actions of the two characters.

Optionally, the similarity between the aforementioned spatial features can be calculated by Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance, and the like.

Optionally, when determining the action feature of a person in a frame of image, the spatial feature and image feature corresponding to the person in a frame of image can be fused to obtain the action feature of the person in the frame of image.

When fusing the above-mentioned spatial features and image features, a combined fusion method can be used for fusion.

Further, when fusing the above-mentioned multiple features, the features to be fused may be added directly, or weighted.

In combination with the second aspect, in some implementations of the second aspect, when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, the processing can be The action characteristics of each of the multiple characters in the image are classified in each frame of the image to obtain the actions of each character, and the group actions of the multiple characters are determined accordingly.

In combination with the second aspect, in some implementations of the second aspect, when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, you can also The action features of multiple characters in each frame of image are merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and the multiple of the image to be processed are determined accordingly. The group actions of the characters.

In combination with the second aspect, in some implementations of the second aspect, after identifying group actions of multiple people in the image to be processed, tag information of the image to be processed is generated according to the group action, and the tag information is used to indicate Group actions of multiple characters in the image to be processed.

In combination with the second aspect, in some implementations of the second aspect, after identifying group actions of multiple characters in the image to be processed, the key person in the image to be processed is determined according to the group actions.

For example, if a video is a ball game, the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.

In a third aspect, an image recognition method is provided. The method includes: extracting image features of an image to be processed; determining the dependency between different characters in the image to be processed and the dependency between actions of the same person at different times; Fusion with the spatio-temporal feature vector to obtain the action feature of each frame of the image to be processed; perform classification prediction on the action feature of each frame of the image to determine the group action category of the image to be processed.

In this application, the complex reasoning process of group action recognition is completed, and when determining group actions of multiple characters, not only the temporal characteristics of multiple characters are taken into consideration, but also the spatial characteristics of multiple characters are taken into consideration. The temporal characteristics and spatial characteristics of personal objects can better and more accurately determine the group actions of multiple characters.

Optionally, when extracting the image features of the image to be processed as described above, target tracking can be performed on each person, and the bounding box of each person in each frame of the image is determined, and the image in each bounding box corresponds to a person, and then The feature extraction is performed on the image of each of the above-mentioned bounding boxes to obtain the image feature of each person.

When extracting the image features of the image to be processed, the image features can also be extracted by identifying the bone nodes of the person, so as to reduce the influence of the redundant information of the image during the feature extraction process and improve the accuracy of feature extraction. Specifically, a convolutional network can be used to extract image features based on bone nodes.

Optionally, the bone nodes in the bounding box may be connected according to the structure of the person to obtain a connected image, and then image feature vector extraction is performed on the connected image. Alternatively, the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors, and then image feature extraction is performed on the processed image.

Optionally, the character's action concealment matrix can be calculated according to the character's image and bone nodes. Each point in the masking matrix corresponds to a pixel. In the masking matrix, the value in the square area with the bone point as the center and side length l is set to 1, and the values in other positions are set to 0.

Further, the RGB color mode can be used for masking. The RGB color model uses the RGB model to assign an intensity value in the range of 0 to 255 for the RGB component of each pixel in the image. The masking matrix is used to mask the original character action pictures to obtain a partially visible image.

Optionally, the area around each node in the skeleton node that becomes length l is reserved, and the other areas are masked.

For each person, the use of locally visible images for image feature extraction can reduce the redundant information in the bounding box, and can extract image features based on the structure information of the person, and enhance the performance of the person's actions in the image features.

When determining the dependence relationship between different characters in the image to be processed and the dependence relationship between the actions of the same character at different moments, the cross interaction module is used to determine the temporal correlation of the body posture of the characters in the multi-frame images, and/or Determine the spatial correlation of the body postures of the characters in the multi-frame images.

Optionally, the above-mentioned cross interaction module is used to realize the interaction of features to establish a feature interaction model, and the feature interaction model is used to represent the association relationship of the body posture of the character in time and/or space.

Optionally, by calculating the similarity between the image features of different characters in the same frame of image, the spatial dependence between the body postures of different characters in the same frame of image can be determined. The spatial dependence is used to indicate the dependence of the body posture of a character on the body posture of other characters in a certain frame of image, that is, the spatial dependence between the actions of the characters. The spatial dependency can be expressed by the spatial feature vector.

Optionally, by calculating the similarity between image features of the same person at different times, the time dependence between the body postures of the same person at different times can be determined. The time dependence may also be referred to as timing dependence, which is used to indicate the dependence of the body posture of the character in a certain frame of image on the body posture of the character in other video frames, that is, the inherent temporal dependence of an action. The time dependence can be expressed by the time series feature vector.

The spatio-temporal feature vector of the k-th person can be calculated according to the spatial feature vector and the time-series feature vector of the k-th person in the image to be processed.

In the above process of fusing image features with spatio-temporal feature vectors to obtain the action features of each frame of the image to be processed, the image feature set X ∈ R ^{T of K people in the images at T moments} image feature ^{× K × D} image when the time T in K of individual objects - when set H∈R feature vector space ^{T × K × D} - a feature vector space are fused to give a time T The image characteristics of each image in the image.

Optionally, the image feature of the k-th person at time t is fused with the time-space feature vector to obtain the person feature vector of the k-th person at time t; or the image feature and the time-space feature vector are residually connected to Get the character feature vector. According to the character feature vector of each person in the K figures, determine the set of the character feature vectors of the K figures at time t. Perform maximum pooling on the set of character feature vectors to obtain action feature vectors.

In the foregoing process of classifying and predicting based on the action characteristics to determine the group action category of the image to be processed, the classification result of the group action can be obtained in different ways.

Optionally, the action feature vector at time t is input to the classification module to obtain the classification result of the frame image. The classification result of the image feature vector at any time t by the classification module may be used as the classification result of the group action in the T frame image. The classification result of the group action in the T frame image can also be understood as the classification result of the group action of the person in the T frame image, or the classification result of the T frame image.

Optionally, the action feature vectors of the T frame images are respectively input to the classification module to obtain the classification result of each frame of image. The classification result of the T frame image can belong to one or more categories. The category with the largest number of images in the corresponding T-frame image in the output category of the classification module can be used as the classification result of the group action in the T-frame image.

Optionally, average the action feature vectors of the T frame images to obtain the average feature vector. Each bit in the average feature vector is the average value of the corresponding bit in the image feature vector representation of the T frame image. The average feature vector can be input to the classification module to obtain the classification result of the group action in the T frame image.

The above method can complete the complex reasoning process of group action recognition: determine the image features of multiple frames of images, and determine the temporal and spatial features according to the interdependence between different characters in the image and between actions at different times, and then The above-mentioned image features are fused to obtain the action features of each frame of image, and then the group action of multiple frames of images is inferred by classifying the action features of each frame of image.

In a fourth aspect, an image recognition device is provided, and the image recognition device has the function of implementing the methods in the first to third aspects or any possible implementation manners thereof.

Optionally, the image recognition device includes a unit that implements the method in any one of the first aspect to the third aspect.

In a fifth aspect, a neural network training device is provided, and the training device has a unit for implementing the method in any one of the first aspect to the third aspect.

In a sixth aspect, an image recognition device is provided, the device includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processing The device is used to execute the method in any one of the foregoing first aspect to the third aspect.

In a seventh aspect, a neural network training device is provided. The device includes: a memory for storing a program; a processor for executing the program stored in the memory. When the program stored in the memory is executed, the device The processor is configured to execute the method in any one of the foregoing first aspect to the third aspect.

In an eighth aspect, an electronic device is provided, and the electronic device includes the image recognition apparatus in the fourth aspect or the sixth aspect.

The electronic device in the above eighth aspect may specifically be a mobile terminal (for example, a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, a vehicle-mounted terminal device, and so on.

In a ninth aspect, a computer device is provided, and the electronic device includes the neural network training device in the fifth aspect or the seventh aspect.

The computer device may specifically be a computer, a server, a cloud device, or a device with a certain computing capability that can implement neural network training.

In a tenth aspect, the present application provides a computer-readable storage medium. The computer-readable storage medium stores computer instructions. When the computer instructions run on the computer, the computer executes any of the first to third aspects. The method in the implementation mode.

In an eleventh aspect, the present application provides a computer program product. The computer program product includes computer program code. When the computer program code runs on a computer, the computer executes any one of the first aspect to the third aspect. One way to achieve this.

In a twelfth aspect, a chip is provided. The chip includes a processor and a data interface. The processor reads instructions stored in a memory through the data interface to execute any one of the first aspect to the third aspect. A method in a way of implementation.

Optionally, as an implementation manner, the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory. When the instructions are executed, the The processor is configured to execute the method in any one of the implementation manners of the first aspect to the third aspect.

The above-mentioned chip may specifically be a field programmable gate array FPGA or an application-specific integrated circuit ASIC.

Description of the drawings

FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an application environment provided by an embodiment of the present application;

FIG. 3 is a schematic flowchart of a method for group action recognition provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of a method for group action recognition provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application;

FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application;

FIG. 9 is a schematic flowchart of an image recognition method provided by an embodiment of the present application;

FIG. 10 is a schematic flowchart of an image recognition method provided by an embodiment of the present application;

FIG. 11 is a schematic flowchart of an image recognition method provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of a process of obtaining a partially visible image provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application;

FIG. 14 is a schematic diagram of the spatial relationship between different character actions provided by an embodiment of the present application;

15 is a schematic diagram of the spatial relationship of different character actions provided by an embodiment of the present application;

FIG. 16 is a schematic diagram of the relationship in time of a character's actions according to an embodiment of the present application;

FIG. 17 is a schematic diagram of the relationship in time of a character's actions according to an embodiment of the present application;

FIG. 18 is a schematic diagram of a system architecture of an image recognition network provided by an embodiment of the present application;

FIG. 19 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application;

FIG. 20 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application;

FIG. 21 is a schematic structural diagram of a neural network training device provided by an embodiment of the present application.

Detailed ways

The technical solution in this application will be described below in conjunction with the accompanying drawings.

The solution of the present application can be applied to the fields of video analysis, video recognition, abnormal or dangerous behavior detection, etc., which require video analysis of complex scenes of multiple people. The video may be, for example, a sports game video, a daily surveillance video, and the like. Two commonly used application scenarios are briefly introduced below.

Application scenario 1: Video management system

With the rapid increase of mobile internet speeds, users store a large number of short videos on electronic devices. More than one person may be included in the short video. Recognizing short videos in the video library can facilitate the user or the system to classify and manage the video library and improve user experience.

As shown in Figure 1, using the group action recognition system provided by this application, using a given database, training a neural network structure suitable for short video classification and deploying tests, the trained neural network structure can be used to determine the label corresponding to the short video , That is, to classify short videos to obtain group action categories corresponding to different short videos, and to tag different short videos with different tags, which is convenient for users to view and find, which can save manual classification and management time, and improve management efficiency and user experience.

Application Scenario 2: Key Person Detection System

Usually, the video includes several people, most of whom are not important. Detecting key figures effectively helps to quickly understand the content of the scene. As shown in Figure 2, the group action recognition system provided by the present application can identify key persons in the video, so as to understand the video content more accurately based on the information around the key persons.

In order to facilitate understanding, the following first introduces related terms and neural network related concepts involved in the embodiments of the present application.

(1) Neural network

A neural network can be composed of neural units. A neural unit can refer to _{an arithmetic unit that takes x s} and intercept b as inputs. The output of the arithmetic unit can be:

Among them, s=1, 2,...n, n is a natural number greater than 1, W _s is the weight of x _s , and b is the bias of the neural unit. f() is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be a region composed of several neural units.

(2) Deep neural network (deep neural network, DNN)

Deep neural network, also known as multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. Dividing DNN according to the location of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. For example, in a fully connected neural network, the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer. Although DNN looks complicated, it is not complicated as far as the work of each layer is concerned. Simply put, it is the following linear relationship expression:

among them,

Is the input vector,

Is the output vector,

Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just the input vector

After such a simple operation, the output vector is obtained

Due to the large number of DNN layers, the coefficient W and the offset vector

The number is also a lot. The definition of these parameters in DNN is as follows: Take coefficient W as an example: Suppose in a three-layer DNN, the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as

The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third-level index 2 and the input second-level index 4.

The summary is: the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as

It should be noted that there is no W parameter in the input layer. In deep neural networks, more hidden layers make the network more capable of portraying complex situations in the real world. In theory, a model with more parameters is more complex and has a greater "capacity", which means it can complete more complex learning tasks. Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).

(3) Convolutional neural network (convolutional neuron network, CNN)

Convolutional neural network is a deep neural network with convolutional structure. The convolutional neural network includes a feature extractor composed of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can only be connected to a part of the neighboring neurons. A convolutional layer usually includes several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.

(4) Recurrent Neural Networks (RNN)

Recurrent neural network is used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer and then to the output layer, the layers are fully connected, and the nodes in each layer are disconnected. Although this ordinary neural network has solved many problems, it is still powerless for many problems. For example, if you want to predict what the next word of a sentence is, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called recurrent neural network is that the current output of a sequence is also related to the previous output. The specific form of expression is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are no longer unconnected but connected, and the input of the hidden layer includes not only The output of the input layer also includes the output of the hidden layer at the previous moment. In theory, RNN can process sequence data of any length. The training of RNN is the same as the training of traditional CNN or DNN. The error back-propagation algorithm is also used, but there is a difference: that is, if the RNN is network expanded, then the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case. And in using the gradient descent algorithm, the output of each step depends not only on the current step of the network, but also on the state of the previous steps of the network. This learning algorithm is called backpropagation through time (BPTT).

Now that we already have a convolutional neural network, why do we need to recycle neural networks? The reason is simple. In convolutional neural networks, there is a premise that the elements are independent of each other, and the input and output are also independent, such as cats and dogs. But in the real world, many elements are connected to each other, such as the change of stocks over time, and another person said: I like traveling, and my favorite place is Yunnan, and I must go if I have the opportunity in the future. Filling in the blanks here, humans should all know that it is filling in "Yunnan". Because humans will make inferences based on the content of the context, but how to make the machine do this step? RNN came into being. RNN aims to make machines have memory capabilities like humans. Therefore, the output of RNN needs to rely on current input information and historical memory information.

(5) Loss function (loss function)

In the process of training a deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value that you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then based on the difference between the two To update the weight vector of each layer of neural network (of course, there is usually an initialization process before the first update, that is, pre-configured parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make its prediction lower, and keep adjusting until the deep neural network can predict the really wanted target value or a value very close to the really wanted target value. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value". This is the loss function or objective function, which is an important equation used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.

(6) Residual network (residual network, ResNet)

When the depth of the neural network is continuously increased, the problem of degradation will occur, that is, as the depth of the neural network increases, the accuracy first increases, and then reaches saturation, and then continues to increase the depth will cause the accuracy to decrease. The biggest difference between the ordinary direct-connected convolutional neural network and the residual network is that ResNet has many bypass branches to directly connect the input to the subsequent layer, and protect the integrity of the information by directly detouring the input information to the output. The problem of degradation. The residual network includes a convolutional layer and/or a pooling layer.

The residual network can be: In addition to connecting multiple hidden layers in a deep neural network, for example, the first hidden layer is connected to the second hidden layer, and the second hidden layer is connected to the third hidden layer. Contained layer, the third hidden layer is connected to the fourth hidden layer (this is a data operation path of a neural network, which can also be called neural network transmission), and the residual network has an additional direct connection branch. This directly connected branch is directly connected from the hidden layer of the 1st layer to the hidden layer of the 4th layer, that is, skips the processing of the 2nd and 3rd hidden layers, and directly transmits the data of the 1st hidden layer Perform calculations on the 4th hidden layer. The road network can be: in addition to the above-mentioned calculation path and direct connection branch, the deep neural network also includes a weight acquisition branch. This branch introduces a transmission gate (transform gate) to acquire the weight value and output The weight value T is used for the subsequent operations of the above calculation path and the directly connected branch.

(7) Backpropagation (BP) algorithm

The convolutional neural network can use the error back propagation algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forwarding the input signal until the output will cause error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss is converged. The backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal parameters of the neural network model, such as the weight matrix.

(8) Pixel value

The pixel value of the image can be a Red-Green-Blue (RGB) color value, and the pixel value can be a long integer representing the color. For example, the pixel value is 255×Red+100×Green+76×Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness. For grayscale images, the pixel values can be grayscale values.

(9) Group activity recognition (GAR)

Group action recognition can also be called group activity recognition, which is used to identify what a group of people do in a video. It is an important subject in computer vision. GAR has many potential applications, including video surveillance and sports video analysis. Compared with traditional single-person action recognition, GAR not only needs to recognize the behavior of the characters, but also needs to infer the potential relationship between the characters.

Group action recognition can use the following methods:

(1) Extract the temporal characteristics of each character from the corresponding bounding box (also known as character action representation);

(2) Infer the spatial context between each character (also known as interactive action representation);

(3) Connect these representations into the final group activity characteristics (also called feature aggregation).

These methods are indeed effective, but they ignore the concurrency of multi-level information, resulting in unsatisfactory performance of GAR.

A group action is composed of different actions of several characters in the group, which is equivalent to actions completed by several characters in cooperation, and these character actions reflect different postures of the body.

In addition, traditional models often ignore the spatial dependence between different characters. The spatial dependence between characters and the time dependence of each character's actions can provide important clues for GAR. For example, when a person hits the ball, he must observe the situation of his teammates, and at the same time, he must constantly adjust his posture over time to perform such a hitting action. And such a few people cooperate with each other to complete a group action. All of the above information, including the action characteristics of each person in each frame of the image (also called the human parts characteristics), and the dependence characteristics of each person's actions in time and space (also called the characters Human actions (human actions features), the features of each frame of images (also called group activity features), and the interrelationship between these features, together constitute an entity that affects the recognition of group actions. In other words, the traditional method uses a step-by-step method to process the complex information of such an entity, and cannot make full use of its potential time and space dependence. Not only that, these methods are also very likely to destroy the co-occurrence relationship between the space domain and the time domain. Existing methods often train the CNN network directly under the condition of extracting timing-dependent features. Therefore, the features extracted by the feature extraction network ignore the spatial dependence between people in the image. In addition, the bounding box contains more redundant information, which may lower the accuracy of the extracted character's action features.

Fig. 3 is a schematic flow chart of a method for group action recognition. For details, please refer to "A Hierarchical Deep Temporal Model for Group Activity Recognition" (Ibrahim M S, Muralidharan S, Deng Z, et al. IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1971-1980).

The existing algorithm is used to target several people in multiple video frames, and the size and position of each person in each video frame are determined. The person CNN is used to extract the convolutional features of each person in each video frame, and the convolutional features are input into the person's long short-term memory network (LSTM) to extract the time series features of each person. The convolution feature and time sequence feature corresponding to each person are spliced together as the person's action feature of the person. The character action characteristics of multiple characters in the video are spliced and max pooled to obtain the action characteristics of each video frame. The action characteristics of each video frame are input into the group LSTM to obtain the corresponding characteristics of the video frame. The feature corresponding to the video frame is input into the group action classifier to classify the input video, that is, the category to which the group action in the video belongs is determined.

Two-step training is required to obtain a hierarchical deep temporal model (HDTM) that can recognize videos that include this specific type of group action. The HDTM model includes a character CNN, a character LSTM, a group LSTM, and a group action classifier.

The existing algorithm is used to target several people in multiple video frames, and the size and position of each person in each video frame are determined. Each character corresponds to a character action tag. Each input video corresponds to a group action tag.

The first step is to train the character CNN, character LSTM, and character action classifier according to the character action label corresponding to each character, so as to obtain the trained character CNN and the trained character LSTM.

The second step of training is to train the parameters of the group LSTM and the group action classifier according to the group action tags, so as to obtain the trained group LSTM and the trained group action classifier.

According to the first step of training, the person CNN and the person LSTM are obtained, and the convolutional features and timing features of each person in the input video are extracted. After that, the second step of training is performed according to the feature representation of each video frame obtained by splicing the convolution features and time sequence features of the extracted multiple people. After the two-step training is completed, the obtained neural network model can perform group action recognition on the input video.

The determination of the character's action feature representation of each character is carried out by the neural network model trained in the first step. The fusion of the character action feature representations of multiple characters to identify group actions is performed by the neural network model trained in the second step. There is an information gap between feature extraction and group action classification, that is, the neural network model obtained in the first step of training can accurately extract the features of recognizing people's actions, but whether these features are suitable for group action recognition can not be guaranteed.

Fig. 4 is a schematic flowchart of a method for group action recognition. For details, please refer to "Social scene understanding: End-to-end multi-person action localization and collective activity recognition" (Bagautdinov, Timur, et al. IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4315-4324).

Send the t-th image of several video frames to fully convolutional networks (FCN) to obtain several character features f ^t . Time-series modeling is performed on several character features f ^t through RNN to obtain the time-series feature of each person, and the time-series feature of each person is sent to the classifier to simultaneously recognize the person action p _I ^t and the group action p _C ^t .

A step of training is required to obtain a neural network model that can recognize videos that include this specific type of group action. That is to say, the training image is input into the FCN, and the parameters of the FCN and RNN are adjusted according to the character action tag and group action tag of each character in the training image to obtain the trained FCN and RNN.

FCN can generate a multi-scale feature map F ^{t of} the t-th frame image. ^{Several detection frames B t} and corresponding probabilities p ^{t are} generated through deep fully convolutional networks (deep fully convolutional networks, DFCN), and B ^t and p ^{t are} sent to Markov random field (Markov random field, MRF) to obtain trusted detection frame b ^t, trusted detection block to determine the corresponding feature ^t ^t B from FIG multiscale wherein F f ^t. According to the characteristics of the persons in the credible detection frame b ^t-1 and the credible detection frame b ^t , it can be determined that the credible detection frames b ^t-1 and b ^t are the same person. FCN can also be obtained through pre-training.

A group action is composed of different actions of several characters, and these character actions are reflected in the different body postures of each character. The temporal characteristics of a character can reflect the time dependence of a character's actions. The spatial dependence between character actions also provides important clues for group action recognition. The accuracy of the group action recognition scheme that does not consider the spatial dependence between characters is affected to a certain extent.

In addition, in the training process of the neural network, determining the character action label of each character is usually done manually, which requires a lot of work.

In order to solve the above-mentioned problem, an embodiment of the present application provides an image recognition method. When determining the group actions of multiple characters in this application, not only the temporal characteristics of multiple characters are considered, but also the spatial characteristics of multiple characters are considered. By integrating the temporal characteristics and spatial characteristics of multiple characters, it is possible to better and more accurately Determine the group actions of multiple characters.

The following first introduces a system architecture of an embodiment of the present application with reference to FIG. 5.

Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 5, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data collection system 560.

In addition, the execution device 510 includes a calculation module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514. The calculation module 511 may include the target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.

The data collection device 560 is used to collect training data. For the image recognition method of the embodiment of the present application, the training data may include multiple frames of training images (the multiple frames of training images include multiple persons, such as multiple persons) and corresponding labels, where the label gives The group action category of the person. After the training data is collected, the data collection device 560 stores the training data in the database 530, and the training device 520 obtains the target model/rule 501 based on the training data maintained in the database 530.

The following describes the target model/rule 501 obtained by the training device 520 based on the training data. The training device 520 recognizes the input multi-frame training image, and compares the output prediction category with the label, until the prediction category output by the training device 520 is equal to The difference between the results of the label is less than a certain threshold, so that the training of the target model/rule 501 is completed.

The above-mentioned target model/rule 501 can be used to implement the image recognition method of the embodiment of the present application, that is, input one or more frames of images to be processed (after relevant preprocessing) into the target model/rule 501 to obtain the The group action category of people in the frame or multiple frames of the image to be processed. The target model/rule 501 in the embodiment of the present application may specifically be a neural network. It should be noted that, in actual applications, the training data maintained in the database 530 may not all come from the collection of the data collection device 560, and may also be received from other devices. In addition, it should be noted that the training device 520 does not necessarily perform the training of the target model/rule 501 completely based on the training data maintained by the database 530. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of the embodiment.

The target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 5, the execution device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, Notebook computers, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or clouds. In FIG. 5, the execution device 510 is configured with an input/output (input/output, I/O) interface 512 for data interaction with external devices. The user can input data to the I/O interface 512 through the client device 540. The input data in this embodiment of the present application may include: a to-be-processed image input by the client device. The client device 540 here may specifically be a terminal device.

The preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 512. In the embodiment of the present application, the preprocessing module 513 and the preprocessing module 514 may be omitted. Or there is only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the calculation module 511 can be directly used to process the input data.

When the execution device 510 preprocesses input data, or when the calculation module 511 of the execution device 510 performs calculations and other related processing, the execution device 510 may call data, codes, etc. in the data storage system 550 for corresponding processing. , The data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 550.

Finally, the I/O interface 512 presents the processing result, such as the group action category calculated by the target model/rule 501, to the client device 540, so as to provide it to the user.

Specifically, the group action category obtained by the target model/rule 501 in the calculation module 511 can be processed by the preprocessing module 513 (or the processing by the preprocessing module 514), and then the processing result is sent to the I/ O interface, and then the I/O interface sends the processing result to the client device 540 for display.

It should be understood that when the preprocessing module 513 and the preprocessing module 514 do not exist in the above system architecture 500, the calculation module 511 may also transmit the group action category obtained by the processing to the I/O interface, and then the I/O interface will process it. The result is sent to the client device 540 for display.

It is worth noting that the training device 520 can generate corresponding target models/rules 501 based on different training data for different goals or tasks, and the corresponding target models/rules 501 can be used to achieve the above goals or complete The above tasks provide users with the desired results.

In the case shown in FIG. 5, the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 512. In another case, the client device 540 can automatically send input data to the I/O interface 512. If the client device 540 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 540. The user can view the result output by the execution device 510 on the client device 540, and the specific presentation form may be a specific manner such as display, sound, and action. The client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output result of the output I/O interface 512 as new sample data as shown in the figure, and store it in the database 530. Of course, it is also possible not to collect through the client device 540, but the I/O interface 512 directly uses the input data input to the I/O interface 512 and the output result of the output I/O interface 512 as a new sample as shown in the figure. The data is stored in the database 530.

It is worth noting that FIG. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 5, the data The storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 may also be placed in the execution device 510.

As shown in FIG. 5, the target model/rule 501 obtained by training according to the training device 520 may be the neural network in the embodiment of the present application. Specifically, the neural network provided in the embodiment of the present application may be a CNN and a deep convolutional neural network ( deep convolutional neural networks, DCNN) and so on.

Since CNN is a very common neural network, the structure of CNN will be introduced below in conjunction with Figure 6. As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to the algorithm of machine learning. Multi-level learning is carried out on the abstract level of the system. As a deep learning architecture, CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.

FIG. 6 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the present application. As shown in FIG. 6, the convolutional neural network 600 may include an input layer 610, a convolutional layer/pooling layer 620 (the pooling layer is optional), and a fully connected layer 630. The following is a detailed introduction to the relevant content of these layers.

Convolutional layer/pooling layer 620:

Convolutional layer:

As shown in FIG. 6, the convolutional layer/pooling layer 620 may include layers 621-626 as shown in the examples. For example, in one implementation, layer 621 is a convolutional layer, layer 622 is a pooling layer, and layer 623 is a convolutional layer. Build layer, 624 is the pooling layer, 625 is the convolutional layer, and 626 is the pooling layer; in another implementation, 621 and 622 are convolutional layers, 623 is the pooling layer, and 624 and 625 are convolutional layers. Layer, 626 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.

The following will take the convolutional layer 621 as an example to introduce the internal working principle of a convolutional layer.

The convolution layer 621 can include many convolution operators. The convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. , It depends on the value of stride) to perform processing, so as to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same. During the convolution operation, the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row×column) are applied. That is, multiple homogeneous matrices. The output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image. Perform obfuscation and so on. The multiple weight matrices have the same size (row×column), the size of the convolution feature maps extracted by the multiple weight matrices of the same size are also the same, and then the multiple extracted convolution feature maps of the same size are combined to form The output of the convolution operation.

The weight values in these weight matrices need to be obtained through a lot of training in practical applications. Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 600 can make correct predictions. .

When the convolutional neural network 600 has multiple convolutional layers, the initial convolutional layer (such as 621) often extracts more general features, which can also be called low-level features; with the convolutional neural network With the deepening of the network 600, the features extracted by the subsequent convolutional layers (for example, 626) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the layers 621-626 as illustrated by 620 in Figure 6, it can be a convolutional layer followed by a layer. The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In the image processing process, the sole purpose of the pooling layer is to reduce the size of the image space. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of the average pooling. The maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling. In addition, just as the size of the weight matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Fully connected layer 630:

After processing by the convolutional layer/pooling layer 620, the convolutional neural network 600 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 620 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network 600 needs to use the fully connected layer 630 to generate one or a group of required classes of output. Therefore, the fully connected layer 630 can include multiple hidden layers (631, 632 to 63n as shown in FIG. 6) and an output layer 640. The parameters included in the multiple hidden layers can be based on specific task types. The relevant training data of the, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.

After the multiple hidden layers in the fully connected layer 630, that is, the final layer of the entire convolutional neural network 600 is the output layer 640. The output layer 640 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 600 (as shown in Figure 6, the propagation from the 610 to 640 direction is forward propagation) is completed, the back propagation (as shown in Figure 6 is the propagation from the 640 to 610 direction as the back propagation). Start to update the weight values and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 600 and the error between the output result of the convolutional neural network 600 through the output layer and the ideal result.

It should be noted that the convolutional neural network 600 shown in FIG. 6 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.

It should be understood that the convolutional neural network 600 shown in FIG. 6 can be used to execute the image recognition method of the embodiment of the present application. As shown in FIG. 6, the image to be processed passes through the input layer 610, the convolutional layer/pooling layer 620, and the fully connected After the layer 630 is processed, the group action category can be obtained.

FIG. 7 is a schematic diagram of a chip hardware structure provided by an embodiment of the application. As shown in FIG. 7, the chip includes a neural network processor 700. The chip can be set in the execution device 510 as shown in FIG. 5 to complete the calculation work of the calculation module 511. The chip can also be set in the training device 520 as shown in FIG. 5 to complete the training work of the training device 520 and output the target model/rule 501. The algorithms of each layer in the convolutional neural network as shown in FIG. 6 can be implemented in the chip as shown in FIG. 7.

A neural network processor (neural-network processing unit, NPU) 50 is mounted on a main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU allocates tasks. The core part of the NPU is the arithmetic circuit 703. The controller 704 controls the arithmetic circuit 703 to extract data from the memory (weight memory or input memory) and perform calculations.

In some implementations, the arithmetic circuit 703 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 703 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 703 fetches the data corresponding to matrix B from the weight memory 702 and buffers it on each PE in the arithmetic circuit 703. The arithmetic circuit 703 takes the matrix A data and the matrix B from the input memory 701 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 708.

The vector calculation unit 707 can perform further processing on the output of the arithmetic circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. For example, the vector calculation unit 707 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .

In some implementations, the vector calculation unit 707 can store the processed output vector to the unified buffer 706. For example, the vector calculation unit 707 may apply a nonlinear function to the output of the arithmetic circuit 703, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 707 generates normalized values, combined values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 703, for example for use in a subsequent layer in a neural network.

The unified memory 706 is used to store input data and output data.

The weight data directly transfers the input data in the external memory to the input memory 701 and/or the unified memory 706 through the storage unit access controller 705 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 702, And the data in the unified memory 706 is stored in the external memory.

The bus interface unit (BIU) 710 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 709 through the bus.

An instruction fetch buffer 709 connected to the controller 704 is used to store instructions used by the controller 704;

The controller 704 is used to call the instructions cached in the memory 709 to control the working process of the computing accelerator.

Generally, the unified memory 706, the input memory 701, the weight memory 702, and the instruction fetch memory 709 are all on-chip memories. The external memory is a memory external to the NPU. The external memory can be a double data rate synchronous dynamic random access memory. Memory (double data rate synchronous dynamic random access memory, referred to as DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.

In addition, in this application, the operations of each layer in the convolutional neural network shown in FIG. 6 may be executed by the arithmetic circuit 703 or the vector calculation unit 707.

FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application.

S801. Obtain training data, where the training data includes T1 frame training images and labeled categories.

The T1 frame training image corresponds to a label category. T1 is a positive integer greater than 1. The T1 frame training image can be a continuous multi-frame image in a video, or a multi-frame image selected according to a preset rule in a video according to a preset. For example, the T1 frame training image may be a multi-frame image obtained by selecting every preset time in a video, or it may be a multi-frame image with a preset number of frames in a video.

The training image of the T1 frame may include multiple characters, and the multiple characters may include only humans, animals, or both humans and animals.

The above-mentioned label category is used to indicate the category of the group action of the person in the training image of the T1 frame.

S802. Use a neural network to process the T1 frame training image to obtain a training category.

Use the neural network to perform the following processing on the T1 frame training image:

S802a: Extract image features of the training image of frame T1.

At least one frame of images is selected from the T1 frame of training images, and image features of multiple people in each frame of the at least one frame of images are extracted.

In a frame of training image, the image feature of a certain person can be used to represent the body posture of the person in the frame of training image, that is, the relative position between different limbs of the person. The above image features can be represented by vectors.

S802b: Determine the spatial characteristics of multiple people in each frame of training image in at least one frame of training image.

Wherein, the spatial feature of the j-th person in the i-th training image of the at least one frame of training image is based on the image feature of the j-th person in the i-th training image and the removal of the j-th person in the i-th frame of image The similarity of image features of other characters is determined, i and j are positive integers.

The spatial feature of the j-th person in the i-th training image is used to represent the actions of the j-th person in the i-th training image and the actions of other people except the j-th person in the i-th training image The relationship.

The similarity between corresponding image features of different characters in the same frame of image can reflect the spatial dependence of the actions of different characters. That is to say, when the similarity of the image features corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image features corresponding to the two characters is lower, this The weaker the association between the actions of the two characters.

S802c. Determine the timing characteristics of each of the multiple characters in the at least one frame of training images in different frames of images.

Wherein, the time sequence feature of the j-th person in the i-th training image in the at least one frame of training image is based on the image feature of the j-th person in the i-th training image and the j-th person in addition to the j-th person. The similarity between image features of training images in frames other than the i-frame image is determined, and i and j are positive integers.

The time series feature of the j-th person in the i-th frame of training image is used to represent the relationship between the action of the j-th person in the i-th frame of training image and the action of the j-th person in the at least one frame of training image in other frames of the training image.

The similarity between corresponding image features of a person in two frames of images can reflect the degree of dependence of the person's actions on time. The higher the similarity of the corresponding image features of a person in the two frames of images, the closer the correlation between the person’s actions at two time points; on the contrary, the lower the similarity, the closer the person’s actions at these two time points The weaker the connection between the actions.

S802d: Determine the action features of multiple characters in each frame of training image in at least one frame of training image.

Wherein, the action feature of the j-th person in the training image of the i-th frame is the spatial feature of the j-th person in the training image of the i-th frame, the time series feature of the j-th person in the training image of the i-th frame, and the The image features of the j-th person in the training image of the i-th frame are fused.

S802e. According to the action characteristics of the multiple characters in each frame of the training image in the at least one frame of training image, identify group actions of multiple characters in the training image of the T1 frame to obtain a training category corresponding to the group action .

The action features of each of the multiple characters in each frame of the training image in the at least one frame of training image may be fused to obtain the feature representation of each frame of the training image in the at least one frame of training image.

The average value of each bit represented by the training feature of each frame of the training image in the T1 training frame image can be calculated to obtain the average feature representation. Each bit represented by the average training feature is the average value of the corresponding bit represented by the feature of each frame of the training image in the T1 frame of training image. The classification can be performed based on the average feature representation, that is, the group actions of multiple characters in the training image of the T1 frame are recognized to obtain the training category.

In order to increase the amount of training data, the training category of each frame of training image in the at least one frame of training image may be determined. To determine the training category of each frame of image as an example for description. The at least one frame of training images may be all or part of the training images in the T1 frame of training images.

S803: Determine the loss value of the neural network according to the training category and the label category.

The loss value L of the neural network can be expressed as:

Among them, N _Y represents the number of group action categories, that is, the number of categories output by the neural network;

Indicates the label category,

Expressed by one-hot encoding,

Including N _Y bits,

Used to represent one of them,

P _t represents the training image category t th frame image T1 frame, P _t is represented by one-hot encoding, comprising N _Y P _t bits,

Means one of them,

The t-th frame image can also be understood as the image at time t.

S804: Adjust the neural network through back propagation according to the loss value.

In the above training process, the training data generally includes a combination of multiple sets of training images and annotated categories. Each combination of training images and annotated categories may include one or more frames of training images, and the one or more frames of training images correspond to Is a unique label category.

In the process of training the above neural network, you can set an initial set of model parameters for the neural network, and then gradually adjust the model parameters of the neural network according to the difference between the training category and the label category, until the training category and the label category are different. The difference is within a certain preset range, or when the number of training times reaches the preset number, the model parameters of the neural network at this time are determined as the final parameters of the neural network model, thus completing the training of the neural network Up.

FIG. 9 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.

S901: Extract image features of the image to be processed.

The image to be processed includes a plurality of people, and the image feature of the image to be processed includes the image feature of each of the plurality of people in each of the multiple frames of the image to be processed.

Before step S901, an image to be processed can be acquired. The image to be processed can be obtained from the memory, or the image to be processed can also be received.

For example, when the image recognition method shown in FIG. 9 is executed by an image recognition device, the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image obtained by the image recognition device from other equipment. The received image or the above-mentioned image to be processed may also be captured by the camera of the image recognition device.

The above-mentioned image to be processed may be a continuous multi-frame image in a video, or a multi-frame image selected according to a preset rule in a video. For example, in a video, multiple frames of images can be selected according to a preset time interval; or, in a video, multiple frames of images can be selected according to a preset frame number interval.

In a frame of image, the image feature of a certain person can be used to represent the body posture of the person in the frame of image, that is, the relative position between different limbs of the person. The image feature of a certain person mentioned above can be represented by a vector, which can be called an image feature vector. The above-mentioned image feature extraction can be performed by CNN.

Optionally, when extracting the image features of the image to be processed, the person in the image can be identified to determine the bounding box of the person. The image in each bounding box corresponds to a person. Perform feature extraction on the image to obtain the image feature of each person.

Since the image in the bounding box contains more redundant information, this redundant information has nothing to do with the actions of the characters. In order to improve the accuracy of the image feature vector, it is possible to reduce the influence of redundant information by identifying the bone nodes of the characters in each bounding box.

Optionally, the person’s bone node in the bounding box corresponding to each person can be identified first, and then the person’s image feature vector can be extracted based on the person’s bone node, so that the extracted image features can be more accurately reflected The actions of the characters improve the accuracy of the extracted image features.

Further, it is also possible to connect the bone nodes in the bounding box according to the character structure to obtain a connected image; and then perform image feature vector extraction on the above-mentioned connected image.

When masking the area outside the area where the bone node is located, the color of the pixel corresponding to the area outside the area where the bone node is located can be set to a certain preset color, such as black. In other words, the area where the bone node is located retains the same information as the original image, and the information of the area outside the area where the bone node is located is masked. Therefore, when extracting image features, only the image features of the above-mentioned partially visible image need to be extracted, and there is no need to extract the above-mentioned masked area.

The area where the aforementioned bone node is located may be a square, circle or other shape centered on the bone node. The side length (or radius), area, etc. of the region where the bone node is located can be preset values.

The above method of extracting the image features of the image to be processed can extract the features according to the locally visible image to obtain the image feature vector of the person corresponding to the bounding box; it can also determine the masking matrix according to the bone node, and mask the image according to the masking matrix . For details, refer to the description of FIG. 11 and FIG. 12.

When acquiring multiple frames of images, target tracking can be used to identify different people in the image. For example, the sub-features of the person in the image can be used to determine the distinction between the person in the image. The sub-features can be colors, edges, motion information, texture information, and so on.

S902: Determine the spatial feature of each of the multiple persons in each of the multiple frames of images.

Based on the similarity between image features of different characters in the same frame of image, the spatial correlation between the actions of different characters in the frame of image is determined.

The spatial characteristics of the j-th person in the i-th frame of the image to be processed above can be based on the image characteristics of the j-th person in the i-th frame and the images of other persons other than the j-th person in the i-th frame. The similarity of features is determined, and i and j are positive integers.

Specifically, the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people except the j-th person can reflect the difference between the j-th person pair in the i-th frame image and the j-th person. The degree of dependence on the actions of characters other than personal objects. That is to say, when the similarity of the image feature vectors corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image feature vectors corresponding to the two characters is lower , The weaker the association between the actions of these two characters. Refer to the description of FIG. 14 and FIG. 15 for the spatial association relationship of the actions of different characters in a frame of image.

S903. Determine the time sequence characteristics of each of the multiple persons in each frame of the multiple frames of images.

According to the similarity between image feature vectors of different actions of the same person in different frames of images, the time correlation between the actions of the person at different moments is determined.

The timing characteristics of the j-th person in the i-th frame of the image to be processed can be based on the similarity between the j-th person’s image characteristics in the i-th frame and the image characteristics in other frames except the i-th frame. The degree is determined, i and j are positive integers.

The time series feature of the j-th person in the i-th frame image is used to indicate the relationship between the action of the j-th person in the i-th frame image and the action in other frame images except the i-th frame image.

The similarity between corresponding image features of a person in two frames of images can reflect the degree of dependence of the person's actions on time. The higher the similarity of the corresponding image features of a person in the two frames of images, the closer the correlation between the person’s actions at two time points; on the contrary, the lower the similarity, the closer the person’s actions at these two time points The weaker the connection between the actions. Refer to the description of Fig. 16 and Fig. 17 for the time-related relationship of a character’s actions.

In the above process, the similarity between features is involved, and the similarity can be obtained in different ways. For example, Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance and other methods can be used to calculate the similarity between the above features.

Optionally, the similarity can be calculated by calculating the sum of the products of each bit of the two features after the linear change.

S904: Determine the action feature of each of the multiple persons in each frame of the multiple frames of images.

Optionally, when determining the action characteristics of a certain person in a certain frame of image, the time sequence, spatial, and image characteristics corresponding to the person in a frame of image can be fused to obtain the person in the frame of image The characteristics of the action.

For example, the spatial characteristics of the j-th person in the i-th frame of the image to be processed, the temporal characteristics of the j-th person in the i-th frame of the image, and the image characteristics of the j-th person in the i-th frame of the image can be fused to obtain The action feature of the j-th person in the i-th frame image.

When fusing the above-mentioned temporal features, spatial features, and image features, different fusion methods can be used for fusion, and the fusion methods are described below with examples.

The first way is to use a combination (combine) method for integration.

The features to be fused can be added directly or weighted.

It should be understood that the weighted addition is to add the features to be fused by a certain coefficient, that is, the weight value.

That is to say, by adopting the method of combination, the channel wise (channel wise) can be linearly combined.

Multiple features output by multiple layers of the feature extraction network can be added together. For example, multiple features output by multiple layers of the feature extraction network can be added directly, or multiple features output by multiple layers of the feature extraction network can be added. The features are added according to a certain weight. T1 and T2 respectively represent the features output by the two layers of the feature extraction network, and T3 can be used to represent the fused feature, T3=a×T1+b×T2, where a and b are respectively the multiplication of T1 and T2 when calculating T3 The coefficient, that is, the weight value, a≠0 and b≠0.

Method 2: Concatenate and channel fusion are used for fusion.

Cascade and channel fusion is another way of fusion. Using cascading and channel fusion methods, the dimensions of the features to be fused can be directly spliced, or they can be spliced after being multiplied by a certain coefficient, that is, a weight value.

The third way is to use the pooling layer to process the above-mentioned features, so as to realize the integration of the above-mentioned features.

Multiple feature vectors can be maximally pooled to determine the target feature vector. In the target feature vector obtained by maximum pooling, each bit is the maximum value of the corresponding bit in the multiple feature vectors. It is also possible to perform average pooling on multiple feature vectors to determine the target feature vector. In the target feature vector obtained by averaging pooling, each bit is the average value of the corresponding bit in the multiple feature vectors.

Optionally, the feature corresponding to a person in a frame of image can be merged in a combination manner to obtain the action feature of the person in the frame of image.

When acquiring multiple frames of images, the feature vector group corresponding to at least one person in the i-th frame image may further include a time-series feature vector corresponding to at least one person in the i-th frame image.

S905: Recognizing group actions of multiple people in the image to be processed according to the action feature of each person in the multiple images in each frame of the image.

It should be understood that group actions are composed of actions of several characters in the group, that is, actions completed by multiple characters.

In an implementation manner, the motion characteristics of each frame of the image may be determined according to the motion characteristics of each of the multiple characters in each frame of the image to be processed. Then, the group actions of multiple people in the image to be processed can be identified according to the action characteristics of each frame of image.

Optionally, the action characteristics of multiple characters in a frame of image can be merged by means of maximum pooling, so as to obtain the action characteristics of the frame of image.

In another implementation manner, it is possible to classify the action characteristics of each of the multiple characters in the image to be processed in each frame of the image to obtain the actions of each character, and determine the group actions of the multiple characters accordingly.

Steps S901 to S904 can be implemented by the neural network model trained in FIG. 8.

It should be understood that there is no sequence limitation for the above steps. For example, the sequence characteristics may be determined first, and then the spatial characteristics may be determined.

The method shown in Figure 9 not only considers the temporal characteristics of multiple characters when determining group actions of multiple characters, but also takes into account the spatial characteristics of multiple characters. It can be better by integrating the temporal and spatial characteristics of multiple characters. Determine the group actions of multiple characters more accurately.

Optionally, in the method shown in FIG. 9, after recognizing group actions of multiple people in the image to be processed, label information of the image to be processed is generated according to the group action, and the label information is used to indicate Group actions of multiple characters.

Optionally, in the method shown in FIG. 9, after recognizing group actions of multiple characters in the image to be processed, the key person in the image to be processed is determined according to the group actions.

Optionally, in the above process of determining the key person, the contribution of each of the multiple characters in the image to be processed to the group action can be determined first, and then the person with the highest contribution rate is determined as the key person.

FIG. 10 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.

S1001. Extract image features of the image to be processed.

The image to be processed includes at least one frame of image, and the image features of the image to be processed include image features of multiple people in the image to be processed.

Before step S1001, an image to be processed can be acquired. The image to be processed can be obtained from the memory, or the image to be processed can also be received.

For example, when the image recognition method shown in FIG. 10 is executed by an image recognition device, the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image obtained by the image recognition device from other equipment. The received image or the above-mentioned image to be processed may also be captured by the camera of the image recognition device.

It should be understood that the above-mentioned image to be processed may be one frame of image or multiple frames of image.

When the above-mentioned image to be processed is multiple frames, it may be consecutive multiple frames of images in a piece of video, or multiple frames of images selected according to preset rules in a piece of video. For example, in a video, multiple frames of images can be selected according to a preset time interval; or, in a video, multiple frames of images can be selected according to a preset frame number interval.

The above-mentioned image to be processed may include a plurality of persons, and the plurality of persons may include only humans, animals, or both humans and animals.

Optionally, the method shown in step S901 in FIG. 9 may be used to extract the image features of the image to be processed.

S1002: Determine the spatial characteristics of multiple people in each frame of the image to be processed.

The spatial characteristics of a certain person among the multiple characters in each frame of the to-be-processed image are based on the image characteristics of the person in the frame of the image to be processed and the images of other people except the person in the frame of the image to be processed The similarity of features is determined.

Optionally, the method shown in step S902 in FIG. 9 may be used to determine the spatial characteristics of multiple persons in each frame of the image to be processed.

S1003: Determine the action characteristics of multiple people in each frame of the image to be processed.

The action feature of a person among the multiple characters in each frame of the image to be processed is obtained by fusing the spatial feature of the person in the frame of image to be processed and the image feature of the person in the frame of image to be processed. .

Optionally, the fusion method shown in step S904 in FIG. 9 may be used to determine the motion characteristics of multiple characters in the image to be processed without a frame.

S1004: Identify group actions of multiple people in the image to be processed according to the action characteristics of multiple people in each frame of the image to be processed.

Optionally, the method shown in step S905 in FIG. 9 may be used to identify group actions of multiple characters in the image to be processed.

In the method shown in FIG. 10, there is no need to calculate the temporal characteristics of each character. When the determination of the spatial characteristics of the characters does not depend on the temporal characteristics of the characters, it is easier to determine the group actions of multiple characters. For another example, when only one frame of image is recognized, there is no time sequence characteristic of the same person at different times, and this method is more suitable.

FIG. 11 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.

S1101, extract image features of the image to be processed.

The image to be processed includes multiple frames of images, and the image features of the image to be processed include image features of multiple people in each frame of at least one frame of image selected from the multiple frames of images.

Optionally, feature extraction can be performed on images corresponding to multiple people in the input multiple frames of images.

The above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the character is located in the bounding box may be masked to obtain the partially visible image.

The above method of extracting the image features of the image to be processed can extract the features according to the locally visible image to obtain the image feature vector of the person corresponding to the bounding box; it can also determine the masking matrix according to the bone node, and mask the image according to the masking matrix .

The following is a specific example of the above method of determining the masking matrix based on the bone node.

S1101a) Determine the bounding box of each person in advance.

For time t, the image of the k-th person included in the bounding box

S1101b) Extract the bone nodes of each character in advance.

At time t, extract the skeletal node of the k-th character

S1101c) Calculate the concealment matrix of the character's action.

Can be based on the image of the character

And bone nodes

Calculate the character's action masking matrix

Masking matrix

Each point corresponds to a pixel.

Optionally, the masking matrix

, The value in the square area with the bone point as the center and side length l is set to 1, and the values in other positions are set to 0. Masking matrix

The calculation formula is as follows:

In the RGB color model, the RGB model is used to assign an intensity value in the range of 0 to 255 for the RGB component of each pixel in the image. If RGB color mode is used, the mask matrix

The calculation formula of can be expressed as:

Use matrix

Action images of original characters

Masking to get a partially visible image

Each bit in can represent a pixel.

The RGB component of each pixel in the value is between 0-1. The operator "ο" means

Each bit in the corresponding

Multiply each bit in.

FIG. 12 is a schematic diagram of a process of obtaining a partially visible image provided by an embodiment of the present application. As shown in Figure 12, the picture

Cover up. Specifically, keep the bone nodes

The area around each node in the variable length becomes l, and the other areas are masked.

Assuming that the number of people in the T frame images is the same, that is, the T frame images all include images of K people. According to the locally visible image corresponding to the K figure in each frame of the T frame image

Extract image features

It can be represented by a D-dimensional vector, that is

The image feature extraction of the above-mentioned T frame image can be performed by CNN.

The set of image features of K people in the T frame image can be expressed as X,

For each character, use a partially visible image

Extracting image features can reduce redundant information in the bounding box, extract image features based on body structure information, and enhance the ability to express human actions in image features.

S1102, determine the dependence relationship between actions of different characters in the image to be processed, and the dependence relationship between actions of the same character at different moments.

In this step, a cross interaction module (CIM) is used to determine the spatial correlation of the actions of different characters in the image to be processed, and the temporal correlation of the actions of the same character at different times.

The cross interaction module is used to implement feature interaction and establish a feature interaction model. The feature interaction model is used to represent the relationship of the character's body posture in time and/or space.

The spatial relevance of a character's body posture can be reflected through spatial dependence. Spatial dependence is used to express the dependence of a character's body posture in a certain frame of image on the body posture of other characters in this frame of image, that is, the spatial dependence of character actions. The above-mentioned spatial dependence can be expressed by a spatial feature vector.

For example, one frame of image in the image to be processed corresponds to the image at time t, then at time t, the spatial feature vector of the k-th person

It can be expressed as:

Among them, K represents that there are K people in the corresponding frame of image at time t,

Represents the image feature of the k-th person in the K person at time t,

Represents the image feature of the k'th person in the K person at time t, r(a,b)=θ(a) ^Τ φ(b) is used to calculate the similarity between feature a and feature b, θ() , Φ(), g() respectively represent three linear embedding functions, θ(), φ(), g() can be the same or different. r(a,b) can reflect the dependence of feature b on feature a.

By calculating the similarity between the image features of different characters in the same frame of image, the spatial dependence between the body postures of different characters in the same frame of image can be determined.

The relevance of a character's body posture in time can be reflected through time dependence. Time dependence can also be called timing dependence, which is used to indicate the dependence of the character's body posture in a certain frame of image on the character's body posture in other frame images, that is, the inherent temporal dependence of a character's actions. The above-mentioned time dependence can be expressed by a time series feature vector.

For example, one frame of image in the image to be processed corresponds to the image at time t, then at time t, the time series feature vector of the k-th person

It can be expressed as:

Among them, T indicates that there are a total of T time images in the image to be processed, that is, the image to be processed includes T frames of images,

Represents the image feature of the k-th person at time t,

Represents the image feature of the k-th person at time t'.

By calculating the similarity between the image features of the same person at different times, the time dependence between the body postures of the same person at different times can be determined.

According to the spatial feature vector of the k-th person at time t in the image to be processed

And time series feature vector

Calculate the time-space feature vector of the kth person at time t

Time-space feature vector

It can be used to represent the "time-space" related information of the k-th person. Time-space feature vector

Can be expressed as a time series feature vector

And spatial eigenvectors

Do "add"

The result of the operation:

FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application. As shown in Figure 13, calculate the image features of the k-th person at time t

The vector representation of the similarity with the image features of other people at time t, and the image features of the k-th person at time t

The vector representation of the similarity between the image features of the k-th person at other times and the average (Avg) is used to determine the time-space feature vector of the k-th person at time t

The set of spatio-temporal feature vectors of K characters in the T frame image can be expressed as H,

S1103: Fuse the image feature with the spatio-temporal feature vector to obtain the action feature of each frame of image.

Set the image feature set X∈R of the ^{K characters in the image at T time moments, the image feature in T×K×D} , and the set of time-space feature vectors of the K characters in the image at time T H∈R. ^{The spatio-temporal feature vectors in T×K×D} are fused to obtain the action characteristics of each image in the images at T times. The motion feature of each frame of image can be represented by a motion feature vector.

The image feature of the k-th person at time t can be

And space-time eigenvectors

Perform fusion to obtain the character feature vector of the k-th character at time t

Image feature

And space-time eigenvectors

Carry out residual connection to get the character feature vector

According to the character feature vector of each character in K figures

At time t, the set of character feature vectors of K characters B _t ∈R ^T×K×D can be expressed as:

Perform maximum pooling on the set of character feature vectors B _t to obtain the action feature vector z _t , and each bit in the action feature vector z _{t is}

The maximum value of this bit in the middle.

S1104: Perform classification prediction on the action feature of each frame of image to determine the group action of the image to be processed.

The classification module can be a softmax classifier. The classification result of the classification module can be one-hot coded, that is, only one bit is valid in the output result. In other words, the category corresponding to the classification result of any image feature vector is the only category among the output categories of the classification module.

_{The action feature vector z t} of a frame of image at time t can be input to the classification module to obtain the classification result of the frame of image. _{The classification result of z t} at any time t by the classification module can be used as the classification result of the group action in the T frame image. The classification result of the group action in the T frame image can also be understood as the classification result of the group action of the person in the T frame image, or the classification result of the T frame image.

The action feature vectors z ₁ , z ₂ ,..., z _{T of the} T frame images can be input into the classification module respectively to obtain the classification result of each frame of image. The classification result of the T frame image can belong to one or more categories. The category with the largest number of images in the corresponding T-frame image in the output category of the classification module can be used as the classification result of the group action in the T-frame image.

_{The action feature vector z 1} , z ₂ ,..., z _T of the T frame image can be averaged to obtain the average action feature vector

Average action feature vector

Each bit in z ₁ , z ₂ ,..., z _T is the average value of that bit. Average action feature vector

Input the classification module to obtain the classification result of the group action in the T frame image.

The above method can complete the complex reasoning process of group action recognition: extract the image features of multiple frames of images, and determine the temporal and spatial features according to the interdependence of actions between different people in the image and the same people at different times, and then The above-mentioned temporal features, spatial features, and image features are fused to obtain the action features of each frame of image, and then by classifying the action features of each frame of image, the group action of multiple frames of images can be inferred.

In the embodiments of the present application, when determining group actions of multiple characters, not only the temporal characteristics of multiple characters are considered, but also the spatial characteristics of multiple characters are considered. By integrating the temporal characteristics and spatial characteristics of multiple characters, it can be more improved. To more accurately determine the group actions of multiple characters.

However, for the situation where it is not necessary to consider the temporal features, that is, the spatial feature does not depend on the temporal features, in the embodiment of the present application, when determining the group actions of multiple characters, only the spatial features of the multiple characters may be considered for recognition. Conveniently determine the group actions of multiple characters.

Experiments on popular benchmark data sets prove the effectiveness of the image recognition method provided in the embodiments of this application.

Using the trained neural network for image recognition can accurately recognize group actions. Table 1 shows the recognition accuracy of the neural network model obtained by training, using the image recognition method provided in the embodiment of the present application, to recognize the public data set. Input the data including group actions in the public data set into the trained neural network. The multi-class accuracy (MCA) indicates that the number of results that are correctly classified in the classification results of the data including group actions by the neural network account for the number of results that include group actions. The proportion of data. The mean per class accuracy (MPCA) represents the average of the ratio of the number of correct results of each category to the number of data including group actions in the classification results of the neural network on the data including group actions value.

Table 1

In the neural network training process of this application, the training of the neural network can be completed without relying on the character action tags.

In the training process, an end-to-end training method is adopted, that is, the neural network is adjusted only according to the final classification results.

Two simple neural networks, the convolutional neural network AlexNet and the residual network ResNet-18, are used, the neural network training method provided in the embodiment of the application is used for training, and the image recognition method provided in the embodiment of the application is used for group action recognition , The accuracy rates of MCA and MPCA are higher, and both can achieve better results.

Feature interaction is to determine the dependence between characters and the dependence of character actions on time. The similarity between the two image features is calculated by the function r(a,b). The larger the calculation result of the function r(a,b), the stronger the dependence of the body posture corresponding to the two image features.

According to the similarity between the image features of multiple people in each frame of image, the spatial feature vector of each person in the frame of image is determined. The spatial feature vector of a person in a frame of image is used to express the person's spatial dependence on other people in the frame of image, that is, the dependence of the person's body posture on the body posture of other people.

FIG. 14 is a schematic diagram of the spatial relationship between different character actions provided by an embodiment of the present application. For a frame of image of the group action as shown in FIG. 14, the spatial dependence matrix of FIG. 15 represents the dependence of each person in the group action on the body posture of other people. Each bit in the spatial dependence matrix is represented by a square, and the color of the square, that is, the brightness, represents the similarity of the image features of the two people, that is, the calculation result of the function r(a, b). The larger the calculation result of the function r(a,b), the darker the color of the grid. The calculation result of the function r(a,b) can be standardized, that is, the calculation result of the function r(a,b) can be mapped between 0 and 1, so as to draw the spatial dependence matrix.

Intuitively speaking, the hitter in Fig. 14 is the number 10 player having a greater influence on the follow-up actions of her teammates. Through the calculation of the function r(a,b), the color of the tenth row and the tenth column representing the number 10 player in the spatial dependence matrix is darker. That is, the number 10 player is most related to group actions. Therefore, the function r(a,b) can reflect the high degree of correlation between the body posture of a person and other people in a frame of image, that is, it can reflect the situation with a high degree of dependence. In Figure 14, the spatial dependency between the body postures of players 1-6 is weak. In the spatial dependence matrix, the color in the upper left black box area is darker, and the upper left area represents the dependence between the body postures of players 1-6. Therefore, the neural network provided by the embodiments of the present application can better reflect the dependency or association relationship between the body posture of one person and the body posture of other people in a frame of image.

According to the similarity between the image features of a person in multiple frames of images, the time sequence feature vector of the person in one frame of image is determined. The time series feature vector of a person in one frame of image is used to represent the dependence of the person's body posture on the body posture of the person in other frames of images.

The body posture of the 10 frame images of the player No. 10 shown in FIG. 14 in time sequence is shown in Fig. 16. The time dependence matrix of Fig. 17 indicates the time dependence of the body posture of the player No. 10. Each bit in the time-dependent matrix is represented by a square. The color of the square, that is, the brightness, represents the similarity of the image features of the two people, that is, the calculation result of the function r(a,b).

The body posture of the No. 10 player in the 10-frame image corresponds to the take-off (frames 1-3), floating (frames 4-8) and landing (frames 9-10). In people's cognition, "jumping" and "landing" should be more discriminative. In the time-dependent matrix shown in FIG. 17, the image features of the No. 10 player in the 2nd and 10th frames are relatively similar to the image features in other images. In the black frame area shown in FIG. 17, the image features of the 4th to 8th frames of the image, that is, the image feature of the player No. 10 in the floating state, have low similarity with the image features in other images. Therefore, the neural network provided by the embodiments of the present application can better reflect the temporal association relationship of a person's body posture in multiple frames of images.

The method embodiments of the embodiments of the present application are described above with reference to the accompanying drawings, and the device embodiments of the embodiments of the present application are described below. It should be understood that the description of the method embodiment and the description of the device embodiment correspond to each other, and therefore, the parts that are not described can refer to the previous method embodiment.

FIG. 18 is a schematic diagram of the system architecture of an image recognition device provided by an embodiment of the present application. The image recognition device shown in FIG. 18 includes a feature extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804. The image recognition device in FIG. 18 can execute the image recognition method of the embodiment of the present application. The process of processing the input picture by the image recognition device will be introduced below.

The feature extraction module 1801, which may also be referred to as a partial-body extractor module, is used to extract the image features of the person according to the bone nodes of the person in the image. The function of the feature extraction module 1801 can be realized by using a convolutional network. The multi-frame images are input to the feature extraction module 1801. The image feature of a person can be represented by a vector, and the vector representing the image feature of the person can be called the image feature vector of the person.

The cross interaction module 1802 is used to map the image features of multiple characters in each frame of the multi-frame images to the spatio-temporal interaction features of each person. The time-space interaction feature is used to indicate the "time-space" associated information of a certain character. The time-space interaction feature of a person in a frame of image may be obtained by fusing the temporal feature and spatial feature of the person in the frame of image. The cross interaction module 1802 may be implemented by a convolutional layer and/or a fully connected layer.

The feature fusion module 1803 is used to fuse the action feature of each person in a frame of image with the time-space interaction feature to obtain the image feature vector of the frame of image. The image feature vector of the frame image can be used as the feature representation of the frame image.

The classification module 1804 is configured to classify according to the image feature vector, so as to determine the category of the group action of the person in the T frame image input to the feature extraction module 1801. The classification module 1804 may be a classifier.

The image recognition device shown in FIG. 18 can be used to execute the image recognition method shown in FIG. 11.

FIG. 19 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application. The image recognition device 3000 shown in FIG. 19 includes an acquisition unit 3001 and a processing unit 3002.

The acquiring unit 3001 is used to acquire an image to be processed;

The processing unit 3002 is configured to execute the image recognition methods in the embodiments of the present application.

Optionally, the obtaining unit 3001 may be used to obtain the image to be processed; the processing unit 3002 may be used to perform steps S901 to S904 or steps S1001 to S1004 described above to identify group actions of multiple people in the image to be processed.

Optionally, the obtaining unit 3001 may be used to obtain the image to be processed; the processing unit 3002 may be used to execute the above steps S1101 to S1104 to identify group actions of people in the image to be processed.

The above-mentioned processing unit 3002 can be divided into multiple modules according to different processing functions.

For example, the processing unit 3002 can be divided into an extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804 as shown in FIG. 18. The processing unit 3002 can realize the functions of the various modules shown in FIG. 18, and further can be used to realize the image recognition method shown in FIG.

FIG. 20 is a schematic diagram of the hardware structure of an image recognition device according to an embodiment of the present application. The image recognition apparatus 4000 shown in FIG. 20 (the apparatus 4000 may specifically be a computer device) includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004. Among them, the memory 4001, the processor 4002, and the communication interface 4003 implement communication connections between each other through the bus 4004.

The memory 4001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 4001 may store a program. When the program stored in the memory 4001 is executed by the processor 4002, the processor 4002 is configured to execute each step of the image recognition method in the embodiment of the present application.

The processor 4002 may adopt a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more The integrated circuit is used to execute related programs to implement the image recognition method in the method embodiment of the present application.

The processor 4002 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the image recognition method of the present application can be completed by the integrated logic circuit of hardware in the processor 4002 or instructions in the form of software.

The aforementioned processor 4002 may also be a general-purpose processor, a digital signal processing (digital signal processing, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 4001, and the processor 4002 reads the information in the memory 4001, combines its hardware to complete the functions required by the units included in the image recognition device, or executes the image recognition method in the method embodiment of the application.

The communication interface 4003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 4000 and other devices or a communication network. For example, the image to be processed can be acquired through the communication interface 4003.

The bus 4004 may include a path for transferring information between various components of the device 4000 (for example, the memory 4001, the processor 4002, and the communication interface 4003).

FIG. 21 is a schematic diagram of the hardware structure of a neural network training device according to an embodiment of the present application. Similar to the above device 4000, the neural network training device 5000 shown in FIG. 21 includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. Among them, the memory 5001, the processor 5002, and the communication interface 5003 implement communication connections between each other through the bus 5004.

The memory 5001 may be ROM, static storage device and RAM. The memory 5001 may store a program. When the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 and the communication interface 5003 are used to execute each step of the neural network training method of the embodiment of the present application.

The processor 5002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits to execute related programs to realize the functions required by the units in the image processing apparatus of the embodiments of the present application. Or execute the neural network training method of the method embodiment of this application.

The processor 5002 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the neural network training method of the embodiment of the present application can be completed by the integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.

The aforementioned processor 5002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the image processing apparatus of the embodiment of the present application, or execute the neural network of the method embodiment of the present application Training method.

The communication interface 5003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 5000 and other devices or communication networks. For example, the image to be processed can be acquired through the communication interface 5003.

The bus 5004 may include a path for transferring information between various components of the device 5000 (for example, the memory 5001, the processor 5002, and the communication interface 5003).

It should be noted that although the foregoing device 4000 and device 5000 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 4000 and device 5000 may also include those necessary for normal operation. Other devices. At the same time, according to specific needs, those skilled in the art should understand that the device 4000 and the device 5000 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the device 4000 and the device 5000 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 20 and FIG. 21.

An embodiment of the present application further provides an image recognition device, including: at least one processor and a communication interface, the communication interface is used for the image recognition device to exchange information with other communication devices, when the program instructions in the at least one processing When executed in the device, the image recognition device is caused to execute the above method.

An embodiment of the present application also provides a computer program storage medium, which is characterized in that the computer program storage medium has program instructions, and when the program instructions are directly or indirectly executed, the foregoing method can be realized.

An embodiment of the present application further provides a chip system, characterized in that the chip system includes at least one processor, and when the program instructions are executed in the at least one processor, the foregoing method can be realized.

A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

An image recognition method, characterized in that it comprises:

Extract the image features of the image to be processed, the image to be processed includes a plurality of people, and the image features of the image to be processed include the plurality of people in each of the multi-frame images of the image to be processed Image characteristics;

Determine the timing characteristics of each of the plurality of people in each of the multi-frame images, wherein the j-th person in the plurality of people is in the i-th character in the image to be processed The timing characteristics in the frame image are based on the image characteristics of the j-th person in the i-th frame image and the j-th person in the multi-frame images other than the i-th frame image The similarity between the image features of the image is determined, i and j are positive integers;

Determine the spatial feature of each of the plurality of persons in each of the plurality of frames of images, wherein the j-th person in the plurality of persons is the i-th in the image to be processed The spatial feature in the frame image is based on the image feature of the j-th person in the i-th frame image and other people among the plurality of people except the j-th person in the i-th frame image The similarity of image features is determined;

Determine the action feature of each of the plurality of people in each of the multi-frame images, wherein the j-th person in the plurality of people is the i-th in the multi-frame images The action feature in the frame image is the spatial feature of the j-th person in the i-th frame image, the time-series feature of the j-th person in the i-th frame image, and the j-th person Obtained by fusing image features in the i-th frame of image;

Recognizing group actions of the multiple persons in the image to be processed according to the action characteristics of the multiple persons in each of the multiple frames of images.
The method according to claim 1, wherein said extracting the image features of the image to be processed comprises:

Determining the image area where the bone node of each of the multiple characters is located in each of the multiple frames of images;

Perform feature extraction on the image of the image region where the bone node of each of the multiple people is located to obtain the image feature of the image to be processed.
The method according to claim 2, wherein the feature extraction is performed on the image of the image region where the bone node of each of the plurality of people is located to obtain the image feature of the image to be processed ,include:

In each of the multiple frames of images, mask the area other than the image area where the bone node of each of the multiple characters is located to obtain a partially visible image, the partially visible image Is an image composed of an image area including the bone node of each of the plurality of characters;

Perform feature extraction on the partially visible image to obtain the image feature of the image to be processed.
The method according to any one of claims 1 to 3, characterized in that, according to the action characteristics of the plurality of people in each frame of the multi-frame image, the identification in the image to be processed The group actions of the multiple characters include:

Classifying the action features of each of the multiple characters in each of the multiple frames of images to obtain the action of each of the multiple characters;

According to the actions of each of the multiple characters, the group actions of the multiple characters in the image to be processed are determined.
The method according to any one of claims 1 to 4, wherein the method further comprises:

Generate label information of the image to be processed, where the label information is used to indicate group actions of the multiple characters in the image to be processed.
The method according to any one of claims 1 to 4, wherein the method further comprises:

Determining the contribution of each of the plurality of characters to the group actions of the plurality of characters according to the group actions of the plurality of characters in the image to be processed;

Determine a key person in the plurality of characters according to the contribution of each of the plurality of characters to the group actions of the plurality of characters, and the key person's contribution to the group actions of the plurality of characters is greater than Contributions of the characters other than the key character in the plurality of characters to the group actions of the plurality of characters.
An image recognition device, characterized in that it comprises:

The acquiring unit is used to acquire the image to be processed;

Processing unit, the processing unit is used to:

Extract the image features of the image to be processed, the image to be processed includes a plurality of people, and the image features of the image to be processed include the plurality of people in each of the multi-frame images of the image to be processed Image characteristics;

Determine the timing characteristics of each of the plurality of people in each of the multi-frame images, wherein the j-th person in the plurality of people is in the i-th character in the image to be processed The timing characteristics in the frame image are based on the image characteristics of the j-th person in the i-th frame image and the j-th person in the multi-frame images other than the i-th frame image The similarity between the image features of the image is determined, i and j are positive integers;

Determine the spatial feature of each of the plurality of persons in each of the plurality of frames of images, wherein the j-th person in the plurality of persons is the i-th in the image to be processed The spatial feature in the frame image is based on the image feature of the j-th person in the i-th frame image and other people among the plurality of people except the j-th person in the i-th frame image The similarity of image features is determined;

Determine the action feature of each of the plurality of people in each of the multi-frame images, wherein the j-th person in the plurality of people is the i-th in the multi-frame images The action feature in the frame image is the spatial feature of the j-th person in the i-th frame image, the time-series feature of the j-th person in the i-th frame image, and the j-th person Obtained by fusing image features in the i-th frame of image;

Recognizing group actions of the multiple persons in the image to be processed according to the action characteristics of the multiple persons in each of the multiple frames of images.
The device according to claim 7, wherein the processing unit is configured to:

In each frame of the multiple frames of images, determine the image area where the bone node of each of the multiple persons is located;

Perform feature extraction on the image of the image region where the bone node of each of the multiple people is located to obtain the image feature of the image to be processed.
The device according to claim 8, wherein the processing unit is configured to:

In each of the multiple frames of images, mask the area other than the image area where the bone node of each of the multiple characters is located to obtain a partially visible image, the partially visible image Is an image composed of an image area including the bone node of each of the plurality of characters;

Perform feature extraction on the partially visible image to obtain the image feature of the image to be processed.
The device according to any one of claims 7 to 9, wherein the processing unit is configured to:

Classifying the action features of each of the multiple characters in each of the multiple frames of images to obtain the action of each of the multiple characters;

According to the actions of each of the multiple characters, the group actions of the multiple characters in the to-be-processed image are determined.
The device according to claim 10, wherein the processing unit is configured to:

Generate label information of the image to be processed, where the label information is used to indicate group actions of the multiple characters in the image to be processed.
The device according to any one of claims 7 to 10, wherein the processing unit is configured to:

Determining the contribution of each of the plurality of characters to the group actions of the plurality of characters according to the group actions of the plurality of characters in the image to be processed;

Determine a key person in the plurality of characters according to the contribution of each of the plurality of characters to the group actions of the plurality of characters, and the key person's contribution to the group actions of the plurality of characters is greater than Contributions of the characters other than the key character in the plurality of characters to the group actions of the plurality of characters.
An image recognition device, characterized in that the device includes:

Memory, used to store programs;

The processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method according to any one of claims 1 to 6.
A computer-readable storage medium, wherein the computer-readable medium stores program code for device execution, and the program code includes a program code for executing the method according to any one of claims 1 to 6 instruction.
A chip, characterized in that, the chip comprises a processor and a data interface, and the processor reads instructions stored on a memory through the data interface to execute the method according to any one of claims 1 to 6 method.