WO2021073311A1 - Image recognition method and apparatus, computer-readable storage medium and chip - Google Patents

Image recognition method and apparatus, computer-readable storage medium and chip Download PDF

Info

Publication number
WO2021073311A1
WO2021073311A1 PCT/CN2020/113788 CN2020113788W WO2021073311A1 WO 2021073311 A1 WO2021073311 A1 WO 2021073311A1 CN 2020113788 W CN2020113788 W CN 2020113788W WO 2021073311 A1 WO2021073311 A1 WO 2021073311A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
frame
person
processed
characters
Prior art date
Application number
PCT/CN2020/113788
Other languages
French (fr)
Chinese (zh)
Inventor
严锐
谢凌曦
田奇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021073311A1 publication Critical patent/WO2021073311A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an image recognition method, device, computer readable storage medium and chip.
  • Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. To put it vividly, it is to install eyes (camera/camcorder) and brain (algorithm) on the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data.
  • computer vision is to use various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • Action recognition is an important research topic in the field of computer vision.
  • the computer can understand the content of the video through motion recognition.
  • Motion recognition technology can be widely used in public place monitoring, human-computer interaction and other fields.
  • Feature extraction is a key link in the process of action recognition. Only based on accurate features, can action recognition be effectively performed.
  • group action recognition the temporal relationship of the actions of each of the multiple characters in the video and the relationship between the actions of multiple characters affect the accuracy of group action recognition.
  • LSTM long short-term memory
  • the interactive action characteristics of each character can be calculated, so that the action characteristics of each character can be determined according to the interactive action characteristics of each character, and the action characteristics of multiple characters can be inferred according to the action characteristics of each character.
  • Group action Interactive action features are used to express the correlation between characters' actions.
  • the present application provides an image recognition method, device, computer readable storage medium, and chip to better recognize group actions of multiple people in an image to be processed.
  • an image recognition method includes: extracting image features of an image to be processed, the image to be processed includes multiple frames of images; determining that each of a plurality of persons is in each frame of the multiple frames of image Time sequence characteristics in the image; determine the spatial characteristics of each of the multiple characters in each frame of the multi-frame image; determine each of the multiple characters in each frame of the multi-frame image Based on the action characteristics of each of the multiple characters in each frame of the multi-frame image, the group actions of multiple characters in the image to be processed are recognized.
  • the group actions of multiple characters in the image to be processed may be a certain sport or activity.
  • the group actions of multiple characters in the image to be processed may be basketball, volleyball, football or dancing, etc. .
  • the image to be processed includes multiple people, and the image features of the image to be processed include image features of the multiple people in each of the multiple frames of the image to be processed.
  • the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image received by the image recognition device from another device, Alternatively, the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
  • the above-mentioned image to be processed may be a continuous multi-frame image in a video, or a multi-frame image selected in accordance with a preset rule in a video according to a preset.
  • the multiple characters may include only humans, or only animals, or both humans and animals.
  • the person in the image can be identified to determine the person's bounding box.
  • the image in each bounding box corresponds to a person in the image.
  • you can Feature extraction is performed on the image of the bounding box to obtain the image feature of each person.
  • the bone node of the person in the bounding box corresponding to each person can be identified first, and then the image feature vector of the person can be extracted according to the bone node of each person, so that the extracted image features can be more accurately reflected
  • the actions of the characters improve the accuracy of the extracted image features.
  • the bone nodes in the bounding box can be connected according to the structure of the person to obtain a connected image, and then the image feature vector is extracted on the connected image.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
  • the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
  • the time between actions of the character at different moments can be determined by the similarity between the image feature vectors of different actions of the character in different frames of images Association relationship, and then get the character's time series characteristics.
  • the multi-frame images in the image to be processed are specifically T frames, and i is a positive integer less than or equal to T
  • the i-th frame of image represents the images in the corresponding order in the T frame image
  • the j-th character represents the characters in the corresponding order among the K characters
  • both i and j are positive integers.
  • the timing characteristics of the j-th person in the i-th frame of the image to be processed above are determined based on the similarity between the image characteristics of the j-th person in the i-th frame and the image characteristics of other frames in the multi-frame image. of.
  • timing characteristics of the j-th person in the i-th frame of image are used to indicate the relationship between the action of the j-th person in the i-th frame of image and the action of the above-mentioned multi-frame image.
  • the similarity between the corresponding image features of a certain person in the two frames of images can reflect the degree of dependence of the person's actions on time.
  • the spatial correlation between the actions of different characters in the frame of image is determined by the similarity between the image characteristics of different characters in the same frame of image.
  • the spatial feature of the j-th person among the multiple people in the i-th frame of the above-mentioned multi-frame image to be processed is based on the image feature of the j-th person in the i-th frame image and the removal of the j-th person from the i-th frame image
  • the similarity of the image features of other characters is determined. That is to say, the j-th person in the i-th frame image can be determined based on the similarity between the image feature of the j-th person in the i-th frame image and the image features of other people except the j-th person in the i-th frame image.
  • the spatial characteristics of the j-th person in the i-th frame image are used to represent the actions of the j-th person in the i-th frame image and the behavior of other people other than the j-th person in the i-th frame image in the i-th frame image. The relationship of the action.
  • the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people except the j-th person can reflect the difference between the j-th person pair in the i-th frame image and the j-th person.
  • the degree of dependence on the actions of characters other than personal objects That is to say, when the similarity of the image feature vectors corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image feature vectors corresponding to the two characters is lower , The weaker the association between the actions of these two characters.
  • the similarity between the above-mentioned temporal features and the spatial features can be calculated by Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance, etc. .
  • the spatial correlation between actions of different characters and the temporal correlation between actions of the same character can provide important clues to the categories of multi-person scenes in the image. Therefore, in the image recognition process of the present application, by comprehensively considering the spatial association relationship between different person actions and the time association relationship between the same person actions, the accuracy of recognition can be effectively improved.
  • the time series, spatial, and image characteristics corresponding to the person in a frame of image can be fused to obtain the person’s behavior in the frame of image. Movement characteristics.
  • a combined fusion method can be used for fusion.
  • the feature corresponding to a person in a frame of image is merged to obtain the action feature of the person in the frame of image.
  • the features to be fused may be added directly or weighted.
  • cascade and channel fusion can be used for fusion.
  • the dimensions of the features to be fused may be directly spliced, or spliced after being multiplied by a certain coefficient, that is, a weight value.
  • a pooling layer may be used to process the above-mentioned multiple features, so as to realize the fusion of the above-mentioned multiple features.
  • the identification of multiple characters in the image to be processed is based on the action characteristics of each of the multiple characters in each frame of the image to be processed.
  • the action characteristics of each of the multiple characters in the image to be processed can be classified in each frame of the image to obtain the actions of each person, and determine the group actions of multiple characters accordingly.
  • the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
  • a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
  • the identification of multiple characters in the image to be processed is based on the action characteristics of each of the multiple characters in each frame of the image to be processed.
  • the action features of multiple people in each frame of image can also be merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and based on this Determine the group actions of multiple characters in the image to be processed.
  • the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
  • the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
  • a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
  • tag information of the image to be processed is generated according to the group action, and the tag information is used to indicate Group actions of multiple characters in the image to be processed.
  • the foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
  • the key person in the image to be processed is determined according to the group actions.
  • the above method can be used to detect key persons in a video image, for example.
  • the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
  • the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and also contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
  • an image recognition method includes: extracting image features of an image to be processed; determining the spatial characteristics of multiple people in each frame of the image to be processed; determining that multiple people are in each frame of the image to be processed Based on the action characteristics of the multiple characters in each frame of the image to be processed, the group actions of multiple characters in the image to be processed are recognized based on the action features of the multiple characters in each frame of the image to be processed.
  • the action features of the multiple people in the image to be processed are obtained by fusing the spatial features of the multiple people in the image to be processed and the image features in the image to be processed.
  • the above-mentioned image to be processed may be one frame of image, or may be multiple frames of continuous or non-continuous images.
  • the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image received by the image recognition device from another device, Alternatively, the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
  • the above-mentioned image to be processed may be one frame of image or continuous multiple frames of image in a piece of video, or one or multiple frames of image selected according to preset rules in a piece of video according to a preset.
  • the multiple characters may include only humans, or only animals, or both humans and animals.
  • the person in the image can be identified to determine the bounding box of the person.
  • the image in each bounding box corresponds to a person in the image.
  • Feature extraction is performed on the image of the frame to obtain the image feature of each person.
  • the bone node of the person in the bounding box corresponding to each person can be identified first, and then the image features of the person can be extracted according to the bone node of each person, so that the extracted image features more accurately reflect the person The action to improve the accuracy of the extracted image features.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
  • a locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located can be masked to obtain the partially visible image.
  • the spatial correlation between the actions of different characters in the same frame of image is determined by the similarity between the image characteristics of different characters in the same frame of image.
  • the spatial characteristics of the j-th person among the multiple people in the i-th frame of the above-mentioned multi-frame image to be processed are determined based on the similarity between the image characteristics of the j-th person in the i-th frame and the image characteristics of other people. . That is to say, the spatial characteristics of the j-th person in the i-th frame image can be determined according to the similarity between the image characteristics of the j-th person in the i-th frame image and the image characteristics of other people.
  • the spatial characteristics of the j-th person in the i-th frame image are used to represent the relationship between the actions of the j-th person in the i-th frame and the actions of other people in the i-th frame of the image except for the j-th person. .
  • the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people in the i-th frame image except for the j-th person can reflect the similarity of the j-th person in the i-th frame image The degree of dependence on the actions of other characters. That is to say, the higher the similarity of the image feature vectors corresponding to the two characters, the closer the association between the two actions; conversely, the lower the similarity, the weaker the association between the actions of the two characters.
  • the similarity between the aforementioned spatial features can be calculated by Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance, and the like.
  • the spatial feature and image feature corresponding to the person in a frame of image can be fused to obtain the action feature of the person in the frame of image.
  • a combined fusion method can be used for fusion.
  • the feature corresponding to a person in a frame of image is merged to obtain the action feature of the person in the frame of image.
  • the features to be fused may be added directly, or weighted.
  • cascade and channel fusion can be used for fusion.
  • the dimensions of the features to be fused may be directly spliced, or spliced after being multiplied by a certain coefficient, that is, a weight value.
  • a pooling layer may be used to process the above-mentioned multiple features, so as to realize the fusion of the above-mentioned multiple features.
  • the processing when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, the processing can be The action characteristics of each of the multiple characters in the image are classified in each frame of the image to obtain the actions of each character, and the group actions of the multiple characters are determined accordingly.
  • the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
  • a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
  • the second aspect when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, you can also The action features of multiple characters in each frame of image are merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and the multiple of the image to be processed are determined accordingly.
  • the group actions of the characters when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, you can also The action features of multiple characters in each frame of image are merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and the multiple of the image to be processed are determined accordingly.
  • the group actions of the characters when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, you can also The action features of multiple characters in each frame of
  • the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
  • the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
  • a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
  • tag information of the image to be processed is generated according to the group action, and the tag information is used to indicate Group actions of multiple characters in the image to be processed.
  • the foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
  • the key person in the image to be processed is determined according to the group actions.
  • the above method can be used to detect key persons in a video image, for example.
  • the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
  • the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
  • an image recognition method includes: extracting image features of an image to be processed; determining the dependency between different characters in the image to be processed and the dependency between actions of the same person at different times; Fusion with the spatio-temporal feature vector to obtain the action feature of each frame of the image to be processed; perform classification prediction on the action feature of each frame of the image to determine the group action category of the image to be processed.
  • the complex reasoning process of group action recognition is completed, and when determining group actions of multiple characters, not only the temporal characteristics of multiple characters are taken into consideration, but also the spatial characteristics of multiple characters are taken into consideration.
  • the temporal characteristics and spatial characteristics of personal objects can better and more accurately determine the group actions of multiple characters.
  • target tracking can be performed on each person, and the bounding box of each person in each frame of the image is determined, and the image in each bounding box corresponds to a person, and then The feature extraction is performed on the image of each of the above-mentioned bounding boxes to obtain the image feature of each person.
  • the image features can also be extracted by identifying the bone nodes of the person, so as to reduce the influence of the redundant information of the image during the feature extraction process and improve the accuracy of feature extraction.
  • a convolutional network can be used to extract image features based on bone nodes.
  • the bone nodes in the bounding box may be connected according to the structure of the person to obtain a connected image, and then image feature vector extraction is performed on the connected image.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors, and then image feature extraction is performed on the processed image.
  • the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
  • the character's action concealment matrix can be calculated according to the character's image and bone nodes.
  • Each point in the masking matrix corresponds to a pixel.
  • the value in the square area with the bone point as the center and side length l is set to 1, and the values in other positions are set to 0.
  • the RGB color mode can be used for masking.
  • the RGB color model uses the RGB model to assign an intensity value in the range of 0 to 255 for the RGB component of each pixel in the image.
  • the masking matrix is used to mask the original character action pictures to obtain a partially visible image.
  • the area around each node in the skeleton node that becomes length l is reserved, and the other areas are masked.
  • the use of locally visible images for image feature extraction can reduce the redundant information in the bounding box, and can extract image features based on the structure information of the person, and enhance the performance of the person's actions in the image features.
  • the cross interaction module is used to determine the temporal correlation of the body posture of the characters in the multi-frame images, and/or Determine the spatial correlation of the body postures of the characters in the multi-frame images.
  • the above-mentioned cross interaction module is used to realize the interaction of features to establish a feature interaction model
  • the feature interaction model is used to represent the association relationship of the body posture of the character in time and/or space.
  • the spatial dependence between the body postures of different characters in the same frame of image can be determined.
  • the spatial dependence is used to indicate the dependence of the body posture of a character on the body posture of other characters in a certain frame of image, that is, the spatial dependence between the actions of the characters.
  • the spatial dependency can be expressed by the spatial feature vector.
  • the time dependence between the body postures of the same person at different times can be determined.
  • the time dependence may also be referred to as timing dependence, which is used to indicate the dependence of the body posture of the character in a certain frame of image on the body posture of the character in other video frames, that is, the inherent temporal dependence of an action.
  • the time dependence can be expressed by the time series feature vector.
  • the spatio-temporal feature vector of the k-th person can be calculated according to the spatial feature vector and the time-series feature vector of the k-th person in the image to be processed.
  • the image feature of the k-th person at time t is fused with the time-space feature vector to obtain the person feature vector of the k-th person at time t; or the image feature and the time-space feature vector are residually connected to Get the character feature vector.
  • the character feature vector of each person in the K figures determine the set of the character feature vectors of the K figures at time t. Perform maximum pooling on the set of character feature vectors to obtain action feature vectors.
  • the classification result of the group action can be obtained in different ways.
  • the action feature vector at time t is input to the classification module to obtain the classification result of the frame image.
  • the classification result of the image feature vector at any time t by the classification module may be used as the classification result of the group action in the T frame image.
  • the classification result of the group action in the T frame image can also be understood as the classification result of the group action of the person in the T frame image, or the classification result of the T frame image.
  • the action feature vectors of the T frame images are respectively input to the classification module to obtain the classification result of each frame of image.
  • the classification result of the T frame image can belong to one or more categories.
  • the category with the largest number of images in the corresponding T-frame image in the output category of the classification module can be used as the classification result of the group action in the T-frame image.
  • each bit in the average feature vector is the average value of the corresponding bit in the image feature vector representation of the T frame image.
  • the average feature vector can be input to the classification module to obtain the classification result of the group action in the T frame image.
  • the above method can complete the complex reasoning process of group action recognition: determine the image features of multiple frames of images, and determine the temporal and spatial features according to the interdependence between different characters in the image and between actions at different times, and then The above-mentioned image features are fused to obtain the action features of each frame of image, and then the group action of multiple frames of images is inferred by classifying the action features of each frame of image.
  • an image recognition device in a fourth aspect, has the function of implementing the methods in the first to third aspects or any possible implementation manners thereof.
  • the image recognition device includes a unit that implements the method in any one of the first aspect to the third aspect.
  • a neural network training device in a fifth aspect, has a unit for implementing the method in any one of the first aspect to the third aspect.
  • an image recognition device in a sixth aspect, includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processing The device is used to execute the method in any one of the foregoing first aspect to the third aspect.
  • a neural network training device in a seventh aspect, includes: a memory for storing a program; a processor for executing the program stored in the memory. When the program stored in the memory is executed, the device The processor is configured to execute the method in any one of the foregoing first aspect to the third aspect.
  • an electronic device in an eighth aspect, includes the image recognition apparatus in the fourth aspect or the sixth aspect.
  • the electronic device in the above eighth aspect may specifically be a mobile terminal (for example, a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, a vehicle-mounted terminal device, and so on.
  • a mobile terminal for example, a smart phone
  • a tablet computer for example, a tablet computer
  • a notebook computer for example, a tablet computer
  • an augmented reality/virtual reality device for example, a vehicle-mounted terminal device, and so on.
  • a computer device in a ninth aspect, includes the neural network training device in the fifth aspect or the seventh aspect.
  • the computer device may specifically be a computer, a server, a cloud device, or a device with a certain computing capability that can implement neural network training.
  • the present application provides a computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions.
  • the computer instructions run on the computer, the computer executes any of the first to third aspects.
  • the present application provides a computer program product.
  • the computer program product includes computer program code.
  • the computer program code runs on a computer, the computer executes any one of the first aspect to the third aspect.
  • a chip in a twelfth aspect, includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface to execute any one of the first aspect to the third aspect.
  • a method in a way of implementation.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
  • the processor is configured to execute the method in any one of the implementation manners of the first aspect to the third aspect.
  • the above-mentioned chip may specifically be a field programmable gate array FPGA or an application-specific integrated circuit ASIC.
  • FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an application environment provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for group action recognition provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for group action recognition provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a process of obtaining a partially visible image provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application.
  • FIG. 14 is a schematic diagram of the spatial relationship between different character actions provided by an embodiment of the present application.
  • 15 is a schematic diagram of the spatial relationship of different character actions provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of the relationship in time of a character's actions according to an embodiment of the present application.
  • FIG. 17 is a schematic diagram of the relationship in time of a character's actions according to an embodiment of the present application.
  • FIG. 18 is a schematic diagram of a system architecture of an image recognition network provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application.
  • FIG. 21 is a schematic structural diagram of a neural network training device provided by an embodiment of the present application.
  • the solution of the present application can be applied to the fields of video analysis, video recognition, abnormal or dangerous behavior detection, etc., which require video analysis of complex scenes of multiple people.
  • the video may be, for example, a sports game video, a daily surveillance video, and the like. Two commonly used application scenarios are briefly introduced below.
  • the trained neural network structure can be used to determine the label corresponding to the short video , That is, to classify short videos to obtain group action categories corresponding to different short videos, and to tag different short videos with different tags, which is convenient for users to view and find, which can save manual classification and management time, and improve management efficiency and user experience.
  • the video includes several people, most of whom are not important. Detecting key figures effectively helps to quickly understand the content of the scene. As shown in Figure 2, the group action recognition system provided by the present application can identify key persons in the video, so as to understand the video content more accurately based on the information around the key persons.
  • a neural network can be composed of neural units.
  • a neural unit can refer to an arithmetic unit that takes x s and intercept b as inputs.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f() is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with many hidden layers. There is no special metric for "many” here. Dividing DNN according to the location of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. For example, in a fully connected neural network, the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • DNN looks complicated, it is not complicated as far as the work of each layer is concerned.
  • the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as It should be noted that there is no W parameter in the input layer.
  • more hidden layers make the network more capable of portraying complex situations in the real world.
  • a model with more parameters is more complex and has a greater "capacity", which means it can complete more complex learning tasks.
  • Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional neural network is a deep neural network with convolutional structure.
  • the convolutional neural network includes a feature extractor composed of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to a part of the neighboring neurons.
  • a convolutional layer usually includes several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
  • Recurrent neural network is used to process sequence data.
  • the layers are fully connected, and the nodes in each layer are disconnected.
  • this ordinary neural network has solved many problems, it is still powerless for many problems. For example, if you want to predict what the next word of a sentence is, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called recurrent neural network is that the current output of a sequence is also related to the previous output.
  • the specific form of expression is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are no longer unconnected but connected, and the input of the hidden layer includes not only The output of the input layer also includes the output of the hidden layer at the previous moment.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • the error back-propagation algorithm is also used, but there is a difference: that is, if the RNN is network expanded, then the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case.
  • the output of each step depends not only on the current step of the network, but also on the state of the previous steps of the network. This learning algorithm is called backpropagation through time (BPTT).
  • BPTT backpropagation through time
  • the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • Residual network Residual network
  • the residual network includes a convolutional layer and/or a pooling layer.
  • the residual network can be: In addition to connecting multiple hidden layers in a deep neural network, for example, the first hidden layer is connected to the second hidden layer, and the second hidden layer is connected to the third hidden layer. Contained layer, the third hidden layer is connected to the fourth hidden layer (this is a data operation path of a neural network, which can also be called neural network transmission), and the residual network has an additional direct connection branch.
  • This directly connected branch is directly connected from the hidden layer of the 1st layer to the hidden layer of the 4th layer, that is, skips the processing of the 2nd and 3rd hidden layers, and directly transmits the data of the 1st hidden layer Perform calculations on the 4th hidden layer.
  • the road network can be: in addition to the above-mentioned calculation path and direct connection branch, the deep neural network also includes a weight acquisition branch. This branch introduces a transmission gate (transform gate) to acquire the weight value and output The weight value T is used for the subsequent operations of the above calculation path and the directly connected branch.
  • a transmission gate transform gate
  • the convolutional neural network can use the error back propagation algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forwarding the input signal until the output will cause error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss is converged.
  • the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • the pixel value of the image can be a Red-Green-Blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 255 ⁇ Red+100 ⁇ Green+76 ⁇ Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • Group action recognition can also be called group activity recognition, which is used to identify what a group of people do in a video. It is an important subject in computer vision. GAR has many potential applications, including video surveillance and sports video analysis. Compared with traditional single-person action recognition, GAR not only needs to recognize the behavior of the characters, but also needs to infer the potential relationship between the characters.
  • Group action recognition can use the following methods:
  • a group action is composed of different actions of several characters in the group, which is equivalent to actions completed by several characters in cooperation, and these character actions reflect different postures of the body.
  • the traditional method uses a step-by-step method to process the complex information of such an entity, and cannot make full use of its potential time and space dependence. Not only that, these methods are also very likely to destroy the co-occurrence relationship between the space domain and the time domain.
  • Existing methods often train the CNN network directly under the condition of extracting timing-dependent features. Therefore, the features extracted by the feature extraction network ignore the spatial dependence between people in the image.
  • the bounding box contains more redundant information, which may lower the accuracy of the extracted character's action features.
  • Fig. 3 is a schematic flow chart of a method for group action recognition.
  • a Hierarchical Deep Temporal Model for Group Activity Recognition (Ibrahim M S, Muralidharan S, Deng Z, et al. IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1971-1980).
  • the existing algorithm is used to target several people in multiple video frames, and the size and position of each person in each video frame are determined.
  • the person CNN is used to extract the convolutional features of each person in each video frame, and the convolutional features are input into the person's long short-term memory network (LSTM) to extract the time series features of each person.
  • the convolution feature and time sequence feature corresponding to each person are spliced together as the person's action feature of the person.
  • the character action characteristics of multiple characters in the video are spliced and max pooled to obtain the action characteristics of each video frame.
  • the action characteristics of each video frame are input into the group LSTM to obtain the corresponding characteristics of the video frame.
  • the feature corresponding to the video frame is input into the group action classifier to classify the input video, that is, the category to which the group action in the video belongs is determined.
  • the HDTM model includes a character CNN, a character LSTM, a group LSTM, and a group action classifier.
  • the existing algorithm is used to target several people in multiple video frames, and the size and position of each person in each video frame are determined.
  • Each character corresponds to a character action tag.
  • Each input video corresponds to a group action tag.
  • the first step is to train the character CNN, character LSTM, and character action classifier according to the character action label corresponding to each character, so as to obtain the trained character CNN and the trained character LSTM.
  • the second step of training is to train the parameters of the group LSTM and the group action classifier according to the group action tags, so as to obtain the trained group LSTM and the trained group action classifier.
  • the person CNN and the person LSTM are obtained, and the convolutional features and timing features of each person in the input video are extracted.
  • the second step of training is performed according to the feature representation of each video frame obtained by splicing the convolution features and time sequence features of the extracted multiple people.
  • the obtained neural network model can perform group action recognition on the input video.
  • the determination of the character's action feature representation of each character is carried out by the neural network model trained in the first step.
  • the fusion of the character action feature representations of multiple characters to identify group actions is performed by the neural network model trained in the second step.
  • Fig. 4 is a schematic flowchart of a method for group action recognition.
  • Social scene understanding: End-to-end multi-person action localization and collective activity recognition please refer to “Social scene understanding: End-to-end multi-person action localization and collective activity recognition” (Bagautdinov, Timur, et al. IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4315-4324).
  • a step of training is required to obtain a neural network model that can recognize videos that include this specific type of group action. That is to say, the training image is input into the FCN, and the parameters of the FCN and RNN are adjusted according to the character action tag and group action tag of each character in the training image to obtain the trained FCN and RNN.
  • FCN can generate a multi-scale feature map F t of the t-th frame image.
  • Several detection frames B t and corresponding probabilities p t are generated through deep fully convolutional networks (deep fully convolutional networks, DFCN), and B t and p t are sent to Markov random field (Markov random field, MRF) to obtain trusted detection frame b t, trusted detection block to determine the corresponding feature t t B from FIG multiscale wherein F f t.
  • MRF Markov random field
  • FCN can also be obtained through pre-training.
  • a group action is composed of different actions of several characters, and these character actions are reflected in the different body postures of each character.
  • the temporal characteristics of a character can reflect the time dependence of a character's actions.
  • the spatial dependence between character actions also provides important clues for group action recognition.
  • the accuracy of the group action recognition scheme that does not consider the spatial dependence between characters is affected to a certain extent.
  • an embodiment of the present application provides an image recognition method.
  • determining the group actions of multiple characters in this application not only the temporal characteristics of multiple characters are considered, but also the spatial characteristics of multiple characters are considered.
  • By integrating the temporal characteristics and spatial characteristics of multiple characters it is possible to better and more accurately Determine the group actions of multiple characters.
  • Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application.
  • the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data collection system 560.
  • the execution device 510 includes a calculation module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514.
  • the calculation module 511 may include the target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.
  • the data collection device 560 is used to collect training data.
  • the training data may include multiple frames of training images (the multiple frames of training images include multiple persons, such as multiple persons) and corresponding labels, where the label gives The group action category of the person.
  • the data collection device 560 stores the training data in the database 530, and the training device 520 obtains the target model/rule 501 based on the training data maintained in the database 530.
  • the training device 520 recognizes the input multi-frame training image, and compares the output prediction category with the label, until the prediction category output by the training device 520 is equal to The difference between the results of the label is less than a certain threshold, so that the training of the target model/rule 501 is completed.
  • the above-mentioned target model/rule 501 can be used to implement the image recognition method of the embodiment of the present application, that is, input one or more frames of images to be processed (after relevant preprocessing) into the target model/rule 501 to obtain the The group action category of people in the frame or multiple frames of the image to be processed.
  • the target model/rule 501 in the embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 530 may not all come from the collection of the data collection device 560, and may also be received from other devices.
  • the training device 520 does not necessarily perform the training of the target model/rule 501 completely based on the training data maintained by the database 530. It may also obtain training data from the cloud or other places for model training.
  • the above description should not be used as a reference to this application Limitations of the embodiment.
  • the target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 5, the execution device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, Notebook computers, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or clouds.
  • the execution device 510 is configured with an input/output (input/output, I/O) interface 512 for data interaction with external devices.
  • the user can input data to the I/O interface 512 through the client device 540.
  • the input data in this embodiment of the present application may include: a to-be-processed image input by the client device.
  • the client device 540 here may specifically be a terminal device.
  • the preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 512.
  • the preprocessing module 513 and the preprocessing module 514 may be omitted. Or there is only one preprocessing module.
  • the calculation module 511 can be directly used to process the input data.
  • the execution device 510 may call data, codes, etc. in the data storage system 550 for corresponding processing.
  • the data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 550.
  • the I/O interface 512 presents the processing result, such as the group action category calculated by the target model/rule 501, to the client device 540, so as to provide it to the user.
  • the group action category obtained by the target model/rule 501 in the calculation module 511 can be processed by the preprocessing module 513 (or the processing by the preprocessing module 514), and then the processing result is sent to the I/ O interface, and then the I/O interface sends the processing result to the client device 540 for display.
  • the calculation module 511 may also transmit the group action category obtained by the processing to the I/O interface, and then the I/O interface will process it. The result is sent to the client device 540 for display.
  • the training device 520 can generate corresponding target models/rules 501 based on different training data for different goals or tasks, and the corresponding target models/rules 501 can be used to achieve the above goals or complete The above tasks provide users with the desired results.
  • the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 512.
  • the client device 540 can automatically send input data to the I/O interface 512. If the client device 540 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 540. The user can view the result output by the execution device 510 on the client device 540, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output result of the output I/O interface 512 as new sample data as shown in the figure, and store it in the database 530.
  • the I/O interface 512 directly uses the input data input to the I/O interface 512 and the output result of the output I/O interface 512 as a new sample as shown in the figure.
  • the data is stored in the database 530.
  • FIG. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 may also be placed in the execution device 510.
  • the target model/rule 501 obtained by training according to the training device 520 may be the neural network in the embodiment of the present application.
  • the neural network provided in the embodiment of the present application may be a CNN and a deep convolutional neural network ( deep convolutional neural networks, DCNN) and so on.
  • CNN is a very common neural network
  • the structure of CNN will be introduced below in conjunction with Figure 6.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning. Multi-level learning is carried out on the abstract level of the system.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.
  • FIG. 6 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the present application.
  • the convolutional neural network 600 may include an input layer 610, a convolutional layer/pooling layer 620 (the pooling layer is optional), and a fully connected layer 630.
  • the pooling layer is optional
  • the fully connected layer 630 is a detailed introduction to the relevant content of these layers.
  • the convolutional layer/pooling layer 620 may include layers 621-626 as shown in the examples.
  • layer 621 is a convolutional layer
  • layer 622 is a pooling layer
  • layer 623 is a convolutional layer.
  • Build layer 624 is the pooling layer
  • 625 is the convolutional layer
  • 626 is the pooling layer
  • 621 and 622 are convolutional layers
  • 623 is the pooling layer
  • 624 and 625 are convolutional layers.
  • Layer, 626 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 621 can include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. , It depends on the value of stride) to perform processing, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row ⁇ column) are applied. That is, multiple homogeneous matrices.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), the size of the convolution feature maps extracted by the multiple weight matrices of the same size are also the same, and then the multiple extracted convolution feature maps of the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 600 can make correct predictions. .
  • the initial convolutional layer (such as 621) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 626) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the size of the image space.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of the average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 600 After processing by the convolutional layer/pooling layer 620, the convolutional neural network 600 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 620 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network 600 needs to use the fully connected layer 630 to generate one or a group of required classes of output. Therefore, the fully connected layer 630 can include multiple hidden layers (631, 632 to 63n as shown in FIG. 6) and an output layer 640. The parameters included in the multiple hidden layers can be based on specific task types. The relevant training data of the, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 640 After the multiple hidden layers in the fully connected layer 630, that is, the final layer of the entire convolutional neural network 600 is the output layer 640.
  • the output layer 640 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • the convolutional neural network 600 shown in FIG. 6 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
  • the convolutional neural network 600 shown in FIG. 6 can be used to execute the image recognition method of the embodiment of the present application. As shown in FIG. 6, the image to be processed passes through the input layer 610, the convolutional layer/pooling layer 620, and the fully connected After the layer 630 is processed, the group action category can be obtained.
  • FIG. 7 is a schematic diagram of a chip hardware structure provided by an embodiment of the application.
  • the chip includes a neural network processor 700.
  • the chip can be set in the execution device 510 as shown in FIG. 5 to complete the calculation work of the calculation module 511.
  • the chip can also be set in the training device 520 as shown in FIG. 5 to complete the training work of the training device 520 and output the target model/rule 501.
  • the algorithms of each layer in the convolutional neural network as shown in FIG. 6 can be implemented in the chip as shown in FIG. 7.
  • a neural network processor (neural-network processing unit, NPU) 50 is mounted on a main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 703.
  • the controller 704 controls the arithmetic circuit 703 to extract data from the memory (weight memory or input memory) and perform calculations.
  • the arithmetic circuit 703 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 703 is a general-purpose matrix processor.
  • the arithmetic circuit 703 fetches the data corresponding to matrix B from the weight memory 702 and buffers it on each PE in the arithmetic circuit 703.
  • the arithmetic circuit 703 takes the matrix A data and the matrix B from the input memory 701 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 708.
  • the vector calculation unit 707 can perform further processing on the output of the arithmetic circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 707 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 707 can store the processed output vector to the unified buffer 706.
  • the vector calculation unit 707 may apply a nonlinear function to the output of the arithmetic circuit 703, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 707 generates normalized values, combined values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 703, for example for use in a subsequent layer in a neural network.
  • the unified memory 706 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 701 and/or the unified memory 706 through the storage unit access controller 705 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 702, And the data in the unified memory 706 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 710 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 709 through the bus.
  • An instruction fetch buffer 709 connected to the controller 704 is used to store instructions used by the controller 704;
  • the controller 704 is used to call the instructions cached in the memory 709 to control the working process of the computing accelerator.
  • the unified memory 706, the input memory 701, the weight memory 702, and the instruction fetch memory 709 are all on-chip memories.
  • the external memory is a memory external to the NPU.
  • the external memory can be a double data rate synchronous dynamic random access memory.
  • Memory double data rate synchronous dynamic random access memory, referred to as DDR SDRAM
  • HBM high bandwidth memory
  • each layer in the convolutional neural network shown in FIG. 6 may be executed by the arithmetic circuit 703 or the vector calculation unit 707.
  • FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application.
  • training data where the training data includes T1 frame training images and labeled categories.
  • the T1 frame training image corresponds to a label category.
  • T1 is a positive integer greater than 1.
  • the T1 frame training image can be a continuous multi-frame image in a video, or a multi-frame image selected according to a preset rule in a video according to a preset.
  • the T1 frame training image may be a multi-frame image obtained by selecting every preset time in a video, or it may be a multi-frame image with a preset number of frames in a video.
  • the training image of the T1 frame may include multiple characters, and the multiple characters may include only humans, animals, or both humans and animals.
  • the above-mentioned label category is used to indicate the category of the group action of the person in the training image of the T1 frame.
  • S802a Extract image features of the training image of frame T1.
  • At least one frame of images is selected from the T1 frame of training images, and image features of multiple people in each frame of the at least one frame of images are extracted.
  • the image feature of a certain person can be used to represent the body posture of the person in the frame of training image, that is, the relative position between different limbs of the person.
  • the above image features can be represented by vectors.
  • S802b Determine the spatial characteristics of multiple people in each frame of training image in at least one frame of training image.
  • the spatial feature of the j-th person in the i-th training image of the at least one frame of training image is based on the image feature of the j-th person in the i-th training image and the removal of the j-th person in the i-th frame of image
  • the similarity of image features of other characters is determined, i and j are positive integers.
  • the spatial feature of the j-th person in the i-th training image is used to represent the actions of the j-th person in the i-th training image and the actions of other people except the j-th person in the i-th training image The relationship.
  • the similarity between corresponding image features of different characters in the same frame of image can reflect the spatial dependence of the actions of different characters. That is to say, when the similarity of the image features corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image features corresponding to the two characters is lower, this The weaker the association between the actions of the two characters.
  • S802c Determine the timing characteristics of each of the multiple characters in the at least one frame of training images in different frames of images.
  • the time sequence feature of the j-th person in the i-th training image in the at least one frame of training image is based on the image feature of the j-th person in the i-th training image and the j-th person in addition to the j-th person.
  • the similarity between image features of training images in frames other than the i-frame image is determined, and i and j are positive integers.
  • the time series feature of the j-th person in the i-th frame of training image is used to represent the relationship between the action of the j-th person in the i-th frame of training image and the action of the j-th person in the at least one frame of training image in other frames of the training image.
  • the similarity between corresponding image features of a person in two frames of images can reflect the degree of dependence of the person's actions on time.
  • S802d Determine the action features of multiple characters in each frame of training image in at least one frame of training image.
  • the action feature of the j-th person in the training image of the i-th frame is the spatial feature of the j-th person in the training image of the i-th frame, the time series feature of the j-th person in the training image of the i-th frame, and the The image features of the j-th person in the training image of the i-th frame are fused.
  • the action features of each of the multiple characters in each frame of the training image in the at least one frame of training image may be fused to obtain the feature representation of each frame of the training image in the at least one frame of training image.
  • the average value of each bit represented by the training feature of each frame of the training image in the T1 training frame image can be calculated to obtain the average feature representation.
  • Each bit represented by the average training feature is the average value of the corresponding bit represented by the feature of each frame of the training image in the T1 frame of training image.
  • the classification can be performed based on the average feature representation, that is, the group actions of multiple characters in the training image of the T1 frame are recognized to obtain the training category.
  • the training category of each frame of training image in the at least one frame of training image may be determined.
  • the at least one frame of training images may be all or part of the training images in the T1 frame of training images.
  • S803 Determine the loss value of the neural network according to the training category and the label category.
  • the loss value L of the neural network can be expressed as:
  • N Y represents the number of group action categories, that is, the number of categories output by the neural network; Indicates the label category, Expressed by one-hot encoding, Including N Y bits, Used to represent one of them, P t represents the training image category t th frame image T1 frame, P t is represented by one-hot encoding, comprising N Y P t bits, Means one of them, The t-th frame image can also be understood as the image at time t.
  • the training data generally includes a combination of multiple sets of training images and annotated categories.
  • Each combination of training images and annotated categories may include one or more frames of training images, and the one or more frames of training images correspond to Is a unique label category.
  • the difference is within a certain preset range, or when the number of training times reaches the preset number, the model parameters of the neural network at this time are determined as the final parameters of the neural network model, thus completing the training of the neural network Up.
  • FIG. 9 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • the image to be processed includes a plurality of people, and the image feature of the image to be processed includes the image feature of each of the plurality of people in each of the multiple frames of the image to be processed.
  • an image to be processed can be acquired.
  • the image to be processed can be obtained from the memory, or the image to be processed can also be received.
  • the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image obtained by the image recognition device from other equipment.
  • the received image or the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
  • the above-mentioned image to be processed may be a continuous multi-frame image in a video, or a multi-frame image selected according to a preset rule in a video.
  • a video multiple frames of images can be selected according to a preset time interval; or, in a video, multiple frames of images can be selected according to a preset frame number interval.
  • the multiple characters may include only humans, or only animals, or both humans and animals.
  • the image feature of a certain person can be used to represent the body posture of the person in the frame of image, that is, the relative position between different limbs of the person.
  • the image feature of a certain person mentioned above can be represented by a vector, which can be called an image feature vector.
  • the above-mentioned image feature extraction can be performed by CNN.
  • the person in the image can be identified to determine the bounding box of the person.
  • the image in each bounding box corresponds to a person.
  • the person’s bone node in the bounding box corresponding to each person can be identified first, and then the person’s image feature vector can be extracted based on the person’s bone node, so that the extracted image features can be more accurately reflected
  • the actions of the characters improve the accuracy of the extracted image features.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
  • the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
  • the color of the pixel corresponding to the area outside the area where the bone node is located can be set to a certain preset color, such as black.
  • the area where the bone node is located retains the same information as the original image, and the information of the area outside the area where the bone node is located is masked. Therefore, when extracting image features, only the image features of the above-mentioned partially visible image need to be extracted, and there is no need to extract the above-mentioned masked area.
  • the area where the aforementioned bone node is located may be a square, circle or other shape centered on the bone node.
  • the side length (or radius), area, etc. of the region where the bone node is located can be preset values.
  • the above method of extracting the image features of the image to be processed can extract the features according to the locally visible image to obtain the image feature vector of the person corresponding to the bounding box; it can also determine the masking matrix according to the bone node, and mask the image according to the masking matrix . For details, refer to the description of FIG. 11 and FIG. 12.
  • target tracking can be used to identify different people in the image.
  • the sub-features of the person in the image can be used to determine the distinction between the person in the image.
  • the sub-features can be colors, edges, motion information, texture information, and so on.
  • S902 Determine the spatial feature of each of the multiple persons in each of the multiple frames of images.
  • the spatial correlation between the actions of different characters in the frame of image is determined.
  • the spatial characteristics of the j-th person in the i-th frame of the image to be processed above can be based on the image characteristics of the j-th person in the i-th frame and the images of other persons other than the j-th person in the i-th frame.
  • the similarity of features is determined, and i and j are positive integers.
  • the spatial characteristics of the j-th person in the i-th frame image are used to represent the actions of the j-th person in the i-th frame image and the behavior of other people other than the j-th person in the i-th frame image in the i-th frame image. The relationship of the action.
  • the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people except the j-th person can reflect the difference between the j-th person pair in the i-th frame image and the j-th person.
  • the degree of dependence on the actions of characters other than personal objects That is to say, when the similarity of the image feature vectors corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image feature vectors corresponding to the two characters is lower , The weaker the association between the actions of these two characters. Refer to the description of FIG. 14 and FIG. 15 for the spatial association relationship of the actions of different characters in a frame of image.
  • S903. Determine the time sequence characteristics of each of the multiple persons in each frame of the multiple frames of images.
  • the time correlation between the actions of the person at different moments is determined.
  • the timing characteristics of the j-th person in the i-th frame of the image to be processed can be based on the similarity between the j-th person’s image characteristics in the i-th frame and the image characteristics in other frames except the i-th frame.
  • the degree is determined, i and j are positive integers.
  • the time series feature of the j-th person in the i-th frame image is used to indicate the relationship between the action of the j-th person in the i-th frame image and the action in other frame images except the i-th frame image.
  • the similarity between corresponding image features of a person in two frames of images can reflect the degree of dependence of the person's actions on time.
  • Minkowski distance such as Euclidean distance, Manhattan distance
  • cosine similarity Chebyshev distance
  • Hamming distance Hamming distance
  • the similarity can be calculated by calculating the sum of the products of each bit of the two features after the linear change.
  • the spatial correlation between actions of different characters and the temporal correlation between actions of the same character can provide important clues to the categories of multi-person scenes in the image. Therefore, in the image recognition process of the present application, by comprehensively considering the spatial association relationship between different person actions and the time association relationship between the same person actions, the accuracy of recognition can be effectively improved.
  • S904 Determine the action feature of each of the multiple persons in each frame of the multiple frames of images.
  • the time sequence, spatial, and image characteristics corresponding to the person in a frame of image can be fused to obtain the person in the frame of image The characteristics of the action.
  • the spatial characteristics of the j-th person in the i-th frame of the image to be processed, the temporal characteristics of the j-th person in the i-th frame of the image, and the image characteristics of the j-th person in the i-th frame of the image can be fused to obtain The action feature of the j-th person in the i-th frame image.
  • the first way is to use a combination (combine) method for integration.
  • the features to be fused can be added directly or weighted.
  • weighted addition is to add the features to be fused by a certain coefficient, that is, the weight value.
  • the channel wise (channel wise) can be linearly combined.
  • Multiple features output by multiple layers of the feature extraction network can be added together.
  • multiple features output by multiple layers of the feature extraction network can be added directly, or multiple features output by multiple layers of the feature extraction network can be added.
  • the features are added according to a certain weight.
  • T1 and T2 respectively represent the features output by the two layers of the feature extraction network
  • the coefficient that is, the weight value, a ⁇ 0 and b ⁇ 0.
  • Method 2 Concatenate and channel fusion are used for fusion.
  • Cascade and channel fusion is another way of fusion.
  • the dimensions of the features to be fused can be directly spliced, or they can be spliced after being multiplied by a certain coefficient, that is, a weight value.
  • the third way is to use the pooling layer to process the above-mentioned features, so as to realize the integration of the above-mentioned features.
  • each bit is the maximum value of the corresponding bit in the multiple feature vectors. It is also possible to perform average pooling on multiple feature vectors to determine the target feature vector. In the target feature vector obtained by averaging pooling, each bit is the average value of the corresponding bit in the multiple feature vectors.
  • the feature corresponding to a person in a frame of image can be merged in a combination manner to obtain the action feature of the person in the frame of image.
  • the feature vector group corresponding to at least one person in the i-th frame image may further include a time-series feature vector corresponding to at least one person in the i-th frame image.
  • S905 Recognizing group actions of multiple people in the image to be processed according to the action feature of each person in the multiple images in each frame of the image.
  • group actions are composed of actions of several characters in the group, that is, actions completed by multiple characters.
  • the group actions of multiple characters in the image to be processed may be a certain sport or activity.
  • the group actions of multiple characters in the image to be processed may be basketball, volleyball, football or dancing, etc. .
  • the motion characteristics of each frame of the image may be determined according to the motion characteristics of each of the multiple characters in each frame of the image to be processed. Then, the group actions of multiple people in the image to be processed can be identified according to the action characteristics of each frame of image.
  • the action characteristics of multiple characters in a frame of image can be merged by means of maximum pooling, so as to obtain the action characteristics of the frame of image.
  • the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
  • the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
  • a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
  • the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
  • a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
  • Steps S901 to S904 can be implemented by the neural network model trained in FIG. 8.
  • sequence characteristics may be determined first, and then the spatial characteristics may be determined.
  • the method shown in Figure 9 not only considers the temporal characteristics of multiple characters when determining group actions of multiple characters, but also takes into account the spatial characteristics of multiple characters. It can be better by integrating the temporal and spatial characteristics of multiple characters. Determine the group actions of multiple characters more accurately.
  • label information of the image to be processed is generated according to the group action, and the label information is used to indicate Group actions of multiple characters.
  • the foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
  • the key person in the image to be processed is determined according to the group actions.
  • the contribution of each of the multiple characters in the image to be processed to the group action can be determined first, and then the person with the highest contribution rate is determined as the key person.
  • the above method can be used to detect key persons in a video image, for example.
  • the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
  • the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and also contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
  • FIG. 10 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • the image to be processed includes at least one frame of image, and the image features of the image to be processed include image features of multiple people in the image to be processed.
  • an image to be processed can be acquired.
  • the image to be processed can be obtained from the memory, or the image to be processed can also be received.
  • the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image obtained by the image recognition device from other equipment.
  • the received image or the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
  • the above-mentioned image to be processed may be one frame of image or multiple frames of image.
  • the above-mentioned image to be processed is multiple frames, it may be consecutive multiple frames of images in a piece of video, or multiple frames of images selected according to preset rules in a piece of video. For example, in a video, multiple frames of images can be selected according to a preset time interval; or, in a video, multiple frames of images can be selected according to a preset frame number interval.
  • the above-mentioned image to be processed may include a plurality of persons, and the plurality of persons may include only humans, animals, or both humans and animals.
  • step S901 in FIG. 9 may be used to extract the image features of the image to be processed.
  • S1002 Determine the spatial characteristics of multiple people in each frame of the image to be processed.
  • the spatial characteristics of a certain person among the multiple characters in each frame of the to-be-processed image are based on the image characteristics of the person in the frame of the image to be processed and the images of other people except the person in the frame of the image to be processed The similarity of features is determined.
  • step S902 in FIG. 9 may be used to determine the spatial characteristics of multiple persons in each frame of the image to be processed.
  • S1003 Determine the action characteristics of multiple people in each frame of the image to be processed.
  • the action feature of a person among the multiple characters in each frame of the image to be processed is obtained by fusing the spatial feature of the person in the frame of image to be processed and the image feature of the person in the frame of image to be processed. .
  • the fusion method shown in step S904 in FIG. 9 may be used to determine the motion characteristics of multiple characters in the image to be processed without a frame.
  • S1004 Identify group actions of multiple people in the image to be processed according to the action characteristics of multiple people in each frame of the image to be processed.
  • step S905 in FIG. 9 may be used to identify group actions of multiple characters in the image to be processed.
  • FIG. 11 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • the image to be processed includes multiple frames of images, and the image features of the image to be processed include image features of multiple people in each frame of at least one frame of image selected from the multiple frames of images.
  • feature extraction can be performed on images corresponding to multiple people in the input multiple frames of images.
  • the image feature of a certain person can be used to represent the body posture of the person in the frame of image, that is, the relative position between different limbs of the person.
  • the image feature of a certain person mentioned above can be represented by a vector, which can be called an image feature vector.
  • the above-mentioned image feature extraction can be performed by CNN.
  • target tracking can be performed on each person, and the bounding box of each person in each frame of the image is determined, and the image in each bounding box corresponds to a person, and then The feature extraction is performed on the image of each of the above-mentioned bounding boxes to obtain the image feature of each person.
  • the bone nodes in the bounding box may be connected according to the structure of the person to obtain a connected image, and then image feature vector extraction is performed on the connected image.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors, and then image feature extraction is performed on the processed image.
  • the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the character is located in the bounding box may be masked to obtain the partially visible image.
  • the color of the pixel corresponding to the area outside the area where the bone node is located can be set to a certain preset color, such as black.
  • the area where the bone node is located retains the same information as the original image, and the information of the area outside the area where the bone node is located is masked. Therefore, when extracting image features, only the image features of the above-mentioned partially visible image need to be extracted, and there is no need to extract the above-mentioned masked area.
  • the area where the aforementioned bone node is located may be a square, circle or other shape centered on the bone node.
  • the side length (or radius), area, etc. of the region where the bone node is located can be preset values.
  • the above method of extracting the image features of the image to be processed can extract the features according to the locally visible image to obtain the image feature vector of the person corresponding to the bounding box; it can also determine the masking matrix according to the bone node, and mask the image according to the masking matrix .
  • the following is a specific example of the above method of determining the masking matrix based on the bone node.
  • the masking matrix The value in the square area with the bone point as the center and side length l is set to 1, and the values in other positions are set to 0.
  • Masking matrix The calculation formula is as follows:
  • the RGB model is used to assign an intensity value in the range of 0 to 255 for the RGB component of each pixel in the image. If RGB color mode is used, the mask matrix
  • the calculation formula of can be expressed as:
  • Use matrix Action images of original characters Masking to get a partially visible image
  • Each bit in can represent a pixel.
  • the RGB component of each pixel in the value is between 0-1.
  • the operator " ⁇ " means Each bit in the corresponding Multiply each bit in.
  • FIG. 12 is a schematic diagram of a process of obtaining a partially visible image provided by an embodiment of the present application. As shown in Figure 12, the picture Cover up. Specifically, keep the bone nodes The area around each node in the variable length becomes l, and the other areas are masked.
  • the T frame images all include images of K people.
  • Extract image features It can be represented by a D-dimensional vector, that is
  • the image feature extraction of the above-mentioned T frame image can be performed by CNN.
  • the set of image features of K people in the T frame image can be expressed as X, For each character, use a partially visible image Extracting image features can reduce redundant information in the bounding box, extract image features based on body structure information, and enhance the ability to express human actions in image features.
  • S1102 determine the dependence relationship between actions of different characters in the image to be processed, and the dependence relationship between actions of the same character at different moments.
  • a cross interaction module (CIM) is used to determine the spatial correlation of the actions of different characters in the image to be processed, and the temporal correlation of the actions of the same character at different times.
  • the cross interaction module is used to implement feature interaction and establish a feature interaction model.
  • the feature interaction model is used to represent the relationship of the character's body posture in time and/or space.
  • Spatial dependence is used to express the dependence of a character's body posture in a certain frame of image on the body posture of other characters in this frame of image, that is, the spatial dependence of character actions.
  • the above-mentioned spatial dependence can be expressed by a spatial feature vector.
  • one frame of image in the image to be processed corresponds to the image at time t, then at time t, the spatial feature vector of the k-th person It can be expressed as:
  • K represents that there are K people in the corresponding frame of image at time t
  • ⁇ () , ⁇ (), g() respectively represent three linear embedding functions
  • ⁇ (), ⁇ (), g() can be the same or different.
  • r(a,b) can reflect the dependence of feature b on feature a.
  • the spatial dependence between the body postures of different characters in the same frame of image can be determined.
  • Time dependence can also be called timing dependence, which is used to indicate the dependence of the character's body posture in a certain frame of image on the character's body posture in other frame images, that is, the inherent temporal dependence of a character's actions.
  • the above-mentioned time dependence can be expressed by a time series feature vector.
  • one frame of image in the image to be processed corresponds to the image at time t, then at time t, the time series feature vector of the k-th person It can be expressed as:
  • T indicates that there are a total of T time images in the image to be processed, that is, the image to be processed includes T frames of images, Represents the image feature of the k-th person at time t, Represents the image feature of the k-th person at time t'.
  • the time dependence between the body postures of the same person at different times can be determined.
  • Time-space feature vector can be expressed as a time series feature vector
  • spatial eigenvectors Do "add" The result of the operation:
  • FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application.
  • calculate the image features of the k-th person at time t The vector representation of the similarity with the image features of other people at time t, and the image features of the k-th person at time t
  • the vector representation of the similarity between the image features of the k-th person at other times and the average (Avg) is used to determine the time-space feature vector of the k-th person at time t
  • the set of spatio-temporal feature vectors of K characters in the T frame image can be expressed as H,
  • S1103 Fuse the image feature with the spatio-temporal feature vector to obtain the action feature of each frame of image.
  • the spatio-temporal feature vectors in T ⁇ K ⁇ D are fused to obtain the action characteristics of each image in the images at T times.
  • the motion feature of each frame of image can be represented by a motion feature vector.
  • the set of character feature vectors of K characters B t ⁇ R T ⁇ K ⁇ D can be expressed as:
  • S1104 Perform classification prediction on the action feature of each frame of image to determine the group action of the image to be processed.
  • the classification module can be a softmax classifier.
  • the classification result of the classification module can be one-hot coded, that is, only one bit is valid in the output result.
  • the category corresponding to the classification result of any image feature vector is the only category among the output categories of the classification module.
  • the action feature vector z t of a frame of image at time t can be input to the classification module to obtain the classification result of the frame of image.
  • the classification result of z t at any time t by the classification module can be used as the classification result of the group action in the T frame image.
  • the classification result of the group action in the T frame image can also be understood as the classification result of the group action of the person in the T frame image, or the classification result of the T frame image.
  • the action feature vectors z 1 , z 2 ,..., z T of the T frame images can be input into the classification module respectively to obtain the classification result of each frame of image.
  • the classification result of the T frame image can belong to one or more categories.
  • the category with the largest number of images in the corresponding T-frame image in the output category of the classification module can be used as the classification result of the group action in the T-frame image.
  • the action feature vector z 1 , z 2 ,..., z T of the T frame image can be averaged to obtain the average action feature vector Average action feature vector
  • Each bit in z 1 , z 2 ,..., z T is the average value of that bit.
  • Average action feature vector Input the classification module to obtain the classification result of the group action in the T frame image.
  • the above method can complete the complex reasoning process of group action recognition: extract the image features of multiple frames of images, and determine the temporal and spatial features according to the interdependence of actions between different people in the image and the same people at different times, and then The above-mentioned temporal features, spatial features, and image features are fused to obtain the action features of each frame of image, and then by classifying the action features of each frame of image, the group action of multiple frames of images can be inferred.
  • the spatial feature does not depend on the temporal features
  • the spatial feature does not depend on the temporal features
  • only the spatial features of the multiple characters may be considered for recognition. Conveniently determine the group actions of multiple characters.
  • Table 1 shows the recognition accuracy of the neural network model obtained by training, using the image recognition method provided in the embodiment of the present application, to recognize the public data set.
  • MCA multi-class accuracy
  • the multi-class accuracy (MCA) indicates that the number of results that are correctly classified in the classification results of the data including group actions by the neural network account for the number of results that include group actions.
  • the mean per class accuracy (MPCA) represents the average of the ratio of the number of correct results of each category to the number of data including group actions in the classification results of the neural network on the data including group actions value.
  • the training of the neural network can be completed without relying on the character action tags.
  • an end-to-end training method is adopted, that is, the neural network is adjusted only according to the final classification results.
  • Feature interaction is to determine the dependence between characters and the dependence of character actions on time.
  • the similarity between the two image features is calculated by the function r(a,b).
  • the spatial feature vector of each person in the frame of image is determined.
  • the spatial feature vector of a person in a frame of image is used to express the person's spatial dependence on other people in the frame of image, that is, the dependence of the person's body posture on the body posture of other people.
  • FIG. 14 is a schematic diagram of the spatial relationship between different character actions provided by an embodiment of the present application.
  • the spatial dependence matrix of FIG. 15 represents the dependence of each person in the group action on the body posture of other people.
  • Each bit in the spatial dependence matrix is represented by a square, and the color of the square, that is, the brightness, represents the similarity of the image features of the two people, that is, the calculation result of the function r(a, b).
  • the calculation result of the function r(a,b) can be standardized, that is, the calculation result of the function r(a,b) can be mapped between 0 and 1, so as to draw the spatial dependence matrix.
  • the hitter in Fig. 14 is the number 10 player having a greater influence on the follow-up actions of her teammates.
  • the function r(a,b) can reflect the high degree of correlation between the body posture of a person and other people in a frame of image, that is, it can reflect the situation with a high degree of dependence.
  • the spatial dependency between the body postures of players 1-6 is weak.
  • the neural network provided by the embodiments of the present application can better reflect the dependency or association relationship between the body posture of one person and the body posture of other people in a frame of image.
  • the time sequence feature vector of the person in one frame of image is determined.
  • the time series feature vector of a person in one frame of image is used to represent the dependence of the person's body posture on the body posture of the person in other frames of images.
  • the body posture of the 10 frame images of the player No. 10 shown in FIG. 14 in time sequence is shown in Fig. 16.
  • the time dependence matrix of Fig. 17 indicates the time dependence of the body posture of the player No. 10.
  • Each bit in the time-dependent matrix is represented by a square.
  • the color of the square that is, the brightness, represents the similarity of the image features of the two people, that is, the calculation result of the function r(a,b).
  • the body posture of the No. 10 player in the 10-frame image corresponds to the take-off (frames 1-3), floating (frames 4-8) and landing (frames 9-10).
  • "jumping" and "landing" should be more discriminative.
  • the image features of the No. 10 player in the 2nd and 10th frames are relatively similar to the image features in other images.
  • the image features of the 4th to 8th frames of the image that is, the image feature of the player No. 10 in the floating state, have low similarity with the image features in other images. Therefore, the neural network provided by the embodiments of the present application can better reflect the temporal association relationship of a person's body posture in multiple frames of images.
  • FIG. 18 is a schematic diagram of the system architecture of an image recognition device provided by an embodiment of the present application.
  • the image recognition device shown in FIG. 18 includes a feature extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804.
  • the image recognition device in FIG. 18 can execute the image recognition method of the embodiment of the present application. The process of processing the input picture by the image recognition device will be introduced below.
  • the feature extraction module 1801 which may also be referred to as a partial-body extractor module, is used to extract the image features of the person according to the bone nodes of the person in the image.
  • the function of the feature extraction module 1801 can be realized by using a convolutional network.
  • the multi-frame images are input to the feature extraction module 1801.
  • the image feature of a person can be represented by a vector, and the vector representing the image feature of the person can be called the image feature vector of the person.
  • the cross interaction module 1802 is used to map the image features of multiple characters in each frame of the multi-frame images to the spatio-temporal interaction features of each person.
  • the time-space interaction feature is used to indicate the "time-space" associated information of a certain character.
  • the time-space interaction feature of a person in a frame of image may be obtained by fusing the temporal feature and spatial feature of the person in the frame of image.
  • the cross interaction module 1802 may be implemented by a convolutional layer and/or a fully connected layer.
  • the feature fusion module 1803 is used to fuse the action feature of each person in a frame of image with the time-space interaction feature to obtain the image feature vector of the frame of image.
  • the image feature vector of the frame image can be used as the feature representation of the frame image.
  • the classification module 1804 is configured to classify according to the image feature vector, so as to determine the category of the group action of the person in the T frame image input to the feature extraction module 1801.
  • the classification module 1804 may be a classifier.
  • the image recognition device shown in FIG. 18 can be used to execute the image recognition method shown in FIG. 11.
  • FIG. 19 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application.
  • the image recognition device 3000 shown in FIG. 19 includes an acquisition unit 3001 and a processing unit 3002.
  • the acquiring unit 3001 is used to acquire an image to be processed
  • the processing unit 3002 is configured to execute the image recognition methods in the embodiments of the present application.
  • the obtaining unit 3001 may be used to obtain the image to be processed; the processing unit 3002 may be used to perform steps S901 to S904 or steps S1001 to S1004 described above to identify group actions of multiple people in the image to be processed.
  • the obtaining unit 3001 may be used to obtain the image to be processed; the processing unit 3002 may be used to execute the above steps S1101 to S1104 to identify group actions of people in the image to be processed.
  • the above-mentioned processing unit 3002 can be divided into multiple modules according to different processing functions.
  • the processing unit 3002 can be divided into an extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804 as shown in FIG. 18.
  • the processing unit 3002 can realize the functions of the various modules shown in FIG. 18, and further can be used to realize the image recognition method shown in FIG.
  • FIG. 20 is a schematic diagram of the hardware structure of an image recognition device according to an embodiment of the present application.
  • the image recognition apparatus 4000 shown in FIG. 20 includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004.
  • the memory 4001, the processor 4002, and the communication interface 4003 implement communication connections between each other through the bus 4004.
  • the memory 4001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 4001 may store a program. When the program stored in the memory 4001 is executed by the processor 4002, the processor 4002 is configured to execute each step of the image recognition method in the embodiment of the present application.
  • the processor 4002 may adopt a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more
  • the integrated circuit is used to execute related programs to implement the image recognition method in the method embodiment of the present application.
  • the processor 4002 may also be an integrated circuit chip with signal processing capability.
  • each step of the image recognition method of the present application can be completed by the integrated logic circuit of hardware in the processor 4002 or instructions in the form of software.
  • the aforementioned processor 4002 may also be a general-purpose processor, a digital signal processing (digital signal processing, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 4001, and the processor 4002 reads the information in the memory 4001, combines its hardware to complete the functions required by the units included in the image recognition device, or executes the image recognition method in the method embodiment of the application.
  • the communication interface 4003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 4000 and other devices or a communication network.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 4000 and other devices or a communication network.
  • the image to be processed can be acquired through the communication interface 4003.
  • the bus 4004 may include a path for transferring information between various components of the device 4000 (for example, the memory 4001, the processor 4002, and the communication interface 4003).
  • FIG. 21 is a schematic diagram of the hardware structure of a neural network training device according to an embodiment of the present application. Similar to the above device 4000, the neural network training device 5000 shown in FIG. 21 includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. Among them, the memory 5001, the processor 5002, and the communication interface 5003 implement communication connections between each other through the bus 5004.
  • the memory 5001 may be ROM, static storage device and RAM.
  • the memory 5001 may store a program. When the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 and the communication interface 5003 are used to execute each step of the neural network training method of the embodiment of the present application.
  • the processor 5002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits to execute related programs to realize the functions required by the units in the image processing apparatus of the embodiments of the present application. Or execute the neural network training method of the method embodiment of this application.
  • the processor 5002 may also be an integrated circuit chip with signal processing capability.
  • each step of the neural network training method of the embodiment of the present application can be completed by the integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
  • the aforementioned processor 5002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the image processing apparatus of the embodiment of the present application, or execute the neural network of the method embodiment of the present application Training method.
  • the communication interface 5003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 5000 and other devices or communication networks.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 5000 and other devices or communication networks.
  • the image to be processed can be acquired through the communication interface 5003.
  • the bus 5004 may include a path for transferring information between various components of the device 5000 (for example, the memory 5001, the processor 5002, and the communication interface 5003).
  • the device 4000 and device 5000 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 4000 and device 5000 may also include those necessary for normal operation. Other devices. At the same time, according to specific needs, those skilled in the art should understand that the device 4000 and the device 5000 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the device 4000 and the device 5000 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 20 and FIG. 21.
  • An embodiment of the present application further provides an image recognition device, including: at least one processor and a communication interface, the communication interface is used for the image recognition device to exchange information with other communication devices, when the program instructions in the at least one processing When executed in the device, the image recognition device is caused to execute the above method.
  • An embodiment of the present application also provides a computer program storage medium, which is characterized in that the computer program storage medium has program instructions, and when the program instructions are directly or indirectly executed, the foregoing method can be realized.
  • An embodiment of the present application further provides a chip system, characterized in that the chip system includes at least one processor, and when the program instructions are executed in the at least one processor, the foregoing method can be realized.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The present application relates to the field of artificial intelligence, in particular to the field of computer vision, and provided therein are an image recognition method and apparatus, a computer-readable storage medium and a chip. The method comprises: extracting image features of an image to be processed; determining a time-sequence feature and spatial feature of each person among a plurality of people in the image to be processed in each image frame among a plurality of image frames in the image to be processed; determining an action feature thereof according to the time-sequence feature and spatial feature; and recognizing the group action of the plurality of people in the image to be processed according to the action features. In the described method, the temporal association between extracted actions of each person among a plurality of people in an image to be processed as well as the association between same and the actions of other people are determined, thereby better recognizing the group action of the plurality of people in the image to be processed.

Description

图像识别方法、装置、计算机可读存储介质及芯片Image recognition method, device, computer readable storage medium and chip
本申请要求于2019年10月15日提交中国专利局、申请号为201910980310.7、申请名称为“图像识别方法、装置、计算机可读存储介质及芯片”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910980310.7, and the application name is "Image recognition method, device, computer readable storage medium and chip" on October 15, 2019. The entire content of the application is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种图像识别方法、装置、计算机可读存储介质及芯片。This application relates to the field of artificial intelligence, and in particular to an image recognition method, device, computer readable storage medium and chip.
背景技术Background technique
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成像系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. To put it vividly, it is to install eyes (camera/camcorder) and brain (algorithm) on the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data. In general, computer vision is to use various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
图像中人的行为的识别和理解是最有价值的信息之一。动作识别是计算机视觉领域的一项重要研究课题。计算机通过动作识别能够理解视频的内容。动作识别技术可以广泛应用于公共场所监控、人机交互等多种领域。特征提取是动作识别过程的关键环节,只有根据准确的特征,才能有效进行动作识别。在进行群体动作识别时,视频中的多个人物中的每个人物的动作在时间上的关系以及多个人物的动作之间的关系,均影响着群体动作识别的准确性。The recognition and understanding of human behavior in images is one of the most valuable information. Action recognition is an important research topic in the field of computer vision. The computer can understand the content of the video through motion recognition. Motion recognition technology can be widely used in public place monitoring, human-computer interaction and other fields. Feature extraction is a key link in the process of action recognition. Only based on accurate features, can action recognition be effectively performed. When performing group action recognition, the temporal relationship of the actions of each of the multiple characters in the video and the relationship between the actions of multiple characters affect the accuracy of group action recognition.
现有方案一般通过长短期记忆网络(long short-term memory,LSTM)提取人物的时序特征,其中时序特征用于表示人物的动作在时间上的关联性。然后,根据每个人物的时序特征可以计算每个人物的交互动作特征,从而根据每个人物的交互动作特征确定每个人物的动作特征,以根据每个人物的动作特征推断出多个人物的群体动作。交互动作特征用于表示人物动作之间的关联性。Existing solutions generally extract the temporal characteristics of characters through a long short-term memory (LSTM) network, where the temporal features are used to represent the temporal relevance of the actions of the characters. Then, according to the time sequence characteristics of each character, the interactive action characteristics of each character can be calculated, so that the action characteristics of each character can be determined according to the interactive action characteristics of each character, and the action characteristics of multiple characters can be inferred according to the action characteristics of each character. Group action. Interactive action features are used to express the correlation between characters' actions.
但是在上述方案中,每个人物的交互动作特征仅仅是基于每个人物的动作在时间上的关联性确定的,在用于群体动作的识别时,准确性有待提高。However, in the above solution, the interactive action characteristics of each character are only determined based on the temporal relevance of each character's actions, and the accuracy needs to be improved when used for group action recognition.
发明内容Summary of the invention
本申请提供一种图像识别方法、装置、计算机可读存储介质及芯片,以更好地识别出 待处理图像中的多个人物的群体动作。The present application provides an image recognition method, device, computer readable storage medium, and chip to better recognize group actions of multiple people in an image to be processed.
第一方面,提供了一种图像识别方法,该方法包括:提取待处理图像的图像特征,待处理图像包括多帧图像;确定多个人物中的每个人物在该多帧图像中的每帧图像中的时序特征;确定多个人物中的每个人物在该多帧图像中的每帧图像中的空间特征;确定多个人物中的每个人物在该多帧图像中的每帧图像中的动作特征;根据多个人物中的每个人物在该多帧图像中的每帧图像中的动作特征,识别待处理图像中的多个人物的群体动作。In a first aspect, an image recognition method is provided. The method includes: extracting image features of an image to be processed, the image to be processed includes multiple frames of images; determining that each of a plurality of persons is in each frame of the multiple frames of image Time sequence characteristics in the image; determine the spatial characteristics of each of the multiple characters in each frame of the multi-frame image; determine each of the multiple characters in each frame of the multi-frame image Based on the action characteristics of each of the multiple characters in each frame of the multi-frame image, the group actions of multiple characters in the image to be processed are recognized.
可选地,上述待处理图像中多个人物的群体动作可以是某种运动或者活动,例如,上述待处理图像中多个人物的群体动作可以是打篮球、打排球、踢足球或跳舞等等。Optionally, the group actions of multiple characters in the image to be processed may be a certain sport or activity. For example, the group actions of multiple characters in the image to be processed may be basketball, volleyball, football or dancing, etc. .
其中,上述待处理图像包括多个人物,上述待处理图像的图像特征包括上述多个人物在待处理图像中的多帧图像中的每帧图像中的图像特征。Wherein, the image to be processed includes multiple people, and the image features of the image to be processed include image features of the multiple people in each of the multiple frames of the image to be processed.
本申请中,在确定多个人物的群体动作时,不仅考虑到了多个人物的时序特征,还考虑到了多个人物的空间特征,通过综合多个人物的时序特征和空间特征能够更好更准确地确定出多个人物的群体动作。In this application, when determining the group actions of multiple characters, not only the temporal characteristics of multiple characters are considered, but also the spatial characteristics of multiple characters are considered. By integrating the temporal characteristics and spatial characteristics of multiple characters, it can be better and more accurate. Determine the group actions of multiple characters.
当上述图像识别方法由图像识别装置执行时,上述待处理图像可以是从该图像识别装置中获取到的图像,或者,上述待处理图像也可以是该图像识别装置从其他设备接收到的图像,或者,上述待处理图像也可以是通过该图像识别装置的摄像头拍摄得到的。When the image recognition method is executed by an image recognition device, the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image received by the image recognition device from another device, Alternatively, the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
上述待处理图像,可以是一段视频中连续的多帧图像,也可以按照预设的在一段视频中按照预设规则选取的多帧图像。The above-mentioned image to be processed may be a continuous multi-frame image in a video, or a multi-frame image selected in accordance with a preset rule in a video according to a preset.
应理解,在上述待处理图像中的多个人物中,该多个人物既可以只包括人,也可以只包括动物,也可以既包括人又包括动物。It should be understood that, among the multiple characters in the above-mentioned image to be processed, the multiple characters may include only humans, or only animals, or both humans and animals.
在上述提取待处理图像的图像特征时,可以对图像中的人物进行识别,从而确定人物的边界框,每个边界框中的图像对应于图像中的一个人物,接下来,可以通过对每个边界框的图像进行特征的提取来获取每个人物的图像特征。When extracting the image features of the image to be processed above, the person in the image can be identified to determine the person's bounding box. The image in each bounding box corresponds to a person in the image. Next, you can Feature extraction is performed on the image of the bounding box to obtain the image feature of each person.
可选地,可以先识别每个人物所对应的边界框中的人物的骨骼节点,然后再根据每个人物的骨骼节点,提取该人物的图像特征向量,从而使提取的图像特征更加准确的反映人物的动作,提高提取的图像特征的准确性。Optionally, the bone node of the person in the bounding box corresponding to each person can be identified first, and then the image feature vector of the person can be extracted according to the bone node of each person, so that the extracted image features can be more accurately reflected The actions of the characters improve the accuracy of the extracted image features.
进一步,还可以根据人物结构将边界框中的骨骼节点进行连接,以得到连接图像,接下来,再对连接图像进行图像特征向量的提取。Further, the bone nodes in the bounding box can be connected according to the structure of the person to obtain a connected image, and then the image feature vector is extracted on the connected image.
或者,还可以将骨骼节点所在的区域和骨骼节点所在的区域之外的区域设置不同的颜色进行显示,得到处理后的图像,然后再对处理后的图像进行图像特征的提取。Alternatively, the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
进一步,可以根据上述人物的骨骼节点所在的图像区域确定对应于该边界框的局部可见图像,然后对该局部可见图像进行特征提取,以得到待处理图像的图像特征。Further, the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
上述局部可见图像是由包括待处理图像中的人物的骨骼节点所在的区域组成的图像。具体地,可以将边界框中人物的骨骼节点所在区域之外的区域进行遮掩,以得到所述局部可见图像。The above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
在确定多个人物中的某个人物的时序特征时,可以通过该人物在不同帧图像中的不同动作之间的图像特征向量之间的相似度来确定该人物的不同时刻动作之间的时间关联关系,进而得到该人物的时序特征。When determining the temporal characteristics of a certain character among multiple characters, the time between actions of the character at different moments can be determined by the similarity between the image feature vectors of different actions of the character in different frames of images Association relationship, and then get the character's time series characteristics.
假设上述待处理图像中的多帧图像具体为T帧,i为小于或等于T的正整数,则第i 帧图像表示T帧图像中相应顺序的图像;假设上述待处理图像中的多个人物具体为K个,则第j个人物表示K个人物中相应顺序的人物,i和j均为正整数。Assuming that the multi-frame images in the image to be processed are specifically T frames, and i is a positive integer less than or equal to T, then the i-th frame of image represents the images in the corresponding order in the T frame image; assuming that there are multiple people in the image to be processed Specifically for K, then the j-th character represents the characters in the corresponding order among the K characters, and both i and j are positive integers.
上述待处理的多帧图像中第i帧图像的第j个人物的时序特征是根据第j个人物在第i帧图像的图像特征与在多帧图像的其他帧图像的图像特征的相似度确定的。The timing characteristics of the j-th person in the i-th frame of the image to be processed above are determined based on the similarity between the image characteristics of the j-th person in the i-th frame and the image characteristics of other frames in the multi-frame image. of.
应理解,上述第i帧图像的第j个人物的时序特征用于表示第j个人物在第i帧图像的动作与在上述多帧图像的动作的关联关系。某个人物在两帧图像中对应的图像特征之间的相似度,可以反映该人物的动作在时间上的依赖程度。It should be understood that the timing characteristics of the j-th person in the i-th frame of image are used to indicate the relationship between the action of the j-th person in the i-th frame of image and the action of the above-mentioned multi-frame image. The similarity between the corresponding image features of a certain person in the two frames of images can reflect the degree of dependence of the person's actions on time.
如果某个人物在两帧图像中对应的图像特征的相似度越高,则该人物在两个时间点上的动作之间的关联越紧密;反之,如果某个人物在两帧图像中对应的图像特征的相似度越低,则该人物在两个时间点上的动作之间的关联越弱。If the similarity of the corresponding image features of a person in the two frames of images is higher, the relationship between the actions of the person at two points in time is closer; on the contrary, if a person corresponds to the two frames of images The lower the similarity of image features, the weaker the association between the person's actions at two points in time.
在确定多个人物的空间特征时,通过同一帧图像中不同人物之间的图像特征之间的相似度,确定该帧图像中不同人物动作之间的空间关联关系。When determining the spatial characteristics of multiple characters, the spatial correlation between the actions of different characters in the frame of image is determined by the similarity between the image characteristics of different characters in the same frame of image.
上述待处理的多帧图像中第i帧图像中多个人物中的第j个人物的空间特征是根据第i帧图像中第j个人物的图像特征与第i帧图像中除第j个人物以外的其他人物的图像特征的相似度确定的。也就是说,可以根据第i帧图像中第j个人物的图像特征与第i帧图像中除第j个人物以外的其他人物的图像特征的相似度,确定上述第i帧图像中第j个人物的空间特征。The spatial feature of the j-th person among the multiple people in the i-th frame of the above-mentioned multi-frame image to be processed is based on the image feature of the j-th person in the i-th frame image and the removal of the j-th person from the i-th frame image The similarity of the image features of other characters is determined. That is to say, the j-th person in the i-th frame image can be determined based on the similarity between the image feature of the j-th person in the i-th frame image and the image features of other people except the j-th person in the i-th frame image. The spatial characteristics of the characters.
应理解,第i帧图像中第j个人物的空间特征用于表示第i帧图像中第j个人物的动作与第i帧图像中第i帧图像中除第j个人物以外的其他人物的动作的关联关系。It should be understood that the spatial characteristics of the j-th person in the i-th frame image are used to represent the actions of the j-th person in the i-th frame image and the behavior of other people other than the j-th person in the i-th frame image in the i-th frame image. The relationship of the action.
具体地,上述第i帧图像中第j个人物的图像特征向量与除第j个人物以外的其他人物的图像特征向量的相似度,可以反映第i帧图像中第j个人物对除第j个人物以外的其他人物的动作的依赖程度。也就是说,当两个人物对应的图像特征向量的相似度越高时,这两个人物的动作之间的关联越紧密;反之,当两个人物对应的图像特征向量的相似度越低时,这两个人物的动作之间的关联越弱。Specifically, the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people except the j-th person can reflect the difference between the j-th person pair in the i-th frame image and the j-th person. The degree of dependence on the actions of characters other than personal objects. That is to say, when the similarity of the image feature vectors corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image feature vectors corresponding to the two characters is lower , The weaker the association between the actions of these two characters.
可选地,可以通过明氏距离(Minkowski distance)(如欧氏距离、曼哈顿距离)、余弦相似度、切比雪夫距离、汉明距离等计算上述时序特征之间和空间特征之间的相似度。Optionally, the similarity between the above-mentioned temporal features and the spatial features can be calculated by Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance, etc. .
不同人物动作之间的空间关联关系以及相同人物动作之间的时间关联关系都可以为图像中的多人场景的类别提供重要线索。因此,本申请在图像识别过程中,通过综合考虑不同人物动作之间的空间关联关系以及相同人物动作之间的时间关联关系,能够有效提高识别的准确性。The spatial correlation between actions of different characters and the temporal correlation between actions of the same character can provide important clues to the categories of multi-person scenes in the image. Therefore, in the image recognition process of the present application, by comprehensively considering the spatial association relationship between different person actions and the time association relationship between the same person actions, the accuracy of recognition can be effectively improved.
可选地,在确定一帧图像中某个人物的动作特征时,可以将对应于一帧图像中的该人物的时序特征、空间特征、图像特征进行融合,从而得到该帧图像中该人物的动作特征。Optionally, when determining the action characteristics of a person in a frame of image, the time series, spatial, and image characteristics corresponding to the person in a frame of image can be fused to obtain the person’s behavior in the frame of image. Movement characteristics.
在对上述时序特征、空间特征、图像特征进行融合时,可以采用组合的融合方式进行融合。When fusing the above-mentioned temporal features, spatial features, and image features, a combined fusion method can be used for fusion.
例如,将对应于一帧图像中一个人物的特征进行融合,以得到该帧图像中该人物的动作特征。For example, the feature corresponding to a person in a frame of image is merged to obtain the action feature of the person in the frame of image.
进一步,在对上述多个特征进行融合时可以将待融合的特征直接相加,或者加权相加。Further, when fusing the above-mentioned multiple features, the features to be fused may be added directly or weighted.
可选地,在对上述多个特征进行融合时,可以采用级联和通道融合的方式进行融合。具体地,可以将待融合的特征的维数直接拼接,或者乘以一定系数即权重值之后进行拼接。Optionally, when fusing the above-mentioned multiple features, cascade and channel fusion can be used for fusion. Specifically, the dimensions of the features to be fused may be directly spliced, or spliced after being multiplied by a certain coefficient, that is, a weight value.
可选地,可以利用池化层对上述多个特征进行处理,以实现对上述多个特征的融合。Optionally, a pooling layer may be used to process the above-mentioned multiple features, so as to realize the fusion of the above-mentioned multiple features.
结合第一方面,在第一方面的某些实现方式中,在根据多个人物中的每个人物在待处理图像中的每帧图像中的动作特征,识别待处理图像中的多个人物的群体动作时,可以对待处理图像中的多个人物中每个人物在每帧图像中的动作特征进行分类,得到每个人物的动作,并据此确定多个人物的群体动作。In combination with the first aspect, in some implementations of the first aspect, the identification of multiple characters in the image to be processed is based on the action characteristics of each of the multiple characters in each frame of the image to be processed. For group actions, the action characteristics of each of the multiple characters in the image to be processed can be classified in each frame of the image to obtain the actions of each person, and determine the group actions of multiple characters accordingly.
可选地,可以将处理图像中的多个人物中每个人物在每帧图像中的动作特征输入分类模块,以得到对上述多个人物中每个人物动作特征的分类结果,即每个人物的动作,进而将对应的人物数量最多的动作作为多个人物的群体动作。Optionally, the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
可选地,可以从多个人物中选择某一人物,将该人物在每帧图像中的动作特征输入分类模块,以得到对该人物动作特征的分类结果,即该人物的动作,进而将上述得到的该人物的动作作为待处理图像中的多个人物的群体动作。Optionally, a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
结合第一方面,在第一方面的某些实现方式中,在根据多个人物中的每个人物在待处理图像中的每帧图像中的动作特征,识别待处理图像中的多个人物的群体动作时,还可以将每帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征,再对每帧图像的动作特征进行分类,得到每帧图像的动作,并据此确定待处理图像中多个人物的群体动作。In combination with the first aspect, in some implementations of the first aspect, the identification of multiple characters in the image to be processed is based on the action characteristics of each of the multiple characters in each frame of the image to be processed. In group action, the action features of multiple people in each frame of image can also be merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and based on this Determine the group actions of multiple characters in the image to be processed.
可选地,可以将每帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征,再将每帧图像的动作特征分别输入分类模块,以得到每帧图像的动作分类结果,将分类模块的输出类别中对应的上述待处理图像中图像数量最多的一个分类结果作为待处理图像中的多个人物的群体动作。Optionally, the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
可选地,可以将每帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征,再对上述得到的每帧图像的动作特征取平均值,以得到每帧图像的平均动作特征,然后将该每帧图像的平均动作特征输入分类模块,进而将该每帧图像的平均动作特征所对应的分类结果作为待处理图像中的多个人物的群体动作。Optionally, the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
可选地,可以从待处理图像中选择一帧图像,将该帧图像中根据多个人物的动作特征融合得到的该帧图像的动作特征输入分类模块,以得到对该帧图像的分类结果,进而将对该帧图像的分类结果作为待处理图像中的多个人物的群体动作。Optionally, a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
结合第一方面,在第一方面的某些实现方式中,在识别出待处理图像中的多个人物的群体动作后,根据该群体动作生成待处理图像的标签信息,该标签信息用于指示待处理图像中多个人物的群体动作。In combination with the first aspect, in some implementations of the first aspect, after identifying group actions of multiple characters in the image to be processed, tag information of the image to be processed is generated according to the group action, and the tag information is used to indicate Group actions of multiple characters in the image to be processed.
上述方式例如可以用于对视频库进行分类,将该视频库中的不同视频根据其对应的群体动作打上标签,便于用户查看和查找。The foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
结合第一方面,在第一方面的某些实现方式中,在识别出待处理图像中的多个人物的群体动作后,根据该群体动作确定待处理图像中的关键人物。With reference to the first aspect, in some implementations of the first aspect, after identifying group actions of multiple characters in the image to be processed, the key person in the image to be processed is determined according to the group actions.
可选地,先确定待处理图像中多个人物中每个人物对上述群体动作的贡献度,再将贡献度最高的人物确定为关键人物。Optionally, first determine the contribution of each of the multiple characters in the image to be processed to the above group action, and then determine the person with the highest contribution as the key person.
应理解,该关键人物对多个人物的群体动作的贡献度大于多个人物中除关键人物之外的其他人物的贡献度。It should be understood that the contribution of the key person to the group actions of the multiple characters is greater than the contribution of other characters among the multiple characters except the key person.
上述方式例如可以用于检测视频图像中的关键人物,通常情况下,视频中包含若干人 物,其中大部分人并不重要。有效地检测出关键人物有助于根据关键人物周围信息,更加快速和准确地理解视频内容。The above method can be used to detect key persons in a video image, for example. Generally, the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
例如假设一段视频是一场球赛,则控球的球员对在场包括球员、裁判和观众等所有人员的影响最大,对群体动作的贡献度也最高,因此可以将该控球的球员确定为关键人物,通过确定关键人物,能够帮助观看视频的人理解比赛正在和即将发生的事情。For example, if a video is a ball game, the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and also contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
第二方面,提供了一种图像识别方法,该方法包括:提取待处理图像的图像特征;确定多个人物在每帧待处理图像中的空间特征;确定多个人物在每帧待处理图像中的动作特征,,根据上述多个人物在每帧待处理图像中的动作特征识别待处理图像中的多个人物的群体动作。In a second aspect, an image recognition method is provided. The method includes: extracting image features of an image to be processed; determining the spatial characteristics of multiple people in each frame of the image to be processed; determining that multiple people are in each frame of the image to be processed Based on the action characteristics of the multiple characters in each frame of the image to be processed, the group actions of multiple characters in the image to be processed are recognized based on the action features of the multiple characters in each frame of the image to be processed.
其中,上述多个人物在待处理图像中的动作特征,是由上述多个人物在该待处理图像中的空间特征和在该待处理图像中的图像特征融合得到的。Wherein, the action features of the multiple people in the image to be processed are obtained by fusing the spatial features of the multiple people in the image to be processed and the image features in the image to be processed.
上述待处理图像可以是一帧图像,或者,可以是多帧连续或非连续的图像。The above-mentioned image to be processed may be one frame of image, or may be multiple frames of continuous or non-continuous images.
本申请中,在确定多个人物的群体动作时,只考虑多个人物的空间特征,而无需计算每个人物的时序特征,特别适用于人物空间特征的确定不依赖于人物时序特征的情况,能够更便于确定出多个人物的群体动作。又例如,当只对一帧图像进行识别时,不存在同一人物在不同时间的时序特征,该方法也更为适用。In this application, when determining the group actions of multiple characters, only the spatial characteristics of the multiple characters are considered, without calculating the temporal characteristics of each character, which is especially suitable for the situation where the determination of the spatial characteristics of the characters does not depend on the temporal characteristics of the characters. It is easier to determine the group actions of multiple characters. For another example, when only one frame of image is recognized, there is no time sequence characteristic of the same person at different times, and this method is more suitable.
当上述图像识别方法由图像识别装置执行时,上述待处理图像可以是从该图像识别装置中获取到的图像,或者,上述待处理图像也可以是该图像识别装置从其他设备接收到的图像,或者,上述待处理图像也可以是通过该图像识别装置的摄像头拍摄得到的。When the image recognition method is executed by an image recognition device, the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image received by the image recognition device from another device, Alternatively, the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
上述待处理图像,可以是一段视频中一帧图像或连续的多帧图像,也可以按照预设的在一段视频中按照预设规则选取的一帧或多帧图像。The above-mentioned image to be processed may be one frame of image or continuous multiple frames of image in a piece of video, or one or multiple frames of image selected according to preset rules in a piece of video according to a preset.
应理解,在上述待处理图像中的多个人物中,该多个人物既可以只包括人,也可以只包括动物,也可以既包括人又包括动物。It should be understood that, among the multiple characters in the above-mentioned image to be processed, the multiple characters may include only humans, or only animals, or both humans and animals.
在提取待处理图像的图像特征时,可以对图像中的人物进行识别,从而确定人物的边界框,每个边界框中的图像对应于图像中的一个人物,接下来,可以通过对每个边界框的图像进行特征的提取,以获取每个人物的图像特征。When extracting the image features of the image to be processed, the person in the image can be identified to determine the bounding box of the person. The image in each bounding box corresponds to a person in the image. Feature extraction is performed on the image of the frame to obtain the image feature of each person.
可选地,可以先识别每个人物所对应的边界框中的人物的骨骼节点,然后再根据每个人物的骨骼节点,提取该人物的图像特征,从而使提取的图像特征更加准确的反映人物的动作,提高提取的图像特征的准确性。Optionally, the bone node of the person in the bounding box corresponding to each person can be identified first, and then the image features of the person can be extracted according to the bone node of each person, so that the extracted image features more accurately reflect the person The action to improve the accuracy of the extracted image features.
进一步,还可以根据人物结构将边界框中的骨骼节点进行连接,以得到连接图像,接下来再对连接图像进行图像特征向量的提取。Further, it is also possible to connect the bone nodes in the bounding box according to the character structure to obtain a connected image, and then extract the image feature vector of the connected image.
或者,还可以将骨骼节点所在的区域和骨骼节点所在的区域之外的区域通过不同的颜色进行显示,得到处理后的图像,然后再对处理后得到的图像进行图像特征的提取。Alternatively, the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
进一步,可以根据上述人物的骨骼节点所在的图像区域确定对应于该边界框的局部可见图像,然后对所该局部可见图像进行特征提取,以得到所述待处理图像的图像特征。Further, a locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
上述局部可见图像是由待处理图像中的人物的骨骼节点所在的区域组成的图像。具体地,可以将边界框中人物的骨骼节点所在区域之外的区域进行遮掩,以得到该局部可见图像。The above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located can be masked to obtain the partially visible image.
在确定多个人物的空间特征时,通过同一帧图像中不同人物之间的图像特征之间的相 似度,确定该帧图像中不同人物动作之间的空间关联关系。When determining the spatial characteristics of multiple characters, the spatial correlation between the actions of different characters in the same frame of image is determined by the similarity between the image characteristics of different characters in the same frame of image.
上述待处理的多帧图像中第i帧图像中多个人物中的第j个人物的空间特征是根据第i帧图像中第j个人物的图像特征与其他人物的图像特征的相似度确定的。也就是说,可以根据第i帧图像中第j个人物的图像特征与其他人物的图像特征的相似度,确定上述第i帧图像中第j个人物的空间特征。The spatial characteristics of the j-th person among the multiple people in the i-th frame of the above-mentioned multi-frame image to be processed are determined based on the similarity between the image characteristics of the j-th person in the i-th frame and the image characteristics of other people. . That is to say, the spatial characteristics of the j-th person in the i-th frame image can be determined according to the similarity between the image characteristics of the j-th person in the i-th frame image and the image characteristics of other people.
应理解,第i帧图像中第j个人物的空间特征用于表示第i帧图像中第j个人物的动作与该第i帧图像中除第j个人物以外的其他人物的动作的关联关系。It should be understood that the spatial characteristics of the j-th person in the i-th frame image are used to represent the relationship between the actions of the j-th person in the i-th frame and the actions of other people in the i-th frame of the image except for the j-th person. .
具体地,第i帧图像中第j个人物的图像特征向量与第i帧图像中除第j个人物以外的其他人物的图像特征向量的相似度,可以反映第i帧图像中第j个人物对其他人物的动作的依赖程度。也就是说,两个人物对应的图像特征向量的相似度越高,这两个的动作之间的关联越紧密;反之,相似度越低,两个人物的动作之间的关联越弱。Specifically, the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people in the i-th frame image except for the j-th person can reflect the similarity of the j-th person in the i-th frame image The degree of dependence on the actions of other characters. That is to say, the higher the similarity of the image feature vectors corresponding to the two characters, the closer the association between the two actions; conversely, the lower the similarity, the weaker the association between the actions of the two characters.
可选地,可以通过明氏距离(Minkowski distance)(如欧氏距离、曼哈顿距离)、余弦相似度、切比雪夫距离、汉明距离等计算上述空间特征之间的相似度。Optionally, the similarity between the aforementioned spatial features can be calculated by Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance, and the like.
可选地,在确定一帧图像中某个人物的动作特征时,可以将对应于一帧图像中该人物的空间特征、图像特征进行融合,从而得到该帧图像中该人物的动作特征。Optionally, when determining the action feature of a person in a frame of image, the spatial feature and image feature corresponding to the person in a frame of image can be fused to obtain the action feature of the person in the frame of image.
在对上述空间特征、图像特征进行融合时,可以采用组合的融合方式进行融合。When fusing the above-mentioned spatial features and image features, a combined fusion method can be used for fusion.
例如,将对应于一帧图像中一个人物的特征进行融合,以得到该帧图像中该人物的动作特征。For example, the feature corresponding to a person in a frame of image is merged to obtain the action feature of the person in the frame of image.
进一步,在对上述多个特征进行融合时,可以将待融合的特征直接相加,或者加权相加。Further, when fusing the above-mentioned multiple features, the features to be fused may be added directly, or weighted.
可选地,在对上述多个特征进行融合时,可以采用级联和通道融合的方式进行融合。具体地,可以将待融合的特征的维数直接拼接,或者乘以一定系数即权重值之后进行拼接。Optionally, when fusing the above-mentioned multiple features, cascade and channel fusion can be used for fusion. Specifically, the dimensions of the features to be fused may be directly spliced, or spliced after being multiplied by a certain coefficient, that is, a weight value.
可选地,可以利用池化层对上述多个特征进行处理,以实现对上述多个特征的融合。Optionally, a pooling layer may be used to process the above-mentioned multiple features, so as to realize the fusion of the above-mentioned multiple features.
结合第二方面,在第二方面的某些实现方式中,在根据多个人物中在每帧待处理图像中的动作特征,识别待处理图像中的多个人物的群体动作时,可以对待处理图像中的多个人物中每个人物在每帧图像中的动作特征进行分类,得到每个人物的动作,并据此确定多个人物的群体动作。In combination with the second aspect, in some implementations of the second aspect, when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, the processing can be The action characteristics of each of the multiple characters in the image are classified in each frame of the image to obtain the actions of each character, and the group actions of the multiple characters are determined accordingly.
可选地,可以将处理图像中的多个人物中每个人物在每帧图像中的动作特征输入分类模块,以得到对上述多个人物中每个人物动作特征的分类结果,即每个人物的动作,进而将对应的人物数量最多的动作作为多个人物的群体动作。Optionally, the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
可选地,可以从多个人物中选择某一人物,将该人物在每帧图像中的动作特征输入分类模块,以得到对该人物动作特征的分类结果,即该人物的动作,进而将上述得到的该人物的动作作为待处理图像中的多个人物的群体动作。Optionally, a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
结合第二方面,在第二方面的某些实现方式中,在根据多个人物中在每帧待处理图像中的动作特征,识别待处理图像中的多个人物的群体动作时,还可以将每帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征,再对每帧图像的动作特征进行分类,得到每帧图像的动作,并据此确定待处理图像中多个人物的群体动作。In combination with the second aspect, in some implementations of the second aspect, when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, you can also The action features of multiple characters in each frame of image are merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and the multiple of the image to be processed are determined accordingly. The group actions of the characters.
可选地,可以将每帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征,再将每帧图像的动作特征分别输入分类模块,以得到每帧图像的动作分类结果,将分 类模块的输出类别中对应的上述待处理图像中图像数量最多的一个分类结果作为待处理图像中的多个人物的群体动作。Optionally, the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
可选地,可以将每帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征,再对上述得到的每帧图像的动作特征取平均值,以得到每帧图像的平均动作特征,然后将该每帧图像的平均动作特征输入分类模块,进而将该每帧图像的平均动作特征所对应的分类结果作为待处理图像中的多个人物的群体动作。Optionally, the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
可选地,可以从待处理图像中选择一帧图像,将该帧图像中根据多个人物的动作特征融合得到的该帧图像的动作特征输入分类模块,以得到对该帧图像的分类结果,进而将对该帧图像的分类结果作为待处理图像中的多个人物的群体动作。Optionally, a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
结合第二方面,在第二方面的某些实现方式中,在识别出待处理图像中的多个人物的群体动作后,根据该群体动作生成待处理图像的标签信息,该标签信息用于指示待处理图像中多个人物的群体动作。In combination with the second aspect, in some implementations of the second aspect, after identifying group actions of multiple people in the image to be processed, tag information of the image to be processed is generated according to the group action, and the tag information is used to indicate Group actions of multiple characters in the image to be processed.
上述方式例如可以用于对视频库进行分类,将该视频库中的不同视频根据其对应的群体动作打上标签,便于用户查看和查找。The foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
结合第二方面,在第二方面的某些实现方式中,在识别出待处理图像中的多个人物的群体动作后,根据该群体动作确定待处理图像中的关键人物。In combination with the second aspect, in some implementations of the second aspect, after identifying group actions of multiple characters in the image to be processed, the key person in the image to be processed is determined according to the group actions.
可选地,先确定待处理图像中多个人物中每个人物对上述群体动作的贡献度,再将贡献度最高的人物确定为关键人物。Optionally, first determine the contribution of each of the multiple characters in the image to be processed to the above group action, and then determine the person with the highest contribution as the key person.
应理解,该关键人物对多个人物的群体动作的贡献度大于多个人物中除关键人物之外的其他人物的贡献度。It should be understood that the contribution of the key person to the group actions of the multiple characters is greater than the contribution of other characters among the multiple characters except the key person.
上述方式例如可以用于检测视频图像中的关键人物,通常情况下,视频中包含若干人物,其中大部分人并不重要。有效地检测出关键人物有助于根据关键人物周围信息,更加快速和准确地理解视频内容。The above method can be used to detect key persons in a video image, for example. Generally, the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
例如假设一段视频是一场球赛,则控球的球员对在场包括球员、裁判和观众等所有人员的影响最大,对群体动作的贡献度也最高,因此可以将该控球的球员确定为关键人物,通过确定关键人物,能够帮助观看视频的人理解比赛正在和即将发生的事情。For example, if a video is a ball game, the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
第三方面,提供了一种图像识别方法,该方法包括:提取待处理图像的图像特征;确定待处理图像中不同人物间的依赖关系以及相同人物不同时刻的动作间的依赖关系;将图像特征与时-空特征向量进行融合,以得到待处理图像的每帧图像的动作特征;对上述每帧图像动作特征进行分类预测,以确定上述待处理图像的群体动作类别。In a third aspect, an image recognition method is provided. The method includes: extracting image features of an image to be processed; determining the dependency between different characters in the image to be processed and the dependency between actions of the same person at different times; Fusion with the spatio-temporal feature vector to obtain the action feature of each frame of the image to be processed; perform classification prediction on the action feature of each frame of the image to determine the group action category of the image to be processed.
在本申请中,完成了对群体动作识别的复杂推理过程,且在确定多个人物的群体动作时,不仅考虑到了多个人物的时序特征,还考虑到了多个人物的空间特征,通过综合多个人物的时序特征和空间特征能够更好更准确地确定出多个人物的群体动作。In this application, the complex reasoning process of group action recognition is completed, and when determining group actions of multiple characters, not only the temporal characteristics of multiple characters are taken into consideration, but also the spatial characteristics of multiple characters are taken into consideration. The temporal characteristics and spatial characteristics of personal objects can better and more accurately determine the group actions of multiple characters.
可选地,在上述提取待处理图像的图像特征时,可以对每个人物进行目标追踪,确定每个人物在每帧图像中的边界框,每个边界框中的图像对应于一个人物,再对上述每个边界框的图像进行特征的提取,以获取每个人物的图像特征。Optionally, when extracting the image features of the image to be processed as described above, target tracking can be performed on each person, and the bounding box of each person in each frame of the image is determined, and the image in each bounding box corresponds to a person, and then The feature extraction is performed on the image of each of the above-mentioned bounding boxes to obtain the image feature of each person.
在上述提取待处理图像的图像特征时,还可以通过识别人物的骨骼节点,对图像特征进行提取,以减少特征提取过程中图像的冗余信息的影响,提高特征提取的准确性。具体地,可以根据骨骼节点,利用卷积网络提取图像特征。When extracting the image features of the image to be processed, the image features can also be extracted by identifying the bone nodes of the person, so as to reduce the influence of the redundant information of the image during the feature extraction process and improve the accuracy of feature extraction. Specifically, a convolutional network can be used to extract image features based on bone nodes.
可选地,可以根据人物结构,将边界框中的骨骼节点进行连接,以得到连接图像,然后对该连接图像进行图像特征向量的提取。或者,还可以将骨骼节点所在的区域和骨骼节点所在的区域之外的区域通过不同的颜色进行显示,然后再对处理后的图像进行图像特征的提取。Optionally, the bone nodes in the bounding box may be connected according to the structure of the person to obtain a connected image, and then image feature vector extraction is performed on the connected image. Alternatively, the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors, and then image feature extraction is performed on the processed image.
进一步,可以根据上述人物的骨骼节点所在的图像区域确定对应于该边界框的局部可见图像,然后对该局部可见图像进行特征提取,以得到待处理图像的图像特征。Further, the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
上述局部可见图像是由包括待处理图像中的人物的骨骼节点所在的区域组成的图像。具体地,可以将边界框中人物的骨骼节点所在区域之外的区域进行遮掩,以得到所述局部可见图像。The above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
可选地,可以根据人物的图像和骨骼节点,计算人物动作遮掩矩阵。遮掩矩阵中每个点对应于一个像素。遮掩矩阵中,以骨骼点为中心,边长为l的方形区域内的值设置为1,其他位置的值设置为0。Optionally, the character's action concealment matrix can be calculated according to the character's image and bone nodes. Each point in the masking matrix corresponds to a pixel. In the masking matrix, the value in the square area with the bone point as the center and side length l is set to 1, and the values in other positions are set to 0.
进一步,可以采用RGB色彩模式进行遮掩。RGB色彩模式使用RGB模型为图像中每一个像素的RGB分量分配一个0至255范围内的强度值。用遮掩矩阵对原始人物动作图片进行遮掩,得到局部可见图像。Further, the RGB color mode can be used for masking. The RGB color model uses the RGB model to assign an intensity value in the range of 0 to 255 for the RGB component of each pixel in the image. The masking matrix is used to mask the original character action pictures to obtain a partially visible image.
可选地,保留骨骼节点中的每个节点周围变长为l的区域,对其他区域进行遮掩。Optionally, the area around each node in the skeleton node that becomes length l is reserved, and the other areas are masked.
对于每个人物,利用局部可见图像进行图像特征的提取,可以减少边界框中的冗余信息,可以根据人物的结构信息,提取图像特征,增强图像特征中对于人物动作的表现能力。For each person, the use of locally visible images for image feature extraction can reduce the redundant information in the bounding box, and can extract image features based on the structure information of the person, and enhance the performance of the person's actions in the image features.
在上述确定待处理图像中不同人物间的依赖关系以及相同人物不同时刻的动作间的依赖关系时,利用十字交互模块确定多帧图像中的人物的身体姿态在时间上的相关性,和/或确定多帧图像中的人物的身体姿态在空间上的相关性。When determining the dependence relationship between different characters in the image to be processed and the dependence relationship between the actions of the same character at different moments, the cross interaction module is used to determine the temporal correlation of the body posture of the characters in the multi-frame images, and/or Determine the spatial correlation of the body postures of the characters in the multi-frame images.
可选地,将上述十字交互模块用于实现特征的交互,建立特征交互模型,特征交互模型用于表示人物的身体姿态在时间上和/或空间上的关联关系。Optionally, the above-mentioned cross interaction module is used to realize the interaction of features to establish a feature interaction model, and the feature interaction model is used to represent the association relationship of the body posture of the character in time and/or space.
可选地,通过计算同一帧图像中不同的人物的图像特征之间的相似度,可以确定同一帧图像中不同的人物的身体姿态之间的空间依赖。所述空间依赖用于表示在某一帧图像中一人物的身体姿态对在其他人物的身体姿态的依赖,即人物动作间的空间依赖。可以通过空间特征向量表示空间依赖性。Optionally, by calculating the similarity between the image features of different characters in the same frame of image, the spatial dependence between the body postures of different characters in the same frame of image can be determined. The spatial dependence is used to indicate the dependence of the body posture of a character on the body posture of other characters in a certain frame of image, that is, the spatial dependence between the actions of the characters. The spatial dependency can be expressed by the spatial feature vector.
可选地,通过计算同一人物在不同时间的图像特征之间的相似度,可以确定同一人物在不同时间的身体姿态之间的时间依赖。所述时间依赖也可以称为时序依赖,用于表示在某一帧图像中该人物的身体姿态对在其他视频帧中该人物的身体姿态的依赖,即一个动作内在的时序依赖。可以通过时序特征向量表示时间依赖性。Optionally, by calculating the similarity between image features of the same person at different times, the time dependence between the body postures of the same person at different times can be determined. The time dependence may also be referred to as timing dependence, which is used to indicate the dependence of the body posture of the character in a certain frame of image on the body posture of the character in other video frames, that is, the inherent temporal dependence of an action. The time dependence can be expressed by the time series feature vector.
可以根据待处理图像中第k个人物的空间特征向量和时序特征向量,计算得到第k个人物的时-空特征向量。The spatio-temporal feature vector of the k-th person can be calculated according to the spatial feature vector and the time-series feature vector of the k-th person in the image to be processed.
在上述将图像特征与时-空特征向量进行融合,以得到待处理图像的每帧图像的动作特征的过程中,将T个时刻的图像中的K个人物的图像特征的集合X∈R T×K×D中的图像特征与T个时刻的图像中的K个人物的时-空特征向量的集合H∈R T×K×D中的时-空特征向量进行融合,以得到T个时刻的图像中每个图像的图像特征。 In the above process of fusing image features with spatio-temporal feature vectors to obtain the action features of each frame of the image to be processed, the image feature set X ∈ R T of K people in the images at T moments image feature × K × D image when the time T in K of individual objects - when set H∈R feature vector space T × K × D - a feature vector space are fused to give a time T The image characteristics of each image in the image.
可选地,将t时刻第k人物的图像特征与时-空特征向量进行融合,以得到t时刻第k人物的人物特征向量;或将图像特征与时-空特征向量进行残差连接,以得到人物特征向 量。根据K个人物中每人物的人物特征向量,确定在t时刻,K个人物的人物特征向量的集合。对所述人物特征向量的集合进行最大池化,以得到动作特征向量。Optionally, the image feature of the k-th person at time t is fused with the time-space feature vector to obtain the person feature vector of the k-th person at time t; or the image feature and the time-space feature vector are residually connected to Get the character feature vector. According to the character feature vector of each person in the K figures, determine the set of the character feature vectors of the K figures at time t. Perform maximum pooling on the set of character feature vectors to obtain action feature vectors.
在上述根据动作特征进行分类预测,以确定待处理图像的群体动作类别的过程中,可以采用不同的方式得到所述群体动作的分类结果。In the foregoing process of classifying and predicting based on the action characteristics to determine the group action category of the image to be processed, the classification result of the group action can be obtained in different ways.
可选地,将t时刻的动作特征向量输入分类模块,以得到对该帧图像的分类结果。可以将分类模块对任意t时刻的所述图像特征向量的分类结果作为T帧图像中的群体动作的分类结果。T帧图像中的群体动作的分类结果也可以理解为T帧图像中的人物的群体动作的分类结果,或者T帧图像的分类结果。Optionally, the action feature vector at time t is input to the classification module to obtain the classification result of the frame image. The classification result of the image feature vector at any time t by the classification module may be used as the classification result of the group action in the T frame image. The classification result of the group action in the T frame image can also be understood as the classification result of the group action of the person in the T frame image, or the classification result of the T frame image.
可选地,将T帧图像的动作特征向量分别输入分类模块,以得到每帧图像的分类结果。T帧图像的分类结果可以属于一个或多个类别。可以将分类模块的输出类别中对应的T帧图像中图像数量最多的一个类别作为T帧图像中的群体动作的分类结果。Optionally, the action feature vectors of the T frame images are respectively input to the classification module to obtain the classification result of each frame of image. The classification result of the T frame image can belong to one or more categories. The category with the largest number of images in the corresponding T-frame image in the output category of the classification module can be used as the classification result of the group action in the T-frame image.
可选地,对T帧图像的动作特征向量取平均值,以得到平均特征向量。平均特征向量中的每一位为T帧图像的图像特征向量表示中对应位的平均值。可以将平均特征向量输入分类模块,以得到T帧图像中的群体动作的分类结果。Optionally, average the action feature vectors of the T frame images to obtain the average feature vector. Each bit in the average feature vector is the average value of the corresponding bit in the image feature vector representation of the T frame image. The average feature vector can be input to the classification module to obtain the classification result of the group action in the T frame image.
上述方法能够完成群体动作识别的复杂推理过程:确定多帧图像的图像特征,并根据图像中不同人物之间和不同时间的动作之间的相互依赖关系确定其时序特征和空间特征,再将其上述图像特征进行融合获得每帧图像动作特征,进而通过对每帧图像的动作特征进行分类,推断出多帧图像的群体动作。The above method can complete the complex reasoning process of group action recognition: determine the image features of multiple frames of images, and determine the temporal and spatial features according to the interdependence between different characters in the image and between actions at different times, and then The above-mentioned image features are fused to obtain the action features of each frame of image, and then the group action of multiple frames of images is inferred by classifying the action features of each frame of image.
第四方面,提供了一种图像识别装置,该图像识别装置具有实现第一方面至第三方面或其任意可能的实现方式中的方法的功能。In a fourth aspect, an image recognition device is provided, and the image recognition device has the function of implementing the methods in the first to third aspects or any possible implementation manners thereof.
可选地,该图像识别装置包括实现第一方面至第三方面中的任意一种实现方式中的实现方式中的方法的单元。Optionally, the image recognition device includes a unit that implements the method in any one of the first aspect to the third aspect.
第五方面,提供了一种神经网络的训练装置,该训练装置具有实现第一方面至第三方面中的任意一种实现方式中的方法的单元。In a fifth aspect, a neural network training device is provided, and the training device has a unit for implementing the method in any one of the first aspect to the third aspect.
第六方面,提供了一种图像识别装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述第一方面至第三方面中的任意一种实现方式中的方法。In a sixth aspect, an image recognition device is provided, the device includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processing The device is used to execute the method in any one of the foregoing first aspect to the third aspect.
第七方面,提供了一种神经网络的训练装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述第一方面至第三方面中的任意一种实现方式中的方法。In a seventh aspect, a neural network training device is provided. The device includes: a memory for storing a program; a processor for executing the program stored in the memory. When the program stored in the memory is executed, the device The processor is configured to execute the method in any one of the foregoing first aspect to the third aspect.
第八方面,提供了一种电子设备,该电子设备包括第四方面或者第六方面中的图像识别装置。In an eighth aspect, an electronic device is provided, and the electronic device includes the image recognition apparatus in the fourth aspect or the sixth aspect.
上述第八方面中的电子设备具体可以是移动终端(例如,智能手机),平板电脑,笔记本电脑,增强现实/虚拟现实设备以及车载终端设备等等。The electronic device in the above eighth aspect may specifically be a mobile terminal (for example, a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, a vehicle-mounted terminal device, and so on.
第九方面,提供了一种计算机设备,该电子设备包括第五方面或者第七方面中的神经网络的训练装置。In a ninth aspect, a computer device is provided, and the electronic device includes the neural network training device in the fifth aspect or the seventh aspect.
该计算机设备具体可以是计算机、服务器、云端设备或者具有一定计算能力能够实现对神经网络的训练的设备。The computer device may specifically be a computer, a server, a cloud device, or a device with a certain computing capability that can implement neural network training.
第十方面,本申请提供一种计算机可读存储介质,计算机可读存储介质中存储有计算 机指令,当计算机指令在计算机上运行时,使得计算机执行第一方面至第三方面中的任意一种实现方式中的方法。In a tenth aspect, the present application provides a computer-readable storage medium. The computer-readable storage medium stores computer instructions. When the computer instructions run on the computer, the computer executes any of the first to third aspects. The method in the implementation mode.
第十一方面,本申请提供一种计算机程序产品,所述计算机程序产品包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行第一方面至第三方面中的任意一种实现方式中的方法。In an eleventh aspect, the present application provides a computer program product. The computer program product includes computer program code. When the computer program code runs on a computer, the computer executes any one of the first aspect to the third aspect. One way to achieve this.
第十二方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行第一方面至第三方面中的任意一种实现方式中的方法。In a twelfth aspect, a chip is provided. The chip includes a processor and a data interface. The processor reads instructions stored in a memory through the data interface to execute any one of the first aspect to the third aspect. A method in a way of implementation.
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面至第三方面中的任意一种实现方式中的方法。Optionally, as an implementation manner, the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory. When the instructions are executed, the The processor is configured to execute the method in any one of the implementation manners of the first aspect to the third aspect.
上述芯片具体可以是现场可编程门阵列FPGA或者专用集成电路ASIC。The above-mentioned chip may specifically be a field programmable gate array FPGA or an application-specific integrated circuit ASIC.
附图说明Description of the drawings
图1是本申请实施例提供的一种应用环境的示意图;FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application;
图2是本申请实施例提供的一种应用环境的示意图;FIG. 2 is a schematic diagram of an application environment provided by an embodiment of the present application;
图3是本申请实施例提供的一种群体动作识别的方法的示意性流程图;FIG. 3 is a schematic flowchart of a method for group action recognition provided by an embodiment of the present application;
图4是本申请实施例提供的一种群体动作识别的方法的示意性流程图;FIG. 4 is a schematic flowchart of a method for group action recognition provided by an embodiment of the present application;
图5是本申请实施例提供的一种系统架构的示意图;FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the present application;
图6是本申请实施例提供的一种卷积神经网络结构示意图;FIG. 6 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the present application;
图7是本申请实施例提供的一种芯片硬件结构的示意图;FIG. 7 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application;
图8是本申请实施例提供的一种神经网络模型的训练方法的示意性流程图;FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application;
图9是本申请实施例提供的一种图像识别方法的示意性流程图;FIG. 9 is a schematic flowchart of an image recognition method provided by an embodiment of the present application;
图10是本申请实施例提供的一种图像识别方法的示意性流程图;FIG. 10 is a schematic flowchart of an image recognition method provided by an embodiment of the present application;
图11是本申请实施例提供的一种图像识别方法的示意性流程图;FIG. 11 is a schematic flowchart of an image recognition method provided by an embodiment of the present application;
图12是本申请实施例提供的一种获取局部可见图像的过程的示意图;FIG. 12 is a schematic diagram of a process of obtaining a partially visible image provided by an embodiment of the present application;
图13是本申请实施例提供的一种计算图像特征之间相似度的方法的示意图;FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application;
图14是本申请实施例提供的不同人物动作在空间上的关系的示意图;FIG. 14 is a schematic diagram of the spatial relationship between different character actions provided by an embodiment of the present application;
图15是本申请实施例提供的不同人物动作在空间上的关系的示意图;15 is a schematic diagram of the spatial relationship of different character actions provided by an embodiment of the present application;
图16是本申请实施例提供的一个人物的动作在时间上的关系的示意图;FIG. 16 is a schematic diagram of the relationship in time of a character's actions according to an embodiment of the present application;
图17是本申请实施例提供的一个人物的动作在时间上的关系的示意图;FIG. 17 is a schematic diagram of the relationship in time of a character's actions according to an embodiment of the present application;
图18是本申请实施例提供的一种图像识别网络的系统架构的示意图;FIG. 18 is a schematic diagram of a system architecture of an image recognition network provided by an embodiment of the present application;
图19是本申请实施例提供的一种图像识别装置的结构示意图;FIG. 19 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application;
图20是本申请实施例提供的一种图像识别装置的结构示意图;FIG. 20 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application;
图21是本申请实施例提供的一种神经网络训练装置的结构示意图。FIG. 21 is a schematic structural diagram of a neural network training device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请中的技术方案进行描述。The technical solution in this application will be described below in conjunction with the accompanying drawings.
本申请的方案可以应用在视频分析、视频识别、异常或危险的行为检测等需要对多人 复杂场景的视频分析的领域。该视频例如可以是体育比赛视频、日常监控视频等。下面对两种常用的应用场景进行简单的介绍。The solution of the present application can be applied to the fields of video analysis, video recognition, abnormal or dangerous behavior detection, etc., which require video analysis of complex scenes of multiple people. The video may be, for example, a sports game video, a daily surveillance video, and the like. Two commonly used application scenarios are briefly introduced below.
应用场景一:视频管理系统Application scenario 1: Video management system
随着移动网速的迅速提升,用户在电子设备上存储了大量的短视频。短视频中可能包括不止一个人。对视频库中的短视频进行识别可以方便用户或者系统对视频库进行分类管理,提升用户体验。With the rapid increase of mobile internet speeds, users store a large number of short videos on electronic devices. More than one person may be included in the short video. Recognizing short videos in the video library can facilitate the user or the system to classify and manage the video library and improve user experience.
如图1所示,利用本申请提供的群体动作识别系统,使用给定的数据库,训练适用于短视频分类的神经网络结构并部署测试,训练得到的神经网络结构可以为确定短视频对应的标签,即对短视频进行分类,得到不同的短视频对应的群体动作类别,并为不同的短视频打上不同的标签,便于用户查看和查找,可以节省人工分类和管理的时间,提高管理的效率和用户体验。As shown in Figure 1, using the group action recognition system provided by this application, using a given database, training a neural network structure suitable for short video classification and deploying tests, the trained neural network structure can be used to determine the label corresponding to the short video , That is, to classify short videos to obtain group action categories corresponding to different short videos, and to tag different short videos with different tags, which is convenient for users to view and find, which can save manual classification and management time, and improve management efficiency and user experience.
应用场景二:关键人物检测系统Application Scenario 2: Key Person Detection System
通常情况下,视频中包括若干人物,其中大部分人并不重要。有效地检测出关键人物有助于快速理解场景内容。如图2所示,利用本申请提供的群体动作识别系统,可以识别出视频中关键人物,从而根据关键人物周围信息,更加准确地理解视频内容。Usually, the video includes several people, most of whom are not important. Detecting key figures effectively helps to quickly understand the content of the scene. As shown in Figure 2, the group action recognition system provided by the present application can identify key persons in the video, so as to understand the video content more accurately based on the information around the key persons.
为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。In order to facilitate understanding, the following first introduces related terms and neural network related concepts involved in the embodiments of the present application.
(1)神经网络(1) Neural network
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距b为输入的运算单元,该运算单元的输出可以为: A neural network can be composed of neural units. A neural unit can refer to an arithmetic unit that takes x s and intercept b as inputs. The output of the arithmetic unit can be:
Figure PCTCN2020113788-appb-000001
Figure PCTCN2020113788-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f()为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。 Among them, s=1, 2,...n, n is a natural number greater than 1, W s is the weight of x s , and b is the bias of the neural unit. f() is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be a region composed of several neural units.
(2)深度神经网络(deep neural network,DNN)(2) Deep neural network (deep neural network, DNN)
深度神经网络,也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。例如,全连接神经网络中层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2020113788-appb-000002
其中,
Figure PCTCN2020113788-appb-000003
是输入向量,
Figure PCTCN2020113788-appb-000004
是输出向量,
Figure PCTCN2020113788-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2020113788-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2020113788-appb-000007
由于DNN层数多,则系数W和偏移向量
Figure PCTCN2020113788-appb-000008
的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元 到第三层的第2个神经元的线性系数定义为
Figure PCTCN2020113788-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
Deep neural network, also known as multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. Dividing DNN according to the location of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. For example, in a fully connected neural network, the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer. Although DNN looks complicated, it is not complicated as far as the work of each layer is concerned. Simply put, it is the following linear relationship expression:
Figure PCTCN2020113788-appb-000002
among them,
Figure PCTCN2020113788-appb-000003
Is the input vector,
Figure PCTCN2020113788-appb-000004
Is the output vector,
Figure PCTCN2020113788-appb-000005
Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just the input vector
Figure PCTCN2020113788-appb-000006
After such a simple operation, the output vector is obtained
Figure PCTCN2020113788-appb-000007
Due to the large number of DNN layers, the coefficient W and the offset vector
Figure PCTCN2020113788-appb-000008
The number is also a lot. The definition of these parameters in DNN is as follows: Take coefficient W as an example: Suppose in a three-layer DNN, the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as
Figure PCTCN2020113788-appb-000009
The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third-level index 2 and the input second-level index 4.
总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2020113788-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
The summary is: the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as
Figure PCTCN2020113788-appb-000010
It should be noted that there is no W parameter in the input layer. In deep neural networks, more hidden layers make the network more capable of portraying complex situations in the real world. In theory, a model with more parameters is more complex and has a greater "capacity", which means it can complete more complex learning tasks. Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
(3)卷积神经网络(convolutional neuron network,CNN)(3) Convolutional neural network (convolutional neuron network, CNN)
卷积神经网络是一种带有卷积结构的深度神经网络。卷积神经网络包括了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包括若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。Convolutional neural network is a deep neural network with convolutional structure. The convolutional neural network includes a feature extractor composed of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can only be connected to a part of the neighboring neurons. A convolutional layer usually includes several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。The convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
(4)循环神经网络(recurrent neural networks,RNN)(4) Recurrent Neural Networks (RNN)
循环神经网络是用来处理序列数据的。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。同样使用误差反向传播算法,不过有一点区别:即,如果将RNN进行网络展开,那么其中的参数,如W,是共享的;而如上举例上述的传统神经网络却不是这样。并且在使用梯度下降算法中,每一步的输出不仅依赖当前步的网络,还依赖前面若干步网络的状态。该学习算法称为基于时间的反向传播算法(back propagation through time,BPTT)。Recurrent neural network is used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer and then to the output layer, the layers are fully connected, and the nodes in each layer are disconnected. Although this ordinary neural network has solved many problems, it is still powerless for many problems. For example, if you want to predict what the next word of a sentence is, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called recurrent neural network is that the current output of a sequence is also related to the previous output. The specific form of expression is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are no longer unconnected but connected, and the input of the hidden layer includes not only The output of the input layer also includes the output of the hidden layer at the previous moment. In theory, RNN can process sequence data of any length. The training of RNN is the same as the training of traditional CNN or DNN. The error back-propagation algorithm is also used, but there is a difference: that is, if the RNN is network expanded, then the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case. And in using the gradient descent algorithm, the output of each step depends not only on the current step of the network, but also on the state of the previous steps of the network. This learning algorithm is called backpropagation through time (BPTT).
既然已经有了卷积神经网络,为什么还要循环神经网络?原因很简单,在卷积神经网络中,有一个前提假设是:元素之间是相互独立的,输入与输出也是独立的,比如猫和狗。 但现实世界中,很多元素都是相互连接的,比如股票随时间的变化,再比如一个人说了:我喜欢旅游,其中最喜欢的地方是云南,以后有机会一定要去。这里填空,人类应该都知道是填“云南”。因为人类会根据上下文的内容进行推断,但如何让机器做到这一步?RNN就应运而生了。RNN旨在让机器像人一样拥有记忆的能力。因此,RNN的输出就需要依赖当前的输入信息和历史的记忆信息。Now that we already have a convolutional neural network, why do we need to recycle neural networks? The reason is simple. In convolutional neural networks, there is a premise that the elements are independent of each other, and the input and output are also independent, such as cats and dogs. But in the real world, many elements are connected to each other, such as the change of stocks over time, and another person said: I like traveling, and my favorite place is Yunnan, and I must go if I have the opportunity in the future. Filling in the blanks here, humans should all know that it is filling in "Yunnan". Because humans will make inferences based on the content of the context, but how to make the machine do this step? RNN came into being. RNN aims to make machines have memory capabilities like humans. Therefore, the output of RNN needs to rely on current input information and historical memory information.
(5)损失函数(loss function)(5) Loss function (loss function)
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training a deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value that you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then based on the difference between the two To update the weight vector of each layer of neural network (of course, there is usually an initialization process before the first update, that is, pre-configured parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make its prediction lower, and keep adjusting until the deep neural network can predict the really wanted target value or a value very close to the really wanted target value. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value". This is the loss function or objective function, which is an important equation used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
(6)残差网络(residual network,ResNet)(6) Residual network (residual network, ResNet)
在不断加神经网络的深度时,会出现退化的问题,即随着神经网络深度的增加,准确率先上升,然后达到饱和,再持续增加深度则会导致准确率下降。普通直连的卷积神经网络和残差网络的最大区别在于,ResNet有很多旁路的支线将输入直接连到后面的层,通过直接将输入信息绕道传到输出,保护信息的完整性,解决退化的问题。残差网络包括卷积层和/或池化层。When the depth of the neural network is continuously increased, the problem of degradation will occur, that is, as the depth of the neural network increases, the accuracy first increases, and then reaches saturation, and then continues to increase the depth will cause the accuracy to decrease. The biggest difference between the ordinary direct-connected convolutional neural network and the residual network is that ResNet has many bypass branches to directly connect the input to the subsequent layer, and protect the integrity of the information by directly detouring the input information to the output. The problem of degradation. The residual network includes a convolutional layer and/or a pooling layer.
残差网络可以是:深度神经网络中多个隐含层之间除了逐层相连之外,例如第1层隐含层连接第2层隐含层,第2层隐含层连接第3层隐含层,第3层隐含层连接第4层隐含层(这是一条神经网络的数据运算通路,也可以形象的称为神经网络传输),残差网络还多了一条直连支路,这条直连支路从第1层隐含层直接连到第4层隐含层,即跳过第2层和第3层隐含层的处理,将第1层隐含层的数据直接传输给第4层隐含层进行运算。公路网络可以是:深度神经网络中除了有上面所述的运算通路和直连分支之外,还包括一条权重获取分支,这条支路引入传输门(transform gate)进行权重值的获取,并输出权重值T供上面的运算通路和直连分支后续的运算使用。The residual network can be: In addition to connecting multiple hidden layers in a deep neural network, for example, the first hidden layer is connected to the second hidden layer, and the second hidden layer is connected to the third hidden layer. Contained layer, the third hidden layer is connected to the fourth hidden layer (this is a data operation path of a neural network, which can also be called neural network transmission), and the residual network has an additional direct connection branch. This directly connected branch is directly connected from the hidden layer of the 1st layer to the hidden layer of the 4th layer, that is, skips the processing of the 2nd and 3rd hidden layers, and directly transmits the data of the 1st hidden layer Perform calculations on the 4th hidden layer. The road network can be: in addition to the above-mentioned calculation path and direct connection branch, the deep neural network also includes a weight acquisition branch. This branch introduces a transmission gate (transform gate) to acquire the weight value and output The weight value T is used for the subsequent operations of the above calculation path and the directly connected branch.
(7)反向传播(back propagation,BP)算法(7) Backpropagation (BP) algorithm
卷积神经网络可以采用误差反向传播算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。The convolutional neural network can use the error back propagation algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forwarding the input signal until the output will cause error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss is converged. The backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
(8)像素值(8) Pixel value
图像的像素值可以是一个红绿蓝(Red-Green-Blue,RGB)颜色值,像素值可以是表示颜色的长整数。例如,像素值为255×Red+100×Green+76×Blue,其中,Blue代表蓝色分量,Green代表绿色分量,Red代表红色分量。各个颜色分量中,数值越小,亮度越低,数值 越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。The pixel value of the image can be a Red-Green-Blue (RGB) color value, and the pixel value can be a long integer representing the color. For example, the pixel value is 255×Red+100×Green+76×Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness. For grayscale images, the pixel values can be grayscale values.
(9)群体动作识别(group activity recognition,GAR)(9) Group activity recognition (GAR)
群体动作识别也可以称为群体活动识别,用于识别视频中一群人所做的事情。是计算机视觉中的一个重要课题。GAR有许多潜在的应用,包括视频监控和体育视频分析。与传统的单人动作识别相比,GAR不仅需要识别人物的行为,还需要推断人物之间的潜在关系。Group action recognition can also be called group activity recognition, which is used to identify what a group of people do in a video. It is an important subject in computer vision. GAR has many potential applications, including video surveillance and sports video analysis. Compared with traditional single-person action recognition, GAR not only needs to recognize the behavior of the characters, but also needs to infer the potential relationship between the characters.
群体动作识别的可以采用以下方式:Group action recognition can use the following methods:
(1)从相应的边界框中提取每个人物的时序特征(又称作人物动作表示);(1) Extract the temporal characteristics of each character from the corresponding bounding box (also known as character action representation);
(2)推断每个人物之间的空间上下文(又称作交互动作表示);(2) Infer the spatial context between each character (also known as interactive action representation);
(3)将这些表示连接为最终的组活动特性(又称作特征聚合)。(3) Connect these representations into the final group activity characteristics (also called feature aggregation).
这些方法确实有效,但却忽略了多级信息的并发性,导致GAR的性能不尽人意。These methods are indeed effective, but they ignore the concurrency of multi-level information, resulting in unsatisfactory performance of GAR.
一个群体动作是由该群体中若干人物的不同动作组成的,即相当于几个人物合作完成的动作,而这些人物动作又反映出身体的不同姿态。A group action is composed of different actions of several characters in the group, which is equivalent to actions completed by several characters in cooperation, and these character actions reflect different postures of the body.
此外,传统的模型往往忽略了不同人物之间的空间依赖关系,人物之间的空间依赖关系以及每个人物动作的时间依赖关系都可以为GAR提供重要的线索。例如,一个人在击球时,必须观察他的队友情况,同时,他必须随着时间推移不断调整自身姿态,以执行这样一个击球动作。而这样几个人互相合作完成一个群体动作。以上所有这些信息,包括每帧图像中每个人的动作特征(也可以称为人物的姿态(human parts)特征)、每个人的动作在时间上和空间上的依赖性特征(也可以称为人物动作(human actions)特征)、每帧图像的特征(也可以称为群体活动(group activity)特征)及这些特征之间的相互关系,共同构成一个实体,影响着群体动作的识别。也就是说,传统方法通过使用分步法去处理这样一个实体的复杂信息,无法充分利用其中潜在的时间和空间的依赖性。不仅如此,这些方法还极有可能会破坏空间域和时间域之间的共现关系。现有方法往往直接在提取时序依赖特征的情况下训练CNN网络,因此特征提取网络提取的特征忽略了图像中的人之间的空间依赖关系。另外,边界框中包括较多的冗余信息,这些信息可能会较低提取的人物的动作特征的准确性。In addition, traditional models often ignore the spatial dependence between different characters. The spatial dependence between characters and the time dependence of each character's actions can provide important clues for GAR. For example, when a person hits the ball, he must observe the situation of his teammates, and at the same time, he must constantly adjust his posture over time to perform such a hitting action. And such a few people cooperate with each other to complete a group action. All of the above information, including the action characteristics of each person in each frame of the image (also called the human parts characteristics), and the dependence characteristics of each person's actions in time and space (also called the characters Human actions (human actions features), the features of each frame of images (also called group activity features), and the interrelationship between these features, together constitute an entity that affects the recognition of group actions. In other words, the traditional method uses a step-by-step method to process the complex information of such an entity, and cannot make full use of its potential time and space dependence. Not only that, these methods are also very likely to destroy the co-occurrence relationship between the space domain and the time domain. Existing methods often train the CNN network directly under the condition of extracting timing-dependent features. Therefore, the features extracted by the feature extraction network ignore the spatial dependence between people in the image. In addition, the bounding box contains more redundant information, which may lower the accuracy of the extracted character's action features.
图3是一种群体动作识别的方法的示意性流程图。具体可参见《A Hierarchical Deep Temporal Model for Group Activity Recognition》(Ibrahim M S,Muralidharan S,Deng Z,et al.IEEE Conference on Computer Vision and Pattern Recognition.2016:1971-1980)。Fig. 3 is a schematic flow chart of a method for group action recognition. For details, please refer to "A Hierarchical Deep Temporal Model for Group Activity Recognition" (Ibrahim M S, Muralidharan S, Deng Z, et al. IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1971-1980).
使用已有算法对多个视频帧中的若干人物进行目标跟踪,确定每个人物在多个视频帧中每个视频帧中的大小和位置。使用人物CNN提取每个视频帧中每个人物的卷积特征,并将卷积特征输入人物长短期记忆网络(long short-term memory,LSTM)提取每个人物的时序特征。将每个人物对应的卷积特征和时序特征进行拼接,作为该人物的人物动作特征。将视频中多个人物的人物动作特征进行拼接和最大池化,以得到每个视频帧的动作特征。将每个视频帧的动作特征输入群体LSTM,以获得视频帧对应的特征。将视频帧对应的特征输入群体动作分类器,从而对输入的视频进行分类,即确定视频中的群体动作所属的类别。The existing algorithm is used to target several people in multiple video frames, and the size and position of each person in each video frame are determined. The person CNN is used to extract the convolutional features of each person in each video frame, and the convolutional features are input into the person's long short-term memory network (LSTM) to extract the time series features of each person. The convolution feature and time sequence feature corresponding to each person are spliced together as the person's action feature of the person. The character action characteristics of multiple characters in the video are spliced and max pooled to obtain the action characteristics of each video frame. The action characteristics of each video frame are input into the group LSTM to obtain the corresponding characteristics of the video frame. The feature corresponding to the video frame is input into the group action classifier to classify the input video, that is, the category to which the group action in the video belongs is determined.
需要进行两步训练,以得到能够对包括该特定类型的群体动作的视频进行识别的分层深度时序模型(hierarchical deep temporal model,HDTM)。HDTM模型包括人物CNN、 人物LSTM、群体LSTM和群体动作分类器。Two-step training is required to obtain a hierarchical deep temporal model (HDTM) that can recognize videos that include this specific type of group action. The HDTM model includes a character CNN, a character LSTM, a group LSTM, and a group action classifier.
使用已有算法对多个视频帧中的若干人物进行目标跟踪,确定每个人物在多个视频帧中每个视频帧中的大小和位置。每个人物对应于一个人物动作标签。每个输入的视频对应于一个群体动作标签。The existing algorithm is used to target several people in multiple video frames, and the size and position of each person in each video frame are determined. Each character corresponds to a character action tag. Each input video corresponds to a group action tag.
第一步训练,根据每个人物对应于的人物动作标签,训练人物CNN、人物LSTM和人物动作分类器,从而得到训练后的人物CNN、训练后的人物LSTM。The first step is to train the character CNN, character LSTM, and character action classifier according to the character action label corresponding to each character, so as to obtain the trained character CNN and the trained character LSTM.
第二步训练,根据群体动作标签训练群体LSTM和群体动作分类器的参数,从而得到训练后的群体LSTM和训练后的群体动作分类器。The second step of training is to train the parameters of the group LSTM and the group action classifier according to the group action tags, so as to obtain the trained group LSTM and the trained group action classifier.
根据第一步训练得到人物CNN、人物LSTM,提取输入的视频中每个人物的卷积特征、时序特征。之后,根据提取的多个人物的卷积特征、时序特征进行拼接得到的每个视频帧的特征表示,进行第二步训练。在两步训练完成之后,得到的神经网络模型能够对输入的视频进行群体动作识别。According to the first step of training, the person CNN and the person LSTM are obtained, and the convolutional features and timing features of each person in the input video are extracted. After that, the second step of training is performed according to the feature representation of each video frame obtained by splicing the convolution features and time sequence features of the extracted multiple people. After the two-step training is completed, the obtained neural network model can perform group action recognition on the input video.
每个人物的人物动作特征表示的确定,是由第一步训练的神经网络模型进行的。对多个人物的人物动作特征表示进行融合,从而识别群体动作,是由第二步训练的神经网络模型进行的。特征提取与群体动作分类之间存在信息隔阂,即第一步训练得到的神经网络模型能够准确提取识别人物动作的特征,但这些特征是否适用于群体动作的识别,并不能得到保证。The determination of the character's action feature representation of each character is carried out by the neural network model trained in the first step. The fusion of the character action feature representations of multiple characters to identify group actions is performed by the neural network model trained in the second step. There is an information gap between feature extraction and group action classification, that is, the neural network model obtained in the first step of training can accurately extract the features of recognizing people's actions, but whether these features are suitable for group action recognition can not be guaranteed.
图4是一种群体动作识别的方法的示意性流程图。具体可参见《Social scene understanding:End-to-end multi-person action localization and collective activity recognition》(Bagautdinov,Timur,et al.IEEE Conference on Computer Vision and Pattern Recognition.2017:4315-4324)。Fig. 4 is a schematic flowchart of a method for group action recognition. For details, please refer to "Social scene understanding: End-to-end multi-person action localization and collective activity recognition" (Bagautdinov, Timur, et al. IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4315-4324).
将若干视频帧中第t帧图像送入全卷积网络(fully convolutional networks,FCN),以得到若干人物特征f t。通过RNN对若干人物特征f t进行时序建模以得到每个人物的时序特征,将每个人物的时序特征送入分类器以同时识别人物动作p I t和群体动作p C tSend the t-th image of several video frames to fully convolutional networks (FCN) to obtain several character features f t . Time-series modeling is performed on several character features f t through RNN to obtain the time-series feature of each person, and the time-series feature of each person is sent to the classifier to simultaneously recognize the person action p I t and the group action p C t .
需要进行一步训练,以得到能够对包括该特定类型的群体动作的视频进行识别的神经网络模型。也就是说,将训练图像输入FCN,根据训练图像中每个人物的人物动作标签和群体动作标签,对FCN、RNN的参数进行调整,以得到训练后的FCN、RNN。A step of training is required to obtain a neural network model that can recognize videos that include this specific type of group action. That is to say, the training image is input into the FCN, and the parameters of the FCN and RNN are adjusted according to the character action tag and group action tag of each character in the training image to obtain the trained FCN and RNN.
FCN可以产生第t帧图像的多尺度特征图F t。通过深度全卷积网络(deep fully convolutional networks,DFCN)产生若干个检测框B t和对应的概率p t,将B t和p t送入马尔可夫随机场(Markov random field,MRF)以得到可信检测框b t,以从多尺度特征图F t中确定可信检测框b t对应的特征f t。根据可信检测框b t-1和可信检测框b t中的人的特征,可以确定可信检测框b t-1和b t中是相同的人。FCN也可以通过预先训练获得。 FCN can generate a multi-scale feature map F t of the t-th frame image. Several detection frames B t and corresponding probabilities p t are generated through deep fully convolutional networks (deep fully convolutional networks, DFCN), and B t and p t are sent to Markov random field (Markov random field, MRF) to obtain trusted detection frame b t, trusted detection block to determine the corresponding feature t t B from FIG multiscale wherein F f t. According to the characteristics of the persons in the credible detection frame b t-1 and the credible detection frame b t , it can be determined that the credible detection frames b t-1 and b t are the same person. FCN can also be obtained through pre-training.
一个群体动作是由若干人物的不同动作组成,而这些人物动作又反映在每个人物的不同身体姿态。人物的时序特征可以反映一个人物的动作的时间依赖关系。人物动作之间的空间依赖关系,也为群体动作识别提供重要的线索。未考虑人物之间的空间依赖性的群体动作识别方案,准确性受到一定影响。A group action is composed of different actions of several characters, and these character actions are reflected in the different body postures of each character. The temporal characteristics of a character can reflect the time dependence of a character's actions. The spatial dependence between character actions also provides important clues for group action recognition. The accuracy of the group action recognition scheme that does not consider the spatial dependence between characters is affected to a certain extent.
另外,在神经网络的训练过程中,确定每个人物的人物动作标签通常由人工进行,工作量较大。In addition, in the training process of the neural network, determining the character action label of each character is usually done manually, which requires a lot of work.
为了解决上述问题,本申请实施例提供了一种图像识别方法。本申请的在确定多个人 物的群体动作时,不仅考虑到了多个人物的时序特征,还考虑到了多个人物的空间特征,通过综合多个人物的时序特征和空间特征能够更好更准确地确定出多个人物的群体动作。In order to solve the above-mentioned problem, an embodiment of the present application provides an image recognition method. When determining the group actions of multiple characters in this application, not only the temporal characteristics of multiple characters are considered, but also the spatial characteristics of multiple characters are considered. By integrating the temporal characteristics and spatial characteristics of multiple characters, it is possible to better and more accurately Determine the group actions of multiple characters.
下面先结合图5对本申请实施例的一种系统架构进行介绍。The following first introduces a system architecture of an embodiment of the present application with reference to FIG. 5.
图5是本申请实施例的一种系统架构的示意图。如图5所示,系统架构500包括执行设备510、训练设备520、数据库530、客户设备540、数据存储系统550、以及数据采集系统560。Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 5, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data collection system 560.
另外,执行设备510包括计算模块511、I/O接口512、预处理模块513和预处理模块514。其中,计算模块511中可以包括目标模型/规则501,预处理模块513和预处理模块514是可选的。In addition, the execution device 510 includes a calculation module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514. The calculation module 511 may include the target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.
数据采集设备560用于采集训练数据。针对本申请实施例的图像识别方法来说,训练数据可以包括多帧训练图像(该多帧训练图像中包括多个人物,例如多个人)以及对应的标签,其中,标签给出了训练图像中的人的群体动作类别。在采集到训练数据之后,数据采集设备560将这些训练数据存入数据库530,训练设备520基于数据库530中维护的训练数据训练得到目标模型/规则501。The data collection device 560 is used to collect training data. For the image recognition method of the embodiment of the present application, the training data may include multiple frames of training images (the multiple frames of training images include multiple persons, such as multiple persons) and corresponding labels, where the label gives The group action category of the person. After the training data is collected, the data collection device 560 stores the training data in the database 530, and the training device 520 obtains the target model/rule 501 based on the training data maintained in the database 530.
下面对训练设备520基于训练数据得到目标模型/规则501进行描述,训练设备520对输入的多帧训练图像进行识别,将输出的预测类别与标签进行对比,直到训练设备520输出的预测类别与标签的结果的差异小于一定的阈值,从而完成目标模型/规则501的训练。The following describes the target model/rule 501 obtained by the training device 520 based on the training data. The training device 520 recognizes the input multi-frame training image, and compares the output prediction category with the label, until the prediction category output by the training device 520 is equal to The difference between the results of the label is less than a certain threshold, so that the training of the target model/rule 501 is completed.
上述目标模型/规则501能够用于实现本申请实施例的图像识别方法,即,将一帧或多帧待处理图像(通过相关预处理后)输入该目标模型/规则501,即可得到该一帧或多帧待处理图像的中的人的群体动作类别。本申请实施例中的目标模型/规则501具体可以为神经网络。需要说明的是,在实际应用中,数据库530中维护的训练数据不一定都来自于数据采集设备560的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备520也不一定完全基于数据库530维护的训练数据进行目标模型/规则501的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。The above-mentioned target model/rule 501 can be used to implement the image recognition method of the embodiment of the present application, that is, input one or more frames of images to be processed (after relevant preprocessing) into the target model/rule 501 to obtain the The group action category of people in the frame or multiple frames of the image to be processed. The target model/rule 501 in the embodiment of the present application may specifically be a neural network. It should be noted that, in actual applications, the training data maintained in the database 530 may not all come from the collection of the data collection device 560, and may also be received from other devices. In addition, it should be noted that the training device 520 does not necessarily perform the training of the target model/rule 501 completely based on the training data maintained by the database 530. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of the embodiment.
根据训练设备520训练得到的目标模型/规则501可以应用于不同的系统或设备中,如应用于图5所示的执行设备510,所述执行设备510可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图5中,执行设备510配置输入/输出(input/output,I/O)接口512,用于与外部设备进行数据交互,用户可以通过客户设备540向I/O接口512输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的待处理图像。这里的客户设备540具体可以是终端设备。The target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 5, the execution device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, Notebook computers, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or clouds. In FIG. 5, the execution device 510 is configured with an input/output (input/output, I/O) interface 512 for data interaction with external devices. The user can input data to the I/O interface 512 through the client device 540. The input data in this embodiment of the present application may include: a to-be-processed image input by the client device. The client device 540 here may specifically be a terminal device.
预处理模块513和预处理模块514用于根据I/O接口512接收到的输入数据(如待处理图像)进行预处理,在本申请实施例中,可以没有预处理模块513和预处理模块514或者只有的一个预处理模块。当不存在预处理模块513和预处理模块514时,可以直接采用计算模块511对输入数据进行处理。The preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 512. In the embodiment of the present application, the preprocessing module 513 and the preprocessing module 514 may be omitted. Or there is only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the calculation module 511 can be directly used to process the input data.
在执行设备510对输入数据进行预处理,或者在执行设备510的计算模块511执行计算等相关的处理过程中,执行设备510可以调用数据存储系统550中的数据、代码等以用 于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统550中。When the execution device 510 preprocesses input data, or when the calculation module 511 of the execution device 510 performs calculations and other related processing, the execution device 510 may call data, codes, etc. in the data storage system 550 for corresponding processing. , The data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 550.
最后,I/O接口512将处理结果,如将目标模型/规则501计算得到的群体动作类别呈现给客户设备540,从而提供给用户。Finally, the I/O interface 512 presents the processing result, such as the group action category calculated by the target model/rule 501, to the client device 540, so as to provide it to the user.
具体地,经过计算模块511中的目标模型/规则501处理得到的群体动作类别可以通过预处理模块513(也可以再加上预处理模块514的处理)的处理后将处理结果送入到I/O接口,再由I/O接口将处理结果送入到客户设备540中显示。Specifically, the group action category obtained by the target model/rule 501 in the calculation module 511 can be processed by the preprocessing module 513 (or the processing by the preprocessing module 514), and then the processing result is sent to the I/ O interface, and then the I/O interface sends the processing result to the client device 540 for display.
应理解,当上述系统架构500中不存在预处理模块513和预处理模块514时,计算模块511还可以将处理得到的群体动作类别传输到I/O接口,然后再由I/O接口将处理结果送入到客户设备540中显示。It should be understood that when the preprocessing module 513 and the preprocessing module 514 do not exist in the above system architecture 500, the calculation module 511 may also transmit the group action category obtained by the processing to the I/O interface, and then the I/O interface will process it. The result is sent to the client device 540 for display.
值得说明的是,训练设备520可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则501,该相应的目标模型/规则501即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。It is worth noting that the training device 520 can generate corresponding target models/rules 501 based on different training data for different goals or tasks, and the corresponding target models/rules 501 can be used to achieve the above goals or complete The above tasks provide users with the desired results.
在图5中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口512提供的界面进行操作。另一种情况下,客户设备540可以自动地向I/O接口512发送输入数据,如果要求客户设备540自动发送输入数据需要获得用户的授权,则用户可以在客户设备540中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端,采集如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果作为新的样本数据,并存入数据库530。当然,也可以不经过客户设备540进行采集,而是由I/O接口512直接将如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果,作为新的样本数据存入数据库530。In the case shown in FIG. 5, the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 512. In another case, the client device 540 can automatically send input data to the I/O interface 512. If the client device 540 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 540. The user can view the result output by the execution device 510 on the client device 540, and the specific presentation form may be a specific manner such as display, sound, and action. The client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output result of the output I/O interface 512 as new sample data as shown in the figure, and store it in the database 530. Of course, it is also possible not to collect through the client device 540, but the I/O interface 512 directly uses the input data input to the I/O interface 512 and the output result of the output I/O interface 512 as a new sample as shown in the figure. The data is stored in the database 530.
值得注意的是,图5仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图5中,数据存储系统550相对执行设备510是外部存储器,在其它情况下,也可以将数据存储系统550置于执行设备510中。It is worth noting that FIG. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 5, the data The storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 may also be placed in the execution device 510.
如图5所示,根据训练设备520训练得到目标模型/规则501,可以是本申请实施例中的神经网络,具体的,本申请实施例提供的神经网络可以是CNN以及深度卷积神经网络(deep convolutional neural networks,DCNN)等等。As shown in FIG. 5, the target model/rule 501 obtained by training according to the training device 520 may be the neural network in the embodiment of the present application. Specifically, the neural network provided in the embodiment of the present application may be a CNN and a deep convolutional neural network ( deep convolutional neural networks, DCNN) and so on.
由于CNN是一种非常常见的神经网络,下面结合图6重点对CNN的结构进行介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像做出响应。Since CNN is a very common neural network, the structure of CNN will be introduced below in conjunction with Figure 6. As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to the algorithm of machine learning. Multi-level learning is carried out on the abstract level of the system. As a deep learning architecture, CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.
图6是本申请实施例提供的一种卷积神经网络结构示意图。如图6所示,卷积神经网络600可以包括输入层610,卷积层/池化层620(其中池化层为可选的),以及全连接层(fully connected layer)630。下面对这些层的相关内容做详细介绍。FIG. 6 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the present application. As shown in FIG. 6, the convolutional neural network 600 may include an input layer 610, a convolutional layer/pooling layer 620 (the pooling layer is optional), and a fully connected layer 630. The following is a detailed introduction to the relevant content of these layers.
卷积层/池化层620:Convolutional layer/pooling layer 620:
卷积层:Convolutional layer:
如图6所示卷积层/池化层620可以包括如示例621-626层,举例来说:在一种实现中, 621层为卷积层,622层为池化层,623层为卷积层,624层为池化层,625为卷积层,626为池化层;在另一种实现方式中,621、622为卷积层,623为池化层,624、625为卷积层,626为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in FIG. 6, the convolutional layer/pooling layer 620 may include layers 621-626 as shown in the examples. For example, in one implementation, layer 621 is a convolutional layer, layer 622 is a pooling layer, and layer 623 is a convolutional layer. Build layer, 624 is the pooling layer, 625 is the convolutional layer, and 626 is the pooling layer; in another implementation, 621 and 622 are convolutional layers, 623 is the pooling layer, and 624 and 625 are convolutional layers. Layer, 626 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
下面将以卷积层621为例,介绍一层卷积层的内部工作原理。The following will take the convolutional layer 621 as an example to introduce the internal working principle of a convolutional layer.
卷积层621可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素,这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。The convolution layer 621 can include many convolution operators. The convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. , It depends on the value of stride) to perform processing, so as to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same. During the convolution operation, the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row×column) are applied. That is, multiple homogeneous matrices. The output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image. Perform obfuscation and so on. The multiple weight matrices have the same size (row×column), the size of the convolution feature maps extracted by the multiple weight matrices of the same size are also the same, and then the multiple extracted convolution feature maps of the same size are combined to form The output of the convolution operation.
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络600进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications. Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 600 can make correct predictions. .
当卷积神经网络600有多个卷积层的时候,初始的卷积层(例如621)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络600深度的加深,越往后的卷积层(例如626)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。When the convolutional neural network 600 has multiple convolutional layers, the initial convolutional layer (such as 621) often extracts more general features, which can also be called low-level features; with the convolutional neural network With the deepening of the network 600, the features extracted by the subsequent convolutional layers (for example, 626) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
池化层:Pooling layer:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图6中620所示例的621-626各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the layers 621-626 as illustrated by 620 in Figure 6, it can be a convolutional layer followed by a layer. The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In the image processing process, the sole purpose of the pooling layer is to reduce the size of the image space. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of the average pooling. The maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling. In addition, just as the size of the weight matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
全连接层630:Fully connected layer 630:
在经过卷积层/池化层620的处理后,卷积神经网络600还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层620只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络600需要利用全连接层630来生成一个或者一组所需要的类的数量的输出。因此,在全连接层630中可以包括多层隐含层(如图6所示的631、632至63n)以及输出层640,该多层隐含层中所包括的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。After processing by the convolutional layer/pooling layer 620, the convolutional neural network 600 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 620 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network 600 needs to use the fully connected layer 630 to generate one or a group of required classes of output. Therefore, the fully connected layer 630 can include multiple hidden layers (631, 632 to 63n as shown in FIG. 6) and an output layer 640. The parameters included in the multiple hidden layers can be based on specific task types. The relevant training data of the, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
在全连接层630中的多层隐含层之后,也就是整个卷积神经网络600的最后层为输出层640,该输出层640具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络600的前向传播(如图6由610至640方向的传播为前向传播)完成,反向传播(如图6由640至610方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络600的损失,及卷积神经网络600通过输出层输出的结果和理想结果之间的误差。After the multiple hidden layers in the fully connected layer 630, that is, the final layer of the entire convolutional neural network 600 is the output layer 640. The output layer 640 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 600 (as shown in Figure 6, the propagation from the 610 to 640 direction is forward propagation) is completed, the back propagation (as shown in Figure 6 is the propagation from the 640 to 610 direction as the back propagation). Start to update the weight values and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 600 and the error between the output result of the convolutional neural network 600 through the output layer and the ideal result.
需要说明的是,如图6所示的卷积神经网络600仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。It should be noted that the convolutional neural network 600 shown in FIG. 6 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
应理解,可以采用图6所示的卷积神经网络600执行本申请实施例的图像识别方法,如图6所示,待处理图像经过输入层610、卷积层/池化层620和全连接层630的处理之后可以得到群体动作类别。It should be understood that the convolutional neural network 600 shown in FIG. 6 can be used to execute the image recognition method of the embodiment of the present application. As shown in FIG. 6, the image to be processed passes through the input layer 610, the convolutional layer/pooling layer 620, and the fully connected After the layer 630 is processed, the group action category can be obtained.
图7为本申请实施例提供的一种芯片硬件结构示意图。如图7所示,该芯片包括神经网络处理器700。该芯片可以被设置在如图5所示的执行设备510中,用以完成计算模块511的计算工作。该芯片也可以被设置在如图5所示的训练设备520中,用以完成训练设备520的训练工作并输出目标模型/规则501。如图6所示的卷积神经网络中各层的算法均可在如图7所示的芯片中得以实现。FIG. 7 is a schematic diagram of a chip hardware structure provided by an embodiment of the application. As shown in FIG. 7, the chip includes a neural network processor 700. The chip can be set in the execution device 510 as shown in FIG. 5 to complete the calculation work of the calculation module 511. The chip can also be set in the training device 520 as shown in FIG. 5 to complete the training work of the training device 520 and output the target model/rule 501. The algorithms of each layer in the convolutional neural network as shown in FIG. 6 can be implemented in the chip as shown in FIG. 7.
神经网络处理器(neural-network processing unit,NPU)50作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路703,控制器704控制运算电路703提取存储器(权重存储器或输入存储器)中的数据并进行运算。A neural network processor (neural-network processing unit, NPU) 50 is mounted on a main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU allocates tasks. The core part of the NPU is the arithmetic circuit 703. The controller 704 controls the arithmetic circuit 703 to extract data from the memory (weight memory or input memory) and perform calculations.
在一些实现中,运算电路703内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路703是二维脉动阵列。运算电路703还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路703是通用的矩阵处理器。In some implementations, the arithmetic circuit 703 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 703 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路703从权重存储器702中取矩阵B相应的数据,并缓存在运算电路703中每一个PE上。运算电路703从输入存储器701中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)708中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 703 fetches the data corresponding to matrix B from the weight memory 702 and buffers it on each PE in the arithmetic circuit 703. The arithmetic circuit 703 takes the matrix A data and the matrix B from the input memory 701 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 708.
向量计算单元707可以对运算电路703的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元707可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。The vector calculation unit 707 can perform further processing on the output of the arithmetic circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. For example, the vector calculation unit 707 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
在一些实现中,向量计算单元能707将经处理的输出的向量存储到统一缓存器706。例如,向量计算单元707可以将非线性函数应用到运算电路703的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元707生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路703的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit 707 can store the processed output vector to the unified buffer 706. For example, the vector calculation unit 707 may apply a nonlinear function to the output of the arithmetic circuit 703, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 707 generates normalized values, combined values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 703, for example for use in a subsequent layer in a neural network.
统一存储器706用于存放输入数据以及输出数据。The unified memory 706 is used to store input data and output data.
权重数据直接通过存储单元访问控制器705(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器701和/或统一存储器706、将外部存储器中的权重数据存入权重存储器702,以及将统一存储器706中的数据存入外部存储器。The weight data directly transfers the input data in the external memory to the input memory 701 and/or the unified memory 706 through the storage unit access controller 705 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 702, And the data in the unified memory 706 is stored in the external memory.
总线接口单元(bus interface unit,BIU)710,用于通过总线实现主CPU、DMAC和取指存储器709之间进行交互。The bus interface unit (BIU) 710 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 709 through the bus.
与控制器704连接的取指存储器(instruction fetch buffer)709,用于存储控制器704使用的指令;An instruction fetch buffer 709 connected to the controller 704 is used to store instructions used by the controller 704;
控制器704,用于调用指存储器709中缓存的指令,实现控制该运算加速器的工作过程。The controller 704 is used to call the instructions cached in the memory 709 to control the working process of the computing accelerator.
一般地,统一存储器706,输入存储器701,权重存储器702以及取指存储器709均为片上(on-chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。Generally, the unified memory 706, the input memory 701, the weight memory 702, and the instruction fetch memory 709 are all on-chip memories. The external memory is a memory external to the NPU. The external memory can be a double data rate synchronous dynamic random access memory. Memory (double data rate synchronous dynamic random access memory, referred to as DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.
另外,在本申请中,图6所示的卷积神经网络中各层的运算可以由运算电路703或向量计算单元707执行。In addition, in this application, the operations of each layer in the convolutional neural network shown in FIG. 6 may be executed by the arithmetic circuit 703 or the vector calculation unit 707.
图8是本申请实施例提供的一种神经网络模型的训练方法的示意性流程图。FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application.
S801、获取训练数据,训练数据包括T1帧训练图像和标注类别。S801. Obtain training data, where the training data includes T1 frame training images and labeled categories.
T1帧训练图像对应于一个标注类别。T1为大于1的正整数。T1帧训练图像可以是一段视频中连续的多帧图像,也可以按照预设的在一段视频中按照预设规则选取的多帧图像。例如,T1帧训练图像可以是一段视频中每经过预设的时间进行选取获得的多帧图像,或者可以是在一段视频中间隔预设数量的帧数的多帧图像。The T1 frame training image corresponds to a label category. T1 is a positive integer greater than 1. The T1 frame training image can be a continuous multi-frame image in a video, or a multi-frame image selected according to a preset rule in a video according to a preset. For example, the T1 frame training image may be a multi-frame image obtained by selecting every preset time in a video, or it may be a multi-frame image with a preset number of frames in a video.
T1帧训练图像中可以包括多个人物,该多个人物既可以只包括人,也可以只包括动物,也可以既包括人又包括动物。The training image of the T1 frame may include multiple characters, and the multiple characters may include only humans, animals, or both humans and animals.
上述标注类别用于指示T1帧训练图像中的人物的群体动作的类别。The above-mentioned label category is used to indicate the category of the group action of the person in the training image of the T1 frame.
S802、利用神经网络对T1帧训练图像进行处理,以得到训练类别。S802. Use a neural network to process the T1 frame training image to obtain a training category.
利用神经网络对T1帧训练图像进行以下处理:Use the neural network to perform the following processing on the T1 frame training image:
S802a、提取T1帧训练图像的图像特征。S802a: Extract image features of the training image of frame T1.
从T1帧训练图像中选择出至少一帧图像,提取该至少一帧图像中每帧图像中的多个人物的图像特征。At least one frame of images is selected from the T1 frame of training images, and image features of multiple people in each frame of the at least one frame of images are extracted.
在一帧训练图像中,某个人物的图像特征可以用于表示该人物在该帧训练图像中的身体姿态,即该人物的不同肢体之间的相对位置。上述图像特征可以通过向量表示。In a frame of training image, the image feature of a certain person can be used to represent the body posture of the person in the frame of training image, that is, the relative position between different limbs of the person. The above image features can be represented by vectors.
S802b、确定至少一帧训练图像中每帧训练图像中的多个人物的空间特征。S802b: Determine the spatial characteristics of multiple people in each frame of training image in at least one frame of training image.
其中,所述至少一帧训练图像的第i帧训练图像中的第j个人物的空间特征是根据第i帧训练图像中第j个人物的图像特征与第i帧图像中除第j个人物之外的其他人物的图像特征的相似度确定的,i和j为正整数。Wherein, the spatial feature of the j-th person in the i-th training image of the at least one frame of training image is based on the image feature of the j-th person in the i-th training image and the removal of the j-th person in the i-th frame of image The similarity of image features of other characters is determined, i and j are positive integers.
所述第i帧训练图像中第j个人物的空间特征用于表示第i帧训练图像中第j个人物的动作与述第i帧训练图像中除第j个人物之外的其他人物的动作的关联关系。The spatial feature of the j-th person in the i-th training image is used to represent the actions of the j-th person in the i-th training image and the actions of other people except the j-th person in the i-th training image The relationship.
不同人物在同一帧图像中对应的图像特征之间的相似度,可以反映上述不同人物的动作在空间上依赖程度。也就是说,当两个人物对应的图像特征的相似度越高时,这两个人物的动作之间的关联越紧密;反之,当两个人物对应的图像特征的相似度越低时,这两个人物的动作之间的关联越弱。The similarity between corresponding image features of different characters in the same frame of image can reflect the spatial dependence of the actions of different characters. That is to say, when the similarity of the image features corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image features corresponding to the two characters is lower, this The weaker the association between the actions of the two characters.
S802c、确定至少一帧训练图像中多个人物中每个人物在不同帧图像中的时序特征。S802c. Determine the timing characteristics of each of the multiple characters in the at least one frame of training images in different frames of images.
其中,所述至少一帧训练图像中的第i帧训练图像中的第j个人物的时序特征是根据该第j个人物在第i帧训练图像的图像特征与该第j个人物在除第i帧图像之外的其他帧训练图像的图像特征之间的相似度确定的,i和j为正整数。Wherein, the time sequence feature of the j-th person in the i-th training image in the at least one frame of training image is based on the image feature of the j-th person in the i-th training image and the j-th person in addition to the j-th person. The similarity between image features of training images in frames other than the i-frame image is determined, and i and j are positive integers.
所述第i帧训练图像中第j个人物的时序特征用于表示第j个人物在第i帧训练图像的动作与其在所述至少一帧训练图像的其他帧训练图像的动作的关联关系。The time series feature of the j-th person in the i-th frame of training image is used to represent the relationship between the action of the j-th person in the i-th frame of training image and the action of the j-th person in the at least one frame of training image in other frames of the training image.
一个人物在两帧图像中对应的图像特征之间的相似度,可以反映该人物的动作在时间上的依赖程度。一个人物在两帧图像中对应的图像特征的相似度越高,则在两个时间点该人物的动作之间的关联越紧密;反之,相似度越低,该人物在这两个时间点的动作之间的关联越弱。The similarity between corresponding image features of a person in two frames of images can reflect the degree of dependence of the person's actions on time. The higher the similarity of the corresponding image features of a person in the two frames of images, the closer the correlation between the person’s actions at two time points; on the contrary, the lower the similarity, the closer the person’s actions at these two time points The weaker the connection between the actions.
S802d、确定至少一帧训练图像中每帧训练图像中的多个人物的动作特征。S802d: Determine the action features of multiple characters in each frame of training image in at least one frame of training image.
其中,所述第i帧训练图像中第j个人物的动作特征是对该第i帧训练图像中第j个人物的空间特征、该第i帧训练图像中第j个人物的时序特征、该第i帧训练图像中第j个人物的图像特征进行融合得到的。Wherein, the action feature of the j-th person in the training image of the i-th frame is the spatial feature of the j-th person in the training image of the i-th frame, the time series feature of the j-th person in the training image of the i-th frame, and the The image features of the j-th person in the training image of the i-th frame are fused.
S802e、根据所述至少一帧训练图像中每帧训练图像中的多个人物的动作特征,识别所述T1帧训练图像中的多个人物的群体动作,以得到所述群体动作对应的训练类别。S802e. According to the action characteristics of the multiple characters in each frame of the training image in the at least one frame of training image, identify group actions of multiple characters in the training image of the T1 frame to obtain a training category corresponding to the group action .
可以将所述至少一帧训练图像中每帧训练图像中的多个人物中每个人物的动作特征进行融合,以得到所述至少一帧训练图像中每帧训练图像的特征表示。The action features of each of the multiple characters in each frame of the training image in the at least one frame of training image may be fused to obtain the feature representation of each frame of the training image in the at least one frame of training image.
可以计算T1训练帧图像中每帧训练图像的训练特征表示的每一位的平均值,以得到平均特征表示。平均训练特征表示的每一位为T1帧训练图像中每帧训练图像的特征表示的对应位的平均值。可以根据平均特征表示进行分类,即识别所述T1帧训练图像中的多个人物的群体动作,以得到训练类别。The average value of each bit represented by the training feature of each frame of the training image in the T1 training frame image can be calculated to obtain the average feature representation. Each bit represented by the average training feature is the average value of the corresponding bit represented by the feature of each frame of the training image in the T1 frame of training image. The classification can be performed based on the average feature representation, that is, the group actions of multiple characters in the training image of the T1 frame are recognized to obtain the training category.
为了增加训练的数据量,可以确定所述至少一帧训练图像中每一帧训练图像的训练类别。以确定每一帧图像的训练类别为例进行说明。所述至少一帧训练图像可以是T1帧训练图像中的全部或部分训练图像。In order to increase the amount of training data, the training category of each frame of training image in the at least one frame of training image may be determined. To determine the training category of each frame of image as an example for description. The at least one frame of training images may be all or part of the training images in the T1 frame of training images.
S803、根据训练类别和标注类别,确定该神经网络的损失值。S803: Determine the loss value of the neural network according to the training category and the label category.
神经网络的损失值L可以表示为:The loss value L of the neural network can be expressed as:
Figure PCTCN2020113788-appb-000011
Figure PCTCN2020113788-appb-000011
其中,N Y表示群体动作类别数量,即神经网络输出的类别的数量;
Figure PCTCN2020113788-appb-000012
表示标注类别,
Figure PCTCN2020113788-appb-000013
通过one-hot编码表示,
Figure PCTCN2020113788-appb-000014
包括N Y位,
Figure PCTCN2020113788-appb-000015
用于表示其中的一位,
Figure PCTCN2020113788-appb-000016
p t表示T1帧图像中第t帧图像的训练类别,p t通过one-hot编码表示,p t包括N Y位,
Figure PCTCN2020113788-appb-000017
表示其中的一位,
Figure PCTCN2020113788-appb-000018
第t帧图像也可以理解为t时刻的图像。
Among them, N Y represents the number of group action categories, that is, the number of categories output by the neural network;
Figure PCTCN2020113788-appb-000012
Indicates the label category,
Figure PCTCN2020113788-appb-000013
Expressed by one-hot encoding,
Figure PCTCN2020113788-appb-000014
Including N Y bits,
Figure PCTCN2020113788-appb-000015
Used to represent one of them,
Figure PCTCN2020113788-appb-000016
P t represents the training image category t th frame image T1 frame, P t is represented by one-hot encoding, comprising N Y P t bits,
Figure PCTCN2020113788-appb-000017
Means one of them,
Figure PCTCN2020113788-appb-000018
The t-th frame image can also be understood as the image at time t.
S804、根据该损失值对神经网络通过反向传播进行调整。S804: Adjust the neural network through back propagation according to the loss value.
在上述训练的过程中,训练数据一般包括多组训练图像和标注类别的组合,每组训练图像和标注类别的组合可以包括一帧或多帧训练图像,以及该一帧或多帧训练图像对应的一个唯一的标注类别。In the above training process, the training data generally includes a combination of multiple sets of training images and annotated categories. Each combination of training images and annotated categories may include one or more frames of training images, and the one or more frames of training images correspond to Is a unique label category.
在对上述神经网络进行训练的过程中,可以为神经网络设置一套初始的模型参数,然后根据训练类别与标注类别的差异来逐渐调整神经网络的模型参数,直到训练类别与标注类别之间的差异在一定的预设范围内,或者,当训练的次数达到预设次数时,将此时的神经网络的模型参数确定为该神经网络模型的最终的参数,这样就完成了对神经网络的训练了。In the process of training the above neural network, you can set an initial set of model parameters for the neural network, and then gradually adjust the model parameters of the neural network according to the difference between the training category and the label category, until the training category and the label category are different. The difference is within a certain preset range, or when the number of training times reaches the preset number, the model parameters of the neural network at this time are determined as the final parameters of the neural network model, thus completing the training of the neural network Up.
图9是本申请实施例提供的一种图像识别方法的示意性流程图。FIG. 9 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
S901、提取待处理图像的图像特征。S901: Extract image features of the image to be processed.
上述待处理图像中包括多个人物,上述待处理图像的图像特征包括上述多个人物中的每个人物在待处理图像中的多帧图像中的每帧中的图像特征。The image to be processed includes a plurality of people, and the image feature of the image to be processed includes the image feature of each of the plurality of people in each of the multiple frames of the image to be processed.
在步骤S901之前,可以获取待处理图像。可以从存储器中获取待处理图像,或者,也可以接收待处理图像。Before step S901, an image to be processed can be acquired. The image to be processed can be obtained from the memory, or the image to be processed can also be received.
例如,当图9所示图像识别方法由图像识别装置执行时,上述待处理图像可以是从该图像识别装置中获取到的图像,或者,上述待处理图像也可以是该图像识别装置从其他设备接收到的图像,或者,上述待处理图像也可以是通过该图像识别装置的摄像头拍摄得到的。For example, when the image recognition method shown in FIG. 9 is executed by an image recognition device, the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image obtained by the image recognition device from other equipment. The received image or the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
上述待处理图像,可以是一段视频中连续的多帧图像,也可以是在一段视频中按照预设规则选取的多帧图像。例如,可以在一段视频中,根据预设的时间间隔选取多帧图像;或者,可以在一段视频中,根据预设的帧数间隔选取多帧图像。The above-mentioned image to be processed may be a continuous multi-frame image in a video, or a multi-frame image selected according to a preset rule in a video. For example, in a video, multiple frames of images can be selected according to a preset time interval; or, in a video, multiple frames of images can be selected according to a preset frame number interval.
应理解,在上述待处理图像中的多个人物中,该多个人物既可以只包括人,也可以只包括动物,也可以既包括人又包括动物。It should be understood that, among the multiple characters in the above-mentioned image to be processed, the multiple characters may include only humans, or only animals, or both humans and animals.
在一帧图像中,某个人物的图像特征可以用于表示该人物在该帧图像中的身体姿态,即该人物的不同肢体之间的相对位置。上述某个人物的图像特征可以通过向量表示,该向量可以称为图像特征向量。上述图像特征的提取可以由CNN进行。In a frame of image, the image feature of a certain person can be used to represent the body posture of the person in the frame of image, that is, the relative position between different limbs of the person. The image feature of a certain person mentioned above can be represented by a vector, which can be called an image feature vector. The above-mentioned image feature extraction can be performed by CNN.
可选地,在上述提取待处理图像的图像特征时,可以对图像中的人物进行识别,从而确定人物的边界框,每个边界框中的图像对应于一个人物,再对上述每个边界框的图像进行特征的提取,以获取每个人物的图像特征。Optionally, when extracting the image features of the image to be processed, the person in the image can be identified to determine the bounding box of the person. The image in each bounding box corresponds to a person. Perform feature extraction on the image to obtain the image feature of each person.
由于边界框内的图像中包括较多的冗余信息,这些冗余信息与人物的动作无关。为了提高图像特征向量的准确性,可以通过识别每个边界框中的人物的骨骼节点,减少冗余信息的影响。Since the image in the bounding box contains more redundant information, this redundant information has nothing to do with the actions of the characters. In order to improve the accuracy of the image feature vector, it is possible to reduce the influence of redundant information by identifying the bone nodes of the characters in each bounding box.
可选地,可以先识别每个人物所对应的边界框中的该人物的骨骼节点,然后再根据该人物的骨骼节点,提取该人物的图像特征向量,从而使提取的图像特征更加准确地反映人物的动作,提高提取的图像特征的准确性。Optionally, the person’s bone node in the bounding box corresponding to each person can be identified first, and then the person’s image feature vector can be extracted based on the person’s bone node, so that the extracted image features can be more accurately reflected The actions of the characters improve the accuracy of the extracted image features.
进一步,还可以根据人物结构将边界框中的骨骼节点进行连接,以得到连接图像;再对上述连接图像进行图像特征向量的提取。Further, it is also possible to connect the bone nodes in the bounding box according to the character structure to obtain a connected image; and then perform image feature vector extraction on the above-mentioned connected image.
或者,还可以将骨骼节点所在的区域和骨骼节点所在的区域之外的区域通过不同的颜色进行显示,得到处理后的图像,然后再对处理后的图像进行图像特征的提取。Alternatively, the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
进一步,可以根据上述人物的骨骼节点所在的图像区域确定对应于该边界框的局部可见图像,然后对该局部可见图像进行特征提取,以得到待处理图像的图像特征。Further, the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
上述局部可见图像是由包括待处理图像中的人物的骨骼节点所在的区域组成的图像。具体地,可以将边界框中人物的骨骼节点所在区域之外的区域进行遮掩,以得到所述局部可见图像。The above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
在对骨骼节点所在区域之外的区域进行遮掩时,可以将骨骼节点所在区域之外的区域对应的像素的颜色设定为某种预设的颜色,例如黑色等。也就是说,骨骼节点所在的区域保留了与原图像相同的信息,骨骼节点所在区域之外的区域的信息则被遮掩。因此,在提取图像特征时,只需要提取上述局部可见图像的图像特征,而不需要对上述被遮掩的区域进行提取操作。When masking the area outside the area where the bone node is located, the color of the pixel corresponding to the area outside the area where the bone node is located can be set to a certain preset color, such as black. In other words, the area where the bone node is located retains the same information as the original image, and the information of the area outside the area where the bone node is located is masked. Therefore, when extracting image features, only the image features of the above-mentioned partially visible image need to be extracted, and there is no need to extract the above-mentioned masked area.
上述骨骼节点所在区域可以是以该骨骼节点为中心的正方形、圆形或其他形状。该骨骼节点所在区域的边长(或半径)、面积等可以是预设值。The area where the aforementioned bone node is located may be a square, circle or other shape centered on the bone node. The side length (or radius), area, etc. of the region where the bone node is located can be preset values.
上述提取待处理图像的图像特征的方法,可以根据局部可见图像进行特征的提取,以得到边界框对应的人物的图像特征向量;也可以根据骨骼节点从而确定遮掩矩阵,根据遮掩矩阵对图像进行遮掩。具体可以参见图11和图12的说明。The above method of extracting the image features of the image to be processed can extract the features according to the locally visible image to obtain the image feature vector of the person corresponding to the bounding box; it can also determine the masking matrix according to the bone node, and mask the image according to the masking matrix . For details, refer to the description of FIG. 11 and FIG. 12.
当获取多帧图像时,可以通过目标跟踪,确定图像中不同的人物。例如,可以通过图像中的人物的子特征,确定对图像中的人物进行区分。子特征可以是颜色、边缘、运动信息、纹理信息等。When acquiring multiple frames of images, target tracking can be used to identify different people in the image. For example, the sub-features of the person in the image can be used to determine the distinction between the person in the image. The sub-features can be colors, edges, motion information, texture information, and so on.
S902、确定多个人物中的每个人物在多帧图像中的每帧图像中的空间特征。S902: Determine the spatial feature of each of the multiple persons in each of the multiple frames of images.
通过同一帧图像中不同人物之间的图像特征之间的相似度,确定该帧图像中的不同人物的动作之间的空间关联关系。Based on the similarity between image features of different characters in the same frame of image, the spatial correlation between the actions of different characters in the frame of image is determined.
上述待处理图像中第i帧图像中的第j个人物的空间特征,可以根据第i帧图像中第j个人物的图像特征与第i帧图像中除第j个人物以外的其他人物的图像特征的相似度确定,i和j为正整数。The spatial characteristics of the j-th person in the i-th frame of the image to be processed above can be based on the image characteristics of the j-th person in the i-th frame and the images of other persons other than the j-th person in the i-th frame. The similarity of features is determined, and i and j are positive integers.
应理解,第i帧图像中第j个人物的空间特征用于表示第i帧图像中第j个人物的动作与第i帧图像中第i帧图像中除第j个人物以外的其他人物的动作的关联关系。It should be understood that the spatial characteristics of the j-th person in the i-th frame image are used to represent the actions of the j-th person in the i-th frame image and the behavior of other people other than the j-th person in the i-th frame image in the i-th frame image. The relationship of the action.
具体地,上述第i帧图像中第j个人物的图像特征向量与除第j个人物以外的其他人物的图像特征向量的相似度,可以反映第i帧图像中第j个人物对除第j个人物以外的其他人物的动作的依赖程度。也就是说,当两个人物对应的图像特征向量的相似度越高时,这两个人物的动作之间的关联越紧密;反之,当两个人物对应的图像特征向量的相似度越低时,这两个人物的动作之间的关联越弱。一帧图像中不同人物的动作在空间上的关联关系可以参见图14和图15的说明。Specifically, the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people except the j-th person can reflect the difference between the j-th person pair in the i-th frame image and the j-th person. The degree of dependence on the actions of characters other than personal objects. That is to say, when the similarity of the image feature vectors corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image feature vectors corresponding to the two characters is lower , The weaker the association between the actions of these two characters. Refer to the description of FIG. 14 and FIG. 15 for the spatial association relationship of the actions of different characters in a frame of image.
S903、确定多个人物中的每个人物在多帧图像中的每帧图像中的时序特征。S903. Determine the time sequence characteristics of each of the multiple persons in each frame of the multiple frames of images.
通过相同人物在不同帧图像中的不同动作之间的图像特征向量之间的相似度,确定该人物的不同时刻动作之间的时间关联关系。According to the similarity between image feature vectors of different actions of the same person in different frames of images, the time correlation between the actions of the person at different moments is determined.
上述待处理图像中第i帧图像的第j个人物的时序特征,可以根据该第j个人物在第i帧图像的图像特征与其在除第i帧图像以外的其他帧图像的图像特征的相似度确定,i和j为正整数。The timing characteristics of the j-th person in the i-th frame of the image to be processed can be based on the similarity between the j-th person’s image characteristics in the i-th frame and the image characteristics in other frames except the i-th frame. The degree is determined, i and j are positive integers.
上述第i帧图像中第j个人物的时序特征用于表示所述第j个人物在所述第i帧图像的动作与在除第i帧图像以外的其他帧图像中的动作的关联关系。The time series feature of the j-th person in the i-th frame image is used to indicate the relationship between the action of the j-th person in the i-th frame image and the action in other frame images except the i-th frame image.
一个人物在两帧图像中对应的图像特征之间的相似度,可以反映该人物的动作在时间上的依赖程度。一个人物在两帧图像中对应的图像特征的相似度越高,则在两个时间点该人物的动作之间的关联越紧密;反之,相似度越低,该人物在这两个时间点的动作之间的关联越弱。一个人物的动作在时间上的关联关系可以参见图16和图17的说明。The similarity between corresponding image features of a person in two frames of images can reflect the degree of dependence of the person's actions on time. The higher the similarity of the corresponding image features of a person in the two frames of images, the closer the correlation between the person’s actions at two time points; on the contrary, the lower the similarity, the closer the person’s actions at these two time points The weaker the connection between the actions. Refer to the description of Fig. 16 and Fig. 17 for the time-related relationship of a character’s actions.
在上述过程中,均涉及到特征之间的相似度,所述相似度可以采用不用的方式获得。例如,可以通过明氏距离(Minkowski distance)(如欧氏距离、曼哈顿距离)、余弦相似度、切比雪夫距离、汉明距离等方法,计算上述特征之间的相似度。In the above process, the similarity between features is involved, and the similarity can be obtained in different ways. For example, Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance and other methods can be used to calculate the similarity between the above features.
可选地,可以通过计算经过线性变化之后的两个特征每一位的乘积之和,来计算相似度。Optionally, the similarity can be calculated by calculating the sum of the products of each bit of the two features after the linear change.
不同人物动作之间的空间关联关系以及相同人物动作之间的时间关联关系都可以为图像中的多人场景的类别提供重要线索。因此,本申请在图像识别过程中,通过综合考虑不同人物动作之间的空间关联关系以及相同人物动作之间的时间关联关系,能够有效提高识别的准确性。The spatial correlation between actions of different characters and the temporal correlation between actions of the same character can provide important clues to the categories of multi-person scenes in the image. Therefore, in the image recognition process of the present application, by comprehensively considering the spatial association relationship between different person actions and the time association relationship between the same person actions, the accuracy of recognition can be effectively improved.
S904、确定多个人物中的每个人物在多帧图像中的每帧图像中的动作特征。S904: Determine the action feature of each of the multiple persons in each frame of the multiple frames of images.
可选地,在确定某个人物在某一帧图像中的动作特征时,可以将对应于一帧图像中该人物的时序特征、空间特征、图像特征进行融合,从而得到该帧图像中该人物的动作特征。Optionally, when determining the action characteristics of a certain person in a certain frame of image, the time sequence, spatial, and image characteristics corresponding to the person in a frame of image can be fused to obtain the person in the frame of image The characteristics of the action.
例如,可以对待处理图像中第i帧图像中第j个人物的空间特征、第i帧图像中第j个人物的时序特征、第i帧图像中第j个人物的图像特征进行融合,以得到的所述第i帧图像中第j个人物的动作特征。For example, the spatial characteristics of the j-th person in the i-th frame of the image to be processed, the temporal characteristics of the j-th person in the i-th frame of the image, and the image characteristics of the j-th person in the i-th frame of the image can be fused to obtain The action feature of the j-th person in the i-th frame image.
在对上述时序特征、空间特征、图像特征进行融合时,可以采用不同的融合方式进行融合,下面对所述融合方式举例说明。When fusing the above-mentioned temporal features, spatial features, and image features, different fusion methods can be used for fusion, and the fusion methods are described below with examples.
方式一、采用组合(combine)的方式进行融合。The first way is to use a combination (combine) method for integration.
可以将待融合的特征直接相加,或者加权相加。The features to be fused can be added directly or weighted.
应理解,所述加权相加,即将待融合的特征乘以一定系数即权重值之后相加。It should be understood that the weighted addition is to add the features to be fused by a certain coefficient, that is, the weight value.
也就是说,采用组合的方式,可以将通道维度(channel wise)进行线性组合。That is to say, by adopting the method of combination, the channel wise (channel wise) can be linearly combined.
可以将特征提取网络的多个层输出的多个特征相加,例如,可以将特征提取网络的多个层输出的多个特征直接相加,也可以将特征提取网络的多个层输出的多个特征按照一定权重相加。T1和T2分别表示特征提取网络的两个层输出的特征,可以用T3表示融合后的特征,T3=a×T1+b×T2,其中,a和b分别为计算T3时T1和T2乘的系数,即权重值,a≠0,且b≠0。Multiple features output by multiple layers of the feature extraction network can be added together. For example, multiple features output by multiple layers of the feature extraction network can be added directly, or multiple features output by multiple layers of the feature extraction network can be added. The features are added according to a certain weight. T1 and T2 respectively represent the features output by the two layers of the feature extraction network, and T3 can be used to represent the fused feature, T3=a×T1+b×T2, where a and b are respectively the multiplication of T1 and T2 when calculating T3 The coefficient, that is, the weight value, a≠0 and b≠0.
方式二、采用级联(concatenate)和通道融合(channel fusion)的方式进行融合。Method 2: Concatenate and channel fusion are used for fusion.
级联和通道融合是另一种融合的方式。采用级联和通道融合的方式,可以将待融合的特征的维数直接拼接,或者乘以一定系数即权重值之后进行拼接。Cascade and channel fusion is another way of fusion. Using cascading and channel fusion methods, the dimensions of the features to be fused can be directly spliced, or they can be spliced after being multiplied by a certain coefficient, that is, a weight value.
方式三、利用利用池化层对上述特征进行处理,以实现对上述特征的融合。The third way is to use the pooling layer to process the above-mentioned features, so as to realize the integration of the above-mentioned features.
可以对多个特征向量进行最大池化,以确定目标特征向量。通过最大池化得到的目标特征向量中,每一位均为该多个特征向量中对应位的最大值。也可以对多个特征向量进行平均池化,以确定目标特征向量。通过平均池化得到的目标特征向量中,每一位均为该多个特征向量中对应位的平均值。Multiple feature vectors can be maximally pooled to determine the target feature vector. In the target feature vector obtained by maximum pooling, each bit is the maximum value of the corresponding bit in the multiple feature vectors. It is also possible to perform average pooling on multiple feature vectors to determine the target feature vector. In the target feature vector obtained by averaging pooling, each bit is the average value of the corresponding bit in the multiple feature vectors.
可选地,可以通过组合的方式,将对应于一帧图像中一个人物的特征进行融合,以得到该帧图像中该人物的动作特征。Optionally, the feature corresponding to a person in a frame of image can be merged in a combination manner to obtain the action feature of the person in the frame of image.
当获取多帧图像时,所述第i帧图像中至少一个人物对应的特征向量组还可以包括所述第i帧图像中至少一个人物对应的时序特征向量。When acquiring multiple frames of images, the feature vector group corresponding to at least one person in the i-th frame image may further include a time-series feature vector corresponding to at least one person in the i-th frame image.
S905、根据多个人物中的每个人物在多帧图像中的每帧图像中的动作特征,识别待处理图像中的多个人物的群体动作。S905: Recognizing group actions of multiple people in the image to be processed according to the action feature of each person in the multiple images in each frame of the image.
应理解,群体动作是由群体中若干人物的动作组成的,即由多个人物共同完成的动作。It should be understood that group actions are composed of actions of several characters in the group, that is, actions completed by multiple characters.
可选地,上述待处理图像中多个人物的群体动作可以是某种运动或者活动,例如,上述待处理图像中多个人物的群体动作可以是打篮球、打排球、踢足球或跳舞等等。Optionally, the group actions of multiple characters in the image to be processed may be a certain sport or activity. For example, the group actions of multiple characters in the image to be processed may be basketball, volleyball, football or dancing, etc. .
在一种实现方式中,可以根据待处理图像中每帧图像的多个人物中每个人物的动作特征,确定每帧图像的动作特征。然后,可以根据每帧图像的动作特征,识别待处理图像中的多个人物的群体动作。In an implementation manner, the motion characteristics of each frame of the image may be determined according to the motion characteristics of each of the multiple characters in each frame of the image to be processed. Then, the group actions of multiple people in the image to be processed can be identified according to the action characteristics of each frame of image.
可选地,可以通过最大池化的方式,对一帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征。Optionally, the action characteristics of multiple characters in a frame of image can be merged by means of maximum pooling, so as to obtain the action characteristics of the frame of image.
可选地,可以将每帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征,再将每帧图像的动作特征分别输入分类模块,以得到每帧图像的动作分类结果,将分类模块的输出类别中对应的上述待处理图像中图像数量最多的一个分类结果作为待处理图像中的多个人物的群体动作。Optionally, the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
可选地,可以将每帧图像中多个人物的动作特征进行融合,以得到该帧图像的动作特征,再对上述得到的每帧图像的动作特征取平均值,以得到每帧图像的平均动作特征,然后将该每帧图像的平均动作特征输入分类模块,进而将该每帧图像的平均动作特征所对应的分类结果作为待处理图像中的多个人物的群体动作。Optionally, the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
可选地,可以从待处理图像中选择一帧图像,将该帧图像中根据多个人物的动作特征融合得到的该帧图像的动作特征输入分类模块,以得到对该帧图像的分类结果,进而将对该帧图像的分类结果作为待处理图像中的多个人物的群体动作。Optionally, a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
在另一种实现方式中,可以对待处理图像中的多个人物中每个人物在每帧图像中的动作特征进行分类,得到每个人物的动作,并据此确定多个人物的群体动作。In another implementation manner, it is possible to classify the action characteristics of each of the multiple characters in the image to be processed in each frame of the image to obtain the actions of each character, and determine the group actions of the multiple characters accordingly.
可选地,可以将处理图像中的多个人物中每个人物在每帧图像中的动作特征输入分类模块,以得到对上述多个人物中每个人物动作特征的分类结果,即每个人物的动作,进而将对应的人物数量最多的动作作为多个人物的群体动作。Optionally, the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
可选地,可以从多个人物中选择某一人物,将该人物在每帧图像中的动作特征输入分类模块,以得到对该人物动作特征的分类结果,即该人物的动作,进而将上述得到的该人物的动作作为待处理图像中的多个人物的群体动作。Optionally, a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
步骤S901至步骤S904可以通过图8训练得到的神经网络模型实现。Steps S901 to S904 can be implemented by the neural network model trained in FIG. 8.
应理解,上述步骤不存在顺序限定,例如也可以先确定时序特征,再确定空间特征, 在此不再赘述。It should be understood that there is no sequence limitation for the above steps. For example, the sequence characteristics may be determined first, and then the spatial characteristics may be determined.
图9所示方法,在确定多个人物的群体动作时,不仅考虑到了多个人物的时序特征,还考虑到了多个人物的空间特征,通过综合多个人物的时序特征和空间特征能够更好更准确地确定出多个人物的群体动作。The method shown in Figure 9 not only considers the temporal characteristics of multiple characters when determining group actions of multiple characters, but also takes into account the spatial characteristics of multiple characters. It can be better by integrating the temporal and spatial characteristics of multiple characters. Determine the group actions of multiple characters more accurately.
可选地,在图9所示方法中,在识别出待处理图像中的多个人物的群体动作后,根据该群体动作生成待处理图像的标签信息,该标签信息用于指示待处理图像中多个人物的群体动作。Optionally, in the method shown in FIG. 9, after recognizing group actions of multiple people in the image to be processed, label information of the image to be processed is generated according to the group action, and the label information is used to indicate Group actions of multiple characters.
上述方式例如可以用于对视频库进行分类,将该视频库中的不同视频根据其对应的群体动作打上标签,便于用户查看和查找。The foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
可选地,在图9所示方法中,在识别出待处理图像中的多个人物的群体动作后,根据该群体动作确定待处理图像的关键人物。Optionally, in the method shown in FIG. 9, after recognizing group actions of multiple characters in the image to be processed, the key person in the image to be processed is determined according to the group actions.
可选地,在上述确定关键人物的过程中,可以先确定待处理图像中多个人物中每个人物对上述群体动作的贡献度,再将贡献度最高的人物确定为关键人物。Optionally, in the above process of determining the key person, the contribution of each of the multiple characters in the image to be processed to the group action can be determined first, and then the person with the highest contribution rate is determined as the key person.
应理解,该关键人物对多个人物的群体动作的贡献度大于多个人物中除关键人物之外的其他人物的贡献度。It should be understood that the contribution of the key person to the group actions of the multiple characters is greater than the contribution of other characters among the multiple characters except the key person.
上述方式例如可以用于检测视频图像中的关键人物,通常情况下,视频中包含若干人物,其中大部分人并不重要。有效地检测出关键人物有助于根据关键人物周围信息,更加快速和准确地理解视频内容。The above method can be used to detect key persons in a video image, for example. Generally, the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
例如假设一段视频是一场球赛,则控球的球员对在场包括球员、裁判和观众等所有人员的影响最大,对群体动作的贡献度也最高,因此可以将该控球的球员确定为关键人物,通过确定关键人物,能够帮助观看视频的人理解比赛正在和即将发生的事情。For example, if a video is a ball game, the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and also contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
图10是本申请实施例提供的一种图像识别方法的示意性流程图。FIG. 10 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
S1001、提取待处理图像的图像特征。S1001. Extract image features of the image to be processed.
上述待处理图像中包括至少一帧图像,上述待处理图像的图像特征包括上述待处理图像中的多个人物的图像特征。The image to be processed includes at least one frame of image, and the image features of the image to be processed include image features of multiple people in the image to be processed.
在步骤S1001之前,可以获取待处理图像。可以从存储器中获取待处理图像,或者,也可以接收待处理图像。Before step S1001, an image to be processed can be acquired. The image to be processed can be obtained from the memory, or the image to be processed can also be received.
例如,当图10所示图像识别方法由图像识别装置执行时,上述待处理图像可以是从该图像识别装置中获取到的图像,或者,上述待处理图像也可以是该图像识别装置从其他设备接收到的图像,或者,上述待处理图像也可以是通过该图像识别装置的摄像头拍摄得到的。For example, when the image recognition method shown in FIG. 10 is executed by an image recognition device, the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image obtained by the image recognition device from other equipment. The received image or the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
应理解,上述待处理图像可以为一帧图像,也可以为多帧图像。It should be understood that the above-mentioned image to be processed may be one frame of image or multiple frames of image.
当上述待处理图像为多帧时,可以是一段视频中连续的多帧图像,也可以是在一段视频中按照预设规则选取的多帧图像。例如,可以在一段视频中,根据预设的时间间隔选取多帧图像;或者,可以在一段视频中,根据预设的帧数间隔选取多帧图像。When the above-mentioned image to be processed is multiple frames, it may be consecutive multiple frames of images in a piece of video, or multiple frames of images selected according to preset rules in a piece of video. For example, in a video, multiple frames of images can be selected according to a preset time interval; or, in a video, multiple frames of images can be selected according to a preset frame number interval.
上述待处理图像,可以包括多个人物,该多个人物既可以仅包括人,也可以仅包括动物,也可以既包括人又包括动物。The above-mentioned image to be processed may include a plurality of persons, and the plurality of persons may include only humans, animals, or both humans and animals.
可选地,可以采用图9中步骤S901所示方法提取上述待处理图像的图像特征。Optionally, the method shown in step S901 in FIG. 9 may be used to extract the image features of the image to be processed.
S1002、确定多个人物在每帧待处理图像中的空间特征。S1002: Determine the spatial characteristics of multiple people in each frame of the image to be processed.
上述每帧待处理图像中的多个人物中的某个人物的空间特征是根据该人物在该帧待处理图像中的图像特征与除该人物以外的其他人物在该帧待处理图像中的图像特征的相似度确定的。The spatial characteristics of a certain person among the multiple characters in each frame of the to-be-processed image are based on the image characteristics of the person in the frame of the image to be processed and the images of other people except the person in the frame of the image to be processed The similarity of features is determined.
可选地,可以采用图9中步骤S902所示方法,确定多个人物在每帧待处理图像中的空间特征。Optionally, the method shown in step S902 in FIG. 9 may be used to determine the spatial characteristics of multiple persons in each frame of the image to be processed.
S1003、确定多个人物在每帧待处理图像中的动作特征。S1003: Determine the action characteristics of multiple people in each frame of the image to be processed.
上述每帧待处理图像中的多个人物中的某个人物的动作特征是对该人物在该帧待处理图像中的空间特征和该人物在该帧待处理图像中的图像特征进行融合得到的。The action feature of a person among the multiple characters in each frame of the image to be processed is obtained by fusing the spatial feature of the person in the frame of image to be processed and the image feature of the person in the frame of image to be processed. .
可选地,可以采用图9中步骤S904所示的融合方法,确定所述没帧待处理图像中多个人物的动作特征。Optionally, the fusion method shown in step S904 in FIG. 9 may be used to determine the motion characteristics of multiple characters in the image to be processed without a frame.
S1004、根据多个人物在每帧待处理图像中的动作特征,识别待处理图像中的多个人物的群体动作。S1004: Identify group actions of multiple people in the image to be processed according to the action characteristics of multiple people in each frame of the image to be processed.
可选地,可以采用图9中步骤S905所示的方法,识别待处理图像中的多个人物的群体动作。Optionally, the method shown in step S905 in FIG. 9 may be used to identify group actions of multiple characters in the image to be processed.
在图10所示方法中,无需计算每个人物的时序特征,当人物空间特征的确定不依赖于人物的时序特征时,能够更便于确定出多个人物的群体动作。又例如,当只对一帧图像进行识别时,不存在同一人物在不同时间的时序特征,该方法也更为适用。In the method shown in FIG. 10, there is no need to calculate the temporal characteristics of each character. When the determination of the spatial characteristics of the characters does not depend on the temporal characteristics of the characters, it is easier to determine the group actions of multiple characters. For another example, when only one frame of image is recognized, there is no time sequence characteristic of the same person at different times, and this method is more suitable.
图11是本申请实施例提供的一种图像识别方法的示意性流程图。FIG. 11 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
S1101、提取待处理图像的图像特征。S1101, extract image features of the image to be processed.
上述待处理图像包括多帧图像,上述待处理图像的图像特征包括从该多帧图像中选择出的至少一帧图像中每帧图像中的多个人物的图像特征。The image to be processed includes multiple frames of images, and the image features of the image to be processed include image features of multiple people in each frame of at least one frame of image selected from the multiple frames of images.
可选地,可以对输入的多帧图像中多个人物对应的图像进行特征的提取。Optionally, feature extraction can be performed on images corresponding to multiple people in the input multiple frames of images.
在一帧图像中,某个人物的图像特征可以用于表示该人物在该帧图像中的身体姿态,即该人物的不同肢体之间的相对位置。上述某个人物的图像特征可以通过向量表示,该向量可以称为图像特征向量。上述图像特征的提取可以由CNN进行。In a frame of image, the image feature of a certain person can be used to represent the body posture of the person in the frame of image, that is, the relative position between different limbs of the person. The image feature of a certain person mentioned above can be represented by a vector, which can be called an image feature vector. The above-mentioned image feature extraction can be performed by CNN.
可选地,在上述提取待处理图像的图像特征时,可以对每个人物进行目标追踪,确定每个人物在每帧图像中的边界框,每个边界框中的图像对应于一个人物,再对上述每个边界框的图像进行特征的提取,以获取每个人物的图像特征。Optionally, when extracting the image features of the image to be processed as described above, target tracking can be performed on each person, and the bounding box of each person in each frame of the image is determined, and the image in each bounding box corresponds to a person, and then The feature extraction is performed on the image of each of the above-mentioned bounding boxes to obtain the image feature of each person.
由于边界框内的图像中包括较多的冗余信息,这些冗余信息与人物的动作无关。为了提高图像特征向量的准确性,可以通过识别每个边界框中的人物的骨骼节点,减少冗余信息的影响。Since the image in the bounding box contains more redundant information, this redundant information has nothing to do with the actions of the characters. In order to improve the accuracy of the image feature vector, it is possible to reduce the influence of redundant information by identifying the bone nodes of the characters in each bounding box.
可选地,可以根据人物结构,将边界框中的骨骼节点进行连接,以得到连接图像,然后对该连接图像进行图像特征向量的提取。或者,还可以将骨骼节点所在的区域和骨骼节点所在的区域之外的区域通过不同的颜色进行显示,然后再对处理后的图像进行图像特征的提取。Optionally, the bone nodes in the bounding box may be connected according to the structure of the person to obtain a connected image, and then image feature vector extraction is performed on the connected image. Alternatively, the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors, and then image feature extraction is performed on the processed image.
进一步,可以根据上述人物的骨骼节点所在的图像区域确定对应于该边界框的局部可见图像,然后对该局部可见图像进行特征提取,以得到待处理图像的图像特征。Further, the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
上述局部可见图像是由包括待处理图像中的人物的骨骼节点所在的区域组成的图像。具体地,可以将边界框中人物的骨骼节点所在区域之外的区域进行遮掩,以得到所述局部 可见图像。The above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the character is located in the bounding box may be masked to obtain the partially visible image.
在对骨骼节点所在区域之外的区域进行遮掩时,可以将骨骼节点所在区域之外的区域对应的像素的颜色设定为某种预设的颜色,例如黑色等。也就是说,骨骼节点所在的区域保留了与原图像相同的信息,骨骼节点所在区域之外的区域的信息则被遮掩。因此,在提取图像特征时,只需要提取上述局部可见图像的图像特征,而不需要对上述被遮掩的区域进行提取操作。When masking the area outside the area where the bone node is located, the color of the pixel corresponding to the area outside the area where the bone node is located can be set to a certain preset color, such as black. In other words, the area where the bone node is located retains the same information as the original image, and the information of the area outside the area where the bone node is located is masked. Therefore, when extracting image features, only the image features of the above-mentioned partially visible image need to be extracted, and there is no need to extract the above-mentioned masked area.
上述骨骼节点所在区域可以是以该骨骼节点为中心的正方形、圆形或其他形状。该骨骼节点所在区域的边长(或半径)、面积等可以是预设值。The area where the aforementioned bone node is located may be a square, circle or other shape centered on the bone node. The side length (or radius), area, etc. of the region where the bone node is located can be preset values.
上述提取待处理图像的图像特征的方法,可以根据局部可见图像进行特征的提取,以得到边界框对应的人物的图像特征向量;也可以根据骨骼节点从而确定遮掩矩阵,根据遮掩矩阵对图像进行遮掩。The above method of extracting the image features of the image to be processed can extract the features according to the locally visible image to obtain the image feature vector of the person corresponding to the bounding box; it can also determine the masking matrix according to the bone node, and mask the image according to the masking matrix .
下面对上述根据骨骼节点确定遮掩矩阵的方法具体举例说明如下。The following is a specific example of the above method of determining the masking matrix based on the bone node.
S1101a)预先确定每个人物的边界框。S1101a) Determine the bounding box of each person in advance.
对于t时刻,边界框内包括第k个人物的图像
Figure PCTCN2020113788-appb-000019
For time t, the image of the k-th person included in the bounding box
Figure PCTCN2020113788-appb-000019
S1101b)预先提取每个人物的骨骼节点。S1101b) Extract the bone nodes of each character in advance.
在t时刻,提取第k个人物的骨骼节点
Figure PCTCN2020113788-appb-000020
At time t, extract the skeletal node of the k-th character
Figure PCTCN2020113788-appb-000020
S1101c)计算人物动作的遮掩矩阵。S1101c) Calculate the concealment matrix of the character's action.
可以根据人物的图像
Figure PCTCN2020113788-appb-000021
和骨骼节点
Figure PCTCN2020113788-appb-000022
计算人物动作遮掩矩阵
Figure PCTCN2020113788-appb-000023
遮掩矩阵
Figure PCTCN2020113788-appb-000024
中每个点对应于一个像素。
Can be based on the image of the character
Figure PCTCN2020113788-appb-000021
And bone nodes
Figure PCTCN2020113788-appb-000022
Calculate the character's action masking matrix
Figure PCTCN2020113788-appb-000023
Masking matrix
Figure PCTCN2020113788-appb-000024
Each point corresponds to a pixel.
可选地,遮掩矩阵
Figure PCTCN2020113788-appb-000025
中,以骨骼点为中心,边长为l的方形区域内的值设置为1,其他位置的值设置为0。遮掩矩阵
Figure PCTCN2020113788-appb-000026
的计算公式如下:
Optionally, the masking matrix
Figure PCTCN2020113788-appb-000025
, The value in the square area with the bone point as the center and side length l is set to 1, and the values in other positions are set to 0. Masking matrix
Figure PCTCN2020113788-appb-000026
The calculation formula is as follows:
Figure PCTCN2020113788-appb-000027
Figure PCTCN2020113788-appb-000027
在RGB色彩模式中,使用RGB模型为图像中每一个像素的RGB分量分配一个0至255范围内的强度值。若采用RGB色彩模式,则遮掩矩阵
Figure PCTCN2020113788-appb-000028
的计算公式可以表示为:
In the RGB color model, the RGB model is used to assign an intensity value in the range of 0 to 255 for the RGB component of each pixel in the image. If RGB color mode is used, the mask matrix
Figure PCTCN2020113788-appb-000028
The calculation formula of can be expressed as:
Figure PCTCN2020113788-appb-000029
Figure PCTCN2020113788-appb-000029
用矩阵
Figure PCTCN2020113788-appb-000030
对原始人物动作图像
Figure PCTCN2020113788-appb-000031
进行遮掩,得到局部可见图像
Figure PCTCN2020113788-appb-000032
Figure PCTCN2020113788-appb-000033
中的每一位可以表示一个像素。
Figure PCTCN2020113788-appb-000034
中每个像素的RGB分量取值在0-1之间。运算符“ο”表示
Figure PCTCN2020113788-appb-000035
中的每一位与对应的
Figure PCTCN2020113788-appb-000036
中的每一位相乘。
Use matrix
Figure PCTCN2020113788-appb-000030
Action images of original characters
Figure PCTCN2020113788-appb-000031
Masking to get a partially visible image
Figure PCTCN2020113788-appb-000032
Figure PCTCN2020113788-appb-000033
Each bit in can represent a pixel.
Figure PCTCN2020113788-appb-000034
The RGB component of each pixel in the value is between 0-1. The operator "ο" means
Figure PCTCN2020113788-appb-000035
Each bit in the corresponding
Figure PCTCN2020113788-appb-000036
Multiply each bit in.
图12是本申请实施例提供的一种获取局部可见图像的过程的示意图。如图12所示,对图片
Figure PCTCN2020113788-appb-000037
进行遮掩。具体地,保留骨骼节点
Figure PCTCN2020113788-appb-000038
中的每个节点周围变长为l的区域,对其他区域进行遮掩。
FIG. 12 is a schematic diagram of a process of obtaining a partially visible image provided by an embodiment of the present application. As shown in Figure 12, the picture
Figure PCTCN2020113788-appb-000037
Cover up. Specifically, keep the bone nodes
Figure PCTCN2020113788-appb-000038
The area around each node in the variable length becomes l, and the other areas are masked.
假设T帧图像中的人物数相同,即T帧图像中均包括K个人物的图像。根据T帧图像中每帧图像中的对应于该K个人物的局部可见图像
Figure PCTCN2020113788-appb-000039
提取图像特征
Figure PCTCN2020113788-appb-000040
可以通过D维向量表示,即
Figure PCTCN2020113788-appb-000041
上述T帧图像的图像特征的提取可以由CNN进行。
Assuming that the number of people in the T frame images is the same, that is, the T frame images all include images of K people. According to the locally visible image corresponding to the K figure in each frame of the T frame image
Figure PCTCN2020113788-appb-000039
Extract image features
Figure PCTCN2020113788-appb-000040
It can be represented by a D-dimensional vector, that is
Figure PCTCN2020113788-appb-000041
The image feature extraction of the above-mentioned T frame image can be performed by CNN.
在T帧图像中的K个人物的图像特征的集合可以表示为X,
Figure PCTCN2020113788-appb-000042
对于 每个人物,利用局部可见图像
Figure PCTCN2020113788-appb-000043
进行图像特征的提取,可以减少边界框中的冗余信息,根据身体的结构信息提取图像特征,增强图像特征中对于人物动作的表现能力。
The set of image features of K people in the T frame image can be expressed as X,
Figure PCTCN2020113788-appb-000042
For each character, use a partially visible image
Figure PCTCN2020113788-appb-000043
Extracting image features can reduce redundant information in the bounding box, extract image features based on body structure information, and enhance the ability to express human actions in image features.
S1102、确定待处理图像中不同人物的动作间的依赖关系,以及相同人物不同时刻的动作间的依赖关系。S1102, determine the dependence relationship between actions of different characters in the image to be processed, and the dependence relationship between actions of the same character at different moments.
在该步骤中,利用十字交互模块(cross interaction module,CIM),确定待处理图像中不同人物的动作在空间上的相关性,以及相同人物不同时间的动作在时间上的相关性。In this step, a cross interaction module (CIM) is used to determine the spatial correlation of the actions of different characters in the image to be processed, and the temporal correlation of the actions of the same character at different times.
该十字交互模块用于实现特征的交互,建立特征交互模型,特征交互模型用于表示人物的身体姿态在时间上和/或空间上的关联关系。The cross interaction module is used to implement feature interaction and establish a feature interaction model. The feature interaction model is used to represent the relationship of the character's body posture in time and/or space.
一个人物的身体姿态在空间上的相关性可以通过空间依赖体现。空间依赖用于表示在某一帧图像中一个人物的身体姿态对在这帧图像中其他人物的身体姿态的依赖,即人物动作间的空间依赖。可以通过空间特征向量表示上述空间依赖性。The spatial relevance of a character's body posture can be reflected through spatial dependence. Spatial dependence is used to express the dependence of a character's body posture in a certain frame of image on the body posture of other characters in this frame of image, that is, the spatial dependence of character actions. The above-mentioned spatial dependence can be expressed by a spatial feature vector.
例如,待处理图像中的一帧图像所对应的是t时刻的图像,则在t时刻,第k个人物的空间特征向量
Figure PCTCN2020113788-appb-000044
可以表示为:
For example, one frame of image in the image to be processed corresponds to the image at time t, then at time t, the spatial feature vector of the k-th person
Figure PCTCN2020113788-appb-000044
It can be expressed as:
Figure PCTCN2020113788-appb-000045
Figure PCTCN2020113788-appb-000045
其中,K表示t时刻相应帧图像中共有K个人物,
Figure PCTCN2020113788-appb-000046
表示t时刻该K个人物中第k个人物的图像特征,
Figure PCTCN2020113788-appb-000047
表示t时刻该K个人物中第k’个人物的图像特征,r(a,b)=θ(a) Τφ(b)用来计算特征a和特征b之间的相似度,θ(),φ(),g()分别表示三个线性嵌入函数,θ(),φ(),g()可以相同或不同。r(a,b)可以体现特征b对特征a的依赖性。
Among them, K represents that there are K people in the corresponding frame of image at time t,
Figure PCTCN2020113788-appb-000046
Represents the image feature of the k-th person in the K person at time t,
Figure PCTCN2020113788-appb-000047
Represents the image feature of the k'th person in the K person at time t, r(a,b)=θ(a) Τ φ(b) is used to calculate the similarity between feature a and feature b, θ() , Φ(), g() respectively represent three linear embedding functions, θ(), φ(), g() can be the same or different. r(a,b) can reflect the dependence of feature b on feature a.
通过计算同一帧图像中不同的人物的图像特征之间的相似度,可以确定同一帧图像中不同的人物的身体姿态之间的空间依赖。By calculating the similarity between the image features of different characters in the same frame of image, the spatial dependence between the body postures of different characters in the same frame of image can be determined.
一个人物的身体姿态在时间上的相关性可以通过时间依赖体现。时间依赖也可以称为时序依赖,用于表示在某一帧图像中该人物的身体姿态对该人物在其他帧图像中的身体姿态的依赖,即一个人物动作内在的时序依赖。可以通过时序特征向量表示上述时间依赖性。The relevance of a character's body posture in time can be reflected through time dependence. Time dependence can also be called timing dependence, which is used to indicate the dependence of the character's body posture in a certain frame of image on the character's body posture in other frame images, that is, the inherent temporal dependence of a character's actions. The above-mentioned time dependence can be expressed by a time series feature vector.
例如,待处理图像中的一帧图像所对应的是t时刻的图像,则在t时刻,第k个人物的时序特征向量
Figure PCTCN2020113788-appb-000048
可以表示为:
For example, one frame of image in the image to be processed corresponds to the image at time t, then at time t, the time series feature vector of the k-th person
Figure PCTCN2020113788-appb-000048
It can be expressed as:
Figure PCTCN2020113788-appb-000049
Figure PCTCN2020113788-appb-000049
其中,T表示待处理图像中共有T个时刻的图像,即待处理图像包括T帧图像,
Figure PCTCN2020113788-appb-000050
表示t时刻第k个人物的图像特征,
Figure PCTCN2020113788-appb-000051
表示t’时刻第k个人物的图像特征。
Among them, T indicates that there are a total of T time images in the image to be processed, that is, the image to be processed includes T frames of images,
Figure PCTCN2020113788-appb-000050
Represents the image feature of the k-th person at time t,
Figure PCTCN2020113788-appb-000051
Represents the image feature of the k-th person at time t'.
通过计算同一个人物在不同时间的图像特征之间的相似度,可以确定同一个人物在不同时间的身体姿态之间的时间依赖。By calculating the similarity between the image features of the same person at different times, the time dependence between the body postures of the same person at different times can be determined.
可以根据待处理图像中t时刻的第k个人物的空间特征向量
Figure PCTCN2020113788-appb-000052
和时序特征向量
Figure PCTCN2020113788-appb-000053
计算得到t时刻的第k个人物的时-空特征向量
Figure PCTCN2020113788-appb-000054
时-空特征向量
Figure PCTCN2020113788-appb-000055
可以用于表示第k个人物的“时-空”关联信息。时-空特征向量
Figure PCTCN2020113788-appb-000056
可以表示为时序特征向量
Figure PCTCN2020113788-appb-000057
与空间特征向量
Figure PCTCN2020113788-appb-000058
进行“相加”
Figure PCTCN2020113788-appb-000059
运算的结果:
According to the spatial feature vector of the k-th person at time t in the image to be processed
Figure PCTCN2020113788-appb-000052
And time series feature vector
Figure PCTCN2020113788-appb-000053
Calculate the time-space feature vector of the kth person at time t
Figure PCTCN2020113788-appb-000054
Time-space feature vector
Figure PCTCN2020113788-appb-000055
It can be used to represent the "time-space" related information of the k-th person. Time-space feature vector
Figure PCTCN2020113788-appb-000056
Can be expressed as a time series feature vector
Figure PCTCN2020113788-appb-000057
And spatial eigenvectors
Figure PCTCN2020113788-appb-000058
Do "add"
Figure PCTCN2020113788-appb-000059
The result of the operation:
Figure PCTCN2020113788-appb-000060
Figure PCTCN2020113788-appb-000060
图13是本申请实施例提供的一种计算图像特征之间相似度的方法的示意图。如图13所示,计算t时刻第k个人物的图像特征
Figure PCTCN2020113788-appb-000061
与t时刻其他人物的图像特征之间的相似度的向量表示,以及t时刻第k个人物的图像特征
Figure PCTCN2020113788-appb-000062
与其他时刻第k个人物的图像特征之间的相似度的向量表示,取平均值(average,Avg),从而确定t时刻第k个人物的时-空特征向量
Figure PCTCN2020113788-appb-000063
在T帧图像中的K个人物的时-空特征向量的集合可以表示为H,
Figure PCTCN2020113788-appb-000064
FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application. As shown in Figure 13, calculate the image features of the k-th person at time t
Figure PCTCN2020113788-appb-000061
The vector representation of the similarity with the image features of other people at time t, and the image features of the k-th person at time t
Figure PCTCN2020113788-appb-000062
The vector representation of the similarity between the image features of the k-th person at other times and the average (Avg) is used to determine the time-space feature vector of the k-th person at time t
Figure PCTCN2020113788-appb-000063
The set of spatio-temporal feature vectors of K characters in the T frame image can be expressed as H,
Figure PCTCN2020113788-appb-000064
S1103、将图像特征与时-空特征向量进行融合,得到每帧图像的动作特征。S1103: Fuse the image feature with the spatio-temporal feature vector to obtain the action feature of each frame of image.
将T个时刻的图像中的K个人物的图像特征的集合X∈R T×K×D中的图像特征与T个时刻的图像中的K个人物的时-空特征向量的集合H∈R T×K×D中的时-空特征向量进行融合,以得到T个时刻的图像中每个图像的动作特征。上述每帧图像的动作特征可以通过动作特征向量表示。 Set the image feature set X∈R of the K characters in the image at T time moments, the image feature in T×K×D , and the set of time-space feature vectors of the K characters in the image at time T H∈R. The spatio-temporal feature vectors in T×K×D are fused to obtain the action characteristics of each image in the images at T times. The motion feature of each frame of image can be represented by a motion feature vector.
可以将t时刻第k个人物的图像特征
Figure PCTCN2020113788-appb-000065
与时-空特征向量
Figure PCTCN2020113788-appb-000066
进行融合,以得到t时刻第k个人物的人物特征向量
Figure PCTCN2020113788-appb-000067
可以将图像特征
Figure PCTCN2020113788-appb-000068
与时-空特征向量
Figure PCTCN2020113788-appb-000069
进行残差连接,以得到人物特征向量
Figure PCTCN2020113788-appb-000070
The image feature of the k-th person at time t can be
Figure PCTCN2020113788-appb-000065
And space-time eigenvectors
Figure PCTCN2020113788-appb-000066
Perform fusion to obtain the character feature vector of the k-th character at time t
Figure PCTCN2020113788-appb-000067
Image feature
Figure PCTCN2020113788-appb-000068
And space-time eigenvectors
Figure PCTCN2020113788-appb-000069
Carry out residual connection to get the character feature vector
Figure PCTCN2020113788-appb-000070
Figure PCTCN2020113788-appb-000071
Figure PCTCN2020113788-appb-000071
根据K个人物中每个人物的人物特征向量
Figure PCTCN2020113788-appb-000072
在t时刻,K个人物的人物特征向量的集合B t∈R T×K×D可以表示为:
According to the character feature vector of each character in K figures
Figure PCTCN2020113788-appb-000072
At time t, the set of character feature vectors of K characters B t ∈R T×K×D can be expressed as:
Figure PCTCN2020113788-appb-000073
Figure PCTCN2020113788-appb-000073
对人物特征向量的集合B t进行最大池化,以得到动作特征向量z t,动作特征向量z t中的每一位为
Figure PCTCN2020113788-appb-000074
中该位的最大值。
Perform maximum pooling on the set of character feature vectors B t to obtain the action feature vector z t , and each bit in the action feature vector z t is
Figure PCTCN2020113788-appb-000074
The maximum value of this bit in the middle.
S1104、对每帧图像的动作特征进行分类预测,以确定待处理图像的群体动作。S1104: Perform classification prediction on the action feature of each frame of image to determine the group action of the image to be processed.
分类模块可以为softmax分类器。分类模块的分类结果可以采用一位有效(one-hot)编码,即输出结果中仅有一位有效。也就是说,任意图像特征向量的分类结果对应的类别是分类模块的输出类别中唯一的一个类别。The classification module can be a softmax classifier. The classification result of the classification module can be one-hot coded, that is, only one bit is valid in the output result. In other words, the category corresponding to the classification result of any image feature vector is the only category among the output categories of the classification module.
可以将t时刻的一帧图像的动作特征向量z t输入分类模块,以得到对该帧图像的分类结果。可以将分类模块对任意t时刻的z t的分类结果作为T帧图像中的群体动作的分类结果。T帧图像中的群体动作的分类结果也可以理解为T帧图像中的人物的群体动作的分类结果,或者T帧图像的分类结果。 The action feature vector z t of a frame of image at time t can be input to the classification module to obtain the classification result of the frame of image. The classification result of z t at any time t by the classification module can be used as the classification result of the group action in the T frame image. The classification result of the group action in the T frame image can also be understood as the classification result of the group action of the person in the T frame image, or the classification result of the T frame image.
可以将T帧图像的动作特征向量z 1,z 2,…,z T分别输入分类模块,以得到每帧图像的分类结果。T帧图像的分类结果可以属于一个或多个类别。可以将分类模块的输出类别中对应的T帧图像中图像数量最多的一个类别作为T帧图像中的群体动作的分类结果。 The action feature vectors z 1 , z 2 ,..., z T of the T frame images can be input into the classification module respectively to obtain the classification result of each frame of image. The classification result of the T frame image can belong to one or more categories. The category with the largest number of images in the corresponding T-frame image in the output category of the classification module can be used as the classification result of the group action in the T-frame image.
可以对T帧图像的动作特征向量z 1,z 2,…,z T取平均值,以得到平均动作特征向量
Figure PCTCN2020113788-appb-000075
平均动作特征向量
Figure PCTCN2020113788-appb-000076
中的每一位为z 1,z 2,…,z T中该位的平均值。可以将平均动作特征向量
Figure PCTCN2020113788-appb-000077
输入分类模块,以得到T帧图像中的群体动作的分类结果。
The action feature vector z 1 , z 2 ,..., z T of the T frame image can be averaged to obtain the average action feature vector
Figure PCTCN2020113788-appb-000075
Average action feature vector
Figure PCTCN2020113788-appb-000076
Each bit in z 1 , z 2 ,..., z T is the average value of that bit. Average action feature vector
Figure PCTCN2020113788-appb-000077
Input the classification module to obtain the classification result of the group action in the T frame image.
上述方法能够完成群体动作识别的复杂推理过程:提取多帧图像的图像特征,并根据图像中不同人物之间和相同人物不同时刻之间的动作的相互依赖关系确定其时序特征和空间特征,再将上述时序特征、空间特征、图像特征进行融合获得每帧图像的动作特征,进而通过对每帧图像的动作特征进行分类,推断出多帧图像的群体动作。The above method can complete the complex reasoning process of group action recognition: extract the image features of multiple frames of images, and determine the temporal and spatial features according to the interdependence of actions between different people in the image and the same people at different times, and then The above-mentioned temporal features, spatial features, and image features are fused to obtain the action features of each frame of image, and then by classifying the action features of each frame of image, the group action of multiple frames of images can be inferred.
在本申请实施例中,在确定多个人物的群体动作时,不仅考虑到了多个人物的时序特 征,还考虑到了多个人物的空间特征,通过综合多个人物的时序特征和空间特征能够更好更准确地确定出多个人物的群体动作。In the embodiments of the present application, when determining group actions of multiple characters, not only the temporal characteristics of multiple characters are considered, but also the spatial characteristics of multiple characters are considered. By integrating the temporal characteristics and spatial characteristics of multiple characters, it can be more improved. To more accurately determine the group actions of multiple characters.
而针对不需要考虑时序特征的情况,即空间特征不依赖于时序特征的情况,本申请实施例在确定多个人物的群体动作时,还可以只考虑多个人物的空间特征进行识别,以更便捷地确定出多个人物的群体动作。However, for the situation where it is not necessary to consider the temporal features, that is, the spatial feature does not depend on the temporal features, in the embodiment of the present application, when determining the group actions of multiple characters, only the spatial features of the multiple characters may be considered for recognition. Conveniently determine the group actions of multiple characters.
在流行的基准数据集上的实验证明了本申请实施例提供的图像识别方法的有效性。Experiments on popular benchmark data sets prove the effectiveness of the image recognition method provided in the embodiments of this application.
将训练得到的神经网络用于图像识别,可以对准确识别群体动作。表1示出了利用训练得到的神经网络模型,采用本申请实施例提供的图像识别方法,对公开数据集进行识别的识别准确率。将公开数据集中包括群体动作的数据输入训练得到的神经网络,多类准确率(multi-class accuracy,MCA)表示神经网络对包括群体动作的数据的分类结果中分类正确的结果数量占包括群体动作的数据的比例。平均每类准确率(mean per class accuracy,MPCA)表示神经网络对包括群体动作的数据的分类结果中,每一类分类正确的结果数量占包括群体动作的数据中该类数据数量的比例的平均值。Using the trained neural network for image recognition can accurately recognize group actions. Table 1 shows the recognition accuracy of the neural network model obtained by training, using the image recognition method provided in the embodiment of the present application, to recognize the public data set. Input the data including group actions in the public data set into the trained neural network. The multi-class accuracy (MCA) indicates that the number of results that are correctly classified in the classification results of the data including group actions by the neural network account for the number of results that include group actions. The proportion of data. The mean per class accuracy (MPCA) represents the average of the ratio of the number of correct results of each category to the number of data including group actions in the classification results of the neural network on the data including group actions value.
表1Table 1
Figure PCTCN2020113788-appb-000078
Figure PCTCN2020113788-appb-000078
本申请在神经网络训练过程中,不依赖于人物动作标签,即可完成神经网络的训练。In the neural network training process of this application, the training of the neural network can be completed without relying on the character action tags.
训练过程中采用端到端的训练方式,即仅根据最终的分类结果对神经网络进行调整。In the training process, an end-to-end training method is adopted, that is, the neural network is adjusted only according to the final classification results.
采用卷积神经网络AlexNet和残差网络ResNet-18这两种简单的神经网络,采用本申请实施例提供的神经网络训练方法进行训练,并采用本申请实施例提供的图像识别方法进行群体动作识别,准确率MCA和MPCA较高,均可以达到较好的效果。Two simple neural networks, the convolutional neural network AlexNet and the residual network ResNet-18, are used, the neural network training method provided in the embodiment of the application is used for training, and the image recognition method provided in the embodiment of the application is used for group action recognition , The accuracy rates of MCA and MPCA are higher, and both can achieve better results.
特征交互,即确定人物间的依赖关系,以及人物动作在时间上的依赖关系。通过函数r(a,b)计算两个图像特征之间的相似度,函数r(a,b)的计算结果越大,两个图像特征对应的身体姿态的依赖关系越强。Feature interaction is to determine the dependence between characters and the dependence of character actions on time. The similarity between the two image features is calculated by the function r(a,b). The larger the calculation result of the function r(a,b), the stronger the dependence of the body posture corresponding to the two image features.
通过每帧图像中的多个人的图像特征之间的相似度,确定该帧图像中每个人的空间特征向量。一帧图像中一个人的空间特征向量用于表示该人对该帧图像中的其他的空间依赖性,即该人的身体姿态对其他人的身体姿态的依赖关系。According to the similarity between the image features of multiple people in each frame of image, the spatial feature vector of each person in the frame of image is determined. The spatial feature vector of a person in a frame of image is used to express the person's spatial dependence on other people in the frame of image, that is, the dependence of the person's body posture on the body posture of other people.
图14是本申请实施例提供的不同人物动作在空间上的关系的示意图。对于如图14所示的群体动作的一帧图像,通过图15的空间依赖矩阵表示群体动作中的每个人对于其他人的身体姿态的依赖性。空间依赖矩阵中的每一位通过一个方格表示,方格颜色的深浅即亮度表示两个人的图像特征的相似度,即函数r(a,b)的计算结果。函数r(a,b)的计算结果越大,格子的颜色越暗。可以将函数r(a,b)的计算结果进行标准化,即,将函数r(a,b)的计算结果映射在0和1之间,从而绘制空间依赖矩阵。FIG. 14 is a schematic diagram of the spatial relationship between different character actions provided by an embodiment of the present application. For a frame of image of the group action as shown in FIG. 14, the spatial dependence matrix of FIG. 15 represents the dependence of each person in the group action on the body posture of other people. Each bit in the spatial dependence matrix is represented by a square, and the color of the square, that is, the brightness, represents the similarity of the image features of the two people, that is, the calculation result of the function r(a, b). The larger the calculation result of the function r(a,b), the darker the color of the grid. The calculation result of the function r(a,b) can be standardized, that is, the calculation result of the function r(a,b) can be mapped between 0 and 1, so as to draw the spatial dependence matrix.
直观而言,图14中的击球者即10号球员对她的队友的后续动作有着较大的影响。通 过函数r(a,b)的计算,在空间依赖矩阵中代表10号球员的第十行和第十列的颜色较深。即10号球员与群体动作最为相关。因此函数r(a,b)能够反应出一帧图像中一个人与其他人之间身体姿态的高关联度,即能够反应出依赖程度较高的情况。图14中的,1-6号球员的身体姿态之间的空间依赖(spatial dependency)较弱。在空间依赖矩阵中左上角黑色框区域内颜色较暗,左上角的区域代表1-6号球员之间的身体姿态之间的依赖性。因此,本申请实施例提供的神经网络能够较好的反映一帧图像中一个人的身体姿态与其他人的身体姿态的之间的依赖关系或者说关联关系。Intuitively speaking, the hitter in Fig. 14 is the number 10 player having a greater influence on the follow-up actions of her teammates. Through the calculation of the function r(a,b), the color of the tenth row and the tenth column representing the number 10 player in the spatial dependence matrix is darker. That is, the number 10 player is most related to group actions. Therefore, the function r(a,b) can reflect the high degree of correlation between the body posture of a person and other people in a frame of image, that is, it can reflect the situation with a high degree of dependence. In Figure 14, the spatial dependency between the body postures of players 1-6 is weak. In the spatial dependence matrix, the color in the upper left black box area is darker, and the upper left area represents the dependence between the body postures of players 1-6. Therefore, the neural network provided by the embodiments of the present application can better reflect the dependency or association relationship between the body posture of one person and the body posture of other people in a frame of image.
通过多帧图像中一个人的图像特征之间的相似度,确定在一帧图像中该人的时序特征向量。一帧图像中一个人的时序特征向量用于表示该人的身体姿态对其他帧图像中人的身体姿态的依赖关系。According to the similarity between the image features of a person in multiple frames of images, the time sequence feature vector of the person in one frame of image is determined. The time series feature vector of a person in one frame of image is used to represent the dependence of the person's body posture on the body posture of the person in other frames of images.
图14所示的10号球员在时间上按照先后顺序的10帧图像中的身体姿态如图16所示,通过图17的时间依赖矩阵表示10号球员的身体姿态的在时间上的依赖性。时间依赖矩阵中的每一位通过一个方格表示,方格颜色的深浅即亮度表示两个人的图像特征的相似度,即函数r(a,b)的计算结果。The body posture of the 10 frame images of the player No. 10 shown in FIG. 14 in time sequence is shown in Fig. 16. The time dependence matrix of Fig. 17 indicates the time dependence of the body posture of the player No. 10. Each bit in the time-dependent matrix is represented by a square. The color of the square, that is, the brightness, represents the similarity of the image features of the two people, that is, the calculation result of the function r(a,b).
10号球员在10帧图像中的身体姿态对应于起跳(第1-3帧)、悬空(第4-8帧)和落地(第9-10帧)。在人们的认知中,“起跳”和“落地”应该更具判别性。在图17所示的时间依赖矩阵中,10号球员在第2帧和第10帧图像中的图像特征与其他图像中的图像特征的相似度相对较高。图17所示的黑色框区域内,第4-8帧图像即悬空状态10号球员的图像特征与其他图像中的图像特征的相似度较低。因此,本申请实施例提供的神经网络能够较好的反映多帧图像中一个人的身体姿态在时间上的关联关系。The body posture of the No. 10 player in the 10-frame image corresponds to the take-off (frames 1-3), floating (frames 4-8) and landing (frames 9-10). In people's cognition, "jumping" and "landing" should be more discriminative. In the time-dependent matrix shown in FIG. 17, the image features of the No. 10 player in the 2nd and 10th frames are relatively similar to the image features in other images. In the black frame area shown in FIG. 17, the image features of the 4th to 8th frames of the image, that is, the image feature of the player No. 10 in the floating state, have low similarity with the image features in other images. Therefore, the neural network provided by the embodiments of the present application can better reflect the temporal association relationship of a person's body posture in multiple frames of images.
上文结合附图描述了本申请实施例的方法实施例,下面描述本申请实施例的装置实施例。应理解,方法实施例的描述与装置实施例的描述相互对应,因此,未描述的部分可以参见前面方法实施例。The method embodiments of the embodiments of the present application are described above with reference to the accompanying drawings, and the device embodiments of the embodiments of the present application are described below. It should be understood that the description of the method embodiment and the description of the device embodiment correspond to each other, and therefore, the parts that are not described can refer to the previous method embodiment.
图18是本申请实施例提供的一种图像识别装置的系统架构的示意图。图18所示的图像识别装置包括特征提取模块1801、十字交互模块1802、特征融合模块1803、分类模块1804。图18中的图像识别装置可以执行本申请实施例的图像识别方法,下面对图像识别装置对输入图片进行处理的过程进行介绍。FIG. 18 is a schematic diagram of the system architecture of an image recognition device provided by an embodiment of the present application. The image recognition device shown in FIG. 18 includes a feature extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804. The image recognition device in FIG. 18 can execute the image recognition method of the embodiment of the present application. The process of processing the input picture by the image recognition device will be introduced below.
特征提取模块1801,也可以称为局部特征抽取模块(partial-body extractor module),用于根据图像中的人物的骨骼节点,提取人物的图像特征。可以采用卷积网络实现特征提取模块1801的功能。将多帧图像输入特征提取模块1801。人物的图像特征可以通过向量表示,表示人物的图像特征的向量可以称为人物的图像特征向量。The feature extraction module 1801, which may also be referred to as a partial-body extractor module, is used to extract the image features of the person according to the bone nodes of the person in the image. The function of the feature extraction module 1801 can be realized by using a convolutional network. The multi-frame images are input to the feature extraction module 1801. The image feature of a person can be represented by a vector, and the vector representing the image feature of the person can be called the image feature vector of the person.
十字交互模块1802,用于将多帧图像中每帧图像的多个人物的图像特征映射为每个人物的时-空交互特征。时-空交互特征用于表示确定人物的“时-空”关联信息。一帧图像中一个人物的时-空交互特征可以是对该帧图像中该人物的时序特征和空间特征进行融合得到的。十字交互模块1802可以由卷积层和/或全连接层实现。The cross interaction module 1802 is used to map the image features of multiple characters in each frame of the multi-frame images to the spatio-temporal interaction features of each person. The time-space interaction feature is used to indicate the "time-space" associated information of a certain character. The time-space interaction feature of a person in a frame of image may be obtained by fusing the temporal feature and spatial feature of the person in the frame of image. The cross interaction module 1802 may be implemented by a convolutional layer and/or a fully connected layer.
特征融合模块1803,用于将一帧图像中的每个人物的动作特征和时-空交互特征进行融合,以得到该帧图像的图像特征向量。该帧图像的图像特征向量可以作为该帧图像的特征表示。The feature fusion module 1803 is used to fuse the action feature of each person in a frame of image with the time-space interaction feature to obtain the image feature vector of the frame of image. The image feature vector of the frame image can be used as the feature representation of the frame image.
分类模块1804,用于根据图像特征向量进行分类,从而确定输入特征提取模块1801 的T帧图像中的人物的群体动作的类别。分类模块1804可以为分类器。The classification module 1804 is configured to classify according to the image feature vector, so as to determine the category of the group action of the person in the T frame image input to the feature extraction module 1801. The classification module 1804 may be a classifier.
图18所示的图像识别装置可以用于执行图11所示的图像识别方法。The image recognition device shown in FIG. 18 can be used to execute the image recognition method shown in FIG. 11.
图19是本申请实施例提供的一种图像识别装置的结构示意图。图19所示的图像识别装置3000包括获取单元3001和处理单元3002。FIG. 19 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application. The image recognition device 3000 shown in FIG. 19 includes an acquisition unit 3001 and a processing unit 3002.
获取单元3001,用于获取待处理图像;The acquiring unit 3001 is used to acquire an image to be processed;
处理单元3002,用于执行本申请实施例的各图像识别方法。The processing unit 3002 is configured to execute the image recognition methods in the embodiments of the present application.
可选地,获取单元3001可以用于获取待处理图像;处理单元3002可以用于执行上述步骤S901至S904或步骤S1001至S1004,以识别所述待处理图像中的多个人物的群体动作。Optionally, the obtaining unit 3001 may be used to obtain the image to be processed; the processing unit 3002 may be used to perform steps S901 to S904 or steps S1001 to S1004 described above to identify group actions of multiple people in the image to be processed.
可选地,获取单元3001可以用于获取待处理图像;处理单元3002可以用于执行上述步骤S1101至S1104,以对所述待处理图像中的人的群体动作进行识别。Optionally, the obtaining unit 3001 may be used to obtain the image to be processed; the processing unit 3002 may be used to execute the above steps S1101 to S1104 to identify group actions of people in the image to be processed.
上述处理单元3002按照处理功能的不同可以分成多个模块。The above-mentioned processing unit 3002 can be divided into multiple modules according to different processing functions.
例如,处理单元3002可以分成如图18中所示提取模块1801、十字交互模块1802、特征融合模块1803、分类模块1804。所述处理的单元3002能够实现图18所示的各个模块的功能,进而可以用于实现图11所示图像识别方法。For example, the processing unit 3002 can be divided into an extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804 as shown in FIG. 18. The processing unit 3002 can realize the functions of the various modules shown in FIG. 18, and further can be used to realize the image recognition method shown in FIG.
图20是本申请实施例的图像识别装置的硬件结构示意图。图20所示的图像识别装置4000(该装置4000具体可以是一种计算机设备)包括存储器4001、处理器4002、通信接口4003以及总线4004。其中,存储器4001、处理器4002、通信接口4003通过总线4004实现彼此之间的通信连接。FIG. 20 is a schematic diagram of the hardware structure of an image recognition device according to an embodiment of the present application. The image recognition apparatus 4000 shown in FIG. 20 (the apparatus 4000 may specifically be a computer device) includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004. Among them, the memory 4001, the processor 4002, and the communication interface 4003 implement communication connections between each other through the bus 4004.
存储器4001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器4001可以存储程序,当存储器4001中存储的程序被处理器4002执行时,处理器4002用于执行本申请实施例的图像识别方法的各个步骤。The memory 4001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 4001 may store a program. When the program stored in the memory 4001 is executed by the processor 4002, the processor 4002 is configured to execute each step of the image recognition method in the embodiment of the present application.
处理器4002可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请方法实施例的图像识别方法。The processor 4002 may adopt a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more The integrated circuit is used to execute related programs to implement the image recognition method in the method embodiment of the present application.
处理器4002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的图像识别方法的各个步骤可以通过处理器4002中的硬件的集成逻辑电路或者软件形式的指令完成。The processor 4002 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the image recognition method of the present application can be completed by the integrated logic circuit of hardware in the processor 4002 or instructions in the form of software.
上述处理器4002还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器4001,处理器4002读取存储器4001中的信息,结合其硬件完成本图像识别装置中包括的单元所需执行的功 能,或者执行本申请方法实施例的图像识别方法。The aforementioned processor 4002 may also be a general-purpose processor, a digital signal processing (digital signal processing, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 4001, and the processor 4002 reads the information in the memory 4001, combines its hardware to complete the functions required by the units included in the image recognition device, or executes the image recognition method in the method embodiment of the application.
通信接口4003使用例如但不限于收发器一类的收发装置,来实现装置4000与其他设备或通信网络之间的通信。例如,可以通过通信接口4003获取待处理图像。The communication interface 4003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 4000 and other devices or a communication network. For example, the image to be processed can be acquired through the communication interface 4003.
总线4004可包括在装置4000各个部件(例如,存储器4001、处理器4002、通信接口4003)之间传送信息的通路。The bus 4004 may include a path for transferring information between various components of the device 4000 (for example, the memory 4001, the processor 4002, and the communication interface 4003).
图21是本申请实施例的神经网络训练装置的硬件结构示意图。与上述装置4000类似,图21所示的神经网络训练装置5000包括存储器5001、处理器5002、通信接口5003以及总线5004。其中,存储器5001、处理器5002、通信接口5003通过总线5004实现彼此之间的通信连接。FIG. 21 is a schematic diagram of the hardware structure of a neural network training device according to an embodiment of the present application. Similar to the above device 4000, the neural network training device 5000 shown in FIG. 21 includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. Among them, the memory 5001, the processor 5002, and the communication interface 5003 implement communication connections between each other through the bus 5004.
存储器5001可以是ROM,静态存储设备和RAM。存储器5001可以存储程序,当存储器5001中存储的程序被处理器5002执行时,处理器5002和通信接口5003用于执行本申请实施例的神经网络的训练方法的各个步骤。The memory 5001 may be ROM, static storage device and RAM. The memory 5001 may store a program. When the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 and the communication interface 5003 are used to execute each step of the neural network training method of the embodiment of the present application.
处理器5002可以采用通用的,CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的图像处理装置中的单元所需执行的功能,或者执行本申请方法实施例的神经网络的训练方法。The processor 5002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits to execute related programs to realize the functions required by the units in the image processing apparatus of the embodiments of the present application. Or execute the neural network training method of the method embodiment of this application.
处理器5002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的神经网络的训练方法的各个步骤可以通过处理器5002中的硬件的集成逻辑电路或者软件形式的指令完成。The processor 5002 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the neural network training method of the embodiment of the present application can be completed by the integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
上述处理器5002还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器5001,处理器5002读取存储器5001中的信息,结合其硬件完成本申请实施例的图像处理装置中包括的单元所需执行的功能,或者执行本申请方法实施例的神经网络的训练方法。The aforementioned processor 5002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the image processing apparatus of the embodiment of the present application, or execute the neural network of the method embodiment of the present application Training method.
通信接口5003使用例如但不限于收发器一类的收发装置,来实现装置5000与其他设备或通信网络之间的通信。例如,可以通过通信接口5003获取待处理图像。The communication interface 5003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 5000 and other devices or communication networks. For example, the image to be processed can be acquired through the communication interface 5003.
总线5004可包括在装置5000各个部件(例如,存储器5001、处理器5002、通信接口5003)之间传送信息的通路。The bus 5004 may include a path for transferring information between various components of the device 5000 (for example, the memory 5001, the processor 5002, and the communication interface 5003).
应注意,尽管上述装置4000和装置5000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置4000和装置5000还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置4000和装置5000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置4000和装置5000也可仅仅包括实现本申请实施例所必须的器件,而不必包括图20和图21中所示的全部器件。It should be noted that although the foregoing device 4000 and device 5000 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 4000 and device 5000 may also include those necessary for normal operation. Other devices. At the same time, according to specific needs, those skilled in the art should understand that the device 4000 and the device 5000 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the device 4000 and the device 5000 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 20 and FIG. 21.
本申请实施例还提供一种图像识别装置,包括:至少一个处理器和通信接口,所述通信接口用于所述图像识别装置与其他通信装置进行信息交互,当程序指令在所述至少一个 处理器中执行时,使得所述图像识别装置执行上文中的方法。An embodiment of the present application further provides an image recognition device, including: at least one processor and a communication interface, the communication interface is used for the image recognition device to exchange information with other communication devices, when the program instructions in the at least one processing When executed in the device, the image recognition device is caused to execute the above method.
本申请实施例还提供一种计算机程序存储介质,其特征在于,所述计算机程序存储介质具有程序指令,当所述程序指令被直接或者间接执行时,使得前文中的方法得以实现。An embodiment of the present application also provides a computer program storage medium, which is characterized in that the computer program storage medium has program instructions, and when the program instructions are directly or indirectly executed, the foregoing method can be realized.
本申请实施例还提供一种芯片系统,其特征在于,所述芯片系统包括至少一个处理器,当程序指令在所述至少一个处理器中执行时,使得前文中的方法得以实现。An embodiment of the present application further provides a chip system, characterized in that the chip system includes at least one processor, and when the program instructions are executed in the at least one processor, the foregoing method can be realized.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (15)

  1. 一种图像识别方法,其特征在于,包括:An image recognition method, characterized in that it comprises:
    提取待处理图像的图像特征,所述待处理图像中包括多个人物,所述待处理图像的图像特征包括所述多个人物分别在所述待处理图像的多帧图像中的每帧图像中的图像特征;Extract the image features of the image to be processed, the image to be processed includes a plurality of people, and the image features of the image to be processed include the plurality of people in each of the multi-frame images of the image to be processed Image characteristics;
    确定所述多个人物中的每个人物在所述多帧图像中的每帧图像中的时序特征,其中,所述多个人物中的第j个人物在所述待处理图像中的第i帧图像中的时序特征是根据所述第j个人物在所述第i帧图像的图像特征与所述第j个人物在所述多帧图像中除所述第i帧图像之外的其他帧图像的图像特征之间的相似度确定的,i和j为正整数;Determine the timing characteristics of each of the plurality of people in each of the multi-frame images, wherein the j-th person in the plurality of people is in the i-th character in the image to be processed The timing characteristics in the frame image are based on the image characteristics of the j-th person in the i-th frame image and the j-th person in the multi-frame images other than the i-th frame image The similarity between the image features of the image is determined, i and j are positive integers;
    确定所述多个人物中的每个人物在所述多帧图像中的每帧图像中的空间特征,其中,所述多个人物中的第j个人物在所述待处理图像中的第i帧图像中的空间特征是根据所述第j个人物在所述第i帧图像的图像特征与所述多个人物中除所述第j个人物之外的其他人物在所述第i帧图像的图像特征的相似度确定的;Determine the spatial feature of each of the plurality of persons in each of the plurality of frames of images, wherein the j-th person in the plurality of persons is the i-th in the image to be processed The spatial feature in the frame image is based on the image feature of the j-th person in the i-th frame image and other people among the plurality of people except the j-th person in the i-th frame image The similarity of image features is determined;
    确定所述多个人物中的每个人物在所述多帧图像中的每帧图像中的动作特征,其中,所述多个人物中的第j个人物在所述多帧图像中的第i帧图像中的动作特征是对所述第j个人物在所述第i帧图像中的空间特征、所述第j个人物在所述第i帧图像中的时序特征、所述第j个人物在所述第i帧图像中的图像特征进行融合得到的;Determine the action feature of each of the plurality of people in each of the multi-frame images, wherein the j-th person in the plurality of people is the i-th in the multi-frame images The action feature in the frame image is the spatial feature of the j-th person in the i-th frame image, the time-series feature of the j-th person in the i-th frame image, and the j-th person Obtained by fusing image features in the i-th frame of image;
    根据所述多个人物在所述多帧图像中的每帧图像中的动作特征,识别所述待处理图像中的所述多个人物的群体动作。Recognizing group actions of the multiple persons in the image to be processed according to the action characteristics of the multiple persons in each of the multiple frames of images.
  2. 根据权利要求1所述的方法,其特征在于,所述提取待处理图像的图像特征,包括:The method according to claim 1, wherein said extracting the image features of the image to be processed comprises:
    确定所述多个人物中的每个人物的骨骼节点在所述多帧图像中的每帧图像中所在的图像区域;Determining the image area where the bone node of each of the multiple characters is located in each of the multiple frames of images;
    对所述多个人物中的每个人物的所述骨骼节点所在的图像区域的图像进行特征提取,得到所述待处理图像的图像特征。Perform feature extraction on the image of the image region where the bone node of each of the multiple people is located to obtain the image feature of the image to be processed.
  3. 根据权利要求2所述的方法,其特征在于,所述对所述多个人物中的每个人物的所述骨骼节点所在的图像区域的图像进行特征提取,得到所述待处理图像的图像特征,包括:The method according to claim 2, wherein the feature extraction is performed on the image of the image region where the bone node of each of the plurality of people is located to obtain the image feature of the image to be processed ,include:
    在所述多帧图像中的每帧图像中,对所述多个人物中的每个人物的所述骨骼节点所在的图像区域以外的区域进行遮掩,以获得局部可见图像,所述局部可见图像是由包括所述多个人物中的每个人物的所述骨骼节点所在的图像区域组成的图像;In each of the multiple frames of images, mask the area other than the image area where the bone node of each of the multiple characters is located to obtain a partially visible image, the partially visible image Is an image composed of an image area including the bone node of each of the plurality of characters;
    对所述局部可见图像进行特征提取,得到所述待处理图像的图像特征。Perform feature extraction on the partially visible image to obtain the image feature of the image to be processed.
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述根据所述多个人物在所述多帧图像中的每帧图像中的动作特征,识别所述待处理图像中的所述多个人物的群体动作,包括:The method according to any one of claims 1 to 3, characterized in that, according to the action characteristics of the plurality of people in each frame of the multi-frame image, the identification in the image to be processed The group actions of the multiple characters include:
    对所述多个人物中的每个人物在所述多帧图像中的每帧图像中的动作特征进行分类,得到所述多个人物中的每个人物的动作;Classifying the action features of each of the multiple characters in each of the multiple frames of images to obtain the action of each of the multiple characters;
    根据所述多个人物中的每个人物的动作,确定所述待处理图像中的所述多个人物的群 体动作。According to the actions of each of the multiple characters, the group actions of the multiple characters in the image to be processed are determined.
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    生成所述待处理图像的标签信息,所述标签信息用于指示所述待处理图像中的所述多个人物的群体动作。Generate label information of the image to be processed, where the label information is used to indicate group actions of the multiple characters in the image to be processed.
  6. 根据权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    根据所述待处理图像中的所述多个人物的群体动作确定所述多个人物中每个人物对所述多个人物的群体动作的贡献度;Determining the contribution of each of the plurality of characters to the group actions of the plurality of characters according to the group actions of the plurality of characters in the image to be processed;
    根据所述多个人物中每个人物对所述多个人物的群体动作的贡献度确定所述多个人物中的关键人物,所述关键人物对所述多个人物的群体动作的贡献度大于所述多个人物中除所述关键人物之外的其他人物对所述多个人物的群体动作的贡献度。Determine a key person in the plurality of characters according to the contribution of each of the plurality of characters to the group actions of the plurality of characters, and the key person's contribution to the group actions of the plurality of characters is greater than Contributions of the characters other than the key character in the plurality of characters to the group actions of the plurality of characters.
  7. 一种图像识别装置,其特征在于,包括:An image recognition device, characterized in that it comprises:
    获取单元,用于获取待处理图像;The acquiring unit is used to acquire the image to be processed;
    处理单元,所述处理单元用于:Processing unit, the processing unit is used to:
    提取待处理图像的图像特征,所述待处理图像中包括多个人物,所述待处理图像的图像特征包括所述多个人物分别在所述待处理图像的多帧图像中的每帧图像中的图像特征;Extract the image features of the image to be processed, the image to be processed includes a plurality of people, and the image features of the image to be processed include the plurality of people in each of the multi-frame images of the image to be processed Image characteristics;
    确定所述多个人物中的每个人物在所述多帧图像中的每帧图像中的时序特征,其中,所述多个人物中的第j个人物在所述待处理图像中的第i帧图像中的时序特征是根据所述第j个人物在所述第i帧图像的图像特征与所述第j个人物在所述多帧图像中除所述第i帧图像之外的其他帧图像的图像特征之间的相似度确定的,i和j为正整数;Determine the timing characteristics of each of the plurality of people in each of the multi-frame images, wherein the j-th person in the plurality of people is in the i-th character in the image to be processed The timing characteristics in the frame image are based on the image characteristics of the j-th person in the i-th frame image and the j-th person in the multi-frame images other than the i-th frame image The similarity between the image features of the image is determined, i and j are positive integers;
    确定所述多个人物中的每个人物在所述多帧图像中的每帧图像中的空间特征,其中,所述多个人物中的第j个人物在所述待处理图像中的第i帧图像中的空间特征是根据所述第j个人物在所述第i帧图像的图像特征与所述多个人物中除所述第j个人物之外的其他人物在所述第i帧图像的图像特征的相似度确定的;Determine the spatial feature of each of the plurality of persons in each of the plurality of frames of images, wherein the j-th person in the plurality of persons is the i-th in the image to be processed The spatial feature in the frame image is based on the image feature of the j-th person in the i-th frame image and other people among the plurality of people except the j-th person in the i-th frame image The similarity of image features is determined;
    确定所述多个人物中的每个人物在所述多帧图像中的每帧图像中的动作特征,其中,所述多个人物中的第j个人物在所述多帧图像中的第i帧图像中的动作特征是对所述第j个人物在所述第i帧图像中的空间特征、所述第j个人物在所述第i帧图像中的时序特征、所述第j个人物在所述第i帧图像中的图像特征进行融合得到的;Determine the action feature of each of the plurality of people in each of the multi-frame images, wherein the j-th person in the plurality of people is the i-th in the multi-frame images The action feature in the frame image is the spatial feature of the j-th person in the i-th frame image, the time-series feature of the j-th person in the i-th frame image, and the j-th person Obtained by fusing image features in the i-th frame of image;
    根据所述多个人物在所述多帧图像中的每帧图像中的动作特征,识别所述待处理图像中的所述多个人物的群体动作。Recognizing group actions of the multiple persons in the image to be processed according to the action characteristics of the multiple persons in each of the multiple frames of images.
  8. 根据权利要求7所述的装置,其特征在于,所述处理单元用于,The device according to claim 7, wherein the processing unit is configured to:
    在所述多帧图像中的每帧图像中,确定所述多个人物中的每个人物的骨骼节点所在的图像区域;In each frame of the multiple frames of images, determine the image area where the bone node of each of the multiple persons is located;
    对所述多个人物中的每个人物的所述骨骼节点所在的图像区域的图像进行特征提取,得到所述待处理图像的图像特征。Perform feature extraction on the image of the image region where the bone node of each of the multiple people is located to obtain the image feature of the image to be processed.
  9. 根据权利要求8所述的装置,其特征在于,所述处理单元用于,The device according to claim 8, wherein the processing unit is configured to:
    在所述多帧图像中的每帧图像中,对所述多个人物中的每个人物的所述骨骼节点所在的图像区域以外的区域进行遮掩,以获得局部可见图像,所述局部可见图像是由包括所述多个人物中的每个人物的所述骨骼节点所在的图像区域组成的图像;In each of the multiple frames of images, mask the area other than the image area where the bone node of each of the multiple characters is located to obtain a partially visible image, the partially visible image Is an image composed of an image area including the bone node of each of the plurality of characters;
    对所述局部可见图像进行特征提取,得到所述待处理图像的图像特征。Perform feature extraction on the partially visible image to obtain the image feature of the image to be processed.
  10. 根据权利要求7至9中任一项所述的装置,其特征在于,所述处理单元用于,The device according to any one of claims 7 to 9, wherein the processing unit is configured to:
    对所述多个人物中的每个人物在所述多帧图像中的每帧图像中的动作特征进行分类,得到所述多个人物中的每个人物的动作;Classifying the action features of each of the multiple characters in each of the multiple frames of images to obtain the action of each of the multiple characters;
    根据所述多个人物中的每个人物的动作,确定所述待处理图像中的所述多个人物的群体动作。According to the actions of each of the multiple characters, the group actions of the multiple characters in the to-be-processed image are determined.
  11. 根据权利要求10所述的装置,其特征在于,所述处理单元用于,The device according to claim 10, wherein the processing unit is configured to:
    生成所述待处理图像的标签信息,所述标签信息用于指示所述待处理图像中的所述多个人物的群体动作。Generate label information of the image to be processed, where the label information is used to indicate group actions of the multiple characters in the image to be processed.
  12. 根据权利要求7至10中任一项所述的装置,其特征在于,所述处理单元用于,The device according to any one of claims 7 to 10, wherein the processing unit is configured to:
    根据所述待处理图像中的所述多个人物的群体动作确定所述多个人物中每个人物对所述多个人物的群体动作的贡献度;Determining the contribution of each of the plurality of characters to the group actions of the plurality of characters according to the group actions of the plurality of characters in the image to be processed;
    根据所述多个人物中每个人物对所述多个人物的群体动作的贡献度确定所述多个人物中的关键人物,所述关键人物对所述多个人物的群体动作的贡献度大于所述多个人物中除所述关键人物之外的其他人物对所述多个人物的群体动作的贡献度。Determine a key person in the plurality of characters according to the contribution of each of the plurality of characters to the group actions of the plurality of characters, and the key person's contribution to the group actions of the plurality of characters is greater than Contributions of the characters other than the key character in the plurality of characters to the group actions of the plurality of characters.
  13. 一种图像识别装置,其特征在于,所述装置包括:An image recognition device, characterized in that the device includes:
    存储器,用于存储程序;Memory, used to store programs;
    处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如权利要求1至6中任一项所述的方法。The processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method according to any one of claims 1 to 6.
  14. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,所述程序代码包括用于执行如权利要求1至6中任一项所述的方法的指令。A computer-readable storage medium, wherein the computer-readable medium stores program code for device execution, and the program code includes a program code for executing the method according to any one of claims 1 to 6 instruction.
  15. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至6中任一项所述的方法。A chip, characterized in that, the chip comprises a processor and a data interface, and the processor reads instructions stored on a memory through the data interface to execute the method according to any one of claims 1 to 6 method.
PCT/CN2020/113788 2019-10-15 2020-09-07 Image recognition method and apparatus, computer-readable storage medium and chip WO2021073311A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910980310.7 2019-10-15
CN201910980310.7A CN112668366B (en) 2019-10-15 2019-10-15 Image recognition method, device, computer readable storage medium and chip

Publications (1)

Publication Number Publication Date
WO2021073311A1 true WO2021073311A1 (en) 2021-04-22

Family

ID=75400028

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113788 WO2021073311A1 (en) 2019-10-15 2020-09-07 Image recognition method and apparatus, computer-readable storage medium and chip

Country Status (2)

Country Link
CN (1) CN112668366B (en)
WO (1) WO2021073311A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111842A (en) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN113283381A (en) * 2021-06-15 2021-08-20 南京工业大学 Human body action detection method suitable for mobile robot platform

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112969058B (en) * 2021-05-18 2021-08-03 南京拓晖信息技术有限公司 Industrial video real-time supervision platform and method with cloud storage function
CN113255518B (en) * 2021-05-25 2021-09-24 神威超算(北京)科技有限公司 Video abnormal event detection method and chip
CN113344562B (en) * 2021-08-09 2021-11-02 四川大学 Method and device for detecting Etheng phishing accounts based on deep neural network
CN114494543A (en) * 2022-01-25 2022-05-13 上海商汤科技开发有限公司 Action generation method and related device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005182660A (en) * 2003-12-22 2005-07-07 Matsushita Electric Works Ltd Recognition method of character/figure
CN108229355A (en) * 2017-12-22 2018-06-29 北京市商汤科技开发有限公司 Activity recognition method and apparatus, electronic equipment, computer storage media, program
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109299646A (en) * 2018-07-24 2019-02-01 北京旷视科技有限公司 Crowd's accident detection method, apparatus, system and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9721086B2 (en) * 2013-03-15 2017-08-01 Advanced Elemental Technologies, Inc. Methods and systems for secure and reliable identity-based computing
WO2018126323A1 (en) * 2017-01-06 2018-07-12 Sportlogiq Inc. Systems and methods for behaviour understanding from trajectories
CN110363279B (en) * 2018-03-26 2021-09-21 华为技术有限公司 Image processing method and device based on convolutional neural network model
CN108764019A (en) * 2018-04-03 2018-11-06 天津大学 A kind of Video Events detection method based on multi-source deep learning
CN109299657B (en) * 2018-08-14 2020-07-03 清华大学 Group behavior identification method and device based on semantic attention retention mechanism
CN109993707B (en) * 2019-03-01 2023-05-12 华为技术有限公司 Image denoising method and device
CN110222717B (en) * 2019-05-09 2022-01-14 华为技术有限公司 Image processing method and device
CN110309856A (en) * 2019-05-30 2019-10-08 华为技术有限公司 Image classification method, the training method of neural network and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005182660A (en) * 2003-12-22 2005-07-07 Matsushita Electric Works Ltd Recognition method of character/figure
CN108229355A (en) * 2017-12-22 2018-06-29 北京市商汤科技开发有限公司 Activity recognition method and apparatus, electronic equipment, computer storage media, program
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109299646A (en) * 2018-07-24 2019-02-01 北京旷视科技有限公司 Crowd's accident detection method, apparatus, system and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111842A (en) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN113111842B (en) * 2021-04-26 2023-06-27 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN113283381A (en) * 2021-06-15 2021-08-20 南京工业大学 Human body action detection method suitable for mobile robot platform
CN113283381B (en) * 2021-06-15 2024-04-05 南京工业大学 Human body action detection method suitable for mobile robot platform

Also Published As

Publication number Publication date
CN112668366A (en) 2021-04-16
CN112668366B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
WO2021043168A1 (en) Person re-identification network training method and person re-identification method and apparatus
WO2020253416A1 (en) Object detection method and device, and computer storage medium
US20220092351A1 (en) Image classification method, neural network training method, and apparatus
WO2021073311A1 (en) Image recognition method and apparatus, computer-readable storage medium and chip
Chen et al. Attention-based context aggregation network for monocular depth estimation
WO2021043112A1 (en) Image classification method and apparatus
WO2021042828A1 (en) Neural network model compression method and apparatus, and storage medium and chip
WO2021043273A1 (en) Image enhancement method and apparatus
WO2021022521A1 (en) Method for processing data, and method and device for training neural network model
WO2021155792A1 (en) Processing apparatus, method and storage medium
WO2021147325A1 (en) Object detection method and apparatus, and storage medium
CN110222717B (en) Image processing method and device
US20210398252A1 (en) Image denoising method and apparatus
WO2021013095A1 (en) Image classification method and apparatus, and method and apparatus for training image classification model
Das et al. Where to focus on for human action recognition?
WO2021018245A1 (en) Image classification method and apparatus
CN113065645B (en) Twin attention network, image processing method and device
CN110222718B (en) Image processing method and device
WO2021018251A1 (en) Image classification method and device
CN113011562A (en) Model training method and device
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN113807183A (en) Model training method and related equipment
CN112464930A (en) Target detection network construction method, target detection method, device and storage medium
CN110705564B (en) Image recognition method and device
CN113449550A (en) Human body weight recognition data processing method, human body weight recognition method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20877961

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20877961

Country of ref document: EP

Kind code of ref document: A1