WO2021073311A1 - Procédé et appareil de reconnaissance d'image, support d'enregistrement lisible par ordinateur et puce - Google Patents

Procédé et appareil de reconnaissance d'image, support d'enregistrement lisible par ordinateur et puce Download PDF

Info

Publication number
WO2021073311A1
WO2021073311A1 PCT/CN2020/113788 CN2020113788W WO2021073311A1 WO 2021073311 A1 WO2021073311 A1 WO 2021073311A1 CN 2020113788 W CN2020113788 W CN 2020113788W WO 2021073311 A1 WO2021073311 A1 WO 2021073311A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
frame
person
processed
characters
Prior art date
Application number
PCT/CN2020/113788
Other languages
English (en)
Chinese (zh)
Inventor
严锐
谢凌曦
田奇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021073311A1 publication Critical patent/WO2021073311A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an image recognition method, device, computer readable storage medium and chip.
  • Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. To put it vividly, it is to install eyes (camera/camcorder) and brain (algorithm) on the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data.
  • computer vision is to use various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • Action recognition is an important research topic in the field of computer vision.
  • the computer can understand the content of the video through motion recognition.
  • Motion recognition technology can be widely used in public place monitoring, human-computer interaction and other fields.
  • Feature extraction is a key link in the process of action recognition. Only based on accurate features, can action recognition be effectively performed.
  • group action recognition the temporal relationship of the actions of each of the multiple characters in the video and the relationship between the actions of multiple characters affect the accuracy of group action recognition.
  • LSTM long short-term memory
  • the interactive action characteristics of each character can be calculated, so that the action characteristics of each character can be determined according to the interactive action characteristics of each character, and the action characteristics of multiple characters can be inferred according to the action characteristics of each character.
  • Group action Interactive action features are used to express the correlation between characters' actions.
  • the present application provides an image recognition method, device, computer readable storage medium, and chip to better recognize group actions of multiple people in an image to be processed.
  • an image recognition method includes: extracting image features of an image to be processed, the image to be processed includes multiple frames of images; determining that each of a plurality of persons is in each frame of the multiple frames of image Time sequence characteristics in the image; determine the spatial characteristics of each of the multiple characters in each frame of the multi-frame image; determine each of the multiple characters in each frame of the multi-frame image Based on the action characteristics of each of the multiple characters in each frame of the multi-frame image, the group actions of multiple characters in the image to be processed are recognized.
  • the group actions of multiple characters in the image to be processed may be a certain sport or activity.
  • the group actions of multiple characters in the image to be processed may be basketball, volleyball, football or dancing, etc. .
  • the image to be processed includes multiple people, and the image features of the image to be processed include image features of the multiple people in each of the multiple frames of the image to be processed.
  • the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image received by the image recognition device from another device, Alternatively, the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
  • the above-mentioned image to be processed may be a continuous multi-frame image in a video, or a multi-frame image selected in accordance with a preset rule in a video according to a preset.
  • the multiple characters may include only humans, or only animals, or both humans and animals.
  • the person in the image can be identified to determine the person's bounding box.
  • the image in each bounding box corresponds to a person in the image.
  • you can Feature extraction is performed on the image of the bounding box to obtain the image feature of each person.
  • the bone node of the person in the bounding box corresponding to each person can be identified first, and then the image feature vector of the person can be extracted according to the bone node of each person, so that the extracted image features can be more accurately reflected
  • the actions of the characters improve the accuracy of the extracted image features.
  • the bone nodes in the bounding box can be connected according to the structure of the person to obtain a connected image, and then the image feature vector is extracted on the connected image.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
  • the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
  • the time between actions of the character at different moments can be determined by the similarity between the image feature vectors of different actions of the character in different frames of images Association relationship, and then get the character's time series characteristics.
  • the multi-frame images in the image to be processed are specifically T frames, and i is a positive integer less than or equal to T
  • the i-th frame of image represents the images in the corresponding order in the T frame image
  • the j-th character represents the characters in the corresponding order among the K characters
  • both i and j are positive integers.
  • the timing characteristics of the j-th person in the i-th frame of the image to be processed above are determined based on the similarity between the image characteristics of the j-th person in the i-th frame and the image characteristics of other frames in the multi-frame image. of.
  • timing characteristics of the j-th person in the i-th frame of image are used to indicate the relationship between the action of the j-th person in the i-th frame of image and the action of the above-mentioned multi-frame image.
  • the similarity between the corresponding image features of a certain person in the two frames of images can reflect the degree of dependence of the person's actions on time.
  • the spatial correlation between the actions of different characters in the frame of image is determined by the similarity between the image characteristics of different characters in the same frame of image.
  • the spatial feature of the j-th person among the multiple people in the i-th frame of the above-mentioned multi-frame image to be processed is based on the image feature of the j-th person in the i-th frame image and the removal of the j-th person from the i-th frame image
  • the similarity of the image features of other characters is determined. That is to say, the j-th person in the i-th frame image can be determined based on the similarity between the image feature of the j-th person in the i-th frame image and the image features of other people except the j-th person in the i-th frame image.
  • the spatial characteristics of the j-th person in the i-th frame image are used to represent the actions of the j-th person in the i-th frame image and the behavior of other people other than the j-th person in the i-th frame image in the i-th frame image. The relationship of the action.
  • the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people except the j-th person can reflect the difference between the j-th person pair in the i-th frame image and the j-th person.
  • the degree of dependence on the actions of characters other than personal objects That is to say, when the similarity of the image feature vectors corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image feature vectors corresponding to the two characters is lower , The weaker the association between the actions of these two characters.
  • the similarity between the above-mentioned temporal features and the spatial features can be calculated by Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance, etc. .
  • the spatial correlation between actions of different characters and the temporal correlation between actions of the same character can provide important clues to the categories of multi-person scenes in the image. Therefore, in the image recognition process of the present application, by comprehensively considering the spatial association relationship between different person actions and the time association relationship between the same person actions, the accuracy of recognition can be effectively improved.
  • the time series, spatial, and image characteristics corresponding to the person in a frame of image can be fused to obtain the person’s behavior in the frame of image. Movement characteristics.
  • a combined fusion method can be used for fusion.
  • the feature corresponding to a person in a frame of image is merged to obtain the action feature of the person in the frame of image.
  • the features to be fused may be added directly or weighted.
  • cascade and channel fusion can be used for fusion.
  • the dimensions of the features to be fused may be directly spliced, or spliced after being multiplied by a certain coefficient, that is, a weight value.
  • a pooling layer may be used to process the above-mentioned multiple features, so as to realize the fusion of the above-mentioned multiple features.
  • the identification of multiple characters in the image to be processed is based on the action characteristics of each of the multiple characters in each frame of the image to be processed.
  • the action characteristics of each of the multiple characters in the image to be processed can be classified in each frame of the image to obtain the actions of each person, and determine the group actions of multiple characters accordingly.
  • the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
  • a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
  • the identification of multiple characters in the image to be processed is based on the action characteristics of each of the multiple characters in each frame of the image to be processed.
  • the action features of multiple people in each frame of image can also be merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and based on this Determine the group actions of multiple characters in the image to be processed.
  • the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
  • the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
  • a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
  • tag information of the image to be processed is generated according to the group action, and the tag information is used to indicate Group actions of multiple characters in the image to be processed.
  • the foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
  • the key person in the image to be processed is determined according to the group actions.
  • the above method can be used to detect key persons in a video image, for example.
  • the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
  • the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and also contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
  • an image recognition method includes: extracting image features of an image to be processed; determining the spatial characteristics of multiple people in each frame of the image to be processed; determining that multiple people are in each frame of the image to be processed Based on the action characteristics of the multiple characters in each frame of the image to be processed, the group actions of multiple characters in the image to be processed are recognized based on the action features of the multiple characters in each frame of the image to be processed.
  • the action features of the multiple people in the image to be processed are obtained by fusing the spatial features of the multiple people in the image to be processed and the image features in the image to be processed.
  • the above-mentioned image to be processed may be one frame of image, or may be multiple frames of continuous or non-continuous images.
  • the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image received by the image recognition device from another device, Alternatively, the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
  • the above-mentioned image to be processed may be one frame of image or continuous multiple frames of image in a piece of video, or one or multiple frames of image selected according to preset rules in a piece of video according to a preset.
  • the multiple characters may include only humans, or only animals, or both humans and animals.
  • the person in the image can be identified to determine the bounding box of the person.
  • the image in each bounding box corresponds to a person in the image.
  • Feature extraction is performed on the image of the frame to obtain the image feature of each person.
  • the bone node of the person in the bounding box corresponding to each person can be identified first, and then the image features of the person can be extracted according to the bone node of each person, so that the extracted image features more accurately reflect the person The action to improve the accuracy of the extracted image features.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
  • a locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located can be masked to obtain the partially visible image.
  • the spatial correlation between the actions of different characters in the same frame of image is determined by the similarity between the image characteristics of different characters in the same frame of image.
  • the spatial characteristics of the j-th person among the multiple people in the i-th frame of the above-mentioned multi-frame image to be processed are determined based on the similarity between the image characteristics of the j-th person in the i-th frame and the image characteristics of other people. . That is to say, the spatial characteristics of the j-th person in the i-th frame image can be determined according to the similarity between the image characteristics of the j-th person in the i-th frame image and the image characteristics of other people.
  • the spatial characteristics of the j-th person in the i-th frame image are used to represent the relationship between the actions of the j-th person in the i-th frame and the actions of other people in the i-th frame of the image except for the j-th person. .
  • the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people in the i-th frame image except for the j-th person can reflect the similarity of the j-th person in the i-th frame image The degree of dependence on the actions of other characters. That is to say, the higher the similarity of the image feature vectors corresponding to the two characters, the closer the association between the two actions; conversely, the lower the similarity, the weaker the association between the actions of the two characters.
  • the similarity between the aforementioned spatial features can be calculated by Minkowski distance (such as Euclidean distance, Manhattan distance), cosine similarity, Chebyshev distance, Hamming distance, and the like.
  • the spatial feature and image feature corresponding to the person in a frame of image can be fused to obtain the action feature of the person in the frame of image.
  • a combined fusion method can be used for fusion.
  • the feature corresponding to a person in a frame of image is merged to obtain the action feature of the person in the frame of image.
  • the features to be fused may be added directly, or weighted.
  • cascade and channel fusion can be used for fusion.
  • the dimensions of the features to be fused may be directly spliced, or spliced after being multiplied by a certain coefficient, that is, a weight value.
  • a pooling layer may be used to process the above-mentioned multiple features, so as to realize the fusion of the above-mentioned multiple features.
  • the processing when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, the processing can be The action characteristics of each of the multiple characters in the image are classified in each frame of the image to obtain the actions of each character, and the group actions of the multiple characters are determined accordingly.
  • the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
  • a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
  • the second aspect when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, you can also The action features of multiple characters in each frame of image are merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and the multiple of the image to be processed are determined accordingly.
  • the group actions of the characters when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, you can also The action features of multiple characters in each frame of image are merged to obtain the action feature of the frame of image, and then the action feature of each frame of image is classified to obtain the action of each frame of image, and the multiple of the image to be processed are determined accordingly.
  • the group actions of the characters when recognizing group actions of multiple characters in the image to be processed according to the action characteristics of multiple characters in each frame of the image to be processed, you can also The action features of multiple characters in each frame of
  • the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
  • the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
  • a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
  • tag information of the image to be processed is generated according to the group action, and the tag information is used to indicate Group actions of multiple characters in the image to be processed.
  • the foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
  • the key person in the image to be processed is determined according to the group actions.
  • the above method can be used to detect key persons in a video image, for example.
  • the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
  • the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
  • an image recognition method includes: extracting image features of an image to be processed; determining the dependency between different characters in the image to be processed and the dependency between actions of the same person at different times; Fusion with the spatio-temporal feature vector to obtain the action feature of each frame of the image to be processed; perform classification prediction on the action feature of each frame of the image to determine the group action category of the image to be processed.
  • the complex reasoning process of group action recognition is completed, and when determining group actions of multiple characters, not only the temporal characteristics of multiple characters are taken into consideration, but also the spatial characteristics of multiple characters are taken into consideration.
  • the temporal characteristics and spatial characteristics of personal objects can better and more accurately determine the group actions of multiple characters.
  • target tracking can be performed on each person, and the bounding box of each person in each frame of the image is determined, and the image in each bounding box corresponds to a person, and then The feature extraction is performed on the image of each of the above-mentioned bounding boxes to obtain the image feature of each person.
  • the image features can also be extracted by identifying the bone nodes of the person, so as to reduce the influence of the redundant information of the image during the feature extraction process and improve the accuracy of feature extraction.
  • a convolutional network can be used to extract image features based on bone nodes.
  • the bone nodes in the bounding box may be connected according to the structure of the person to obtain a connected image, and then image feature vector extraction is performed on the connected image.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors, and then image feature extraction is performed on the processed image.
  • the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
  • the character's action concealment matrix can be calculated according to the character's image and bone nodes.
  • Each point in the masking matrix corresponds to a pixel.
  • the value in the square area with the bone point as the center and side length l is set to 1, and the values in other positions are set to 0.
  • the RGB color mode can be used for masking.
  • the RGB color model uses the RGB model to assign an intensity value in the range of 0 to 255 for the RGB component of each pixel in the image.
  • the masking matrix is used to mask the original character action pictures to obtain a partially visible image.
  • the area around each node in the skeleton node that becomes length l is reserved, and the other areas are masked.
  • the use of locally visible images for image feature extraction can reduce the redundant information in the bounding box, and can extract image features based on the structure information of the person, and enhance the performance of the person's actions in the image features.
  • the cross interaction module is used to determine the temporal correlation of the body posture of the characters in the multi-frame images, and/or Determine the spatial correlation of the body postures of the characters in the multi-frame images.
  • the above-mentioned cross interaction module is used to realize the interaction of features to establish a feature interaction model
  • the feature interaction model is used to represent the association relationship of the body posture of the character in time and/or space.
  • the spatial dependence between the body postures of different characters in the same frame of image can be determined.
  • the spatial dependence is used to indicate the dependence of the body posture of a character on the body posture of other characters in a certain frame of image, that is, the spatial dependence between the actions of the characters.
  • the spatial dependency can be expressed by the spatial feature vector.
  • the time dependence between the body postures of the same person at different times can be determined.
  • the time dependence may also be referred to as timing dependence, which is used to indicate the dependence of the body posture of the character in a certain frame of image on the body posture of the character in other video frames, that is, the inherent temporal dependence of an action.
  • the time dependence can be expressed by the time series feature vector.
  • the spatio-temporal feature vector of the k-th person can be calculated according to the spatial feature vector and the time-series feature vector of the k-th person in the image to be processed.
  • the image feature of the k-th person at time t is fused with the time-space feature vector to obtain the person feature vector of the k-th person at time t; or the image feature and the time-space feature vector are residually connected to Get the character feature vector.
  • the character feature vector of each person in the K figures determine the set of the character feature vectors of the K figures at time t. Perform maximum pooling on the set of character feature vectors to obtain action feature vectors.
  • the classification result of the group action can be obtained in different ways.
  • the action feature vector at time t is input to the classification module to obtain the classification result of the frame image.
  • the classification result of the image feature vector at any time t by the classification module may be used as the classification result of the group action in the T frame image.
  • the classification result of the group action in the T frame image can also be understood as the classification result of the group action of the person in the T frame image, or the classification result of the T frame image.
  • the action feature vectors of the T frame images are respectively input to the classification module to obtain the classification result of each frame of image.
  • the classification result of the T frame image can belong to one or more categories.
  • the category with the largest number of images in the corresponding T-frame image in the output category of the classification module can be used as the classification result of the group action in the T-frame image.
  • each bit in the average feature vector is the average value of the corresponding bit in the image feature vector representation of the T frame image.
  • the average feature vector can be input to the classification module to obtain the classification result of the group action in the T frame image.
  • the above method can complete the complex reasoning process of group action recognition: determine the image features of multiple frames of images, and determine the temporal and spatial features according to the interdependence between different characters in the image and between actions at different times, and then The above-mentioned image features are fused to obtain the action features of each frame of image, and then the group action of multiple frames of images is inferred by classifying the action features of each frame of image.
  • an image recognition device in a fourth aspect, has the function of implementing the methods in the first to third aspects or any possible implementation manners thereof.
  • the image recognition device includes a unit that implements the method in any one of the first aspect to the third aspect.
  • a neural network training device in a fifth aspect, has a unit for implementing the method in any one of the first aspect to the third aspect.
  • an image recognition device in a sixth aspect, includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processing The device is used to execute the method in any one of the foregoing first aspect to the third aspect.
  • a neural network training device in a seventh aspect, includes: a memory for storing a program; a processor for executing the program stored in the memory. When the program stored in the memory is executed, the device The processor is configured to execute the method in any one of the foregoing first aspect to the third aspect.
  • an electronic device in an eighth aspect, includes the image recognition apparatus in the fourth aspect or the sixth aspect.
  • the electronic device in the above eighth aspect may specifically be a mobile terminal (for example, a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, a vehicle-mounted terminal device, and so on.
  • a mobile terminal for example, a smart phone
  • a tablet computer for example, a tablet computer
  • a notebook computer for example, a tablet computer
  • an augmented reality/virtual reality device for example, a vehicle-mounted terminal device, and so on.
  • a computer device in a ninth aspect, includes the neural network training device in the fifth aspect or the seventh aspect.
  • the computer device may specifically be a computer, a server, a cloud device, or a device with a certain computing capability that can implement neural network training.
  • the present application provides a computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions.
  • the computer instructions run on the computer, the computer executes any of the first to third aspects.
  • the present application provides a computer program product.
  • the computer program product includes computer program code.
  • the computer program code runs on a computer, the computer executes any one of the first aspect to the third aspect.
  • a chip in a twelfth aspect, includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface to execute any one of the first aspect to the third aspect.
  • a method in a way of implementation.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
  • the processor is configured to execute the method in any one of the implementation manners of the first aspect to the third aspect.
  • the above-mentioned chip may specifically be a field programmable gate array FPGA or an application-specific integrated circuit ASIC.
  • FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an application environment provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for group action recognition provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for group action recognition provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a process of obtaining a partially visible image provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application.
  • FIG. 14 is a schematic diagram of the spatial relationship between different character actions provided by an embodiment of the present application.
  • 15 is a schematic diagram of the spatial relationship of different character actions provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of the relationship in time of a character's actions according to an embodiment of the present application.
  • FIG. 17 is a schematic diagram of the relationship in time of a character's actions according to an embodiment of the present application.
  • FIG. 18 is a schematic diagram of a system architecture of an image recognition network provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application.
  • FIG. 21 is a schematic structural diagram of a neural network training device provided by an embodiment of the present application.
  • the solution of the present application can be applied to the fields of video analysis, video recognition, abnormal or dangerous behavior detection, etc., which require video analysis of complex scenes of multiple people.
  • the video may be, for example, a sports game video, a daily surveillance video, and the like. Two commonly used application scenarios are briefly introduced below.
  • the trained neural network structure can be used to determine the label corresponding to the short video , That is, to classify short videos to obtain group action categories corresponding to different short videos, and to tag different short videos with different tags, which is convenient for users to view and find, which can save manual classification and management time, and improve management efficiency and user experience.
  • the video includes several people, most of whom are not important. Detecting key figures effectively helps to quickly understand the content of the scene. As shown in Figure 2, the group action recognition system provided by the present application can identify key persons in the video, so as to understand the video content more accurately based on the information around the key persons.
  • a neural network can be composed of neural units.
  • a neural unit can refer to an arithmetic unit that takes x s and intercept b as inputs.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f() is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with many hidden layers. There is no special metric for "many” here. Dividing DNN according to the location of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. For example, in a fully connected neural network, the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • DNN looks complicated, it is not complicated as far as the work of each layer is concerned.
  • the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as It should be noted that there is no W parameter in the input layer.
  • more hidden layers make the network more capable of portraying complex situations in the real world.
  • a model with more parameters is more complex and has a greater "capacity", which means it can complete more complex learning tasks.
  • Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional neural network is a deep neural network with convolutional structure.
  • the convolutional neural network includes a feature extractor composed of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to a part of the neighboring neurons.
  • a convolutional layer usually includes several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
  • Recurrent neural network is used to process sequence data.
  • the layers are fully connected, and the nodes in each layer are disconnected.
  • this ordinary neural network has solved many problems, it is still powerless for many problems. For example, if you want to predict what the next word of a sentence is, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called recurrent neural network is that the current output of a sequence is also related to the previous output.
  • the specific form of expression is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are no longer unconnected but connected, and the input of the hidden layer includes not only The output of the input layer also includes the output of the hidden layer at the previous moment.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • the error back-propagation algorithm is also used, but there is a difference: that is, if the RNN is network expanded, then the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case.
  • the output of each step depends not only on the current step of the network, but also on the state of the previous steps of the network. This learning algorithm is called backpropagation through time (BPTT).
  • BPTT backpropagation through time
  • the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • Residual network Residual network
  • the residual network includes a convolutional layer and/or a pooling layer.
  • the residual network can be: In addition to connecting multiple hidden layers in a deep neural network, for example, the first hidden layer is connected to the second hidden layer, and the second hidden layer is connected to the third hidden layer. Contained layer, the third hidden layer is connected to the fourth hidden layer (this is a data operation path of a neural network, which can also be called neural network transmission), and the residual network has an additional direct connection branch.
  • This directly connected branch is directly connected from the hidden layer of the 1st layer to the hidden layer of the 4th layer, that is, skips the processing of the 2nd and 3rd hidden layers, and directly transmits the data of the 1st hidden layer Perform calculations on the 4th hidden layer.
  • the road network can be: in addition to the above-mentioned calculation path and direct connection branch, the deep neural network also includes a weight acquisition branch. This branch introduces a transmission gate (transform gate) to acquire the weight value and output The weight value T is used for the subsequent operations of the above calculation path and the directly connected branch.
  • a transmission gate transform gate
  • the convolutional neural network can use the error back propagation algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forwarding the input signal until the output will cause error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss is converged.
  • the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • the pixel value of the image can be a Red-Green-Blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 255 ⁇ Red+100 ⁇ Green+76 ⁇ Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • Group action recognition can also be called group activity recognition, which is used to identify what a group of people do in a video. It is an important subject in computer vision. GAR has many potential applications, including video surveillance and sports video analysis. Compared with traditional single-person action recognition, GAR not only needs to recognize the behavior of the characters, but also needs to infer the potential relationship between the characters.
  • Group action recognition can use the following methods:
  • a group action is composed of different actions of several characters in the group, which is equivalent to actions completed by several characters in cooperation, and these character actions reflect different postures of the body.
  • the traditional method uses a step-by-step method to process the complex information of such an entity, and cannot make full use of its potential time and space dependence. Not only that, these methods are also very likely to destroy the co-occurrence relationship between the space domain and the time domain.
  • Existing methods often train the CNN network directly under the condition of extracting timing-dependent features. Therefore, the features extracted by the feature extraction network ignore the spatial dependence between people in the image.
  • the bounding box contains more redundant information, which may lower the accuracy of the extracted character's action features.
  • Fig. 3 is a schematic flow chart of a method for group action recognition.
  • a Hierarchical Deep Temporal Model for Group Activity Recognition (Ibrahim M S, Muralidharan S, Deng Z, et al. IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1971-1980).
  • the existing algorithm is used to target several people in multiple video frames, and the size and position of each person in each video frame are determined.
  • the person CNN is used to extract the convolutional features of each person in each video frame, and the convolutional features are input into the person's long short-term memory network (LSTM) to extract the time series features of each person.
  • the convolution feature and time sequence feature corresponding to each person are spliced together as the person's action feature of the person.
  • the character action characteristics of multiple characters in the video are spliced and max pooled to obtain the action characteristics of each video frame.
  • the action characteristics of each video frame are input into the group LSTM to obtain the corresponding characteristics of the video frame.
  • the feature corresponding to the video frame is input into the group action classifier to classify the input video, that is, the category to which the group action in the video belongs is determined.
  • the HDTM model includes a character CNN, a character LSTM, a group LSTM, and a group action classifier.
  • the existing algorithm is used to target several people in multiple video frames, and the size and position of each person in each video frame are determined.
  • Each character corresponds to a character action tag.
  • Each input video corresponds to a group action tag.
  • the first step is to train the character CNN, character LSTM, and character action classifier according to the character action label corresponding to each character, so as to obtain the trained character CNN and the trained character LSTM.
  • the second step of training is to train the parameters of the group LSTM and the group action classifier according to the group action tags, so as to obtain the trained group LSTM and the trained group action classifier.
  • the person CNN and the person LSTM are obtained, and the convolutional features and timing features of each person in the input video are extracted.
  • the second step of training is performed according to the feature representation of each video frame obtained by splicing the convolution features and time sequence features of the extracted multiple people.
  • the obtained neural network model can perform group action recognition on the input video.
  • the determination of the character's action feature representation of each character is carried out by the neural network model trained in the first step.
  • the fusion of the character action feature representations of multiple characters to identify group actions is performed by the neural network model trained in the second step.
  • Fig. 4 is a schematic flowchart of a method for group action recognition.
  • Social scene understanding: End-to-end multi-person action localization and collective activity recognition please refer to “Social scene understanding: End-to-end multi-person action localization and collective activity recognition” (Bagautdinov, Timur, et al. IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4315-4324).
  • a step of training is required to obtain a neural network model that can recognize videos that include this specific type of group action. That is to say, the training image is input into the FCN, and the parameters of the FCN and RNN are adjusted according to the character action tag and group action tag of each character in the training image to obtain the trained FCN and RNN.
  • FCN can generate a multi-scale feature map F t of the t-th frame image.
  • Several detection frames B t and corresponding probabilities p t are generated through deep fully convolutional networks (deep fully convolutional networks, DFCN), and B t and p t are sent to Markov random field (Markov random field, MRF) to obtain trusted detection frame b t, trusted detection block to determine the corresponding feature t t B from FIG multiscale wherein F f t.
  • MRF Markov random field
  • FCN can also be obtained through pre-training.
  • a group action is composed of different actions of several characters, and these character actions are reflected in the different body postures of each character.
  • the temporal characteristics of a character can reflect the time dependence of a character's actions.
  • the spatial dependence between character actions also provides important clues for group action recognition.
  • the accuracy of the group action recognition scheme that does not consider the spatial dependence between characters is affected to a certain extent.
  • an embodiment of the present application provides an image recognition method.
  • determining the group actions of multiple characters in this application not only the temporal characteristics of multiple characters are considered, but also the spatial characteristics of multiple characters are considered.
  • By integrating the temporal characteristics and spatial characteristics of multiple characters it is possible to better and more accurately Determine the group actions of multiple characters.
  • Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application.
  • the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data collection system 560.
  • the execution device 510 includes a calculation module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514.
  • the calculation module 511 may include the target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.
  • the data collection device 560 is used to collect training data.
  • the training data may include multiple frames of training images (the multiple frames of training images include multiple persons, such as multiple persons) and corresponding labels, where the label gives The group action category of the person.
  • the data collection device 560 stores the training data in the database 530, and the training device 520 obtains the target model/rule 501 based on the training data maintained in the database 530.
  • the training device 520 recognizes the input multi-frame training image, and compares the output prediction category with the label, until the prediction category output by the training device 520 is equal to The difference between the results of the label is less than a certain threshold, so that the training of the target model/rule 501 is completed.
  • the above-mentioned target model/rule 501 can be used to implement the image recognition method of the embodiment of the present application, that is, input one or more frames of images to be processed (after relevant preprocessing) into the target model/rule 501 to obtain the The group action category of people in the frame or multiple frames of the image to be processed.
  • the target model/rule 501 in the embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 530 may not all come from the collection of the data collection device 560, and may also be received from other devices.
  • the training device 520 does not necessarily perform the training of the target model/rule 501 completely based on the training data maintained by the database 530. It may also obtain training data from the cloud or other places for model training.
  • the above description should not be used as a reference to this application Limitations of the embodiment.
  • the target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 5, the execution device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, Notebook computers, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or clouds.
  • the execution device 510 is configured with an input/output (input/output, I/O) interface 512 for data interaction with external devices.
  • the user can input data to the I/O interface 512 through the client device 540.
  • the input data in this embodiment of the present application may include: a to-be-processed image input by the client device.
  • the client device 540 here may specifically be a terminal device.
  • the preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 512.
  • the preprocessing module 513 and the preprocessing module 514 may be omitted. Or there is only one preprocessing module.
  • the calculation module 511 can be directly used to process the input data.
  • the execution device 510 may call data, codes, etc. in the data storage system 550 for corresponding processing.
  • the data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 550.
  • the I/O interface 512 presents the processing result, such as the group action category calculated by the target model/rule 501, to the client device 540, so as to provide it to the user.
  • the group action category obtained by the target model/rule 501 in the calculation module 511 can be processed by the preprocessing module 513 (or the processing by the preprocessing module 514), and then the processing result is sent to the I/ O interface, and then the I/O interface sends the processing result to the client device 540 for display.
  • the calculation module 511 may also transmit the group action category obtained by the processing to the I/O interface, and then the I/O interface will process it. The result is sent to the client device 540 for display.
  • the training device 520 can generate corresponding target models/rules 501 based on different training data for different goals or tasks, and the corresponding target models/rules 501 can be used to achieve the above goals or complete The above tasks provide users with the desired results.
  • the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 512.
  • the client device 540 can automatically send input data to the I/O interface 512. If the client device 540 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 540. The user can view the result output by the execution device 510 on the client device 540, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output result of the output I/O interface 512 as new sample data as shown in the figure, and store it in the database 530.
  • the I/O interface 512 directly uses the input data input to the I/O interface 512 and the output result of the output I/O interface 512 as a new sample as shown in the figure.
  • the data is stored in the database 530.
  • FIG. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 may also be placed in the execution device 510.
  • the target model/rule 501 obtained by training according to the training device 520 may be the neural network in the embodiment of the present application.
  • the neural network provided in the embodiment of the present application may be a CNN and a deep convolutional neural network ( deep convolutional neural networks, DCNN) and so on.
  • CNN is a very common neural network
  • the structure of CNN will be introduced below in conjunction with Figure 6.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning. Multi-level learning is carried out on the abstract level of the system.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.
  • FIG. 6 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the present application.
  • the convolutional neural network 600 may include an input layer 610, a convolutional layer/pooling layer 620 (the pooling layer is optional), and a fully connected layer 630.
  • the pooling layer is optional
  • the fully connected layer 630 is a detailed introduction to the relevant content of these layers.
  • the convolutional layer/pooling layer 620 may include layers 621-626 as shown in the examples.
  • layer 621 is a convolutional layer
  • layer 622 is a pooling layer
  • layer 623 is a convolutional layer.
  • Build layer 624 is the pooling layer
  • 625 is the convolutional layer
  • 626 is the pooling layer
  • 621 and 622 are convolutional layers
  • 623 is the pooling layer
  • 624 and 625 are convolutional layers.
  • Layer, 626 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 621 can include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. , It depends on the value of stride) to perform processing, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row ⁇ column) are applied. That is, multiple homogeneous matrices.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), the size of the convolution feature maps extracted by the multiple weight matrices of the same size are also the same, and then the multiple extracted convolution feature maps of the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 600 can make correct predictions. .
  • the initial convolutional layer (such as 621) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 626) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the size of the image space.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of the average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 600 After processing by the convolutional layer/pooling layer 620, the convolutional neural network 600 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 620 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network 600 needs to use the fully connected layer 630 to generate one or a group of required classes of output. Therefore, the fully connected layer 630 can include multiple hidden layers (631, 632 to 63n as shown in FIG. 6) and an output layer 640. The parameters included in the multiple hidden layers can be based on specific task types. The relevant training data of the, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 640 After the multiple hidden layers in the fully connected layer 630, that is, the final layer of the entire convolutional neural network 600 is the output layer 640.
  • the output layer 640 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • the convolutional neural network 600 shown in FIG. 6 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
  • the convolutional neural network 600 shown in FIG. 6 can be used to execute the image recognition method of the embodiment of the present application. As shown in FIG. 6, the image to be processed passes through the input layer 610, the convolutional layer/pooling layer 620, and the fully connected After the layer 630 is processed, the group action category can be obtained.
  • FIG. 7 is a schematic diagram of a chip hardware structure provided by an embodiment of the application.
  • the chip includes a neural network processor 700.
  • the chip can be set in the execution device 510 as shown in FIG. 5 to complete the calculation work of the calculation module 511.
  • the chip can also be set in the training device 520 as shown in FIG. 5 to complete the training work of the training device 520 and output the target model/rule 501.
  • the algorithms of each layer in the convolutional neural network as shown in FIG. 6 can be implemented in the chip as shown in FIG. 7.
  • a neural network processor (neural-network processing unit, NPU) 50 is mounted on a main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 703.
  • the controller 704 controls the arithmetic circuit 703 to extract data from the memory (weight memory or input memory) and perform calculations.
  • the arithmetic circuit 703 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 703 is a general-purpose matrix processor.
  • the arithmetic circuit 703 fetches the data corresponding to matrix B from the weight memory 702 and buffers it on each PE in the arithmetic circuit 703.
  • the arithmetic circuit 703 takes the matrix A data and the matrix B from the input memory 701 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 708.
  • the vector calculation unit 707 can perform further processing on the output of the arithmetic circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 707 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 707 can store the processed output vector to the unified buffer 706.
  • the vector calculation unit 707 may apply a nonlinear function to the output of the arithmetic circuit 703, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 707 generates normalized values, combined values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 703, for example for use in a subsequent layer in a neural network.
  • the unified memory 706 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 701 and/or the unified memory 706 through the storage unit access controller 705 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 702, And the data in the unified memory 706 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 710 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 709 through the bus.
  • An instruction fetch buffer 709 connected to the controller 704 is used to store instructions used by the controller 704;
  • the controller 704 is used to call the instructions cached in the memory 709 to control the working process of the computing accelerator.
  • the unified memory 706, the input memory 701, the weight memory 702, and the instruction fetch memory 709 are all on-chip memories.
  • the external memory is a memory external to the NPU.
  • the external memory can be a double data rate synchronous dynamic random access memory.
  • Memory double data rate synchronous dynamic random access memory, referred to as DDR SDRAM
  • HBM high bandwidth memory
  • each layer in the convolutional neural network shown in FIG. 6 may be executed by the arithmetic circuit 703 or the vector calculation unit 707.
  • FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application.
  • training data where the training data includes T1 frame training images and labeled categories.
  • the T1 frame training image corresponds to a label category.
  • T1 is a positive integer greater than 1.
  • the T1 frame training image can be a continuous multi-frame image in a video, or a multi-frame image selected according to a preset rule in a video according to a preset.
  • the T1 frame training image may be a multi-frame image obtained by selecting every preset time in a video, or it may be a multi-frame image with a preset number of frames in a video.
  • the training image of the T1 frame may include multiple characters, and the multiple characters may include only humans, animals, or both humans and animals.
  • the above-mentioned label category is used to indicate the category of the group action of the person in the training image of the T1 frame.
  • S802a Extract image features of the training image of frame T1.
  • At least one frame of images is selected from the T1 frame of training images, and image features of multiple people in each frame of the at least one frame of images are extracted.
  • the image feature of a certain person can be used to represent the body posture of the person in the frame of training image, that is, the relative position between different limbs of the person.
  • the above image features can be represented by vectors.
  • S802b Determine the spatial characteristics of multiple people in each frame of training image in at least one frame of training image.
  • the spatial feature of the j-th person in the i-th training image of the at least one frame of training image is based on the image feature of the j-th person in the i-th training image and the removal of the j-th person in the i-th frame of image
  • the similarity of image features of other characters is determined, i and j are positive integers.
  • the spatial feature of the j-th person in the i-th training image is used to represent the actions of the j-th person in the i-th training image and the actions of other people except the j-th person in the i-th training image The relationship.
  • the similarity between corresponding image features of different characters in the same frame of image can reflect the spatial dependence of the actions of different characters. That is to say, when the similarity of the image features corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image features corresponding to the two characters is lower, this The weaker the association between the actions of the two characters.
  • S802c Determine the timing characteristics of each of the multiple characters in the at least one frame of training images in different frames of images.
  • the time sequence feature of the j-th person in the i-th training image in the at least one frame of training image is based on the image feature of the j-th person in the i-th training image and the j-th person in addition to the j-th person.
  • the similarity between image features of training images in frames other than the i-frame image is determined, and i and j are positive integers.
  • the time series feature of the j-th person in the i-th frame of training image is used to represent the relationship between the action of the j-th person in the i-th frame of training image and the action of the j-th person in the at least one frame of training image in other frames of the training image.
  • the similarity between corresponding image features of a person in two frames of images can reflect the degree of dependence of the person's actions on time.
  • S802d Determine the action features of multiple characters in each frame of training image in at least one frame of training image.
  • the action feature of the j-th person in the training image of the i-th frame is the spatial feature of the j-th person in the training image of the i-th frame, the time series feature of the j-th person in the training image of the i-th frame, and the The image features of the j-th person in the training image of the i-th frame are fused.
  • the action features of each of the multiple characters in each frame of the training image in the at least one frame of training image may be fused to obtain the feature representation of each frame of the training image in the at least one frame of training image.
  • the average value of each bit represented by the training feature of each frame of the training image in the T1 training frame image can be calculated to obtain the average feature representation.
  • Each bit represented by the average training feature is the average value of the corresponding bit represented by the feature of each frame of the training image in the T1 frame of training image.
  • the classification can be performed based on the average feature representation, that is, the group actions of multiple characters in the training image of the T1 frame are recognized to obtain the training category.
  • the training category of each frame of training image in the at least one frame of training image may be determined.
  • the at least one frame of training images may be all or part of the training images in the T1 frame of training images.
  • S803 Determine the loss value of the neural network according to the training category and the label category.
  • the loss value L of the neural network can be expressed as:
  • N Y represents the number of group action categories, that is, the number of categories output by the neural network; Indicates the label category, Expressed by one-hot encoding, Including N Y bits, Used to represent one of them, P t represents the training image category t th frame image T1 frame, P t is represented by one-hot encoding, comprising N Y P t bits, Means one of them, The t-th frame image can also be understood as the image at time t.
  • the training data generally includes a combination of multiple sets of training images and annotated categories.
  • Each combination of training images and annotated categories may include one or more frames of training images, and the one or more frames of training images correspond to Is a unique label category.
  • the difference is within a certain preset range, or when the number of training times reaches the preset number, the model parameters of the neural network at this time are determined as the final parameters of the neural network model, thus completing the training of the neural network Up.
  • FIG. 9 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • the image to be processed includes a plurality of people, and the image feature of the image to be processed includes the image feature of each of the plurality of people in each of the multiple frames of the image to be processed.
  • an image to be processed can be acquired.
  • the image to be processed can be obtained from the memory, or the image to be processed can also be received.
  • the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image obtained by the image recognition device from other equipment.
  • the received image or the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
  • the above-mentioned image to be processed may be a continuous multi-frame image in a video, or a multi-frame image selected according to a preset rule in a video.
  • a video multiple frames of images can be selected according to a preset time interval; or, in a video, multiple frames of images can be selected according to a preset frame number interval.
  • the multiple characters may include only humans, or only animals, or both humans and animals.
  • the image feature of a certain person can be used to represent the body posture of the person in the frame of image, that is, the relative position between different limbs of the person.
  • the image feature of a certain person mentioned above can be represented by a vector, which can be called an image feature vector.
  • the above-mentioned image feature extraction can be performed by CNN.
  • the person in the image can be identified to determine the bounding box of the person.
  • the image in each bounding box corresponds to a person.
  • the person’s bone node in the bounding box corresponding to each person can be identified first, and then the person’s image feature vector can be extracted based on the person’s bone node, so that the extracted image features can be more accurately reflected
  • the actions of the characters improve the accuracy of the extracted image features.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors to obtain a processed image, and then image feature extraction is performed on the processed image.
  • the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the person in the bounding box is located may be masked to obtain the partially visible image.
  • the color of the pixel corresponding to the area outside the area where the bone node is located can be set to a certain preset color, such as black.
  • the area where the bone node is located retains the same information as the original image, and the information of the area outside the area where the bone node is located is masked. Therefore, when extracting image features, only the image features of the above-mentioned partially visible image need to be extracted, and there is no need to extract the above-mentioned masked area.
  • the area where the aforementioned bone node is located may be a square, circle or other shape centered on the bone node.
  • the side length (or radius), area, etc. of the region where the bone node is located can be preset values.
  • the above method of extracting the image features of the image to be processed can extract the features according to the locally visible image to obtain the image feature vector of the person corresponding to the bounding box; it can also determine the masking matrix according to the bone node, and mask the image according to the masking matrix . For details, refer to the description of FIG. 11 and FIG. 12.
  • target tracking can be used to identify different people in the image.
  • the sub-features of the person in the image can be used to determine the distinction between the person in the image.
  • the sub-features can be colors, edges, motion information, texture information, and so on.
  • S902 Determine the spatial feature of each of the multiple persons in each of the multiple frames of images.
  • the spatial correlation between the actions of different characters in the frame of image is determined.
  • the spatial characteristics of the j-th person in the i-th frame of the image to be processed above can be based on the image characteristics of the j-th person in the i-th frame and the images of other persons other than the j-th person in the i-th frame.
  • the similarity of features is determined, and i and j are positive integers.
  • the spatial characteristics of the j-th person in the i-th frame image are used to represent the actions of the j-th person in the i-th frame image and the behavior of other people other than the j-th person in the i-th frame image in the i-th frame image. The relationship of the action.
  • the similarity between the image feature vector of the j-th person in the i-th frame image and the image feature vector of other people except the j-th person can reflect the difference between the j-th person pair in the i-th frame image and the j-th person.
  • the degree of dependence on the actions of characters other than personal objects That is to say, when the similarity of the image feature vectors corresponding to two characters is higher, the correlation between the actions of the two characters is closer; conversely, when the similarity of the image feature vectors corresponding to the two characters is lower , The weaker the association between the actions of these two characters. Refer to the description of FIG. 14 and FIG. 15 for the spatial association relationship of the actions of different characters in a frame of image.
  • S903. Determine the time sequence characteristics of each of the multiple persons in each frame of the multiple frames of images.
  • the time correlation between the actions of the person at different moments is determined.
  • the timing characteristics of the j-th person in the i-th frame of the image to be processed can be based on the similarity between the j-th person’s image characteristics in the i-th frame and the image characteristics in other frames except the i-th frame.
  • the degree is determined, i and j are positive integers.
  • the time series feature of the j-th person in the i-th frame image is used to indicate the relationship between the action of the j-th person in the i-th frame image and the action in other frame images except the i-th frame image.
  • the similarity between corresponding image features of a person in two frames of images can reflect the degree of dependence of the person's actions on time.
  • Minkowski distance such as Euclidean distance, Manhattan distance
  • cosine similarity Chebyshev distance
  • Hamming distance Hamming distance
  • the similarity can be calculated by calculating the sum of the products of each bit of the two features after the linear change.
  • the spatial correlation between actions of different characters and the temporal correlation between actions of the same character can provide important clues to the categories of multi-person scenes in the image. Therefore, in the image recognition process of the present application, by comprehensively considering the spatial association relationship between different person actions and the time association relationship between the same person actions, the accuracy of recognition can be effectively improved.
  • S904 Determine the action feature of each of the multiple persons in each frame of the multiple frames of images.
  • the time sequence, spatial, and image characteristics corresponding to the person in a frame of image can be fused to obtain the person in the frame of image The characteristics of the action.
  • the spatial characteristics of the j-th person in the i-th frame of the image to be processed, the temporal characteristics of the j-th person in the i-th frame of the image, and the image characteristics of the j-th person in the i-th frame of the image can be fused to obtain The action feature of the j-th person in the i-th frame image.
  • the first way is to use a combination (combine) method for integration.
  • the features to be fused can be added directly or weighted.
  • weighted addition is to add the features to be fused by a certain coefficient, that is, the weight value.
  • the channel wise (channel wise) can be linearly combined.
  • Multiple features output by multiple layers of the feature extraction network can be added together.
  • multiple features output by multiple layers of the feature extraction network can be added directly, or multiple features output by multiple layers of the feature extraction network can be added.
  • the features are added according to a certain weight.
  • T1 and T2 respectively represent the features output by the two layers of the feature extraction network
  • the coefficient that is, the weight value, a ⁇ 0 and b ⁇ 0.
  • Method 2 Concatenate and channel fusion are used for fusion.
  • Cascade and channel fusion is another way of fusion.
  • the dimensions of the features to be fused can be directly spliced, or they can be spliced after being multiplied by a certain coefficient, that is, a weight value.
  • the third way is to use the pooling layer to process the above-mentioned features, so as to realize the integration of the above-mentioned features.
  • each bit is the maximum value of the corresponding bit in the multiple feature vectors. It is also possible to perform average pooling on multiple feature vectors to determine the target feature vector. In the target feature vector obtained by averaging pooling, each bit is the average value of the corresponding bit in the multiple feature vectors.
  • the feature corresponding to a person in a frame of image can be merged in a combination manner to obtain the action feature of the person in the frame of image.
  • the feature vector group corresponding to at least one person in the i-th frame image may further include a time-series feature vector corresponding to at least one person in the i-th frame image.
  • S905 Recognizing group actions of multiple people in the image to be processed according to the action feature of each person in the multiple images in each frame of the image.
  • group actions are composed of actions of several characters in the group, that is, actions completed by multiple characters.
  • the group actions of multiple characters in the image to be processed may be a certain sport or activity.
  • the group actions of multiple characters in the image to be processed may be basketball, volleyball, football or dancing, etc. .
  • the motion characteristics of each frame of the image may be determined according to the motion characteristics of each of the multiple characters in each frame of the image to be processed. Then, the group actions of multiple people in the image to be processed can be identified according to the action characteristics of each frame of image.
  • the action characteristics of multiple characters in a frame of image can be merged by means of maximum pooling, so as to obtain the action characteristics of the frame of image.
  • the action features of multiple characters in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image can be input into the classification module to obtain the action classification result of each frame of image Taking a classification result with the largest number of images in the image to be processed corresponding to the output category of the classification module as a group action of multiple people in the image to be processed.
  • the action features of multiple people in each frame of image can be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image obtained above can be averaged to obtain the average of each frame of image Action feature, and then input the average action feature of each frame of image to the classification module, and then the classification result corresponding to the average action feature of each frame of image is regarded as the group action of multiple people in the image to be processed.
  • a frame of image can be selected from the image to be processed, and the action feature of the frame of image obtained by fusing the action features of multiple characters in the frame of image is input into the classification module to obtain the classification result of the frame of image, Then, the classification result of the frame image is taken as the group action of multiple people in the image to be processed.
  • the action characteristics of each of the multiple characters in the processed image in each frame of the image can be input into the classification module to obtain a classification result of the action characteristics of each of the multiple characters, that is, each character Then, the action with the largest number of characters is regarded as a group action of multiple characters.
  • a certain person can be selected from a plurality of people, and the action characteristics of the person in each frame of the image can be input into the classification module to obtain the classification result of the action characteristics of the person, that is, the action of the person, and then the above The obtained action of the character is used as a group action of multiple characters in the image to be processed.
  • Steps S901 to S904 can be implemented by the neural network model trained in FIG. 8.
  • sequence characteristics may be determined first, and then the spatial characteristics may be determined.
  • the method shown in Figure 9 not only considers the temporal characteristics of multiple characters when determining group actions of multiple characters, but also takes into account the spatial characteristics of multiple characters. It can be better by integrating the temporal and spatial characteristics of multiple characters. Determine the group actions of multiple characters more accurately.
  • label information of the image to be processed is generated according to the group action, and the label information is used to indicate Group actions of multiple characters.
  • the foregoing method can be used, for example, to classify a video library, and tag different videos in the video library according to their corresponding group actions, so as to facilitate users to view and find.
  • the key person in the image to be processed is determined according to the group actions.
  • the contribution of each of the multiple characters in the image to be processed to the group action can be determined first, and then the person with the highest contribution rate is determined as the key person.
  • the above method can be used to detect key persons in a video image, for example.
  • the video contains several persons, most of which are not important. Detecting the key person effectively helps to understand the video content more quickly and accurately based on the information around the key person.
  • the player who holds the ball has the greatest impact on all personnel present, including players, referees, and spectators, and also contributes the most to the group action. Therefore, the player who holds the ball can be identified as the key person. , By identifying key people, it can help people watching the video understand what is going on and what is about to happen in the game.
  • FIG. 10 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • the image to be processed includes at least one frame of image, and the image features of the image to be processed include image features of multiple people in the image to be processed.
  • an image to be processed can be acquired.
  • the image to be processed can be obtained from the memory, or the image to be processed can also be received.
  • the image to be processed may be an image obtained from the image recognition device, or the image to be processed may also be an image obtained by the image recognition device from other equipment.
  • the received image or the above-mentioned image to be processed may also be captured by the camera of the image recognition device.
  • the above-mentioned image to be processed may be one frame of image or multiple frames of image.
  • the above-mentioned image to be processed is multiple frames, it may be consecutive multiple frames of images in a piece of video, or multiple frames of images selected according to preset rules in a piece of video. For example, in a video, multiple frames of images can be selected according to a preset time interval; or, in a video, multiple frames of images can be selected according to a preset frame number interval.
  • the above-mentioned image to be processed may include a plurality of persons, and the plurality of persons may include only humans, animals, or both humans and animals.
  • step S901 in FIG. 9 may be used to extract the image features of the image to be processed.
  • S1002 Determine the spatial characteristics of multiple people in each frame of the image to be processed.
  • the spatial characteristics of a certain person among the multiple characters in each frame of the to-be-processed image are based on the image characteristics of the person in the frame of the image to be processed and the images of other people except the person in the frame of the image to be processed The similarity of features is determined.
  • step S902 in FIG. 9 may be used to determine the spatial characteristics of multiple persons in each frame of the image to be processed.
  • S1003 Determine the action characteristics of multiple people in each frame of the image to be processed.
  • the action feature of a person among the multiple characters in each frame of the image to be processed is obtained by fusing the spatial feature of the person in the frame of image to be processed and the image feature of the person in the frame of image to be processed. .
  • the fusion method shown in step S904 in FIG. 9 may be used to determine the motion characteristics of multiple characters in the image to be processed without a frame.
  • S1004 Identify group actions of multiple people in the image to be processed according to the action characteristics of multiple people in each frame of the image to be processed.
  • step S905 in FIG. 9 may be used to identify group actions of multiple characters in the image to be processed.
  • FIG. 11 is a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • the image to be processed includes multiple frames of images, and the image features of the image to be processed include image features of multiple people in each frame of at least one frame of image selected from the multiple frames of images.
  • feature extraction can be performed on images corresponding to multiple people in the input multiple frames of images.
  • the image feature of a certain person can be used to represent the body posture of the person in the frame of image, that is, the relative position between different limbs of the person.
  • the image feature of a certain person mentioned above can be represented by a vector, which can be called an image feature vector.
  • the above-mentioned image feature extraction can be performed by CNN.
  • target tracking can be performed on each person, and the bounding box of each person in each frame of the image is determined, and the image in each bounding box corresponds to a person, and then The feature extraction is performed on the image of each of the above-mentioned bounding boxes to obtain the image feature of each person.
  • the bone nodes in the bounding box may be connected according to the structure of the person to obtain a connected image, and then image feature vector extraction is performed on the connected image.
  • the area where the bone node is located and the area outside the area where the bone node is located can be displayed in different colors, and then image feature extraction is performed on the processed image.
  • the locally visible image corresponding to the bounding box may be determined according to the image area where the bone node of the above-mentioned person is located, and then feature extraction is performed on the locally visible image to obtain the image feature of the image to be processed.
  • the above-mentioned partially visible image is an image composed of the area where the bone node of the person in the image to be processed is located. Specifically, the area outside the area where the bone node of the character is located in the bounding box may be masked to obtain the partially visible image.
  • the color of the pixel corresponding to the area outside the area where the bone node is located can be set to a certain preset color, such as black.
  • the area where the bone node is located retains the same information as the original image, and the information of the area outside the area where the bone node is located is masked. Therefore, when extracting image features, only the image features of the above-mentioned partially visible image need to be extracted, and there is no need to extract the above-mentioned masked area.
  • the area where the aforementioned bone node is located may be a square, circle or other shape centered on the bone node.
  • the side length (or radius), area, etc. of the region where the bone node is located can be preset values.
  • the above method of extracting the image features of the image to be processed can extract the features according to the locally visible image to obtain the image feature vector of the person corresponding to the bounding box; it can also determine the masking matrix according to the bone node, and mask the image according to the masking matrix .
  • the following is a specific example of the above method of determining the masking matrix based on the bone node.
  • the masking matrix The value in the square area with the bone point as the center and side length l is set to 1, and the values in other positions are set to 0.
  • Masking matrix The calculation formula is as follows:
  • the RGB model is used to assign an intensity value in the range of 0 to 255 for the RGB component of each pixel in the image. If RGB color mode is used, the mask matrix
  • the calculation formula of can be expressed as:
  • Use matrix Action images of original characters Masking to get a partially visible image
  • Each bit in can represent a pixel.
  • the RGB component of each pixel in the value is between 0-1.
  • the operator " ⁇ " means Each bit in the corresponding Multiply each bit in.
  • FIG. 12 is a schematic diagram of a process of obtaining a partially visible image provided by an embodiment of the present application. As shown in Figure 12, the picture Cover up. Specifically, keep the bone nodes The area around each node in the variable length becomes l, and the other areas are masked.
  • the T frame images all include images of K people.
  • Extract image features It can be represented by a D-dimensional vector, that is
  • the image feature extraction of the above-mentioned T frame image can be performed by CNN.
  • the set of image features of K people in the T frame image can be expressed as X, For each character, use a partially visible image Extracting image features can reduce redundant information in the bounding box, extract image features based on body structure information, and enhance the ability to express human actions in image features.
  • S1102 determine the dependence relationship between actions of different characters in the image to be processed, and the dependence relationship between actions of the same character at different moments.
  • a cross interaction module (CIM) is used to determine the spatial correlation of the actions of different characters in the image to be processed, and the temporal correlation of the actions of the same character at different times.
  • the cross interaction module is used to implement feature interaction and establish a feature interaction model.
  • the feature interaction model is used to represent the relationship of the character's body posture in time and/or space.
  • Spatial dependence is used to express the dependence of a character's body posture in a certain frame of image on the body posture of other characters in this frame of image, that is, the spatial dependence of character actions.
  • the above-mentioned spatial dependence can be expressed by a spatial feature vector.
  • one frame of image in the image to be processed corresponds to the image at time t, then at time t, the spatial feature vector of the k-th person It can be expressed as:
  • K represents that there are K people in the corresponding frame of image at time t
  • ⁇ () , ⁇ (), g() respectively represent three linear embedding functions
  • ⁇ (), ⁇ (), g() can be the same or different.
  • r(a,b) can reflect the dependence of feature b on feature a.
  • the spatial dependence between the body postures of different characters in the same frame of image can be determined.
  • Time dependence can also be called timing dependence, which is used to indicate the dependence of the character's body posture in a certain frame of image on the character's body posture in other frame images, that is, the inherent temporal dependence of a character's actions.
  • the above-mentioned time dependence can be expressed by a time series feature vector.
  • one frame of image in the image to be processed corresponds to the image at time t, then at time t, the time series feature vector of the k-th person It can be expressed as:
  • T indicates that there are a total of T time images in the image to be processed, that is, the image to be processed includes T frames of images, Represents the image feature of the k-th person at time t, Represents the image feature of the k-th person at time t'.
  • the time dependence between the body postures of the same person at different times can be determined.
  • Time-space feature vector can be expressed as a time series feature vector
  • spatial eigenvectors Do "add" The result of the operation:
  • FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application.
  • calculate the image features of the k-th person at time t The vector representation of the similarity with the image features of other people at time t, and the image features of the k-th person at time t
  • the vector representation of the similarity between the image features of the k-th person at other times and the average (Avg) is used to determine the time-space feature vector of the k-th person at time t
  • the set of spatio-temporal feature vectors of K characters in the T frame image can be expressed as H,
  • S1103 Fuse the image feature with the spatio-temporal feature vector to obtain the action feature of each frame of image.
  • the spatio-temporal feature vectors in T ⁇ K ⁇ D are fused to obtain the action characteristics of each image in the images at T times.
  • the motion feature of each frame of image can be represented by a motion feature vector.
  • the set of character feature vectors of K characters B t ⁇ R T ⁇ K ⁇ D can be expressed as:
  • S1104 Perform classification prediction on the action feature of each frame of image to determine the group action of the image to be processed.
  • the classification module can be a softmax classifier.
  • the classification result of the classification module can be one-hot coded, that is, only one bit is valid in the output result.
  • the category corresponding to the classification result of any image feature vector is the only category among the output categories of the classification module.
  • the action feature vector z t of a frame of image at time t can be input to the classification module to obtain the classification result of the frame of image.
  • the classification result of z t at any time t by the classification module can be used as the classification result of the group action in the T frame image.
  • the classification result of the group action in the T frame image can also be understood as the classification result of the group action of the person in the T frame image, or the classification result of the T frame image.
  • the action feature vectors z 1 , z 2 ,..., z T of the T frame images can be input into the classification module respectively to obtain the classification result of each frame of image.
  • the classification result of the T frame image can belong to one or more categories.
  • the category with the largest number of images in the corresponding T-frame image in the output category of the classification module can be used as the classification result of the group action in the T-frame image.
  • the action feature vector z 1 , z 2 ,..., z T of the T frame image can be averaged to obtain the average action feature vector Average action feature vector
  • Each bit in z 1 , z 2 ,..., z T is the average value of that bit.
  • Average action feature vector Input the classification module to obtain the classification result of the group action in the T frame image.
  • the above method can complete the complex reasoning process of group action recognition: extract the image features of multiple frames of images, and determine the temporal and spatial features according to the interdependence of actions between different people in the image and the same people at different times, and then The above-mentioned temporal features, spatial features, and image features are fused to obtain the action features of each frame of image, and then by classifying the action features of each frame of image, the group action of multiple frames of images can be inferred.
  • the spatial feature does not depend on the temporal features
  • the spatial feature does not depend on the temporal features
  • only the spatial features of the multiple characters may be considered for recognition. Conveniently determine the group actions of multiple characters.
  • Table 1 shows the recognition accuracy of the neural network model obtained by training, using the image recognition method provided in the embodiment of the present application, to recognize the public data set.
  • MCA multi-class accuracy
  • the multi-class accuracy (MCA) indicates that the number of results that are correctly classified in the classification results of the data including group actions by the neural network account for the number of results that include group actions.
  • the mean per class accuracy (MPCA) represents the average of the ratio of the number of correct results of each category to the number of data including group actions in the classification results of the neural network on the data including group actions value.
  • the training of the neural network can be completed without relying on the character action tags.
  • an end-to-end training method is adopted, that is, the neural network is adjusted only according to the final classification results.
  • Feature interaction is to determine the dependence between characters and the dependence of character actions on time.
  • the similarity between the two image features is calculated by the function r(a,b).
  • the spatial feature vector of each person in the frame of image is determined.
  • the spatial feature vector of a person in a frame of image is used to express the person's spatial dependence on other people in the frame of image, that is, the dependence of the person's body posture on the body posture of other people.
  • FIG. 14 is a schematic diagram of the spatial relationship between different character actions provided by an embodiment of the present application.
  • the spatial dependence matrix of FIG. 15 represents the dependence of each person in the group action on the body posture of other people.
  • Each bit in the spatial dependence matrix is represented by a square, and the color of the square, that is, the brightness, represents the similarity of the image features of the two people, that is, the calculation result of the function r(a, b).
  • the calculation result of the function r(a,b) can be standardized, that is, the calculation result of the function r(a,b) can be mapped between 0 and 1, so as to draw the spatial dependence matrix.
  • the hitter in Fig. 14 is the number 10 player having a greater influence on the follow-up actions of her teammates.
  • the function r(a,b) can reflect the high degree of correlation between the body posture of a person and other people in a frame of image, that is, it can reflect the situation with a high degree of dependence.
  • the spatial dependency between the body postures of players 1-6 is weak.
  • the neural network provided by the embodiments of the present application can better reflect the dependency or association relationship between the body posture of one person and the body posture of other people in a frame of image.
  • the time sequence feature vector of the person in one frame of image is determined.
  • the time series feature vector of a person in one frame of image is used to represent the dependence of the person's body posture on the body posture of the person in other frames of images.
  • the body posture of the 10 frame images of the player No. 10 shown in FIG. 14 in time sequence is shown in Fig. 16.
  • the time dependence matrix of Fig. 17 indicates the time dependence of the body posture of the player No. 10.
  • Each bit in the time-dependent matrix is represented by a square.
  • the color of the square that is, the brightness, represents the similarity of the image features of the two people, that is, the calculation result of the function r(a,b).
  • the body posture of the No. 10 player in the 10-frame image corresponds to the take-off (frames 1-3), floating (frames 4-8) and landing (frames 9-10).
  • "jumping" and "landing" should be more discriminative.
  • the image features of the No. 10 player in the 2nd and 10th frames are relatively similar to the image features in other images.
  • the image features of the 4th to 8th frames of the image that is, the image feature of the player No. 10 in the floating state, have low similarity with the image features in other images. Therefore, the neural network provided by the embodiments of the present application can better reflect the temporal association relationship of a person's body posture in multiple frames of images.
  • FIG. 18 is a schematic diagram of the system architecture of an image recognition device provided by an embodiment of the present application.
  • the image recognition device shown in FIG. 18 includes a feature extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804.
  • the image recognition device in FIG. 18 can execute the image recognition method of the embodiment of the present application. The process of processing the input picture by the image recognition device will be introduced below.
  • the feature extraction module 1801 which may also be referred to as a partial-body extractor module, is used to extract the image features of the person according to the bone nodes of the person in the image.
  • the function of the feature extraction module 1801 can be realized by using a convolutional network.
  • the multi-frame images are input to the feature extraction module 1801.
  • the image feature of a person can be represented by a vector, and the vector representing the image feature of the person can be called the image feature vector of the person.
  • the cross interaction module 1802 is used to map the image features of multiple characters in each frame of the multi-frame images to the spatio-temporal interaction features of each person.
  • the time-space interaction feature is used to indicate the "time-space" associated information of a certain character.
  • the time-space interaction feature of a person in a frame of image may be obtained by fusing the temporal feature and spatial feature of the person in the frame of image.
  • the cross interaction module 1802 may be implemented by a convolutional layer and/or a fully connected layer.
  • the feature fusion module 1803 is used to fuse the action feature of each person in a frame of image with the time-space interaction feature to obtain the image feature vector of the frame of image.
  • the image feature vector of the frame image can be used as the feature representation of the frame image.
  • the classification module 1804 is configured to classify according to the image feature vector, so as to determine the category of the group action of the person in the T frame image input to the feature extraction module 1801.
  • the classification module 1804 may be a classifier.
  • the image recognition device shown in FIG. 18 can be used to execute the image recognition method shown in FIG. 11.
  • FIG. 19 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application.
  • the image recognition device 3000 shown in FIG. 19 includes an acquisition unit 3001 and a processing unit 3002.
  • the acquiring unit 3001 is used to acquire an image to be processed
  • the processing unit 3002 is configured to execute the image recognition methods in the embodiments of the present application.
  • the obtaining unit 3001 may be used to obtain the image to be processed; the processing unit 3002 may be used to perform steps S901 to S904 or steps S1001 to S1004 described above to identify group actions of multiple people in the image to be processed.
  • the obtaining unit 3001 may be used to obtain the image to be processed; the processing unit 3002 may be used to execute the above steps S1101 to S1104 to identify group actions of people in the image to be processed.
  • the above-mentioned processing unit 3002 can be divided into multiple modules according to different processing functions.
  • the processing unit 3002 can be divided into an extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804 as shown in FIG. 18.
  • the processing unit 3002 can realize the functions of the various modules shown in FIG. 18, and further can be used to realize the image recognition method shown in FIG.
  • FIG. 20 is a schematic diagram of the hardware structure of an image recognition device according to an embodiment of the present application.
  • the image recognition apparatus 4000 shown in FIG. 20 includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004.
  • the memory 4001, the processor 4002, and the communication interface 4003 implement communication connections between each other through the bus 4004.
  • the memory 4001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 4001 may store a program. When the program stored in the memory 4001 is executed by the processor 4002, the processor 4002 is configured to execute each step of the image recognition method in the embodiment of the present application.
  • the processor 4002 may adopt a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more
  • the integrated circuit is used to execute related programs to implement the image recognition method in the method embodiment of the present application.
  • the processor 4002 may also be an integrated circuit chip with signal processing capability.
  • each step of the image recognition method of the present application can be completed by the integrated logic circuit of hardware in the processor 4002 or instructions in the form of software.
  • the aforementioned processor 4002 may also be a general-purpose processor, a digital signal processing (digital signal processing, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 4001, and the processor 4002 reads the information in the memory 4001, combines its hardware to complete the functions required by the units included in the image recognition device, or executes the image recognition method in the method embodiment of the application.
  • the communication interface 4003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 4000 and other devices or a communication network.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 4000 and other devices or a communication network.
  • the image to be processed can be acquired through the communication interface 4003.
  • the bus 4004 may include a path for transferring information between various components of the device 4000 (for example, the memory 4001, the processor 4002, and the communication interface 4003).
  • FIG. 21 is a schematic diagram of the hardware structure of a neural network training device according to an embodiment of the present application. Similar to the above device 4000, the neural network training device 5000 shown in FIG. 21 includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. Among them, the memory 5001, the processor 5002, and the communication interface 5003 implement communication connections between each other through the bus 5004.
  • the memory 5001 may be ROM, static storage device and RAM.
  • the memory 5001 may store a program. When the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 and the communication interface 5003 are used to execute each step of the neural network training method of the embodiment of the present application.
  • the processor 5002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits to execute related programs to realize the functions required by the units in the image processing apparatus of the embodiments of the present application. Or execute the neural network training method of the method embodiment of this application.
  • the processor 5002 may also be an integrated circuit chip with signal processing capability.
  • each step of the neural network training method of the embodiment of the present application can be completed by the integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
  • the aforementioned processor 5002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the image processing apparatus of the embodiment of the present application, or execute the neural network of the method embodiment of the present application Training method.
  • the communication interface 5003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 5000 and other devices or communication networks.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 5000 and other devices or communication networks.
  • the image to be processed can be acquired through the communication interface 5003.
  • the bus 5004 may include a path for transferring information between various components of the device 5000 (for example, the memory 5001, the processor 5002, and the communication interface 5003).
  • the device 4000 and device 5000 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 4000 and device 5000 may also include those necessary for normal operation. Other devices. At the same time, according to specific needs, those skilled in the art should understand that the device 4000 and the device 5000 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the device 4000 and the device 5000 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 20 and FIG. 21.
  • An embodiment of the present application further provides an image recognition device, including: at least one processor and a communication interface, the communication interface is used for the image recognition device to exchange information with other communication devices, when the program instructions in the at least one processing When executed in the device, the image recognition device is caused to execute the above method.
  • An embodiment of the present application also provides a computer program storage medium, which is characterized in that the computer program storage medium has program instructions, and when the program instructions are directly or indirectly executed, the foregoing method can be realized.
  • An embodiment of the present application further provides a chip system, characterized in that the chip system includes at least one processor, and when the program instructions are executed in the at least one processor, the foregoing method can be realized.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention se rapporte au domaine de l'intelligence artificielle, en particulier au domaine de la vision artificielle, et concerne un procédé et un appareil de reconnaissance d'image, un support d'enregistrement lisible par ordinateur et une puce. Le procédé consiste à : extraire des caractéristiques d'image d'une image à traiter ; déterminer une caractéristique de séquence temporelle et une caractéristique spatiale de chaque personne parmi une pluralité de personnes dans l'image à traiter dans chaque trame d'image parmi une pluralité de trames d'image dans l'image à traiter ; déterminer une caractéristique d'action associée en fonction de la caractéristique de séquence temporelle et de la caractéristique spatiale ; et reconnaître l'action de groupe de la pluralité de personnes dans l'image à traiter selon les caractéristiques d'action. Dans le procédé décrit, l'association temporelle entre des actions extraites de chaque personne parmi une pluralité de personnes dans une image à traiter ainsi que l'association entre celles-ci et les actions d'autres personnes sont déterminées, ce qui permet de mieux reconnaître l'action de groupe de la pluralité de personnes dans l'image à traiter.
PCT/CN2020/113788 2019-10-15 2020-09-07 Procédé et appareil de reconnaissance d'image, support d'enregistrement lisible par ordinateur et puce WO2021073311A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910980310.7A CN112668366B (zh) 2019-10-15 2019-10-15 图像识别方法、装置、计算机可读存储介质及芯片
CN201910980310.7 2019-10-15

Publications (1)

Publication Number Publication Date
WO2021073311A1 true WO2021073311A1 (fr) 2021-04-22

Family

ID=75400028

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113788 WO2021073311A1 (fr) 2019-10-15 2020-09-07 Procédé et appareil de reconnaissance d'image, support d'enregistrement lisible par ordinateur et puce

Country Status (2)

Country Link
CN (1) CN112668366B (fr)
WO (1) WO2021073311A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111842A (zh) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 一种动作识别方法、装置、设备及计算机可读存储介质
CN113283381A (zh) * 2021-06-15 2021-08-20 南京工业大学 一种适用于移动机器人平台的人体动作检测方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112969058B (zh) * 2021-05-18 2021-08-03 南京拓晖信息技术有限公司 一种具有云储存功能的工业视频实时监管平台及方法
CN113255518B (zh) * 2021-05-25 2021-09-24 神威超算(北京)科技有限公司 一种视频异常事件检测方法和芯片
CN113505733A (zh) * 2021-07-26 2021-10-15 浙江大华技术股份有限公司 行为识别方法、装置、存储介质及电子装置
CN113344562B (zh) * 2021-08-09 2021-11-02 四川大学 基于深度神经网络的以太坊钓鱼诈骗账户检测方法与装置
CN114494543A (zh) * 2022-01-25 2022-05-13 上海商汤科技开发有限公司 动作生成方法及相关装置、电子设备和存储介质
JP7563662B1 (ja) 2023-06-16 2024-10-08 コニカミノルタ株式会社 画像解析装置、画像解析方法、および画像解析プログラム

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005182660A (ja) * 2003-12-22 2005-07-07 Matsushita Electric Works Ltd 文字・図形の認識方法
CN108229355A (zh) * 2017-12-22 2018-06-29 北京市商汤科技开发有限公司 行为识别方法和装置、电子设备、计算机存储介质、程序
CN109101896A (zh) * 2018-07-19 2018-12-28 电子科技大学 一种基于时空融合特征和注意力机制的视频行为识别方法
CN109299646A (zh) * 2018-07-24 2019-02-01 北京旷视科技有限公司 人群异常事件检测方法、装置、系统和存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9721086B2 (en) * 2013-03-15 2017-08-01 Advanced Elemental Technologies, Inc. Methods and systems for secure and reliable identity-based computing
CA3041148C (fr) * 2017-01-06 2023-08-15 Sportlogiq Inc. Systemes et procedes de comprehension de comportements a partir de trajectoires
CN110363279B (zh) * 2018-03-26 2021-09-21 华为技术有限公司 基于卷积神经网络模型的图像处理方法和装置
CN108764019A (zh) * 2018-04-03 2018-11-06 天津大学 一种基于多源深度学习的视频事件检测方法
CN109299657B (zh) * 2018-08-14 2020-07-03 清华大学 基于语义注意力保留机制的群体行为识别方法及装置
CN109993707B (zh) * 2019-03-01 2023-05-12 华为技术有限公司 图像去噪方法和装置
CN110222717B (zh) * 2019-05-09 2022-01-14 华为技术有限公司 图像处理方法和装置
CN110309856A (zh) * 2019-05-30 2019-10-08 华为技术有限公司 图像分类方法、神经网络的训练方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005182660A (ja) * 2003-12-22 2005-07-07 Matsushita Electric Works Ltd 文字・図形の認識方法
CN108229355A (zh) * 2017-12-22 2018-06-29 北京市商汤科技开发有限公司 行为识别方法和装置、电子设备、计算机存储介质、程序
CN109101896A (zh) * 2018-07-19 2018-12-28 电子科技大学 一种基于时空融合特征和注意力机制的视频行为识别方法
CN109299646A (zh) * 2018-07-24 2019-02-01 北京旷视科技有限公司 人群异常事件检测方法、装置、系统和存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111842A (zh) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 一种动作识别方法、装置、设备及计算机可读存储介质
CN113111842B (zh) * 2021-04-26 2023-06-27 浙江商汤科技开发有限公司 一种动作识别方法、装置、设备及计算机可读存储介质
CN113283381A (zh) * 2021-06-15 2021-08-20 南京工业大学 一种适用于移动机器人平台的人体动作检测方法
CN113283381B (zh) * 2021-06-15 2024-04-05 南京工业大学 一种适用于移动机器人平台的人体动作检测方法

Also Published As

Publication number Publication date
CN112668366A (zh) 2021-04-16
CN112668366B (zh) 2024-04-26

Similar Documents

Publication Publication Date Title
WO2021043168A1 (fr) Procédé d'entraînement de réseau de ré-identification de personnes et procédé et appareil de ré-identification de personnes
WO2021073311A1 (fr) Procédé et appareil de reconnaissance d'image, support d'enregistrement lisible par ordinateur et puce
WO2020253416A1 (fr) Procédé et dispositif de détection d'objet et support de stockage informatique
US20220092351A1 (en) Image classification method, neural network training method, and apparatus
Chen et al. Attention-based context aggregation network for monocular depth estimation
CN112446398B (zh) 图像分类方法以及装置
CN110555481B (zh) 一种人像风格识别方法、装置和计算机可读存储介质
WO2021043273A1 (fr) Procédé et appareil d'amélioration d'image
WO2021022521A1 (fr) Procédé de traitement de données et procédé et dispositif d'apprentissage de modèle de réseau neuronal
CN110222717B (zh) 图像处理方法和装置
WO2021155792A1 (fr) Appareil de traitement, procédé et support de stockage
WO2021013095A1 (fr) Procédé et appareil de classification d'image, et procédé et appareil pour apprentissage de modèle de classification d'image
CN112446476A (zh) 神经网络模型压缩的方法、装置、存储介质和芯片
CN113011562B (zh) 一种模型训练方法及装置
CN111310604A (zh) 一种物体检测方法、装置以及存储介质
WO2021018245A1 (fr) Procédé et appareil de classification d'images
CN113065645B (zh) 孪生注意力网络、图像处理方法和装置
CN110222718B (zh) 图像处理的方法及装置
WO2021227787A1 (fr) Procédé et appareil de formation de prédicteur de réseau neuronal, et procédé et appareil de traitement d'image
WO2021018251A1 (fr) Procédé et dispositif de classification d'image
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN113807183A (zh) 模型训练方法及相关设备
CN112464930A (zh) 目标检测网络构建方法、目标检测方法、装置和存储介质
CN110705564A (zh) 图像识别的方法和装置
WO2021189321A1 (fr) Procédé et dispositif de traitement d'image

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20877961

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20877961

Country of ref document: EP

Kind code of ref document: A1