WO2019228316A1 - 动作识别方法和装置 - Google Patents

动作识别方法和装置 Download PDF

Info

Publication number
WO2019228316A1
WO2019228316A1 PCT/CN2019/088694 CN2019088694W WO2019228316A1 WO 2019228316 A1 WO2019228316 A1 WO 2019228316A1 CN 2019088694 W CN2019088694 W CN 2019088694W WO 2019228316 A1 WO2019228316 A1 WO 2019228316A1
Authority
WO
WIPO (PCT)
Prior art keywords
picture
optical flow
features
feature
spatial
Prior art date
Application number
PCT/CN2019/088694
Other languages
English (en)
French (fr)
Inventor
乔宇
周磊
王亚立
江立辉
刘健庄
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19810779.9A priority Critical patent/EP3757874B1/en
Publication of WO2019228316A1 publication Critical patent/WO2019228316A1/zh
Priority to US17/034,654 priority patent/US11392801B2/en
Priority to US17/846,533 priority patent/US11704938B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/72Data preparation, e.g. statistical preprocessing of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the present application relates to the technical field of motion recognition, and more particularly, to a method and device for motion recognition.
  • Motion recognition includes motion recognition of people in videos and motion recognition of people in pictures. Since more information is contained in videos, it is relatively easy to recognize motion of people in videos. Compared with videos, pictures contain less information, so how to effectively identify the action categories of people in pictures is a problem that needs to be solved.
  • the present application provides a motion recognition method and device, which can improve the accuracy of motion recognition.
  • a motion recognition method includes: acquiring a picture to be processed; determining a spatial feature of the picture to be processed; and according to the spatial feature of the picture to be processed and X spatial features and X optical flows in a feature library.
  • the above feature library is a preset feature library, and the feature library includes multiple spatial features and multiple optical flow features.
  • Each spatial feature in the feature library corresponds to an action category
  • each optical flow feature in the feature database corresponds to an action category.
  • Both X and Y are integers greater than one.
  • the above-mentioned action category corresponding to each spatial feature and the action category corresponding to each optical flow feature are obtained by training according to a convolutional neural network model in advance.
  • each spatial feature in the feature library corresponds to an optical flow feature
  • each optical flow feature in the feature library corresponds to a space feature
  • the virtual optical flow characteristics of the to-be-processed picture can be obtained through the spatial features of the picture to be processed and the spatial and optical flow features in the feature library, thereby simulating the time sequence information closely related to the action for the picture.
  • the similarity between the virtual optical flow feature of the picture to be processed and the optical flow feature in the feature database is used for motion recognition of the picture to be processed.
  • this application directly performs motion recognition by comparing the virtual optical flow features of the picture to be processed with the optical flow features in the feature library, there is no need to establish a training model to perform motion recognition on the picture to be processed, and less light can be used.
  • the stream feature realizes motion recognition of pictures to be processed.
  • the above-mentioned X spatial features and X optical flow features are all spatial features and all optical flow features in the feature library, respectively.
  • the virtual optical flow characteristics of the picture to be processed can be determined more accurately, and thus the more Accurately determine the action category of the picture to be processed.
  • the above-mentioned X spatial features and X optical flow features are a part of the space feature and a part of the optical flow feature in the feature database, respectively.
  • the amount of calculation of the virtual optical flow features of the pictures to be processed can be reduced, thereby improving the treatment. Speed of processing pictures for motion recognition.
  • the X spatial features and the X optical flow features correspond one-to-one.
  • each spatial feature corresponds to an optical flow feature
  • each optical flow feature corresponds to a space. feature.
  • Y optical flow features may be all the optical flow features in the feature library or part of the optical flow features in the feature library.
  • X and Y may be the same or different.
  • the action category of the picture to be processed is obtained according to the similarity between the virtual optical flow features of the picture to be processed and all the optical flow features in the feature database, which can improve the The accuracy of the first type of confidence improves the effect of motion recognition on pictures to be processed.
  • the amount of calculation when determining the first type of confidence can be reduced, and the speed of motion recognition of the picture to be processed can be improved.
  • the to-be-processed picture is a picture containing a person
  • determining the action category of the to-be-processed picture according to the first type of confidence includes determining the action category of the person in the to-be-processed picture according to the first-type confidence.
  • determining the action category of the picture to be processed is actually determining the action category of a person or other target object in the picture to be processed.
  • the picture to be processed is a static picture.
  • the spatial feature is specifically a spatial feature vector
  • the optical flow feature is specifically an optical flow feature vector
  • determining the virtual optical flow characteristics of the picture to be processed according to the spatial characteristics of the picture to be processed and the X spatial features and X optical flow features in the feature library includes: The similarity of each of the X spatial features in the feature database is obtained by weighting and summing the X optical flow features to obtain a virtual optical flow feature of the picture to be processed.
  • the feature library includes spatial features and optical flow features of the training video.
  • a virtual optical flow feature of a to-be-processed picture may be determined according to a spatial feature and an optical flow feature of a training video and a spatial feature of a picture to be processed, and then an action category of the to-be-processed picture may be determined according to the virtual optical flow feature.
  • the feature library further includes spatial features and virtual optical flow features of the training picture.
  • the virtual optical flow characteristics of the to-be-processed picture can be determined according to the respective spatial characteristics and optical flow characteristics of the training video and the training picture, and the spatial characteristics of the to-be-processed picture. More accurate virtual optical flow features can be obtained, which can further Improve the accuracy of motion recognition.
  • the action category of the training picture and the action category of the training video are not exactly the same.
  • the types of action categories that can be identified can be increased, thereby increasing the applicable range of action recognition.
  • the number of videos of different action categories in the training video is the same.
  • the number of videos in different categories in the training video is the same, the number of training videos in different action categories can be guaranteed to be equal, and the stability of the action recognition result can be guaranteed.
  • the method further includes: selecting a picture that matches the action category that needs to be identified from a preset picture library to obtain the training picture.
  • the above picture library may be a local picture database or a picture database located on a network server.
  • the method further includes: selecting, from a preset video library, videos whose similarity with the spatial characteristics of the training pictures meets the preset requirements to obtain the training videos.
  • the above video library may be a local video library or a video library in a network server.
  • selecting a video from a preset video library whose similarity with the spatial features of the training picture meets the preset requirements to obtain a training video includes selecting from the preset video library the spatial features of the training picture from the For videos whose similarity is greater than a preset similarity threshold, a training video is obtained.
  • all videos in the preset video library that have a similarity with the spatial features of the training pictures greater than 0.5 can be selected to form a training video.
  • selecting a video from the preset video library whose similarity with the spatial features of the training picture meets the preset requirements to obtain the training video includes determining the spatial features of the videos in the video library and the spatial features of the training picture.
  • the similarity between the video library and the first J videos with the largest similarity to the spatial features of the training picture to select the training video where J is less than K, J and K are integers greater than 0, and K is the video library The total number of videos in.
  • the video library contains a total of 100 videos. Then, the first 50 videos with the greatest similarity to the spatial features of the training pictures in the video library can be selected to form a training video.
  • the above-mentioned determining the virtual optical flow feature of the picture to be processed according to the spatial features of the picture to be processed and the X spatial features and X optical flow features in the feature library specifically includes: according to the space of the picture to be processed The similarity between the feature and each of the X spatial features, determines the weight coefficient of the optical flow feature corresponding to each of the X spatial features in the feature library; according to each of the X optical flow features Weight coefficients of the optical flow features, and weighted summing the X optical flow features to obtain the virtual optical flow features of the picture to be processed.
  • the corresponding spatial features and optical flow features in the feature database correspond to the same video or picture, that is, the corresponding spatial features and optical flow features in the feature database belong to the same video or the same picture.
  • the size of the weight coefficient of each of the X optical flow features is positively related to the first similarity, where the first similarity is X and X of the spatial features.
  • the X spatial features include a first spatial feature
  • the X optical flow features include a first optical flow feature
  • the first spatial feature corresponds to the first optical flow feature
  • the similarity of the spatial features of is the similarity 1, then, the size of the weight coefficient of the first optical flow feature is positively related to the similarity 1 (specifically, it can be a proportional relationship).
  • the virtual optical flow features of the picture to be processed obtained according to the optical flow features in the feature library are more accurate.
  • the feature library includes the spatial features and optical flow features of the training video
  • the virtual optical flow features of the picture to be processed are determined according to the spatial features of the picture to be processed and the spatial features and optical flow features in the feature library, and specifically include: The similarity between the spatial feature of the picture and each spatial feature of the training video determines the weight coefficient of the optical flow feature corresponding to each spatial feature of the training video; according to the weight coefficient of each optical flow feature in the training video, The optical flow features in the feature library are weighted and summed to obtain the virtual optical flow features of the picture to be processed.
  • the virtual optical flow characteristics of the picture to be processed are determined only based on the spatial characteristics and optical flow characteristics of the training video, which can reduce the complexity of determining the virtual optical flow characteristics.
  • the feature database includes the spatial features and optical flow features of the training video and the spatial features and virtual optical flow features of the training picture. Based on the spatial features of the picture to be processed and the spatial features and optical flow features in the feature database, the features of the picture to be processed are determined.
  • the virtual optical flow feature specifically includes: determining a weight of the optical flow feature corresponding to each spatial feature of the training video and the training picture according to the similarity between the spatial feature of the picture to be processed and each spatial feature of the training video and the training picture Coefficient; according to the weight coefficient of each optical flow feature in the training video and the training picture, the optical flow features in the training video and the training picture are weighted and summed to obtain the virtual optical flow feature of the picture to be processed.
  • the virtual optical flow characteristics of the picture to be processed are comprehensively determined through the spatial characteristics and optical flow characteristics of the training video and the spatial and virtual optical flow characteristics of the training picture, which can make the acquired virtual optical flow characteristics of the to-be-processed picture more capable. Reflects the motion information of the picture to be processed.
  • the virtual optical flow feature of the training picture is a weighted sum of the optical flow features of the training video according to the similarity between the spatial features of the training picture and the spatial features of the training video. owned.
  • the above method further includes: weighting and summing the optical flow features of the training video according to the similarity between the spatial features of the training picture and the spatial features of the training video to obtain the virtual optical flow features of the training picture.
  • weighted summing the optical flow features of the training video to obtain the virtual optical flow features of the training picture including: The similarity of each spatial feature in the training video determines the weight coefficient of the optical flow feature corresponding to each spatial feature in the training video; according to the weight coefficient of each optical flow feature in the training video, The optical flow features are weighted and summed to obtain the virtual optical flow features of the training pictures.
  • the above feature database may initially include only the spatial features and optical flow features of the training video.
  • the spatial features and virtual optical flow features of the training pictures may be added to the feature database, and The virtual optical flow features of the training picture can be determined according to the spatial features and optical flow features of the training video contained in the feature library.
  • the virtual optical flow feature of the training picture is determined by the spatial feature and optical flow feature of the training video, and the spatial feature and virtual optical flow feature of the training picture are incorporated into the feature database, which can improve the action to a certain extent The effect of recognition.
  • the method further includes: determining a second type of confidence of the picture to be processed on different action categories according to the similarity between the spatial feature of the picture to be processed and the Z spatial features in a preset feature library. , Wherein each of the Z spatial features corresponds to an action category; determining the action category of the picture to be processed according to the first type of confidence includes: determining the pending category according to the first type of confidence and the second type of confidence. Action category for processing pictures.
  • the first type of confidence is obtained through an optical flow prediction process
  • the second type of confidence is obtained through a spatial prediction process.
  • Z is an integer greater than 1. Any two values of X, Y, and Z may be the same or different.
  • the above-mentioned Z spatial features may be all the spatial features in the feature database or only part of the spatial features in the feature database.
  • the confidence level of the picture to be processed is comprehensively obtained through optical flow prediction and spatial prediction, which can more accurately determine the action category of the picture to be processed.
  • determining the action category of the picture to be processed according to the first-class confidence and the second-class confidence includes: weighted summing the first-class confidence and the second-class confidence to obtain the to-be-processed The final confidence level of the picture on different action categories; the action category of the picture to be processed is determined according to the final confidence degree.
  • the confidence level that can comprehensively reflect the different action categories of the picture to be processed can be obtained, and the action category of the picture to be processed can be better determined.
  • the above method further includes: adding spatial characteristics and virtual optical flow features of the picture to be processed, and action category information of the picture to be processed to the feature database.
  • the spatial features and optical flow features contained in the feature library can be expanded, which is convenient for subsequent follow-up based on the spatial features and optical flow in the feature database.
  • Features better perform motion recognition on pictures.
  • a motion recognition device which includes a module for executing the method in the first aspect.
  • a motion recognition device includes: a memory for storing a program; a processor for executing the program stored in the memory; and when the program stored in the memory is executed, the processor is configured to execute the first Methods.
  • a computer-readable medium in a fourth aspect, storing program code for device execution, the program code including the method for performing the method in the first aspect.
  • a computer program product containing instructions which when executed on a computer, causes the computer to execute the method in the first aspect described above.
  • an electronic device including the motion recognition device according to the second aspect or the third aspect.
  • FIG. 1 is a schematic flowchart of a motion recognition method according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a motion recognition method according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of extracting spatial features according to a CNN model
  • FIG. 4 is a schematic diagram of extracting a spatial feature and an optical flow feature of a video
  • FIG. 5 is a schematic diagram of acquiring a virtual optical flow feature of an input picture
  • FIG. 6 is a schematic diagram of performing optical flow prediction on an input picture
  • FIG. 7 is a schematic diagram of establishing an optical flow feature database and a spatial feature database
  • FIG. 8 is a schematic block diagram of a motion recognition device according to an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of a motion recognition device according to an embodiment of the present application.
  • FIG. 10 is a schematic block diagram of a motion recognition device according to an embodiment of the present application.
  • FIG. 11 is a schematic block diagram of a motion recognition device performing motion recognition on an input picture according to an embodiment of the present application.
  • the motion recognition method according to the embodiment of the present application can be applied to image retrieval, album management, safe city, human-computer interaction, and scenes requiring recognition for motion recognition.
  • the action recognition method in the embodiment of the present application can be applied to an album management system and a picture search system. The following briefly introduces the album management system and the picture search system, respectively.
  • the motion recognition method in the embodiment of the present application can be used to perform motion recognition on the pictures in the album to obtain the action category of each picture, so that the user can classify and manage the pictures of different action categories, thereby facilitating the user to find pictures and saving Manage time and improve the efficiency of album management.
  • the action recognition method in the embodiment of the present application can identify the action type in the picture, and then can find the picture of interest from the Internet or a database according to the action category of the picture.
  • the traditional scheme is to extract person images from a large number of training pictures, and then to convolutional neural networks (CNN) based on the person images extracted from the training pictures and the action categories corresponding to the person images. )
  • the model is trained to obtain the parameters of the CNN model.
  • a person image can be extracted from the picture to be processed, and the person image extracted from the picture to be processed is input to a trained CNN model for motion recognition, so as to determine Action category of the pending picture.
  • the traditional scheme only considers the spatial characteristics of the picture when performing motion recognition, and does not further explore the time attributes of the person's motion in the picture to be processed, resulting in a low accuracy of performing the motion recognition.
  • the known optical flow characteristics are used to simulate the optical flow characteristics of the picture to be processed to obtain the virtual optical flow characteristics of the picture to be processed.
  • the virtual optical flow feature is used to identify the action of the picture to be processed.
  • the motion recognition method according to the embodiment of the present application is described in detail below with reference to FIG. 1.
  • the method shown in FIG. 1 may be executed by a motion recognition device.
  • the motion recognition device may specifically be a device with a picture processing function, such as a monitoring device, a terminal device, a network server, and a network cloud platform.
  • the method shown in FIG. 1 includes steps 101 to 105, and steps 101 to 105 are described in detail below.
  • the picture to be processed (also referred to as a to-be-processed image) may be a picture containing a person.
  • the action recognition of the picture to be processed is essentially to recognize the action of the person in the picture to be processed, and determine the action category of the picture to be processed.
  • the picture to be processed may be a picture taken by an electronic device, or may be a picture taken from a video.
  • the pictures to be processed may be stored in a local picture database or in a network.
  • the pictures to be processed When obtaining the pictures to be processed, they can be retrieved directly from the local picture database or online from the network.
  • the action category may refer to what action the character in the picture to be processed specifically performs, for example, the action category may include: run, walk, baseball_pitch, baseball_swing ), Bowling, clean_and_jerk, golf_swing, jump_rope, pullup, pushup, situp, squat, playing guitar (strum_guitar) and swimming (swim) and so on.
  • CNN convolutional neural network
  • the CNN model may be a pre-trained model
  • Spatial characteristics can be used to perform a convolution operation on the picture to be processed to obtain the picture to be processed.
  • X and Y are all integers greater than 1.
  • the above feature library may be a preset feature library, and the feature library includes multiple spatial features and multiple optical flow features, and each spatial feature in the multiple spatial features corresponds to an action category, and the multiple Each optical flow feature in the optical flow feature also corresponds to a type of action.
  • the action category corresponding to each spatial feature and the action category corresponding to each optical flow feature may be obtained by training according to a convolutional neural network model in advance.
  • any one of the multiple spatial features corresponds to one optical flow feature among the multiple optical flow features.
  • Any one of the optical flow features corresponds to one of a plurality of spatial features.
  • the number of spatial features and optical flow features in the feature library is generally the same.
  • the above-mentioned X spatial features and X optical flow features may also have a one-to-one correspondence, that is, among the X spatial features and X optical flow features, each spatial feature corresponds to an optical flow feature, and each optical flow The feature corresponds to a spatial feature.
  • the above X spatial features may be all the spatial features or part of the spatial features in the feature database.
  • the X optical flow features may be all or some of the optical flow features in the feature library.
  • the above-mentioned X spatial features and X optical flow features are all the spatial features and all the optical flow features in the feature library, respectively, it can be based on the spatial features of the picture to be processed, and all the spatial features and all the optical flow features in the feature database.
  • the virtual optical flow characteristics of the picture to be processed can be determined more accurately, and then the action category of the picture to be processed can be determined more accurately.
  • X spatial features and X optical flow features are part of the spatial features and part of the optical flow features in the feature library, respectively, by combining part of the spatial features and part of the optical flow features in the feature library and the spatial features of the picture to be processed Determining the virtual optical flow characteristics of the picture to be processed can reduce the amount of calculation to calculate the virtual optical flow characteristics of the picture to be processed, thereby improving the speed of motion recognition of the picture to be processed.
  • the spatial feature and the optical flow feature in the feature database corresponding to the same video or picture that is, the spatial feature and the optical flow feature in the feature database belong to the same video or the same picture.
  • the specific expression form of the spatial feature mentioned in this application may be a spatial feature vector
  • the specific expression form of the optical flow feature or the virtual optical flow feature may be an optical flow feature vector or a virtual optical flow feature vector.
  • each spatial feature in the feature library corresponds to an optical flow feature (the spatial feature and the optical flow feature in the feature library have a one-to-one correspondence relationship)
  • the spatial feature and the preset feature library are used according to the spatial feature of the picture to be processed.
  • Spatial features and optical flow features in the image when determining the virtual optical flow features of the picture to be processed, the spatial features in the feature database and the feature database can be matched according to the similarity between the spatial features of the picture to be processed and the spatial features in the feature database.
  • the weighted summation of the optical flow features is performed to obtain the virtual optical flow features of the picture to be processed.
  • determining the virtual optical flow feature of the picture to be processed according to the spatial features of the picture to be processed and the X spatial features and X optical flow features in the preset feature database in step 103 includes the following specific processes:
  • the magnitude of the weight coefficient of each of the X optical flow features is positively related to the first similarity, where the first similarity is X optical features in the X spatial features.
  • the X spatial features include a first spatial feature and a second spatial feature
  • the Y optical flow features include a first optical flow feature and a second optical flow feature
  • the first spatial feature corresponds to the first optical flow feature
  • the second spatial feature corresponds to the second optical flow feature.
  • the similarity between the spatial feature of the picture to be processed and the first spatial feature is similarity 1
  • the similarity between the spatial feature of the picture to be processed and the second spatial feature is similarity 2.
  • the degree 1 is greater than the degree of similarity 2 then, when the first optical flow feature and the second optical flow feature and the other optical flow features in the X optical flow features are weighted and summed, the weight coefficient of the first optical flow feature is greater than the first optical flow feature.
  • the weight coefficient of the two optical flow features are weighted and summed.
  • the virtual optical flow features of the pictures to be processed obtained according to the optical flow features in the feature library are more accurate.
  • each of the Y optical flow features corresponds to an action category, and Y is an integer greater than 1.
  • Y optical flow features may be all the optical flow features in the feature database or part of the optical flow features in the feature database.
  • Y and X may be the same or different.
  • the action category of the picture to be processed is obtained according to the similarity between the virtual optical flow features of the picture to be processed and all the optical flow features in the feature database, which can improve the The accuracy of the first type of confidence improves the effect of motion recognition on pictures to be processed.
  • the amount of calculation when determining the first type of confidence can be reduced, and the speed of motion recognition of the picture to be processed can be improved.
  • the virtual optical flow features of the picture to be processed can be obtained through the spatial features of the picture to be processed and the spatial features and optical flow features in the preset feature library, thereby simulating the time sequence information closely related to the action for the picture.
  • Action recognition may be performed on the picture to be processed according to the similarity between the virtual optical flow feature of the picture to be processed and the optical flow feature in a preset feature library.
  • this application directly performs motion recognition by comparing the virtual optical flow features of the picture to be processed with the optical flow features in the feature library, there is no need to establish a training model to perform motion recognition on the picture to be processed, and less use can be made.
  • the optical flow feature realizes motion recognition of pictures to be processed.
  • the spatial features in the feature database include the spatial features of the training video
  • the optical flow features in the feature database include the optical flow features of the training video
  • the virtual optical flow characteristics of the picture to be processed can be simulated based on the spatial characteristics and optical flow characteristics of the training video, and then the spatial features and virtual optical flow characteristics of the picture to be processed can be combined to perform motion recognition, thereby improving the accuracy of motion recognition .
  • the spatial features of the feature database further include the spatial features of the training picture
  • the optical flow features in the feature database further include the virtual optical flow features of the training picture
  • the above feature library contains not only the spatial features and optical flow features of the training video, but also the spatial features and optical flow features of the training picture.
  • the spatial and optical flow features of the training video and the training picture can be integrated to determine the virtual optical flow features of the picture to be processed. Can further improve the accuracy of final motion recognition.
  • the virtual optical flow feature of the picture to be processed may be determined only based on the spatial feature and optical flow feature of the training video in the feature library. It is also possible to comprehensively determine the virtual optical flow features of the picture to be processed by combining the spatial features and optical flow features of the training video in the feature library with the spatial features and virtual optical flow features of the training picture.
  • the feature library includes the spatial features and optical flow features of the training video
  • the virtual optical flow features of the picture to be processed are determined according to the spatial features of the picture to be processed and the spatial features and optical flow features in the feature library, and specifically include: The similarity between the spatial feature of the picture and each spatial feature of the training video determines the weight coefficient of the optical flow feature corresponding to each spatial feature of the training video; according to the weight coefficient of each optical flow feature in the training video, The optical flow features in the feature library are weighted and summed to obtain the virtual optical flow features of the picture to be processed.
  • the virtual optical flow characteristics of the picture to be processed are determined only based on the spatial characteristics and optical flow characteristics of the training video, which can reduce the complexity of determining the virtual optical flow characteristics.
  • the feature database includes the spatial features and optical flow features of the training video and the spatial features and virtual optical flow features of the training picture. Based on the spatial features of the picture to be processed and the spatial features and optical flow features in the feature database, the features of the picture to be processed are determined.
  • the virtual optical flow feature specifically includes: determining a weight of the optical flow feature corresponding to each spatial feature of the training video and the training picture according to the similarity between the spatial feature of the picture to be processed and each spatial feature of the training video and the training picture Coefficient; according to the weight coefficient of each optical flow feature in the training video and the training picture, the optical flow features in the training video and the training picture are weighted and summed to obtain the virtual optical flow feature of the picture to be processed.
  • the virtual optical flow characteristics of the picture to be processed are comprehensively determined through the spatial characteristics and optical flow characteristics of the training video and the spatial and virtual optical flow characteristics of the training picture, which can make the acquired virtual optical flow characteristics of the to-be-processed picture more capable. Reflects the motion information of the picture to be processed.
  • the virtual optical flow features of the training pictures in the feature database may be obtained according to the spatial features and optical flow features of the training video and the spatial features of the training picture. That is, the virtual optical flow features of the training picture are obtained by weighting and summing the optical flow features of the training video according to the similarity between the spatial features of the training picture and the spatial features of the training video.
  • the virtual optical flow characteristics of the training picture may be determined before performing motion recognition on the picture to be processed.
  • the method shown in FIG. 1 further includes: weighting and summing the optical flow features of the training video according to the similarity between the spatial features of the training picture and the spatial features of the training video to obtain the training picture's Virtual optical flow characteristics.
  • the weighted summation of the optical flow features of the training video to obtain the virtual optical flow features of the training picture specifically includes:
  • the weight coefficient of the optical flow feature corresponding to each spatial feature in the training video According to the similarity between the spatial feature of the training picture and each spatial feature in the training video, determine the weight coefficient of the optical flow feature corresponding to each spatial feature in the training video; according to the weight of each optical flow feature in the training video Coefficient, weighted summing the optical flow features in the training video to obtain the virtual optical flow features of the training picture.
  • the confidence of the to-be-processed picture in different action categories can also be calculated according to the spatial characteristics of the to-be-processed picture, and These two types of confidence are used to comprehensively judge the action category of the picture to be processed.
  • the spatial features of the picture to be processed are extracted in step 102, it can be determined that the pictures to be processed are in different action categories according to the similarity between the spatial features of the picture to be processed and the Z spatial features in a preset feature library.
  • the second type of confidence is that each of the Z spatial features corresponds to a type of action.
  • the Z is an integer greater than 1, and the values of Z and the X or Y may be the same or different.
  • the Z spatial features may be all the spatial features in the feature database or only a part of the spatial features in the feature database.
  • the action category of the picture to be processed may be determined according to the first-type confidence and the second-type confidence.
  • the confidence level of the picture to be processed is comprehensively obtained through optical flow prediction and spatial prediction, so that the action category of the picture to be processed can be determined more accurately.
  • the weighted summation of the confidence of the first type and the confidence of the second type may be performed first to obtain the Final confidence on different action categories, and then determine the action category of the picture to be processed based on the final confidence.
  • the confidence level that can comprehensively reflect the different action categories of the picture to be processed can be obtained, and the action category of the picture to be processed can be better determined.
  • action category of the picture to be processed may also be determined according to the confidence of the first category and the confidence of the second category, and then the action category of the picture to be processed may be determined.
  • determining the action category of the picture to be processed according to the first category confidence and the second category of confidence includes: determining the action category of the picture to be processed according to the first category confidence as the first action category; and according to the second category The confidence level determines that the action category of the picture to be processed is the second action category; and in the case where the first action category and the second action category are the same, it is determined that the action category of the picture to be processed is the first action category.
  • the spatial and optical flow features of the picture to be processed are added to the feature library.
  • the method shown in FIG. 1 further includes: determining the spatial feature and virtual optical flow feature of the picture to be processed, and the action category information of the picture to be processed. To the feature library.
  • the spatial features and optical flow features contained in the feature library can be expanded, which is convenient for subsequent follow-up based on the spatial features and optical flow in the feature database.
  • Features better perform motion recognition on pictures.
  • FIG. 2 is a schematic diagram of a motion recognition method according to an embodiment of the present application.
  • the specific process of the action recognition method shown in FIG. 2 includes:
  • the input picture is equivalent to the picture to be processed above.
  • a convolutional neural network CNN model may be used to extract the spatial features of the input picture.
  • the CNN model is used to perform convolution processing on the input picture to obtain the convolution feature map of the input image.
  • Next Pull into a one-dimensional vector to get the vector u rgb .
  • the vector u rgb is the spatial feature of the input picture.
  • the CNN module can be implemented using a variety of architectures, such as VGG16, TSN networks, and so on.
  • the coefficients of the CNN module need to be pre-trained on the data set for action recognition.
  • a virtual optical flow feature of the input image may be simulated or generated according to the optical flow feature of the video stored in the video warehouse.
  • the virtual optical flow characteristics of the input picture may be generated according to the virtual optical flow characteristics of the N videos.
  • the spatial features and optical flow features of the video can be extracted according to the process shown in FIG. 4.
  • the specific process of extracting the spatial and optical flow features of a video includes:
  • the intermediate optical flow map includes optical flow x and optical flow y
  • the RGB images in the middle of the video are sent to a pre-trained spatial feature CNN model to obtain the spatial features of the video;
  • the optical flow map in the middle of the video is sent to a pre-trained optical flow feature CNN model to obtain the optical flow feature of the video;
  • optical flow map in the middle of the video may be generated from several frames of pictures before and after the middle of the video.
  • the spatial features of the extracted video and the optical flow features of the extracted video may be independent of each other, and they may be performed simultaneously or sequentially.
  • the spatial and optical flow features of the video extracted in the process shown in FIG. 4 may specifically be spatial feature vectors and optical flow feature vectors, where the length of the spatial feature vector and optical flow feature vector of each video may be M, then , N video spatial feature vectors can be represented by a matrix V rgb ⁇ N * M, and N video optical flow feature vectors can be represented by a matrix V flow ⁇ N * M. In this way, the spatial feature vector V rgb ⁇ N * M and the optical flow feature vector V flow ⁇ N * M of the N videos are obtained.
  • the optical flow features of the N videos can be based on the similarity between the spatial features of the input picture and the spatial features of each of the N videos. Perform weighted summation to obtain the virtual optical flow characteristics of the input picture.
  • the optical flow features of the N videos are weighted and summed to obtain the virtual optical flow features of the input picture.
  • the specific process includes:
  • the spatial features of the input picture are compared with the spatial features of the video in the video warehouse, and the similarity between the spatial features of the input picture and the spatial features of each video in the video warehouse is obtained.
  • the optical flow features of each video in the video warehouse are weighted and summed to obtain the virtual optical flow features of the input picture.
  • a Gaussian process can be used to calculate the similarity between the spatial features of the input picture and the spatial features of the video in the video warehouse.
  • formula (3) can be used to determine the similarity between the spatial features of the input picture and the spatial features of the video in the video warehouse.
  • u rgb is the spatial feature of the input picture
  • V rgb is the spatial feature of the video in the video warehouse
  • each element in K h (u rgb , V rgb ) ⁇ 1 * N is for each row of u rgb and V rgb
  • K h (V rgb , V rgb ) is the covariance matrix of V rgb
  • I is the identity matrix
  • w h is the similarity between the spatial features of the input picture and the spatial features of the video in the video warehouse.
  • w h is a one-dimensional vector of length N
  • the i-th element in w h represents the similarity between the spatial feature of the input picture and the spatial feature of the i-th video
  • the value of the i-th element in w h The larger the similarity between the spatial feature of the input picture and the spatial feature of the ith video.
  • formula (4) can be specifically used to calculate the virtual optical flow feature of the input picture.
  • w h represents the similarity between the spatial feature of the input picture and the spatial feature of the video in the video warehouse
  • V flow represents the optical flow feature of the video in the video warehouse
  • u flow represents the virtual optical flow feature of the input picture
  • u flow is also A feature vector of length M.
  • the optical flow prediction of the input picture to obtain the first type of confidence is as follows:
  • one-hot-label is used to indicate an action-type label of each video or picture.
  • the action category label can be represented by a vector whose length is the same as the total number of action types, and each position in the action category label corresponds to an action category.
  • the vector has only one position with a value of 1 and the remaining positions with a value of 0.
  • the action category corresponding to the position with a value of 1 is the action category of the video or picture.
  • Video1 Video1
  • Video2 Video2
  • Video3 Video3
  • Picture1 Image1
  • Picture2 Picture2
  • Picture3 Picture3
  • the action categories are Run, Dance, Run, Jump, Dance, and Run.
  • the action category labels of these 3 videos and 3 pictures are [1,0,0], [0,1,0], [1,0,0], [0,0,1]. ], [0,1,0], [1,0,0]. Then, according to these action category tags, it can be known that the action categories corresponding to the 3 videos and 3 pictures in turn are running, dancing, running, jumping, dancing, and running.
  • the above optical flow feature library may include optical flow features of N v videos and virtual optical flow features of N i pictures, where the optical flow features of N v videos are The virtual optical flow characteristics of N i pictures are The two together form the optical flow feature of the optical flow feature warehouse
  • formula (5) may be used to calculate the similarity between the virtual optical flow feature of the input picture and the optical flow feature in the optical flow feature library.
  • u flow represents the virtual optical flow feature of the input picture
  • M flow represents the optical flow feature of the optical flow feature warehouse
  • each element in K P (u flow , M flow ) ⁇ 1 * (N v + N i ) is u
  • K P (u flow , M flow ) ⁇ 1 * (N v + N i ) is u
  • K P (u flow , M flow ) ⁇ 1 * (N v + N i )
  • K p (M flow , M flow ) is the covariance matrix of M flow
  • Is a noise parameter matrix
  • w flow represents the similarity between the virtual optical flow feature of the input picture and the optical flow feature in the optical flow feature database
  • w flow is a one-dimensional vector of length N v + N i , where the i-th
  • the element indicates the similarity between the virtual optical flow feature of the input picture and the i-th optical flow feature. The larger the value, the closer the optical flow feature of the input picture is to
  • the first type of confidence of the input picture on each action category can be calculated according to formula (6).
  • each line in L ⁇ (N v + N i ) * P represents the action category label corresponding to each optical flow feature in the optical flow feature warehouse, and P is the total number of action categories. For each action category label, only It is 1 in the category to which it belongs and 0 in the rest. Among them, L flow is the first type of confidence of the person in the input picture on each action category.
  • the following describes the process of calculating the first-class confidence of the person in the input picture on each action category with reference to Table 2.
  • the optical flow feature warehouse contains 3 videos and 3 pictures: Video1 (Video1), Video2 (Video2), Video3 (Video3), Picture1 (Image1), Picture2 (Image2), and Picture3 (Image3) .
  • the action categories are Run, Dance, Run, Jump, Dance, and Run.
  • the action category labels corresponding to these 3 videos and 3 pictures are shown in columns 2 to 7 (excluding the last row) of Table 2.
  • the similarity of the stream features is shown in the last row of Table 2.
  • the first-class confidence of the input picture in each action category is shown in the last column of Table 2.
  • the process of spatial prediction of the input picture is basically the same as the process of optical flow prediction of the input picture.
  • the weighted action category is de-weighted to obtain the confidence level L rgb of each category in the space, and finally the second category confidence level of the person in the input picture on each action category is obtained.
  • optical flow feature database and the spatial feature database used in the above steps 204 and 205 may be feature databases that have been established in advance.
  • the process shown in FIG. 7 may be used to establish an optical flow feature library (also referred to as an optical flow feature warehouse) and a spatial feature library (also referred to as a spatial feature warehouse).
  • an optical flow feature library also referred to as an optical flow feature warehouse
  • a spatial feature library also referred to as a spatial feature warehouse
  • spatial features and optical flow features are extracted from the training video set and the training picture set, and the extracted spatial features of the training video set and the spatial features in the training picture set are sent to a spatial feature database, and The optical flow features of the extracted training video set and the optical flow features in the training picture set are sent to the optical flow feature database.
  • the resulting spatial feature database contains the spatial features of N v training videos and the spatial features of N i pictures
  • the resulting optical flow feature database contains the optical flow features of N i training videos and N i pictures.
  • Virtual optical flow characteristics the spatial features in the spatial feature library can be expressed as
  • the optical flow features in the optical flow feature library can be expressed as
  • training video set and training picture set may be a video set and a picture set stored in a local database.
  • the above target confidence level includes the confidence level of the input picture on various action categories, so when determining the action category of the person in the input picture according to the target confidence level, the action corresponding to the largest confidence level in the target confidence level may be The category is determined as the action category of the person in the input picture.
  • a confidence level greater than a preset threshold may be selected from the target confidence level, and then the maximum confidence level is selected from the confidence level.
  • the action category corresponding to the maximum confidence level is determined as the action category of the person in the input picture.
  • a video with a high degree of relevance to the training picture may also be selected from the local video library and put into the video warehouse.
  • the videos in the existing video library correspond to a total of P v action categories, and the number of videos in different action categories is inconsistent.
  • the same number of candidate videos for example, K, K is an integer greater than 0
  • candidate videos are selected from each action category to form Pv video bags, where each video There are K candidate videos in the package.
  • the existing training picture set has a total of P i action categories, and these action categories are not exactly the same as those of the video.
  • each video package the top J (J ⁇ K, J and K are integers greater than 0) with the largest similarity metric is selected as the final video for storage.
  • This application proposes a motion recognition method based on a virtual optical flow and a feature library. By generating optical flow features closely related to motion for a single picture, and then combining the spatial and motion features of the picture to perform motion recognition, the accuracy of motion recognition can be improved.
  • this application uses the spatial features of training videos and training pictures and (virtual) optical flow features to establish a feature database, and compares the spatial features and virtual optical flow features of the input picture with the feature database to obtain action categories, which can then be used in training data In rare cases, high accuracy of motion recognition is achieved.
  • Table 3 shows the recognition accuracy rates of the motion recognition method and the existing motion recognition method on different motion recognition data sets in the embodiments of the present application.
  • the training set shown in Table 2 uses only one picture as the training set for each category of training pictures.
  • the motion recognition method according to the embodiment of the present application has been described in detail above with reference to FIGS. 1 to 7.
  • the following describes the motion recognition device according to the embodiment of the present application with reference to FIGS. 8 to 11.
  • the motion recognition device shown in FIGS. 8 to 11 may be a monitoring device, a terminal device, a network server, and a network cloud platform. Processing equipment.
  • the motion recognition device shown in FIG. 8 to FIG. 11 may execute each step of the motion recognition method in the embodiment of the present application. For brevity, repeated descriptions are appropriately omitted below.
  • FIG. 8 is a schematic block diagram of a motion recognition device according to an embodiment of the present application.
  • the motion recognition device 800 shown in FIG. 8 includes:
  • An acquisition module 801 configured to acquire a picture to be processed
  • An extraction module 802 configured to extract a spatial feature of the picture to be processed
  • a processing module 803 is configured to determine a virtual optical flow feature of the picture to be processed according to the spatial feature of the picture to be processed and the X spatial features and X optical flow features in a preset feature library. There is a one-to-one correspondence between the X spatial features and the X optical flow features, and X is an integer greater than 1;
  • the processing module 803 is further configured to determine a first type of the picture to be processed in different action categories according to the similarity between the virtual optical flow feature of the picture to be processed and the Y optical flow features in the feature database. Confidence degree, wherein each of the Y optical flow features corresponds to an action category, and Y is an integer greater than 1;
  • the processing module 803 is further configured to determine an action category of the picture to be processed according to the first-type confidence.
  • the virtual optical flow characteristics of the to-be-processed picture can be obtained through the spatial features of the picture to be processed and the spatial and optical flow features in the feature library, thereby simulating the time sequence information closely related to the action for the picture.
  • the similarity between the virtual optical flow feature of the picture to be processed and the optical flow feature in the feature database is used for motion recognition of the picture to be processed.
  • FIG. 9 is a schematic diagram of a hardware structure of a motion recognition device according to an embodiment of the present application.
  • the motion recognition device 900 shown in FIG. 9 includes a memory 901, a processor 902, a communication interface 903, and a bus 904.
  • the memory 901, the processor 902, and the communication interface 903 implement a communication connection between each other through a bus 904.
  • the memory 901 may be a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 901 may store a program. When the program stored in the memory 901 is executed by the processor 902, the processor 902 and the communication interface 903 are configured to execute each step of the action recognition method in the embodiment of the present application.
  • the processor 902 may use a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more
  • the integrated circuit is configured to execute a related program to implement a function required by a module in the apparatus for motion recognition in the embodiment of the present application, or to execute the motion recognition method in the method embodiment of the present application.
  • the processor 902 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the motion recognition method of the present application may be completed by an integrated logic circuit of hardware in the processor 902 or an instruction in the form of software.
  • the processor 902 may also be a general-purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made programmable gate array (FPGA), or other programmable logic device. , Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA ready-made programmable gate array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a software module may be located in a mature storage medium such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, and the like.
  • the storage medium is located in the memory 901, and the processor 902 reads the information in the memory 901 and, in combination with its hardware, performs functions required by the modules included in the device for motion recognition in the embodiment of the present application, or performs actions in the method embodiment of the present application recognition methods.
  • the communication interface 903 uses a transceiving device such as but not limited to a transceiver to implement communication between the device 900 and other devices or a communication network. For example, a picture to be processed may be acquired through the communication interface 903.
  • a transceiving device such as but not limited to a transceiver to implement communication between the device 900 and other devices or a communication network. For example, a picture to be processed may be acquired through the communication interface 903.
  • the bus 904 may include a path for transmitting information between various components of the device 900 (for example, the memory 901, the processor 902, and the communication interface 903).
  • the device 900 shown in FIG. 9 only shows the memory 901, the processor 902, and the communication interface 903, in the specific implementation process, those skilled in the art should understand that the device 900 also includes the necessary Other devices. At the same time, according to specific needs, those skilled in the art should understand that the apparatus 900 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the apparatus 900 may also include only the components necessary to implement the embodiments of the present application, and not necessarily all the components shown in FIG. 9.
  • the acquisition module 801 in the motion recognition device 800 corresponds to the communication interface 903 in the motion recognition device 900
  • the extraction module 802 and the processing module 803 correspond to the processor 902.
  • FIG. 10 is a schematic block diagram of a motion recognition device according to an embodiment of the present application.
  • the motion recognition device 1000 shown in FIG. 10 includes a CNN module 1001, a virtual optical flow module 1002, a spatial prediction module 1003, an optical flow prediction module 1004, a fusion output module 1005, a video warehouse 1006, a spatial feature warehouse 1007, and an optical flow feature warehouse 1008. .
  • the video warehouse 1006, the spatial feature warehouse 1007, and the optical flow feature warehouse 1008 may be located inside the motion recognition device 1000 or outside the motion recognition device 1000.
  • the video warehouse 1006, the spatial feature warehouse 1007, and the optical flow feature warehouse 1008 It can be located in the server or a local database, and the motion recognition device can retrieve the features contained in the video warehouse 1006, the spatial feature warehouse 1007, and the optical flow feature warehouse 1008 from the server or the local database.
  • the motion recognition device 1000 may execute each step of the motion recognition method in the embodiment of the present application.
  • the specific functions of each module are as follows:
  • a CNN module 1001 is configured to perform a convolution operation on a picture to be processed to obtain a spatial feature u rgb of the image.
  • a video warehouse 1006 is used to store the spatial feature V rgb and optical flow feature V flow of the training video.
  • the virtual optical flow module 1002 is configured to use the spatial feature u rgb of the picture to be processed and the spatial feature V rgb and the optical flow feature V flow of the training video stored in the video warehouse 1006 to generate a virtual optical flow feature u flow of the picture to be processed.
  • a spatial feature warehouse 1007 is used to store spatial features and class labels of training videos and images.
  • An optical flow feature warehouse 1008 is used to store optical flow features of training videos, virtual optical flow features of training pictures, and their action category tags.
  • the spatial prediction module 1003 is configured to compare the spatial features of the picture to be processed with the features in the spatial feature warehouse to obtain the confidence degree of the picture to be processed on each action category.
  • the optical flow prediction module 1004 is configured to compare the virtual optical flow features of the pictures to be processed with the features in the optical flow feature warehouse, to obtain the confidence of the pictures to be processed in various categories.
  • a fusion output module 1005 is used to fuse the confidence levels of the to-be-processed pictures obtained by the spatial prediction module 1003 and the optical flow prediction module 1004 in each action category to obtain the final confidence of each category, and select the action category with the highest confidence to output.
  • the motion recognition device 1000 shown in FIG. 10 may further include an acquisition module 1009 and a video selection module 1010, where the acquisition module 1009 is used to acquire a picture to be processed, and the video selection module 1010 is used to select a video for the video warehouse 1006.
  • the acquisition module 1009 in the motion recognition device 1000 corresponds to the communication interface 903 in the motion recognition device 900, and other modules in the motion recognition device 1000 correspond to the processor 902 in the motion recognition device 900.
  • the CNN module 1001, the virtual optical flow module 1002, the spatial prediction module 1003, the optical flow prediction module 1004, the fusion output module 1005, the acquisition module 1009, and the video selection module 1010 in the above-mentioned motion recognition device 1000 may use hardware or software during specific implementation , Or a combination of hardware and software.
  • the process of performing motion recognition on the input picture by the motion recognition device 1000 is as follows:
  • CNN module 1001 obtains the input picture and extracts the spatial features of the input picture
  • the virtual optical flow module 1002 determines the virtual optical flow characteristics of the input picture according to the optical flow characteristics and spatial characteristics in the video warehouse 1006 and the spatial characteristics of the input picture;
  • the optical flow prediction module 1004 performs optical flow prediction according to the extracted virtual optical flow characteristics of the input picture and the optical flow characteristics in the optical flow characteristic warehouse 1008 to obtain the first type of confidence;
  • the spatial prediction module 1003 performs spatial prediction according to the extracted spatial features of the input picture and the spatial features in the spatial feature warehouse 1007 to obtain a second type of confidence;
  • the fusion output module 1005 fuses the first-class confidence and the second-class confidence to obtain the target confidence, and then determines the action category of the person in the input picture according to the target confidence.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, which may be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the aforementioned storage media include: U disks, mobile hard disks, read-only memories (ROM), random access memories (RAM), magnetic disks or optical disks, and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

本申请涉及人工智能,提供了动作识别方法和装置。该方法包括:获取待处理图片;提取待处理图片的空间特征;根据待处理图片的空间特征以及预设的特征库中的X个空间特征和X个光流特征,确定待处理图片的虚拟光流特征,其中,预设特征库中的X个空间特征和X个光流特征存在一一对应关系,X为大于1的整数;根据待处理图片的虚拟光流特征与预设特征库中的Y个光流特征的相似度,确定待处理图片在不同动作类别上的第一类置信度,其中,预设特征库中的Y个光流特征每个光流特征对应一种动作类别,Y为大于1的整数;根据第一类置信度确定待处理图片的动作类别。本申请能够提高动作识别的准确率。

Description

动作识别方法和装置
本申请要求于2018年05月29日提交中国专利局、申请号为201810533284.9、申请名称为“动作识别方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及动作识别技术领域,并且更具体地,涉及一种动作识别方法和装置。
背景技术
动作识别包括对视频中的人物的动作识别和对图片中的人物的动作识别,由于视频中包含的信息较多,对视频中的人物的动作识别相对比较容易。与视频相比,图片中包含的信息较少,因此,如何有效地识别图片中的人物的动作类别是一个需要解决的问题。
发明内容
本申请提供一种动作识别方法和装置,能够提高动作识别的准确度。
第一方面,提供了一种动作识别方法,该方法包括:获取待处理图片;确定待处理图片的空间特征;根据待处理图片的空间特征和特征库中的X个空间特征和X个光流特征,确定待处理图片的虚拟光流特征;根据待处理图片的虚拟光流特征与特征库中的Y个光流特征的相似度,确定待处理图片在不同动作类别上的第一类置信度;根据第一类置信度确定待处理图片动作类别。
应理解,上述特征库为预先设置的特征库,该特征库中包含多个空间特征和多个光流特征。特征库中的每个空间特征对应一种动作类别,特征库中的每个光流特征对应一种动作类别。上述X和Y均为大于1的整数。
可选地,上述每个空间特征对应的动作类别和每个光流特征对应的动作类别是预先根据卷积神经网络模型训练得到的。
可选地,特征库中的多个空间特征和多个光流特征存在一一对应关系,特征库中的每个空间特征对应一个光流特征,特征库中的每个光流特征对应一个空间特征。
本申请中,通过待处理图片的空间特征以及特征库中的空间特征和光流特征能够获取待处理图片的虚拟光流特征,从而为图片模拟出与动作密切相关的时序信息,接下来就可以根据待处理图片的虚拟光流特征与特征库中的光流特征的相似度对待处理图片进行动作识别。
另外,由于本申请是直接通过对比待处理图片的虚拟光流特征与特征库中的光流特征的方式来进行动作识别,无需建立训练模型来对待处理图片进行动作识别,可以利用较少的光流特征实现对待处理图片的动作识别。
可选地,上述X个空间特征和X个光流特征分别是特征库中的全部空间特征和全部 光流特征。
通过根据待处理图片的空间特征,以及特征库中的全部空间特征和全部光流特征来确定待处理图片的虚拟光流特征,可以更准确地确定待处理图片的虚拟光流特征,进而能够更准确地确定待处理图片的动作类别。
可选地,上述X个空间特征和X个光流特征分别是特征库中的部分空间特征和部分光流特征。
通过结合特征库中的部分空间特征和部分光流特征以及待处理图片的空间特征来确定待处理图片的虚拟光流特征,能够减少计算待处理图片的虚拟光流特征的计算量,进而提高对待处理图片进行动作识别的速度。
可选地,上述X个空间特征和X个光流特征一一对应,在X个空间特征和X个光流特征中,每个空间特征对应一个光流特征,每个光流特征对应一个空间特征。
应理解,上述Y个光流特征可以是特征库中的全部光流特征,也可以是特征库中的部分光流特征,另外,X与Y既可以相同,也可以不同。
当Y个光流特征为特征库中的全部光流特征时,是依据待处理图片的虚拟光流特征与特征库中的全部光流特征的相似度来获得待处理图片的动作类别,可以提高第一类置信度的准确程度,进而提高对待处理图片进行动作识别的效果。
而当Y个光流特征为特征库中的部分光流特征时,能够减少确定第一类置信度时的运算量,进而可以提高对待处理图片进行动作识别的速度。
可选地,上述待处理图片为包含人物的图片,根据第一类置信度确定待处理图片动作类别包括:根据第一类置信度确定待处理图片中人物的动作类别。
也就是说,在本申请中,确定待处理图片的动作类别其实是确定待处理图片中的人物或者其它目标物体的动作类别。
可选地,上述待处理图片为静态图片。
可选地,上述空间特征的具体为空间特征向量,上述光流特征具体为光流特征向量。
在某些实现方式中,根据待处理图片的空间特征以及特征库中的X个空间特征和X个光流特征,确定待处理图片的虚拟光流特征,包括:根据待处理图片的空间特征与特征库中X个的空间特征中的每个空间特征的相似度,对X个光流特征进行加权求和,得到待处理图片的虚拟光流特征。
在某些实现方式中,上述特征库包含训练视频的空间特征和光流特征。
本申请中,可以根据训练视频的空间特征和光流特征以及待处理图片的空间特征来确定待处理图片的虚拟光流特征,进而根据该虚拟光流特征确定待处理图片的动作类别。
在某些实现方式中,上述特征库还包含训练图片的空间特征和虚拟光流特征。
本申请中,可以综合根据训练视频和训练图片的各自的空间特征和光流特征以及待处理图片的空间特征来确定待处理图片的虚拟光流特征,可以得到更准确的虚拟光流特征,能够进一步提高动作识别的准确度。
可选地,训练图片的动作类别与训练视频的动作类别不完全相同。
由于训练视频的动作类别和训练图片的动作类别不完全相同,可以增加可以识别的动作类别的种类,进而提高动作识别的适用范围。
可选地,训练视频中不同动作类别的视频的数目相同。
当训练视频中不同类别的视频数目相同时,能够保证不同动作类别的训练视频的数量均衡性,保证动作识别结果的稳定性。
可选地,上述方法还包括:从预设的图片库中选择出与需要识别的动作类别相匹配的图片,得到所述训练图片。
上述图片库可以是本地的图片数据库,也可以是位于网络服务器中的图片数据库。
可选地,上述方法还包括:从预设的视频库中选择出与训练图片的空间特征的相似度满足预设要求的视频,得到所述训练视频。
上述视频库可以是本地的视频库,也可以是网络服务器中的视频库。
具体地,从预设的视频库中选择出与训练图片的空间特征的相似度满足预设要求的视频,得到训练视频,包括:从预设的视频库中选择出与训练图片的空间特征的相似度大于预设的相似度阈值的视频,得到训练视频。
例如,可以将预设的视频库中与训练图片的空间特征的相似度大于0.5的视频都选择出来,组成训练视频。
可选地,从预设的视频库中选择出与训练图片的空间特征的相似度满足预设要求的视频,得到训练视频,包括:确定视频库中的视频的空间特征与训练图片的空间特征的相似度;将视频库中与训练图片的空间特征的相似度最大的前J个视频选择出来,得到训练视频,其中,J小于K,J和K均为大于0的整数,K为视频库中视频的总数。
例如,视频库中一共包含100个视频,那么,可以将视频库中与训练图片的空间特征的相似度最大的前50个视频选择出来构成训练视频。
在某些实现方式中,上述根据待处理图片的空间特征以及特征库中的X个空间特征和X个光流特征,确定待处理图片的虚拟光流特征,具体包括:根据待处理图片的空间特征与X个空间特征中的每个空间特征的相似度,确定特征库中与X个空间特征中的每个空间特征相对应的光流特征的权重系数;根据X个光流特征中的每个光流特征的权重系数,对X个光流特征进行加权求和,得到待处理图片的虚拟光流特征。
应理解,特征库中相互对应的空间特征和光流特征对应的是同一个视频或者图片,也就是说,特征库中相互对应的空间特征和光流特征属于同一个视频或者同一个图片。
在某些实现方式中,上述X个光流特征中的每个光流特征的权重系数的大小与第一相似度是正相关的关系,其中,该第一相似度是X个空间特征中与X个光流特征中的每个光流特征相对应的空间特征与待处理图片的空间特征的相似度。
例如,上述X个空间特征中包括第一空间特征,上述X个光流特征中包括第一光流特征,第一空间特征与第一光流特征存在对应关系,第一空间特征与待处理图片的空间特征的相似度为相似度1,那么,第一光流特征的权重系数的大小与相似度1是正相关的关系(具体可以是成正比的关系)。
通过合理设置光流特征的权重系数使得根据特征库中的光流特征得到的待处理图片的虚拟光流特征更准确。
可选地,特征库包含训练视频的空间特征和光流特征,根据待处理图片的空间特征以及特征库中的空间特征和光流特征,确定待处理图片的虚拟光流特征,具体包括:根据待处理图片的空间特征与训练视频的每个空间特征的相似度,确定与训练视频的每个空间特征相对应的光流特征的权重系数;根据训练视频中的每个光流特征的权重系数,对特征库 中的光流特征进行加权求和,得到待处理图片的虚拟光流特征。
应理解,上述训练视频的空间特征和光流特征均为多个。
本申请中,只根据训练视频的空间特征和光流特征来确定待处理图片的虚拟光流特征,能够减少确定虚拟光流特征的复杂度。
可选地,特征库包含训练视频的空间特征和光流特征以及训练图片的空间特征和虚拟光流特征,根据待处理图片的空间特征以及特征库中的空间特征和光流特征,确定待处理图片的虚拟光流特征,具体包括:根据待处理图片的空间特征与训练视频和训练图片的每个空间特征的相似度,确定与训练视频和训练图片的每个空间特征相对应的光流特征的权重系数;根据训练视频和训练图片中的每个光流特征的权重系数,对训练视频和训练图片中的光流特征进行加权求和,得到待处理图片的虚拟光流特征。
应理解,上述训练图片的空间特征和光流特征均为多个。
本申请中,通过训练视频的空间特征和光流特征以及训练图片的空间特征和虚拟光流特征来综合确定待处理图片的虚拟光流特征,能够使得获取的待处理图片的虚拟光流特征更能够反映待处理图片的运动信息。
在某些实现方式中,所述训练图片的虚拟光流特征是根据所述训练图片的空间特征与所述训练视频的空间特征的相似度,对所述训练视频的光流特征进行加权求和得到的。
在某些实现方式中,上述方法还包括:根据训练图片的空间特征与训练视频的空间特征的相似度,对训练视频的光流特征进行加权求和,得到训练图片的虚拟光流特征。
可选地,根据训练图片的空间特征与训练视频的空间特征的相似度,对训练视频的光流特征进行加权求和,得到训练图片的虚拟光流特征,包括:根据训练图片的空间特征与训练视频中的每个空间特征的相似度,确定训练视频中与每个空间特征相对应的光流特征的权重系数;根据训练视频中的每个光流特征的权重系数,对训练视频中的光流特征进行加权求和,得到训练图片的虚拟光流特征。
应理解,上述特征库在初始时可以仅包含训练视频的空间特征和光流特征,为了进一步提高最终动作识别的准确性,可以在特征库中再加入训练图片的空间特征和虚拟光流特征,而该训练图片的虚拟光流特征可以根据特征库中包含的训练视频的空间特征和光流特征来确定。
因此,本申请中,通过训练视频的空间特征和光流特征确定训练图片的虚拟光流特征,并将训练图片的空间特征和虚拟光流特征并入到特征库中,能够在一定程度上提高动作识别的效果。
在某些实现方式中,上述方法还包括:根据待处理图片的空间特征与预设的特征库中的Z个空间特征的相似度,确定待处理图片在不同动作类别上的第二类置信度,其中,Z个空间特征中的每个空间特征对应一种动作类别;根据第一类置信度确定待处理图片的动作类别,包括:根据第一类置信度和第二类置信度,确定待处理图片的动作类别。
应理解,上述第一类置信度是通过光流预测过程得到的,上述第二类置信度是通过空间预测流程得到的。Z为大于1的整数。X、Y和Z中的任意两个数值可以相同,也可以不同。另外,上述Z个空间特征既可以是特征库中的全部空间特征也可以只是特征库中的部分空间特征。
本申请中,通过光流预测和空间预测来综合得到待处理图片的置信度,能够更准确地 确定待处理图片的动作类别。
在某些实现方式中,根据第一类置信度和第二类置信度,确定待处理图片的动作类别,包括:对第一类置信度和第二类置信度进行加权求和,得到待处理图片在不同动作类别上的最终置信度;根据最终置信度确定待处理图片的动作类别。
通过对第一类置信度和第二类置信度进行加权求和,能够得到可以综合反映待处理图片在不同动作类别上的置信度,能够更好地确定待处理图片的动作类别。
在某些实现方式中,在确定待处理图片的动作类别之后,上述方法还包括:将待处理图片的空间特征和虚拟光流特征,以及待处理图片的动作类别信息添加到所述特征库中。
通过将待处理图片的空间特征和虚拟光流特征,以及对应的动作类别信息添加到特征库中,能够扩充特征库中包含的空间特征和光流特征,便于后续依据特征库中的空间特征和光流特征更好地对图片进行动作识别。
第二方面,提供一种动作识别装置,该动作识别装置包括用于执行第一方面中的方法的模块。
第三方面,提供一种动作识别装置,该动作识别装置包括:存储器,用于存储程序;处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行第一方面中的方法。
第四方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面中的方法。
第五方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面中的方法。
第六方面,提供一种电子设备,该电子设备包括上述第二方面或第三方面中的动作识别装置。
附图说明
图1是本申请实施例的动作识别方法的示意性流程图;
图2是本申请实施例的动作识别方法的示意性流程图;
图3是根据CNN模型提取空间特征的示意图;
图4是提取视频的空间特征和光流特征的示意图;
图5是获取输入图片的虚拟光流特征的示意图;
图6是对输入图片进行光流预测的示意图;
图7是建立光流特征库和空间特征库的示意图;
图8是本申请实施例的动作识别装置的示意性框图;
图9是本申请实施例的动作识别装置的示意性框图;
图10是本申请实施例的动作识别装置的示意性框图;
图11是本申请实施例的动作识别装置对输入图片进行动作识别的示意性框图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例的动作识别方法能够应用在图片检索、相册管理、平安城市、人机交互 以及需要识别进行动作识别的场景。具体而言,本申请实施例的动作识别方法能够应用在相册管理系统和图片查找系统中,下面分别对相册管理系统和图片查找系统进行简单的介绍。
相册管理系统:
当用户在手机或者云盘上存储了大量的图片时,为了方便地查找不同种类的图片,可以对相册中的图片进行分类。例如,可以利用本申请实施例的动作识别方法对相册中的图片进行动作识别,得到每个图片的动作类别,使得用户能够对不同动作类别的图片进行分类管理,从而方便用户查找图片,能够节省管理时间,提高相册管理的效率。
图片查找系统:
互联网上有海量的图片,目前查询图片时主要是根据图片所在的网页的文字进行查找,而图片所在的网页的文字并不能完全反映图片本身的特征。采用本申请实施例的动作识别方法能够识别出图片中的动作类型,进而可以根据图片的动作类别从互联网或者数据库中查找出自己感兴趣的图片。
为了实现对图片的动作识别,传统方案是从大量的训练图片中提取人物图像,然后根据从训练图片中提取到的人物图像以及人物图像对应的动作类别对卷积神经网络(convolutional neural networks,CNN)模型进行训练,得到CNN模型的各个参数。当需要对待处理图片中的人物进行动作识别时,可以从该待处理图片中提取人物图像,并将从待处理图片中提取出来的人物图像输入到已经训练好的CNN模型进行动作识别,从而确定待处理图片的动作类别。传统方案在进行动作识别时仅考虑到了图片的空间特征,而未进一步探索待处理图片中的人物的动作的时间属性,导致进行行动作识别的准确率仍然较低。
因此,本申请提出了一种新的动作识别方法,通过已知的光流特征来模拟待处理图片的光流特征,以得到待处理图片的虚拟光流特征,接下来再根据待处理图片的虚拟光流特征来对待处理图片的动作进行识别。
下面结合图1对本申请实施例的动作识别方法进行详细的介绍。图1所示的方法可以由动作识别装置执行,该动作识别装置具体可以是监控设备、终端设备、网络服务器以及网络云平台等具有图片处理功能的设备。图1所示的方法包括步骤101至步骤105,下面分别对步骤101至步骤105分别进行详细的描述。
101、获取待处理图片。
上述待处理图片(也可以称为待处理图像)可以是包含人物的图片。对待处理图片进行动作识别实质上是对待处理图片中的人物的动作进行识别,确定待处理图片的动作类别。
上述待处理图片可以是通过电子设备拍摄的照片,也可以是从视频中截取的图片。上述待处理图片可以存储在本地图片数据库中,也可以存储在网络中。
在获取待处理图片时可以从本地图片数据库中直接调取,也可以从网络中在线获取。
在本申请中,动作类别可以是指待处理图片中的人物具体在做什么动作,例如,动作类别可以包括:跑步(run)、行走(walk)、棒球投掷(Baseball_pitch)、棒球击球(baseball_swing)、投球(bowl)、挺举(clean_and_jerk)、打高尔夫球(golf_swing)、跳绳(jump_rope)、引体向上(pullup)、俯卧撑(pushup)、端坐(situp)、蹲坐(squat)、弹吉他(strum_guitar)以及游泳(swim)等等。
应理解,上述例子只是动作类型的一些具体例子(主要是体育运动方面的动作类别),事实上,本申请实施例包含的动作类型不限于此,本申请的动作类别还可以包含体育运动之外的其它动作,例如,看手机、人机交互系统中人的姿势等等。另外,上述动作类别还可以称为动作种类、动作类型等等。
102、提取待处理图片的空间特征。
在确定待处理图片的空间特征时,可以采用卷积神经网络(convolutional neural networks,CNN)模型(该CNN模型可以是预先训练好的模型)对待处理图片进行卷积运算,从而得到待处理图片的空间特征。
103、根据待处理图片的空间特征以及预设的特征库中的X个空间特征和Y个光流特征,确定待处理图片的虚拟光流特征。
其中,上述X和Y均为大于1的整数。
应理解,上述特征库可以是预先设置的特征库,该特征库中包含多个空间特征和多个光流特征,该多个空间特征中的每个空间特征对应一种动作类别,该多个光流特征中的每个光流特征也对应一种动作类别。其中,每个空间特征对应的动作类别以及每个光流特征对应的动作类别可以是预先根据卷积神经网络模型训练得到的。
另外,上述特征库中的多个空间特征和多个光流特征存在一一对应关系,多个空间特征中的任意一个空间特征对应多个光流特征中的一个光流特征,多个光流特征中的任意一个光流特征对应多个空间特征中的一个特征。特征库中的空间特征和光流特征的数目一般是相同的。
上述X个空间特征和X个光流特征也可以是一一对应的,也就是说,在X个空间特征和X个光流特征中,每个空间特征对应一个光流特征,每个光流特征对应一个空间特征。
上述X个空间特征可以是特征库中的全部空间特征或者部分空间特征。
上述X个光流特征可以是特征库中的全部光流特征或者部分光流特征。
当上述X个空间特征和X个光流特征分别为特征库中的全部空间特征和全部光流特征时,能够根据待处理图片的空间特征,以及特征库中的全部空间特征和全部光流特征来确定待处理图片的虚拟光流特征,可以更准确地确定待处理图片的虚拟光流特征,进而能够更准确地确定待处理图片的动作类别。
当上述X个空间特征和X个光流特征分别为特征库中的部分空间特征和部分光流特征时,通过结合特征库中的部分空间特征和部分光流特征以及待处理图片的空间特征来确定待处理图片的虚拟光流特征,能够减少计算待处理图片的虚拟光流特征的计算量,进而提高对待处理图片进行动作识别的速度。
另外,上述特征库中存在对应关系的空间特征和光流特征对应的是同一个视频或者图片,也就是说,特征库中对应关系的空间特征和光流特征属于同一个视频或者同一个图片。另外,本申请中提及的空间特征具体表现形式可以为空间特征向量,光流特征或者虚拟光流特征具体表现形式可以为光流特征向量或者虚拟光流特征向量。
具体地,由于特征库中的每个空间特征对应一个光流特征(特征库中的空间特征和光流特征是一一对应关系),因此,在根据待处理图片的空间特征以及预设的特征库中的空间特征和光流特征,确定待处理图片的虚拟光流特征时可以根据待处理图片的空间特征与 特征库中的空间特征的相似度,对特征库中与特征库中的空间特征相对应的光流特征进行加权求和,得到待处理图片的虚拟光流特征。
因此,上述步骤103中的根据待处理图片的空间特征以及预设的特征库中的X个空间特征和X个光流特征,确定待处理图片的虚拟光流特征,包含以下具体过程:
(1)、根据待处理图片的空间特征与X个空间特征中的每个空间特征的相似度,确定X个光流特征中的每个光流特征的权重系数(也可以称为加权系数);
(2)、根据X个光流特征中的每个光流特征的权重系数,对X个光流特征进行加权求和,得到待处理图片的虚拟光流特征。
可选地,上述X个光流特征中的每个光流特征的权重系数的大小与第一相似度是正相关的关系,其中,该第一相似度是X个空间特征中与X个光流特征中的每个光流特征相对应的空间特征与待处理图片的空间特征的相似度。
例如,上述X个空间特征中包括第一空间特征和第二空间特征,上述Y个光流特征中包括第一光流特征和第二光流特征,第一空间特征对应第一光流特征,第二空间特征对应第二光流特征,待处理图片的空间特征与第一空间特征的相似度为相似度1,待处理图片的空间特征与第二空间特征的相似度为相似度2,相似度1大于相似度2,那么,在对第一光流特征和第二光流特征以及X个光流特征中的其它光流特征进行加权求和时,第一光流特征的权重系数大于第二光流特征的权重系数。
本申请中,通过合理设置光流特征的权重系数使得根据特征库中的光流特征得到的待处理图片的虚拟光流特征更准确。
104、根据待处理图片的虚拟光流特征与预设的特征库中的Y个光流特征的相似度,确定待处理图片在不同动作类别上的第一类置信度。
其中,上述Y个光流特征中的每个光流特征对应一种动作类别,Y为大于1的整数。
应理解,上述Y个光流特征可以是特征库中的全部光流特征,也可以是特征库中的部分光流特征,另外,Y与X既可以相同,也可以不同。
当Y个光流特征为特征库中的全部光流特征时,是依据待处理图片的虚拟光流特征与特征库中的全部光流特征的相似度来获得待处理图片的动作类别,可以提高第一类置信度的准确程度,进而提高对待处理图片进行动作识别的效果。
而当Y个光流特征为特征库中的部分光流特征时,能够减少确定第一类置信度时的运算量,进而可以提高对待处理图片进行动作识别的速度。
105、根据第一类置信度确定待处理图片的动作类别。
本申请中,通过待处理图片的空间特征以及预设特征库中的空间特征和光流特征能够获取待处理图片的虚拟光流特征,从而为图片模拟出与动作密切相关的时序信息,接下来就可以根据待处理图片的虚拟光流特征与预设特征库中的光流特征的相似度对待处理图片进行动作识别。
进一步地,由于本申请是直接通过对比待处理图片的虚拟光流特征与特征库中的光流特征的方式来进行动作识别,无需建立训练模型来对待处理图片进行动作识别,可以利用较少的光流特征实现对待处理图片的动作识别。
可选地,上述特征库中的空间特征包含训练视频的空间特征,特征库中的光流特征包含训练视频的光流特征。
其中,上述训练视频的空间特征可以为多个,上述训练视频的光流特征也可以为多个。
本申请中,根据训练视频的空间特征和光流特征能够模拟出待处理图片的虚拟光流特征,进而可以综合待处理图片的空间特征和虚拟光流特征进行动作识别,提高动作动作识别的准确性。
可选地,上述特征库的空间特征还包含训练图片的空间特征,上述特征库中的光流特征还包括训练图片的虚拟光流特征。
其中,上述训练图片的空间特征可以为多个,上述训练图片的光流特征也可以为多个。
上述特征库中不仅包含训练视频的空间特征和光流特征,还包含训练图片的空间特征和光流特征,能够综合训练视频和训练图片的空间特征和光流特征来确定待处理图片的虚拟光流特征,可以进一步提高最终动作识别的准确度。
通过训练视频的空间特征和光流特征以及训练图片的空间特征和虚拟光流特征来综合确定待处理图片的虚拟光流特征,能够得到更准确的虚拟光流特征。
应理解,在根据特征库中空间特征和光流特征确定待处理图片的虚拟光流特征时,既可以只根据特征库中训练视频的空间特征和光流特征来确定待处理图片的虚拟光流特征,也可以结合特征库中训练视频的空间特征和光流特征以及训练图片的空间特征和虚拟光流特征来综合确定待处理图片的虚拟光流特征。
可选地,特征库包含训练视频的空间特征和光流特征,根据待处理图片的空间特征以及特征库中的空间特征和光流特征,确定待处理图片的虚拟光流特征,具体包括:根据待处理图片的空间特征与训练视频的每个空间特征的相似度,确定与训练视频的每个空间特征相对应的光流特征的权重系数;根据训练视频中的每个光流特征的权重系数,对特征库中的光流特征进行加权求和,得到待处理图片的虚拟光流特征。
本申请中,只根据训练视频的空间特征和光流特征来确定待处理图片的虚拟光流特征,能够减少确定虚拟光流特征的复杂度。
可选地,特征库包含训练视频的空间特征和光流特征以及训练图片的空间特征和虚拟光流特征,根据待处理图片的空间特征以及特征库中的空间特征和光流特征,确定待处理图片的虚拟光流特征,具体包括:根据待处理图片的空间特征与训练视频和训练图片的每个空间特征的相似度,确定与训练视频和训练图片的每个空间特征相对应的光流特征的权重系数;根据训练视频和训练图片中的每个光流特征的权重系数,对训练视频和训练图片中的光流特征进行加权求和,得到待处理图片的虚拟光流特征。
本申请中,通过训练视频的空间特征和光流特征以及训练图片的空间特征和虚拟光流特征来综合确定待处理图片的虚拟光流特征,能够使得获取的待处理图片的虚拟光流特征更能够反映待处理图片的运动信息。
可选地,上述特征库中的训练图片的虚拟光流特征可以是根据训练视频的空间特征和光流特征以及训练图片的空间特征得到的。也就是说,训练图片的虚拟光流特征是根据训练图片的空间特征与训练视频的空间特征的相似度,对训练视频的光流特征进行加权求和得到的。
具体地,可以在对待处理图片进行动作识别之前先确定训练图片的虚拟光流特征。
可选地,作为一个实施例,图1所示的方法还包括:根据训练图片的空间特征与训练视频的空间特征的相似度,对训练视频的光流特征进行加权求和,得到训练图片的虚拟光 流特征。
上述根据训练图片的空间特征与训练视频的空间特征的相似度,对训练视频的光流特征进行加权求和,得到训练图片的虚拟光流特征,具体包括:
根据训练图片的空间特征与训练视频中的每个空间特征的相似度,确定训练视频中与每个空间特征相对应的光流特征的权重系数;根据训练视频中的每个光流特征的权重系数,对训练视频中的光流特征进行加权求和,得到训练图片的虚拟光流特征。
除了根据待处理图片的虚拟光流特征计算待处理图片在不同动作类别上置信度之外,还可以再根据待处理图片的空间特征来计算待处理图片在不同动作类别上的置信度,然后根据这两类置信度来综合判断待处理图片的动作类别。
具体地,在步骤102中提取到待处理图片的空间特征之后,可以根据待处理图片的空间特征与预设的特征库中的Z个空间特征的相似度,确定待处理图片在不同动作类别上的第二类置信度,其中,该Z个空间特征中的每个空间特征对应一种动作类别。
上述Z为大于1的整数,Z与上述X或者Y的数值既可以相同,也可以不同,上述Z个空间特征既可以是特征库中的全部空间特征也可以只是特征库中的部分空间特征。
在得到了上述第二类置信度之后,可以综合根据第一类置信度和第二类置信度来确定待处理图片的动作类别。
本申请中,通过光流预测和空间预测来综合得到待处理图片的置信度,能够更准确地确定待处理图片的动作类别。
具体地,在根据第一类置信度和第二类置信度来确定待处理图片的动作类别时,可以先对第一类置信度和第二类置信度进行加权求和,得到待处理图片在不同动作类别上的最终置信度,然后再根据最终置信度确定待处理图片的动作类别。
通过对第一类置信度和第二类置信度进行加权求和,能够得到可以综合反映待处理图片在不同动作类别上的置信度,能够更好地确定待处理图片的动作类别。
应理解,还可以根据第一类置信度和第二类置信度分别确定待处理图片的动作类别,然后再确定待处理图片的动作类别。
可选地,根据第一类置信度和第二类置信度,确定待处理图片的动作类别,包括:根据第一类置信度确定待处理图片的动作类别为第一动作类别;根据第二类置信度确定待处理图片的动作类别为第二动作类别;在第一动作类别和第二动作类别相同的情况下,确定待处理图片的动作类别为第一动作类别。
为了增加特征库中包含的空间特征和光流特征,便于后续对图片进行更好的动作识别,在确定了待处理图片的动作类别之后,还可以将待处理图片的空间特征和光流特征以及该待处理图片的动作类别等信息添加到特征库中。
可选地,作为一个实施例,在确定了待处理图片的动作类别之后,图1所示的方法还包括:将待处理图片的空间特征和虚拟光流特征,以及待处理图片的动作类别信息添加到所述特征库中。
通过将待处理图片的空间特征和虚拟光流特征,以及对应的动作类别信息添加到特征库中,能够扩充特征库中包含的空间特征和光流特征,便于后续依据特征库中的空间特征和光流特征更好地对图片进行动作识别。
下面结合图2对本申请实施例的动作识别方法的过程进行详细的描述,
图2是本申请实施例的动作识别方法的示意图。图2所示的动作识别方法的具体过程包括:
201、获取输入图片。
输入图片相当于上文中的待处理图片。
202、提取输入图片的空间特征。
具体地,在步骤202中,可以采用卷积神经网络CNN模型来提取输入图片的空间特征。如图3所示,通过CNN模型对输入图片进行卷积处理,得到输入图像的卷积特征图
Figure PCTCN2019088694-appb-000001
接下来再将
Figure PCTCN2019088694-appb-000002
拉成一维向量,得到向量u rgb。向量u rgb就是输入图片的空间特征。CNN模块可以采用多种构架实现,例如,VGG16、TSN网络等。另外,CNN模块的系数需要在动作识别的数据集上进行预训练。
203、生成输入图片的虚拟光流特征。
在步骤203中,可以根据视频仓库中存储的视频的光流特征来模拟或者生成输入图像的虚拟光流特征。
具体地,假设视频仓库中存储了N个视频,那么,可以根据该N个视频的虚拟光流特征来生成输入图片的虚拟光流特征。
在生成输入图片的虚拟光流特征之前需要先获取N个视频的空间特征和光流特征。对于N个视频中的每个视频来说,都可以按照图4所示的过程来提取该视频的空间特征和光流特征。
如图4所示,提取视频的空间特征和光流特征的具体过程包括:
首先,抽取视频中间的RGB图像和中间的光流图(该中间的光流图包括光流x和光流y);
其次,将视频中间的RGB图像送入到预先训练好的空间特征CNN模型,得到视频的空间特征;
再次,将视频中间的光流图送入到预先训练好的光流特征CNN模型,得到视频的光流特征;
最后,将每个视频的空间特征和光流特征送入到视频仓库中。
应理解,视频中间的光流图可以由视频中间时刻前后若干帧图片产生。另外,提取视频的空间特征与提取视频的光流特征可以是相互独立的,两者既可以同时进行,也可以依次进行。
上述图4所示的过程中提取到的视频的空间特征和光流特征具体可以是空间特征向量和光流特征向量,其中,每个视频的空间特征向量和光流特征向量的长度可以均为M,那么,N个视频空间特征向量就可以用一个矩阵V rgb∈N*M表示,N个视频光流特征向量可以用一个矩阵V flow∈N*M表示。这样就获取到了N个视频的空间特征向量V rgb∈N*M和光流特征向量V flow∈N*M。
在得到了视频仓库中的N个视频的空间特征和光流特征之后,就可以根据输入图片的空间特征与N个视频中的每个视频的空间特征的相似度,对N个视频的光流特征进行加权求和,从而得到输入图片的虚拟光流特征。
如图5所示,根据输入图片的空间特征与N个视频中的每个视频的空间特征的相似度,对N个视频的光流特征进行加权求和,得到输入图片的虚拟光流特征的具体过程包 括:
首先,将输入图片的空间特征和视频仓库中的视频的空间特征进行比较,得到输入图片的空间特征与视频仓库中的每个视频的空间特征的相似度。
其次,根据输入图片的空间特征与视频仓库中的每个视频的空间特征的相似度,对视频仓库中的每个视频的光流特征进行加权求和,得到输入图片的虚拟光流特征。
具体地,可以采用高斯过程来计算输入图片的空间特征和视频仓库中的视频的空间特征的相似度。例如,可以采用公式(3)来确定输入图片的空间特征和视频仓库中的视频的空间特征的相似度。
Figure PCTCN2019088694-appb-000003
其中,u rgb为输入图片的空间特征,V rgb为视频仓库中的视频的空间特征,K h(u rgb,V rgb)∈1*N中的每个元素是u rgb与V rgb每行的点积,K h(V rgb,V rgb)是V rgb的协方差矩阵,
Figure PCTCN2019088694-appb-000004
是一个噪声参数,I为单位矩阵,w h为输入图片的空间特征和视频仓库中的视频的空间特征的相似度。其中,w h是一个长度为N的一维向量,w h中的第i个元素表示输入图片的空间特征与第i个视频的空间特征的相似度,w h中的第i个元素的数值越大,输入图片的空间特征与第i个视频的空间特征的相似度越大。
在得到了输入图片的空间特征与视频仓库中的每个视频的空间特征的相似度之后,具体可以采用公式(4)来计算输入图片的虚拟光流特征。
u flow=w h*V flow∈1*M        (4)
其中,w h表示输入图片的空间特征与视频仓库中的视频的空间特征的相似度,V flow表示视频仓库中的视频的光流特征,u flow表示输入图片的虚拟光流特征,u flow也是一个长度为M的特征向量。
204、对输入图片进行光流预测,得到输入图片中的人物在各个动作类别上的第一类置信度。
如图6所示,对输入图片进行光流预测,得到第一类置信度的具体过程如下:
首先,确定输入图片的虚拟光流特征与光流特征库中的光流特征的相似度;
其次,根据输入图片的虚拟光流特征与光流特征库中的光流特征的相似度,对光流特征库中的光流特征对应的动作类别标签进行加权求和,得到光流预测的置信度(相当于步骤204中的第一类置信度)。
其中,动作类别标签(One-hot-label)用来表示每个视频或图片的动作类别标签。动作类别标签可以采用一个向量来表示,该向量的长度与动作类型的总数相同,动作类别标签中的每个位置对应一个动作类别。该向量有且仅有一个位置的值为1,其余位置为0,其中,值为1的位置对应的动作类别即为该视频或图片的动作类别。
例如,现有3个视频和3张图片:视频1(Video1)、视频2(Video2)、视频3(Video3)、图片1(Image1)、图片2(Image2)和图片3(Image3)。动作类别依次为跑步(Run)、跳舞(Dance)、跑步(Run)、跳跃(Jump)、跳舞(Dance)和跑步(Run)。
表1
  视频1 视频2 视频3 图片1 图片2 图片3
跑步 1 0 1 0 0 1
跳舞 0 1 0 0 1 0
跳跃 0 0 0 1 0 0
如表1所示,这3个视频和3张图片的动作类别标签依次为[1,0,0],[0,1,0],[1,0,0],[0,0,1],[0,1,0],[1,0,0]。那么,根据这些动作类别标签可知这3个视频和3张图片依次对应的动作类别分别是跑步、跳舞、跑步、跳跃、跳舞和跑步。
上述光流特征库中可以包含N v个视频的光流特征和N i张图片的虚拟光流特征,其中,N v个视频的光流特征为
Figure PCTCN2019088694-appb-000005
N i张图片的虚拟光流特征为
Figure PCTCN2019088694-appb-000006
两者共同构成了光流特征仓库的光流特征
Figure PCTCN2019088694-appb-000007
在计算输入图片的虚拟光流特征与光流特征库中的光流特征的相似度时,仍然可以采用高斯过程进行计算。例如,可以采用公式(5)来计算输入图片的虚拟光流特征与光流特征库中的光流特征的相似度。
w flow=K P(u flow,M flow)[K p(M flow,M flow)+Σ p] -1∈1*(N v+N i)   (5)
其中,u flow表示输入图片的虚拟光流特征,M flow表示光流特征仓库的光流特征,K P(u flow,M flow)∈1*(N v+N i)中每个元素是u flow与M flow每行的点积,K p(M flow,M flow)是M flow的协方差矩阵,
Figure PCTCN2019088694-appb-000008
是一个噪声参数矩阵,w flow表示输入图片的虚拟光流特征与光流特征库中的光流特征的相似度,w flow是一个长度为N v+N i的一维向量,其中第i个元素表示输入图片的虚拟光流特征与第i个光流特征的相似度,这个值越大,说明输入图片的光流特征与该第i个光流特征相越接近。
在得到输入图片的虚拟光流特征与光流特征库中的光流特征的相似度之后,可以根据公式(6)来计算输入图片在各个动作类别上的第一类置信度。
L flow=(w flowL)∈1*P   (6)
其中L∈(N v+N i)*P中的每一行表示光流特征仓库中每个光流特征对应的动作类别标签,P是动作类别的总数,对于每个动作类别标签来说,只有在其所属类别上为1,其余位置为0。其中,L flow也就是输入图片中的人物在各个动作类别上的第一类置信度。
下面结合表2对计算输入图片中的人物在各个动作类别上的第一类置信度的过程进行说明。假设光流特征仓库中包含3个视频和3张图片:视频1(Video1)、视频2(Video2)、视频3(Video3)、图片1(Image1)、图片2(Image2)和图片3(Image3)。动作类别依次为跑步(Run)、跳舞(Dance)、跑步(Run)、跳跃(Jump)、跳舞(Dance)和跑步(Run)。这3个视频和3个图片各自对应的动作类别标签如表2的第2列至第7列(不包含最后一行)所示,输入图片与光流特征库中的3个视频/图片的光流特征的相似度如表2最后一行所示,最终得到的输入图片在各个动作类别上的第一类置信度如表2最后一列所示。
表2
  视频1 视频2 视频3 图片1 图片2 图片3 置信度
跑步 1 0 1 0 0 1 0.3
跳舞 0 1 0 0 1 0 0.45
跳跃 0 0 0 1 0 0 0.25
  0.1 0.2 0.01 0.25 0.25 0.19  
205、对输入图片进行空间预测,得到输入图片中的人物在各个动作上的第二类置信度。
对输入图片进行空间预测的过程与对输入图片进行光流预测的过程基本相同,首先比较输入图片的空间特征u rgb与空间特征库中的空间特征M rgb的相似度w rgb,然后利用w rgb去加权动作类别得到在空间上每个类别的置信度L rgb,最终得到的输入图片中的人物在各个动作类别上的第二类置信度。
上述步骤204和步骤205中采用的光流特征库和空间特征库可以是预先建立好的特征库。
具体地,可以采用图7所示的过程来建立光流特征库(也可以称为光流特征仓库)和空间特征库(也可以称为空间特征仓库)。
如图7所示,从训练视频集合和训练图片集合中提取空间特征和光流特征,并将提取到的训练视频集合的空间特征和训练图片集合中的空间特征送入到空间特征库中,将提取到的训练视频集合的光流特征和训练图片集合中的光流特征送入到光流特征库中。
假设,最终得到的空间特征库包含N v个训练视频的空间特征和N i张图片的空间特征,最终得到的光流特征库中包含N i个训练视频的光流特征和N i张图片的虚拟光流特征。那么,空间特征库中的空间特征可以表示为
Figure PCTCN2019088694-appb-000009
光流特征库中的光流特征可以表示为
Figure PCTCN2019088694-appb-000010
另外,上述训练视频集合和训练图片集合可以是存储在本地的数据库中的视频集合和图片集合。
206、对第一类置信度和第二类置信度进行融合处理,得到目标置信度。
在对第一类置信度和第二类置信度进行融合时,可以采用但不限于等比例融合,从而得到输入图片在每个动作类别上的目标置信度(也就是输入图片在每个动作类别上的最终置信度)。
207、根据目标置信度确定输入图片中的人物的动作类别。
应理解,上述目标置信度包含输入图片在各个动作类别上的置信度,因此,在根据目标置信度确定输入图片中的人物的动作类别时,可以将目标置信度中最大的置信度对应的动作类别确定为输入图片中的人物的动作类别。
进一步地,在根据目标置信度确定输入图片中的人物的动作类别时,还可以先从目标置信度中选择出大于预设阈值的置信度,然后再从该置信度中选择出最大置信度,并将该最大的置信度对应的动作类别确定为输入图片中的人物的动作类别。
当目标置信度中不存在大于预设阈值的置信度时,说明在进行动作识别时,没有能够识别出与该输入图片的准确动作类别。
应理解,在上述步骤201至步骤207之前,还可以先从本地视频库中选择出训练图片相关度较高的视频放入到视频仓库中。
假设,现有的视频库中的视频共对应P v个动作类别,不同动作类别的视频个数不一致。为了避免动作类别的不均衡性,要从每个动作类别选择相同个数(比如K,K为大于0的整数)个候选视频,组成P v个视频包(Video Bag),其中,每个视频包里面有K个候选视频。现有的训练图片集合共有P i个动作类别,并且这些动作类别与视频的动作类别不完全相同。
那么,根据训练图片选择与训练图片相关度较高的视频的具体过程如下:
(1)、提取视频库中每个视频包中的视频的空间特征和光流特征;
(2)、提取训练图片集合中所有训练图片的空间特征,并根据视频库中每个视频包中的视频的空间特征和光流特征确定每个训练图片的虚拟光流特征;
(3)、根据训练图片的空间特征和虚拟光流特征建立本地的图片空间特征库和图片光流特征库;
(4)、将视频包中的每个视频的空间特征和光流特征分别与图片空间特征库和图片光流特征库进行相似度比较,最后得到视频包中的每个视频在不同动作类别置信度,然后将置信度的最大值作为每个视频与训练图片的相似度度量;
(5)、在每个视频包中,选择相似度度量最大的前J(J<K,J和K均为大于0的整数,)个视频作为最终入库的视频。
与视频相比,图片中的动作缺乏时间上下文关系,进行动作识别时难度较大。本申请提出一种基于虚拟光流和特征库的动作识别方法。通过为单张图片产生与动作密切相关的光流特征,进而结合图片的空间特征和动作特征进行动作识别,能够提高动作识别准确率。
另外,本申请利用训练视频和训练图片的空间特征和(虚拟)光流特征建立特征库,并通过输入图片的空间特征和虚拟光流特征与特征库进行比较得到动作类别,进而可以在训练数据较为稀少的情况下,取得较高的动作识别准确率。
为了与现有的动作识别方法的性能进行对比,下面结合表3对本申请实施例的动作识别方法的识别效果进行说明。表3示出了本申请实施例的动作识别方法和现有动作识别方法在不同动作识别数据集上的识别准确率。为了凸显本申请适合于训练图片稀少的情况,表2所示的训练集中,每个类别的训练图片仅用1张图片作为训练集。
表3
动作识别方法 WEB101数据集 VOC数据集 DIFF20数据集
KNN算法 26.1 38.3 55.7
SVM算法 22.3 32.0 54.2
TSN算法 26.1 40.3 56.3
RCNN算法 n/a 28.3 n/a
本申请 35.4 42.2 60.2
从表3中可以看出,在不同数据集上,本申请的动作识别方法的识别准确率均高其它现有方案的识别准确率,因此,本申请实施例的动作识别方法具有较高的识别准确率。
上文结合图1至图7对本申请实施例的动作识别方法进行了详细的描述。下文结合图8至图11对本申请实施例的动作识别装置进行描述,应理解,图8至图11所示的动作识别装置具体可以是监控设备、终端设备、网络服务器以及网络云平台等具有图片处理功能的设备。图8至图11所示的动作识别装置可以执行本申请实施例的动作识别方法的各个步骤,为了简洁,下面适当省略重复的描述。
图8是本申请实施例的动作识别装置的示意性框图。图8所示的动作识别装置800包括:
获取模块801,用于获取待处理图片;
提取模块802,用于提取所述待处理图片的空间特征;
处理模块803,用于根据所述待处理图片的空间特征以及预设的特征库中的X个空间特征和X个光流特征,确定所述待处理图片的虚拟光流特征,其中,所述X个空间特征和所述X个光流特征存在一一对应关系,X为大于1的整数;
所述处理模块803还用于根据所述待处理图片的虚拟光流特征与所述特征库中的Y个光流特征的相似度,确定所述待处理图片在不同动作类别上的第一类置信度,其中,所述Y个光流特征中的每个光流特征对应一种动作类别,Y为大于1的整数;
所述处理模块803还用于根据所述第一类置信度确定所述待处理图片的动作类别。
本申请中,通过待处理图片的空间特征以及特征库中的空间特征和光流特征能够获取待处理图片的虚拟光流特征,从而为图片模拟出与动作密切相关的时序信息,接下来就可以根据待处理图片的虚拟光流特征与特征库中的光流特征的相似度对待处理图片进行动作识别。
图9是本申请实施例的动作识别的装置的硬件结构示意图。图9所示的动作识别装置900(该动作识别装置900具体可以是一种计算机设备)包括存储器901、处理器902、通信接口903以及总线904。其中,存储器901、处理器902、通信接口903通过总线904实现彼此之间的通信连接。
存储器901可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器901可以存储程序,当存储器901中存储的程序被处理器902执行时,处理器902和通信接口903用于执行本申请实施例的动作识别方法的各个步骤。
处理器902可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的动作识别的装置中的模块所需执行的功能,或者执行本申请方法实施例的动作识别方法。
处理器902还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的动作识别方法的各个步骤可以通过处理器902中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器902还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是 微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器901,处理器902读取存储器901中的信息,结合其硬件完成本申请实施例的动作识别的装置中包括的模块所需执行的功能,或者执行本申请方法实施例的动作识别方法。
通信接口903使用例如但不限于收发器一类的收发装置,来实现装置900与其他设备或通信网络之间的通信。例如,可以通过通信接口903获取待处理图片。
总线904可包括在装置900各个部件(例如,存储器901、处理器902、通信接口903)之间传送信息的通路。
应注意,尽管图9所示的装置900仅仅示出了存储器901、处理器902、通信接口903,但是在具体实现过程中,本领域的技术人员应当理解,装置900还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置900还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置900也可仅仅包括实现本申请实施例所必须的器件,而不必包括图9中所示的全部器件。
应理解,动作识别装置800中的获取模块801相当于动作识别装置900中的通信接口903,提取模块802和处理模块803相当于处理器902。
图10是本申请实施例的动作识别装置的示意性框图。图10所示的动作识别装置1000包括:CNN模块1001、虚拟光流模块1002、空间预测模块1003、光流预测模块1004、融合输出模块1005、视频仓库1006、空间特征仓库1007和光流特征仓库1008。
其中,视频仓库1006、空间特征仓库1007和光流特征仓库1008既可以位于动作识别装置1000的内部,也可以位于动作识别装置1000的外部,例如,视频仓库1006、空间特征仓库1007和光流特征仓库1008可以位于服务器中或者本地的数据库中,动作识别装置可以从服务器或者本地的数据库中调取视频仓库1006、空间特征仓库1007和光流特征仓库1008中包含的特征。
应理解,动作识别装置1000可以执行本申请实施例的动作识别方法的各个步骤。各个模块的具体作用如下:
CNN模块1001,用于对待处理图片进行卷积运算,得到图像的空间特征u rgb
视频仓库1006,用于存储训练视频的空间特征V rgb和光流特征V flow
虚拟光流模块1002,用于利用待处理图片的空间特征u rgb以及视频仓库1006中存储的训练视频的空间特征V rgb和光流特征V flow,产生待处理图片的虚拟光流特征u flow
空间特征仓库1007,用于存储训练视频和图像的空间特征以及类标签。
光流特征仓库1008,用于存储训练视频的光流特征、训练图片的虚拟光流特征以及它们的动作类别标签。
空间预测模块1003,用于把待处理图片的空间特征和空间特征仓库中的特征进行比较,得到待处理图片在各个动作类别上的置信度。
光流预测模块1004,用于把待处理图片的虚拟光流特征和光流特征仓库中的特征进行比较,得到待处理图片在各个类别上的置信度。
融合输出模块1005,用于把空间预测模块1003和光流预测模块1004得到的待处理 图片在各个动作类别的置信度进行融合,得到最终每个类别的置信度,选取置信度最大的动作类别输出。
可选地,图10所示的动作识别装置1000还可以包括获取模块1009和视频选择模块1010,其中,获取模块1009用于获取待处理图片,视频选择模块1010用于为视频仓库1006选择视频。
应理解,动作识别装置1000中的获取模块1009相当于动作识别装置900中的通信接口903,动作识别装置1000中的其它模块相当于动作识别装置900中的处理器902。
上述动作识别装置1000中的CNN模块1001、虚拟光流模块1002、空间预测模块1003、光流预测模块1004、融合输出模块1005、获取模块1009以及视频选择模块1010在具体实现时可以采用硬件或者软件,或者硬件和软件相结合的方式来实现。
为了更好地理解动作识别装置1000中各个模块的工作流程,下面结合图11对动作识别装置1000进行动作识别的过程进行大致的描述(详细过程可以参见图2所示的方法中的各个步骤,这里不再详细描述)。
如图11所示,动作识别装置1000对输入图片进行动作识别的过程具体如下:
CNN模块1001得到输入图片,提取该输入图片的空间特征;
虚拟光流模块1002根据视频仓库1006中的光流特征和空间特征以及输入图片的空间特征确定输入图片的虚拟光流特征;
接下来,光流预测模块1004根据提取到的输入图片的虚拟光流特征以及光流特征仓库1008中的光流特征进行光流预测,得到第一类置信度;
空间预测模块1003根据提取到的输入图片的空间特征以及空间特征仓库1007中的空间特征进行空间预测,得到第二类置信度;
融合输出模块1005对第一类置信度和第二类置信度进行融合,得到目标置信度,然后根据目标置信度确定输入图片中的人物的动作类别。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种动作识别方法,其特征在于,包括:
    获取待处理图片;
    提取所述待处理图片的空间特征;
    根据所述待处理图片的空间特征以及预设的特征库中的X个空间特征和X个光流特征,确定所述待处理图片的虚拟光流特征,其中,所述X个空间特征和所述X个光流特征存在一一对应关系,X为大于1的整数;
    根据所述待处理图片的虚拟光流特征与所述特征库中的Y个光流特征的相似度,确定所述待处理图片在不同动作类别上的第一类置信度,其中,所述Y个光流特征中的每个光流特征对应一种动作类别,Y为大于1的整数;
    根据所述第一类置信度确定所述待处理图片的动作类别。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述待处理图片的空间特征以及预设的特征库中的X个空间特征和X个光流特征,确定所述待处理图片的虚拟光流特征,包括:
    根据所述待处理图片的空间特征与所述X个空间特征中的每个空间特征的相似度,确定所述X个光流特征中的每个光流特征的权重系数;
    根据X个光流特征中的每个光流特征的权重系数,对所述X个光流特征进行加权求和,得到所述待处理图片的虚拟光流特征。
  3. 如权利要求2所述的方法,其特征在于,所述X个光流特征中的每个光流特征的权重系数的大小与第一相似度正相关,所述第一相似度为所述X个空间特征中与所述X个光流特征中的每个光流特征相对应的空间特征与所述待处理图片的空间特征的相似度。
  4. 如权利要求1-3中任一项所述的方法,其特征在于,所述特征库中的空间特征包括训练视频的空间特征,所述特征库中的光流特征包括所述训练视频的光流特征。
  5. 如权利要求4所述的方法,其特征在于,所述特征库中的空间特征还包括训练图片的空间特征,所述特征库的光流特征还包括所述训练图片的虚拟光流特征。
  6. 如权利要求5所述的方法,其特征在于,所述训练图片的虚拟光流特征是根据所述训练图片的空间特征与所述训练视频的空间特征的相似度,对所述训练视频的光流特征进行加权求和得到的。
  7. 如权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:
    根据所述待处理图片的空间特征与所述特征库中的Z个空间特征的相似度,确定所述待处理图片在不同动作类别上的第二类置信度,其中,所述Z个空间特征中的每个空间特征对应一种动作类别,Z为大于1的整数;
    所述根据所述第一类置信度确定所述待处理图片的动作类别,包括:
    根据所述第一类置信度和所述第二类置信度,确定所述待处理图片的动作类别。
  8. 如权利要求7所述的方法,其特征在于,所述根据所述第一类置信度和第二类置信度,确定所述待处理图片的动作类别,包括:
    对所述第一类置信度和所述第二类置信度进行加权求和,得到所述待处理图片在不同 动作类别上的最终置信度;
    根据所述最终置信度确定所述待处理图片的动作类别。
  9. 如权利要求1-8中任一项所述的方法,其特征在于,在确定所述待处理图片的动作类别之后,所述方法还包括:
    将所述待处理图片的空间特征和虚拟光流特征,以及所述待处理图片的动作类别信息添加到所述特征库中。
  10. 一种动作识别装置,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于执行所述存储器中存储的程序,当所述存储器中存储的程序被执行时,所述处理器用于:
    获取待处理图片;
    提取所述待处理图片的空间特征;
    根据所述待处理图片的空间特征以及预设的特征库中的X个空间特征和X个光流特征,确定所述待处理图片的虚拟光流特征,其中,所述X个空间特征和所述X个光流特征存在一一对应关系,X为大于1的整数;
    根据所述待处理图片的虚拟光流特征与所述特征库中的Y个光流特征的相似度,确定所述待处理图片在不同动作类别上的第一类置信度,其中,所述Y个光流特征中的每个光流特征对应一种动作类别,Y为大于1的整数;
    根据所述第一类置信度确定所述待处理图片的动作类别。
  11. 如权利要求10所述的装置,其特征在于,所述处理器用于:
    根据所述待处理图片的空间特征与所述X个空间特征中的每个空间特征的相似度,确定所述X个光流特征中的每个光流特征的权重系数;
    根据所述特征库中的每个光流特征的权重系数,对所述特征库中的光流特征进行加权求和,得到所述待处理图片的虚拟光流特征。
  12. 如权利要求11所述的装置,其特征在于,所述X个光流特征中的每个光流特征的权重系数的大小与第一相似度正相关,所述第一相似度为所述X个空间特征中与所述X个光流特征中的每个光流特征相对应的空间特征与所述待处理图片的空间特征的相似度。
  13. 如权利要求10-12中任一项所述的装置,其特征在于,所述特征库中的空间特征包括训练视频的空间特征,所述特征库中的光流特征包括所述训练视频的光流特征。
  14. 如权利要求13所述的装置,其特征在于,所述特征库中的空间特征还包括训练图片的空间特征,所述特征库的光流特征还包括所述训练图片的虚拟光流特征。
  15. 如权利要求14所述的装置,其特征在于,所述训练图片的虚拟光流特征是根据所述训练图片的空间特征与所述训练视频的空间特征的相似度,对所述训练视频的光流特征进行加权求和得到的。
  16. 如权利要求10-15中任一项所述的装置,其特征在于,所述处理器用于:
    根据所述待处理图片的空间特征与所述特征库中的Z个空间特征的相似度,确定所述待处理图片在不同动作类别上的第二类置信度,其中,所述Z个空间特征中的每个空间特征对应一种动作类别,Z为大于1的整数;
    根据所述第一类置信度和所述第二类置信度,确定所述待处理图片的动作类别。
  17. 如权利要求16所述的装置,其特征在于,所述处理器用于:
    对所述第一类置信度和所述第二类置信度进行加权求和,得到所述待处理图片在不同动作类别上的最终置信度;
    根据所述最终置信度确定所述待处理图片的动作类别。
  18. 如权利要求10-17中任一项所述的装置,其特征在于,在确定所述待处理图片的动作类别之后,所述处理器还用于将所述待处理图片的空间特征和虚拟光流特征,以及所述待处理图片的动作类别信息添加到所述特征库中。
PCT/CN2019/088694 2018-05-29 2019-05-28 动作识别方法和装置 WO2019228316A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP19810779.9A EP3757874B1 (en) 2018-05-29 2019-05-28 Action recognition method and apparatus
US17/034,654 US11392801B2 (en) 2018-05-29 2020-09-28 Action recognition method and apparatus
US17/846,533 US11704938B2 (en) 2018-05-29 2022-06-22 Action recognition method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810533284.9 2018-05-29
CN201810533284.9A CN109902547B (zh) 2018-05-29 2018-05-29 动作识别方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/034,654 Continuation US11392801B2 (en) 2018-05-29 2020-09-28 Action recognition method and apparatus

Publications (1)

Publication Number Publication Date
WO2019228316A1 true WO2019228316A1 (zh) 2019-12-05

Family

ID=66943243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088694 WO2019228316A1 (zh) 2018-05-29 2019-05-28 动作识别方法和装置

Country Status (4)

Country Link
US (2) US11392801B2 (zh)
EP (1) EP3757874B1 (zh)
CN (1) CN109902547B (zh)
WO (1) WO2019228316A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766207A (zh) * 2021-01-28 2021-05-07 珠海格力电器股份有限公司 行为识别模型的构建方法、行为识别方法及智能家居

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11714961B2 (en) 2019-02-24 2023-08-01 Wrethink, Inc. Methods and apparatus for suggesting and/or associating tags corresponding to identified image content and/or storing said image content in association with tags to facilitate retrieval and use
US11741699B2 (en) * 2019-02-24 2023-08-29 Wrethink, Inc. Methods and apparatus for detecting features of scanned images, associating tags with images and/or using tagged images
US11748509B2 (en) 2019-02-24 2023-09-05 Wrethink, Inc. Methods and apparatus for automatically controlling access to stored data, a storage location of stored data, and/or ownership of stored data based on life event information
CN110597251B (zh) * 2019-09-03 2022-10-25 三星电子(中国)研发中心 用于控制智能移动设备的方法及装置
CN112668597B (zh) * 2019-10-15 2023-07-28 杭州海康威视数字技术股份有限公司 一种特征比对方法、装置及设备
KR102554848B1 (ko) * 2022-10-31 2023-07-12 주식회사 디퍼아이 행동인지 장치 및 그 방법

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164694A (zh) * 2013-02-20 2013-06-19 上海交通大学 一种人体动作识别的方法
CN105303571A (zh) * 2015-10-23 2016-02-03 苏州大学 用于视频处理的时空显著性检测方法
CN107967441A (zh) * 2017-09-19 2018-04-27 北京工业大学 一种基于双通道3d-2d rbm模型的视频行为识别方法

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100421740B1 (ko) * 2000-11-14 2004-03-10 삼성전자주식회사 객체 활동 모델링 방법
US8189866B1 (en) * 2008-08-26 2012-05-29 Adobe Systems Incorporated Human-action recognition in images and videos
US8345984B2 (en) 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
US8761437B2 (en) * 2011-02-18 2014-06-24 Microsoft Corporation Motion recognition
US8917907B2 (en) * 2011-02-28 2014-12-23 Seiko Epson Corporation Continuous linear dynamic systems
US9111375B2 (en) * 2012-01-05 2015-08-18 Philip Meier Evaluation of three-dimensional scenes using two-dimensional representations
US9436890B2 (en) * 2014-01-23 2016-09-06 Samsung Electronics Co., Ltd. Method of generating feature vector, generating histogram, and learning classifier for recognition of behavior
US10083233B2 (en) * 2014-09-09 2018-09-25 Microsoft Technology Licensing, Llc Video processing for motor task analysis
CN104408444A (zh) * 2014-12-15 2015-03-11 北京国双科技有限公司 人体动作识别方法和装置
CN104899554A (zh) 2015-05-07 2015-09-09 东北大学 一种基于单目视觉的车辆测距方法
CN105069413B (zh) 2015-07-27 2018-04-06 电子科技大学 一种基于深度卷积神经网络的人体姿势识别方法
CN105160310A (zh) * 2015-08-25 2015-12-16 西安电子科技大学 基于3d卷积神经网络的人体行为识别方法
CN105551182A (zh) 2015-11-26 2016-05-04 吉林大学 基于Kinect人体姿势识别的驾驶状态监测系统
CN106933340B (zh) * 2015-12-31 2024-04-26 北京体基科技有限公司 手势动作识别方法、控制方法和装置以及腕式设备
CN106570480B (zh) 2016-11-07 2019-04-19 南京邮电大学 一种基于姿势识别的人体动作分类方法
CN106846367B (zh) * 2017-02-15 2019-10-01 北京大学深圳研究生院 一种基于运动约束光流法的复杂动态场景的运动物体检测方法
CN107169415B (zh) * 2017-04-13 2019-10-11 西安电子科技大学 基于卷积神经网络特征编码的人体动作识别方法
CN107194419A (zh) * 2017-05-10 2017-09-22 百度在线网络技术(北京)有限公司 视频分类方法及装置、计算机设备与可读介质
JP6870114B2 (ja) * 2017-05-15 2021-05-12 ディープマインド テクノロジーズ リミテッド 3d時空畳み込みニューラルネットワークを使用した映像におけるアクション認識
CN107609460B (zh) * 2017-05-24 2021-02-02 南京邮电大学 一种融合时空双重网络流和attention机制的人体行为识别方法
CN107463949B (zh) * 2017-07-14 2020-02-21 北京协同创新研究院 一种视频动作分类的处理方法及装置
CN107644519A (zh) * 2017-10-09 2018-01-30 中电科新型智慧城市研究院有限公司 一种基于视频人体行为识别的智能报警方法和系统
CN107862376A (zh) * 2017-10-30 2018-03-30 中山大学 一种基于双流神经网络的人体图像动作识别方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164694A (zh) * 2013-02-20 2013-06-19 上海交通大学 一种人体动作识别的方法
CN105303571A (zh) * 2015-10-23 2016-02-03 苏州大学 用于视频处理的时空显著性检测方法
CN107967441A (zh) * 2017-09-19 2018-04-27 北京工业大学 一种基于双通道3d-2d rbm模型的视频行为识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIAO, QINGWEI: "Human . Action Recognition via Dual Spatio-Temporal Network Flow and Attention Mechanism Fusion", MASTER'S THESES , vol. 40, no. 10, October 2018 (2018-10-01), pages 2395 - 2401, XP055734724 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766207A (zh) * 2021-01-28 2021-05-07 珠海格力电器股份有限公司 行为识别模型的构建方法、行为识别方法及智能家居

Also Published As

Publication number Publication date
CN109902547B (zh) 2020-04-28
EP3757874A1 (en) 2020-12-30
EP3757874B1 (en) 2023-10-25
EP3757874A4 (en) 2021-04-07
US20210012164A1 (en) 2021-01-14
CN109902547A (zh) 2019-06-18
US11704938B2 (en) 2023-07-18
US20220391645A1 (en) 2022-12-08
US11392801B2 (en) 2022-07-19

Similar Documents

Publication Publication Date Title
WO2019228316A1 (zh) 动作识别方法和装置
WO2021082743A1 (zh) 视频分类方法、装置及电子设备
US11398062B2 (en) Face synthesis
AU2019201787B2 (en) Compositing aware image search
Kao et al. Visual aesthetic quality assessment with a regression model
US8605957B2 (en) Face clustering device, face clustering method, and program
US8750602B2 (en) Method and system for personalized advertisement push based on user interest learning
US20170124400A1 (en) Automatic video summarization
US10783402B2 (en) Information processing apparatus, information processing method, and storage medium for generating teacher information
WO2018196718A1 (zh) 图像消歧方法、装置、存储介质和电子设备
CN111405360B (zh) 视频处理方法、装置、电子设备和存储介质
US20170103284A1 (en) Selecting a set of exemplar images for use in an automated image object recognition system
CN111400615B (zh) 一种资源推荐方法、装置、设备及存储介质
CN112200041B (zh) 视频动作识别方法、装置、存储介质与电子设备
WO2020077999A1 (zh) 视频摘要生成方法和装置、电子设备、计算机存储介质
Wang et al. Aspect-ratio-preserving multi-patch image aesthetics score prediction
CN112131944B (zh) 一种视频行为识别方法及系统
JP7085600B2 (ja) 画像間の類似度を利用した類似領域強調方法およびシステム
US20240193790A1 (en) Data processing method and apparatus, electronic device, storage medium, and program product
Kapadia et al. Improved CBIR system using Multilayer CNN
KR102215285B1 (ko) 키 프레임 선택 방법 및 이를 수행하는 장치들
CN108961314B (zh) 运动图像生成方法、装置、电子设备及计算机可读存储介质
CN111046232B (zh) 一种视频分类方法、装置及系统
CN110633387A (zh) 基于局部信息的图像检索方法
US20220207366A1 (en) Action-Actor Detection with Graph Neural Networks from Spatiotemporal Tracking Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19810779

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019810779

Country of ref document: EP

Effective date: 20200923

NENP Non-entry into the national phase

Ref country code: DE