WO2021179898A1 - Action recognition method and apparatus, electronic device, and computer-readable storage medium - Google Patents
Action recognition method and apparatus, electronic device, and computer-readable storage medium Download PDFInfo
- Publication number
- WO2021179898A1 WO2021179898A1 PCT/CN2021/077268 CN2021077268W WO2021179898A1 WO 2021179898 A1 WO2021179898 A1 WO 2021179898A1 CN 2021077268 W CN2021077268 W CN 2021077268W WO 2021179898 A1 WO2021179898 A1 WO 2021179898A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature information
- target object
- action
- key frame
- frame image
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Definitions
- the present disclosure relates to the fields of computer technology and image processing, and in particular, to an action recognition method and device, electronic equipment, and computer-readable storage media.
- Motion detection and recognition are widely used in robotics, safety and health and other fields.
- the related art when performing action recognition, due to factors such as the limited data processing capability of the recognition device and the single type of data used for performing the action recognition, there is a defect that the accuracy of the action recognition is low.
- the present disclosure at least provides an action recognition method and device, electronic equipment, and computer-readable storage medium.
- an action recognition method including:
- the object frame corresponding to the target object is used to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing
- the number of images used for action recognition is conducive to improving the accuracy of action recognition; in addition, this aspect not only uses the action feature information of the target object for action classification and recognition, but also uses video clips and the determined action feature information mentioned above.
- the scene feature information of the scene where the target object is located and the time series feature information associated with the target object's actions are extracted. On the basis of the action feature information, combining the scene information and the time series feature information can further improve the accuracy of action recognition.
- the above-mentioned action recognition method further includes the step of determining an object frame in the key frame image:
- the initial object bounding box is expanded to obtain the object edge of the target object in the key frame image.
- the object detection method is used to determine the bounding box of the target object in the image, which reduces the amount of data that needs to be processed for action recognition, and after determining a smaller initial object bounding box, it is expanded , So that the object frame used for action recognition can include more complete target object information and more environmental information, and retain more spatial details, thereby helping to improve the accuracy of action recognition.
- the determining the action feature information of the target object based on the object frame in the key frame image of the video clip includes:
- the object frame corresponding to the key frame image respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
- the action characteristic information of the target object is determined.
- the target object in the key frame image is used to locate the target object, and the target object image used to determine the action feature information is intercepted from multiple associated images associated with the key frame image, which improves the determination of the action feature.
- the accuracy of the image used in the information can be increased, and the number of images used to determine the action feature information can be increased, so that the accuracy of the action recognition can be improved.
- filtering out multiple associated images corresponding to key frame images from the video clip includes:
- the first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
- the multiple associated images are filtered from the first sub video segment.
- the images associated with the key frame images can be filtered from the sub video clips that are similar in the shooting time of the key frame images, and the images with the closest degree of association with the key frame images can be filtered, based on the degree of association with the key frame images The most recent image can improve the accuracy of the determined motion feature information.
- the method further includes:
- the target object image is set as an image with a preset image resolution.
- the target object image is set to a preset resolution, which can increase the amount of information included in the target object image, that is, the captured target object image can be magnified, which is conducive to obtaining
- the fine-grained details of the target object can thereby improve the accuracy of the determined action feature information.
- the determining the scene characteristic information and time sequence characteristic information corresponding to the target object based on the video clip and the action characteristic information includes:
- the time sequence feature information corresponding to the target object is determined.
- the embodiments of the present disclosure by extracting scene features from the associated images associated with the key frame images, relatively complete scene feature information can be obtained, and the accuracy of action recognition can be improved based on the relatively complete scene feature information; in addition, the embodiments of the present disclosure Extract the time series characteristics of objects other than the target object, that is, the above-mentioned initial time series characteristic information, and determine the time series characteristic information associated with the target object based on the time series characteristics of other objects and the action characteristic information of the target object. The time sequence feature information associated with the target object can further improve the accuracy of action recognition.
- the performing a time-series feature extraction operation on objects other than the target object in the video clip to obtain initial time-series feature information includes:
- a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
- sub-video segments that are closer to the shooting time of the key frame image are selected from the video segments to extract the timing features, which can reduce the amount of data of the extracted timing features, and can improve the determined timing features and
- the relevance of key frame images is beneficial to improve the accuracy of action recognition; in addition, in the embodiments of the present disclosure, the action features of other objects are used as time-series features, which can improve the pertinence of the time-series features used in action recognition. Conducive to improving the accuracy of action recognition.
- the determining the time sequence characteristic information corresponding to the target object based on the initial time sequence characteristic information and the action characteristic information includes:
- the initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
- the initial time series feature information and the action feature information are reduced in dimensionality, which can reduce the amount of data that needs to be processed and is beneficial to improve the action. Recognition efficiency; in addition, the embodiments of the present disclosure perform a mean pooling operation on the initial time series feature information after dimensionality reduction, which simplifies the operation steps of time series feature extraction, and can improve the efficiency of action recognition.
- the determining the time sequence characteristic information corresponding to the target object based on the initial time sequence characteristic information and the action characteristic information further includes:
- the time sequence feature extraction operation for determining the time sequence feature information corresponding to the target object is repeatedly executed, which can improve the accuracy of the determined time sequence feature information.
- an action recognition device including:
- Video acquisition module configured to acquire video clips
- An action feature determining module configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip;
- a scene timing feature determining module configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information;
- the action recognition module is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
- the action feature determining module is further configured to determine the object frame in the key frame image:
- the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
- the action feature determination module determines the action feature information of the target object based on the object frame in the key frame image of the video clip, the action feature determination module is configured to:
- the object frame corresponding to the key frame image respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
- the action characteristic information of the target object is determined.
- the action feature determining module when the action feature determining module filters out multiple associated images corresponding to key frame images from the video clip, it is configured to:
- the first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
- the motion characteristic determination module is further configured to:
- the target object image is set as an image with a preset image resolution.
- the scene time sequence feature determination module determines the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information, it is configured to:
- the time sequence feature information corresponding to the target object is determined.
- the scene timing feature determination module when the scene timing feature determination module performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, it is configured to:
- a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
- the scene time sequence feature determination module determines the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information, it is configured to:
- the initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
- the scene sequence feature determination module determines the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information, it is further configured to:
- the present disclosure provides an electronic device including: a processor and a storage medium connected to each other.
- the storage medium stores machine-readable instructions executable by the processor.
- the The processor executes the machine-readable instructions to execute the steps of the above-mentioned action recognition method.
- the present disclosure also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the steps of the above-mentioned action recognition method when the computer program is run by a processor.
- the present disclosure also provides a computer program, including computer-readable code.
- a processor in the electronic device executes the method for implementing the above-mentioned action recognition method. step.
- the foregoing apparatus, electronic equipment, and computer-readable storage medium of the present disclosure at least contain technical features that are substantially the same as or similar to the technical features of any aspect of the foregoing method or any embodiment of any aspect of the present disclosure. Therefore, regarding the foregoing apparatus, for the effect description of the electronic device, and the computer-readable storage medium, please refer to the effect description of the above method content, which will not be repeated here.
- FIG. 1A shows a flowchart of an action recognition method provided by an embodiment of the present disclosure
- FIG. 1B shows a schematic diagram of a network architecture provided by an embodiment of the present disclosure
- FIG. 2 shows a flowchart of determining the action characteristic information of a target object in another action method provided by an embodiment of the present disclosure
- FIG. 3 shows a flowchart of determining scene feature information and time sequence feature information corresponding to the target object in yet another action recognition method provided by an embodiment of the present disclosure
- FIG. 4 shows a schematic structural diagram of a simplified timing feature extraction module in an embodiment of the present disclosure
- FIG. 5 shows a flowchart of still another method for action recognition provided by an embodiment of the present disclosure
- FIG. 6 shows a schematic structural diagram of an action recognition device provided by an embodiment of the present disclosure
- Fig. 7 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
- the present disclosure provides an action recognition method and device, electronic equipment, and computer-readable storage medium.
- the present disclosure uses the object frame corresponding to the target object to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing the amount of data used for motion recognition.
- the number of images for action recognition is conducive to improving the accuracy of action recognition; in addition, the present disclosure not only uses the action feature information of the target object to perform action classification and recognition, but also uses video clips and the determined action feature information to extract The scene feature information of the scene where the target object is located and the time sequence feature information associated with the action of the target object. On the basis of the action feature information, combining the scene information and the time sequence feature information can further improve the accuracy of action recognition.
- the embodiments of the present disclosure provide an action recognition method, which is applied to a hardware device such as a terminal device that performs action recognition, and the method may also be implemented by a processor executing a computer program.
- the action recognition method provided by the embodiment of the present disclosure includes the following steps:
- a video clip is a video clip used for action recognition, and includes multiple images.
- the images include a target object that needs to be motion recognized, and the target object may be a human or an animal.
- the above-mentioned video clips can be shot by the terminal device that performs motion recognition using its own camera or other shooting equipment, or it can be shot by other shooting devices. After shooting by other shooting devices, the video clip can be passed to the terminal device for motion recognition. .
- S120 Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip.
- the object frame is the bounding box surrounding the target object.
- the image information in the bounding box is used to determine the motion characteristic information of the target object, the amount of data processed by the terminal device can be reduced.
- the present disclosure does not limit the method of filtering key frame images from video clips.
- the object borders in each key frame image can be used to determine the action feature information of the target object.
- the frame of the object in the frame image determines the action characteristic information of the target object.
- object detection methods can be used, for example, a human body detector is used to determine the object frame by using the human body detection method.
- a human body detector is used to determine the object frame by using the human body detection method.
- other methods can also be used to determine the object frame.
- the present disclosure does not deal with the method for determining the object frame. limited.
- the object frame detected by the human body detector may be used as the final object frame used to determine the action feature information.
- the object frame detected by the human body detector may be a smaller frame including the target object, in order to obtain more complete target object information and more environmental information, after the human body detector detects the object frame, the The object frame detected by each human body detector may be expanded according to the preset expansion size information to obtain the final object frame of the target object in each of the key frame images. After that, the determined final object frame is used to determine the action characteristic information of the target object.
- the expansion size information for expanding the object frame is preset.
- the expansion size information includes a first extension length of the object frame in the length direction and a second extension length of the object frame in the width direction.
- the length of the object frame is extended to both sides according to the first extension length, and the two sides in the length direction are respectively extended by half of the first extension length.
- the width of the object frame is extended to both sides according to the second extension length, and both sides in the width direction are respectively extended by half of the second extension length.
- the first extension length and the second extension length may be preset specific values, or may be determined based on the length and width of the object frame directly detected by the human body detector.
- the first extension length may be equal to the length of the object frame directly detected by the human body detector
- the second extension length may be equal to the width of the object frame directly detected by the human body detector.
- the object detection method is used to determine the bounding box of the target object in the image, which reduces the amount of data that needs to be processed for action recognition. After determining a smaller initial bounding box of the object, it is expanded.
- the frame of the object configured for action recognition can include more complete target object information and more environmental information, thereby helping to improve the accuracy of action recognition.
- the aforementioned action feature information is extracted from the image in the video clip, and can characterize the action feature of the target object.
- the scene feature information is used to characterize the scene feature of the scene in which the target object is located, and may be obtained by performing scene feature extraction from at least part of the associated images associated with the key frame image.
- the time sequence feature information is the feature information related to the target object's action in time sequence. For example, it can be the action feature information of other objects in the video clip except the target object. In specific implementation, it can be based on the video clip and the target object's action feature information. The action feature information is determined.
- S140 Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
- the above three types of information can be combined. For example, splicing; after that, the combined information is classified to obtain the action type of the target object, so as to realize the action recognition of the target object.
- the object frame corresponding to the target object is used to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing
- the number of images used for action recognition is conducive to improving the accuracy of action recognition; in addition, the embodiments of the present disclosure not only use the action feature information of the target object for action classification and recognition, but also use video clips and the determined action features mentioned above.
- Information, the scene feature information of the scene where the target object is located and the time sequence feature information associated with the target object’s actions are extracted. Based on the action feature information, combining the scene information and time sequence feature information can further improve the accuracy of action recognition.
- FIG. 1B shows a schematic diagram of a network architecture provided by an embodiment of the present disclosure, and the network architecture includes: users A terminal 201, a network 202, and an action-recognized terminal device 203.
- the user terminal 201 and the action recognition terminal device 203 establish a communication connection through the network 202.
- the request information configured to determine the action type is passed through the network 202 is sent to the terminal device 203 for action recognition; then, the terminal device 203 for action recognition obtains the video clip and uses the object frame corresponding to the target object to determine the action feature information; and uses the video clip and the determined action feature information to extract The scene feature information of the scene where the target object is located and the time sequence feature information associated with the target object’s actions; finally, the scene information and time sequence feature information are comprehensively considered to determine the target object’s action type with a higher accuracy, and will determine The output action type is fed back to the user terminal 201.
- the user terminal 201 may include a device with data processing capabilities
- the motion recognition terminal device 203 may include an image acquisition device, and a processing device with data processing capabilities or a remote server.
- the network 202 may adopt a wired connection or a wireless connection.
- the terminal device for action recognition is a processing device with data processing capabilities
- the user terminal 201 can communicate with the processing device through a wired connection, such as data communication via a bus; when the terminal device for action recognition 203 is a remote server At the time, the user terminal can interact with the remote server through the wireless network.
- the foregoing determination of the motion feature information of the target object based on the object frame of the target object in the key frame image in the video clip can be specifically implemented by using the following steps:
- the associated image associated with the key frame image is an image similar to the image feature of the key frame image, for example, it may be an image similar to the shooting time of the key frame image.
- the following sub-steps can be used to filter the associated images corresponding to the key frame images:
- Sub-step 1 Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is positive Integer.
- the key frame image may be located in the first half of the first sub-video segment, or may be located in the second half of the first sub-video segment, of course, it can also be located in the first sub-video segment.
- the middle of or near the middle may be located in the first half of the first sub-video segment.
- a sub-video segment including a key frame image can be intercepted from the video segment, for example, a 64-frame sub-video segment can be intercepted.
- the key frame image is in the middle of the sub-video segment or a position close to the middle.
- the sub video segment includes the first 32 frames of the key frame image, the key frame image, and the last 31 frames of the key frame image; for another example, in the sub video segment, the key frame image is in the first half of the sub video segment.
- the key frame image is in the second half of the sub video segment, and the sub video segment includes the first 50 frames of the key frame image, the key frame image, and the last 13 frames of the key frame image.
- the key frame image may also be located at both ends of the first sub-video segment, that is, the above-mentioned N images adjacent to the key frame image in time sequence are the first N images of the key frame image. Or the next N images.
- the present disclosure does not limit the position of the key frame image in the first sub-video segment.
- the second sub-step is to filter the multiple associated images from the first sub-video segment.
- the associated images may be filtered from the first sub-video segment based on a preset time interval, for example, T frames of associated images can be obtained by sparsely sampling the first sub-video segment with a time span ⁇ .
- the related images obtained by screening may include or may not include key frame images, and have a certain degree of randomness. The present disclosure does not limit whether the related images include key frame images.
- the image similarity between each frame of the image in the first sub-video segment and the key frame image may be calculated first, and then multiple images with the highest image similarity are selected as the associated images associated with the key frame image.
- S220 According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image.
- the object frame corresponding to the key image to intercept part of the image from some or all of the associated images associated with the key frame image. If it is to intercept the target object image from part of the related images, specifically it can select the part of the related image closest to the shooting time of the key frame image from all the related images to intercept the target object image. Of course, other methods can also be used to select the part of the related image. Image to capture the target object image. For example, according to a certain time interval, some related images are selected from all related images.
- the specific process includes: first copying the object frame on all the associated images or part of the associated images in chronological order.
- the coordinate information of the object frame on the key frame image is used to realize the frame copy on the associated image, for example, according to the coordinate information of the object frame on the key frame image, according to the time sequence. Offset the frame position or copy the frame position directly to get the object frame on the associated image.
- the associated image is cropped according to the object frame to obtain the target object image, that is, the image within the object frame in the associated image is intercepted as the target object image.
- the function of the key frame image is to realize the positioning of the target object image, and is not necessarily used to directly determine the action characteristic information. For example, when the associated image does not include the key frame image, the target object image used to determine the action feature information is not intercepted from the key frame image.
- S230 Determine the action feature information of the target object based on the multiple target object images corresponding to the key frame.
- action feature extraction can be performed on multiple target object images.
- a three-dimensional (3D) convolutional neural network can be used to process the target object image to extract the action in the target object image.
- Features to obtain the action feature information of the target object can be used to obtain the action feature information of the target object.
- the following steps may be used to process the target object image:
- the target object image is set as an image with a preset image resolution.
- the aforementioned preset image resolution is higher than the original image resolution of the target object image.
- existing methods or tools can be used to set the image resolution of the target object image, for example, interpolation and other methods can be used to adjust the image resolution of the target object image.
- the target object image is set to the preset resolution, which can increase the amount of information included in the target object image, that is, the intercepted target object image can be enlarged, and the target object can be more fine-grained. Details, which can improve the accuracy of the determined action feature information.
- the above-mentioned preset image resolution can be set to H ⁇ W, the number of target object images captured by each key frame image is T, and the number of channels of each frame of target object image is 3, then the input 3D convolutional neural
- the action feature extracted by the network is a T ⁇ H ⁇ W ⁇ 3 image block.
- a 2048-dimensional feature vector can be obtained, which is the aforementioned action feature information.
- the target object frame in the key frame image is used for positioning, and the target object image used to determine the action feature information is intercepted from multiple associated images associated with the key frame image, which improves the determination of the action feature.
- the accuracy of the image used in the information can be increased, and the number of images used to determine the action feature information can be increased, so that the accuracy of the action recognition can be improved.
- the foregoing determination of scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information includes:
- For the key frame image filter multiple associated images corresponding to the key frame image from the video clip, and perform a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information.
- a 3D convolutional neural network can be used to perform video scene feature extraction and global average pooling on part or all of the associated images to obtain a 2048-dimensional feature vector, which is the aforementioned scene feature information.
- S320 Perform a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information.
- the initial time sequence feature information is the time sequence feature of other objects other than the target object, such as the action feature of other objects, which can be determined through the following sub-steps during specific implementation:
- Sub-step 1 For the key frame image, select a second sub-video segment that includes a key frame image from the video segments; the second sub-video segment also includes P images that are temporally adjacent to the key frame image Image; where P is a positive integer.
- the key frame image may be located in the first half of the second sub-video segment, or may be located in the second half of the second sub-video segment, and of course, it can also be located in the second sub-video segment.
- the middle of or near the middle may be located in the first half of the second sub-video segment, or may be located in the second half of the second sub-video segment, and of course, it can also be located in the second sub-video segment. The middle of or near the middle.
- the key frame image may also be located at both ends of the second sub-video segment, that is, the P images adjacent to the key frame image in time sequence are the previous P images of the key frame image. Or the next P images.
- the present disclosure does not limit the position of the key frame image in the second sub-video segment.
- a sub-video segment including a key frame image can be intercepted from a video segment.
- a 2-second sub-video segment can be intercepted.
- the sub-video has a longer time and is used to determine the timing of a long time sequence. feature.
- the second sub-step is to extract the action features of objects other than the target object in each image in the second sub-video segment, and use the obtained action features as the initial time sequence feature information.
- the 3D convolutional neural network can be used to extract the action features of objects other than the target object in the sub-video segment, and the obtained initial timing feature information can be based on the video timing feature bank (long-term Feature Bank, LFB) Form storage and use.
- video timing feature bank long-term Feature Bank, LFB
- sub-video segments that are closer to the shooting time of the key frame image are selected from the video segments to extract the timing features, which can reduce the amount of data of the extracted timing features, and can improve the determined timing features and
- the relevance of the key frame images is beneficial to improve the accuracy of action recognition;
- the action features of other objects are used as time-series features, which can improve the pertinence of the time-series features used in action recognition. Conducive to improving the accuracy of action recognition.
- time sequence feature extraction on the initial time sequence feature information and the action feature information to obtain the time sequence feature information corresponding to the target object.
- the following sub-steps can be used to perform time-series feature extraction on the initial time-series feature information and action feature information to obtain the time-series feature information corresponding to the target object:
- Sub-step 1 Perform dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
- the initial time series feature information and the action feature information can be reduced in dimensionality, which can reduce the amount of data that needs to be processed and help improve action recognition s efficiency.
- the initial time sequence feature information and the action feature information can also be randomly deactivated (Dropout) processing.
- the Dropout processing can be used to extract the initial time sequence.
- the final network layer of the neural network for feature information and action feature information can also be implemented at each network layer of the neural network for extracting initial time series feature information and action feature information.
- the second step is to perform the mean pooling operation on the initial time series feature information after dimensionality reduction processing.
- Sub-step 3 The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
- the above merging operation can specifically be channel splicing, that is, the channel of one feature information is added to the channel of another feature information to realize the merging; the merging operation can also be an addition operation, that is, the initial timing feature information and dimensionality reduction after the mean pooling operation The processed action feature information is added.
- Sub-step two and sub-step three are essentially to perform a sequence feature extraction operation on the initial sequence feature information and the action feature information, and can be specifically implemented by using the simplified sequence feature extraction module as shown in FIG. 4.
- the simplified time-series feature extraction module 400 shown in FIG. 4 is used to extract the above-mentioned time-series feature information, which may specifically include a linear layer 401, an average pooling (Average) layer 402, and a normalization and activation function (LN+ReLU).
- a layer 403 and a dropout layer 404 A layer 403 and a dropout layer 404.
- the time series feature extraction operation is simplified. Only the average pooling Average layer is used to perform the average pooling operation on the initial time series feature information after the dimensionality reduction process.
- the normalization (softmax) operation is not performed, which is simplified
- the operation steps of time sequence feature extraction are simplified, that is, the existing time sequence feature extraction module is simplified, and the efficiency of action recognition can be improved.
- the existing time series feature extraction module does not include an average pooling layer, but includes a classification normalization softmax layer, and the processing complexity of the softmax layer is higher than the average pooling operation.
- the existing time sequence feature extraction module further includes a linear layer before the random inactivation layer.
- the simplified time sequence feature extraction module in the present disclosure does not include the linear layer, so the efficiency of action recognition can be further improved.
- the time series feature information output by the time series feature extraction module may be a 512-dimensional feature vector, and the 512-dimensional feature vector is the time series feature information of the aforementioned target object.
- extracting scene features from part or all of the associated images associated with the key frame images can obtain relatively complete scene feature information, and based on relatively complete scene feature information, the accuracy of action recognition can be improved.
- the time series characteristics of other objects other than the target object that is, the above-mentioned initial time series characteristic information
- the target object associated with the target object is determined.
- Time sequence feature information, using the time sequence feature information associated with the target object can further improve the accuracy of action recognition.
- time series feature extraction modules can be connected in series to extract the above time series feature information, and the time series feature information extracted by one time series feature extraction module is used as the input of another time series feature extraction module.
- the time sequence feature information corresponding to the target object extracted by the previous time sequence feature extraction module may be used as the new initial time sequence feature information, and return to the above to reduce the initial time sequence feature information and the action feature information respectively.
- three simplified timing feature extraction modules can be connected in series to determine the final timing feature information.
- the embodiment of the present disclosure uses a person as a target object for action recognition.
- the action recognition method of the embodiment of the present disclosure may include:
- Step 1 Obtain a video clip 501, and filter key frame images from the above video clip;
- Step 2 Use the human body detector 502 to locate the person in each key frame image to obtain the person, that is, the initial object bounding box of the target object;
- Step 3 Expand the initial object bounding box according to the preset expansion size information to obtain the final object bounding box; then, use the object bounding box to perform partial image interception of the associated image associated with the key frame image to obtain each key image Corresponding target object image;
- Step 4 Input the target object images corresponding to all the obtained key images into the 3D convolutional neural network 503, and use the 3D convolutional neural network 503 to extract the action features of the target object to obtain the action feature information corresponding to the target object.
- Step 5 Input the associated image associated with the key frame image into the aforementioned 3D convolutional neural network 503, and use the 3D convolutional neural network 503 to extract the video scene features of the scene where the target object is located, to obtain scene feature information.
- Step 6 Use another 3D convolutional neural network 503 to perform time-series feature extraction on the video clip, that is, extract the action features of objects other than the target object to obtain initial time-series feature information.
- the above-mentioned initial time-series feature information can be based on time-series features. It exists in the form of a library; here, when the timing feature is extracted, it can be extracted from the entire video segment, or it can be extracted from the video segment, including a longer sub-video segment of the key frame image.
- Step 7 Using the simplified time sequence feature extraction module 504, perform a time sequence feature extraction operation on the initial time sequence feature information and the action feature information to obtain the time sequence feature information corresponding to the target object.
- Step 8 Perform splicing processing on the time series feature information, action feature information, and scene feature information, and use the action classifier 505 to classify the spliced information to obtain the action type of the target object.
- the present disclosure also provides an action recognition device, which is applied to hardware devices such as terminal equipment that performs action recognition on a target object, and each module can implement the same method steps and methods as in the above-mentioned method.
- an action recognition device which is applied to hardware devices such as terminal equipment that performs action recognition on a target object, and each module can implement the same method steps and methods as in the above-mentioned method. The same beneficial effects are achieved, and therefore the same parts will not be repeated in this disclosure.
- an action device provided by the present disclosure may include:
- the video acquisition module 610 is configured to acquire video clips.
- the action feature determining module 620 is configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip.
- the scene timing feature determining module 630 is configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information.
- the action recognition module 640 is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
- the action feature determining module 620 is further configured to determine the object frame in the key frame image:
- the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
- the action feature determination module 620 is configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip:
- the object frame corresponding to the key frame image respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
- the action characteristic information of the target object is determined.
- the action feature determining module 620 when the action feature determining module 620 filters out multiple associated images corresponding to the key frame images from the video clip, it is configured to:
- the first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
- the motion characteristic determination module 620 is further configured to:
- the target object image is set as an image with a preset image resolution.
- the scene timing feature determination module 630 determines the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information, it is configured to:
- the time sequence feature information corresponding to the target object is determined.
- the scene timing feature determination module 630 when the scene timing feature determination module 630 performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, it is configured to:
- a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
- the scene timing feature determining module 630 determines the timing feature information corresponding to the target object based on the initial timing feature information and the action feature information, it is configured to:
- the initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
- the scene timing feature determining module 630 determines the timing feature information corresponding to the target object based on the initial timing feature information and the action feature information, it is further configured to:
- the embodiment of the present disclosure discloses an electronic device, as shown in FIG. 7, comprising: a processor 701 and a storage medium 702 connected to each other, and the storage medium stores machine-readable instructions executable by the processor.
- the processor executes the machine-readable instructions to execute the steps of the above-mentioned action recognition method.
- the processor 701 and the storage medium 702 may be connected through a bus 703.
- the embodiments of the present disclosure also provide a computer program product corresponding to the above method and device, including a computer-readable storage medium storing program code.
- the instructions included in the program code can be used to execute the method in the previous method embodiment, and the specific implementation is Please refer to the method embodiment, which will not be repeated here.
- the computer-readable storage medium may be a volatile or non-volatile storage medium.
- the modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present disclosure.
- the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile computer readable storage medium executable by a processor.
- the technical solution of the present disclosure essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
- the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
- the present disclosure provides an action recognition method and device, an electronic device, and a computer-readable storage medium, wherein the method includes: obtaining a video clip; determining the target object based on the object frame in the key frame image in the video clip The action feature information of the target object; based on the video clip and the action feature information, determine the scene feature information and time sequence feature information corresponding to the target object; based on the action feature information, the scene feature information, and the The time sequence feature information determines the action type of the target object.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
Description
Claims (21)
- 一种动作识别方法,其中,包括:An action recognition method, which includes:获取视频片段;Get video clips;基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息;Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip;基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息;Determine the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information;基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
- 根据权利要求1所述的动作识别方法,其中,还包括确定所述关键帧图像中的对象边框的步骤:The action recognition method according to claim 1, further comprising the step of determining the border of the object in the key frame image:从所述视频片段中筛选关键帧图像;Screening key frame images from the video clips;对筛选得到的所述关键帧图像进行对象检测,确定所述目标对象在所述关键帧图像中的初始对象边界框;Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;按照预设扩展尺寸信息,对所述初始对象边界框进行扩展,得到所述目标对象在所述关键帧图像中的所述对象边框。According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
- 根据权利要求1或2所述的动作识别方法,其中,所述基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息,包括:The action recognition method according to claim 1 or 2, wherein the determining the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip comprises:针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;按照该关键帧图像对应的对象边框,分别从该关键帧图像对应的至少部分关联图像中截取部分图像,得到该关键帧图像对应的多张目标对象图像;According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;基于所述关键帧图像对应的多张目标对象图像,确定所述目标对象的动作特征信息。Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
- 根据权利要求3所述的动作识别方法,其中,从所述视频片段中筛选出与关键帧图像对应的多张关联图像,包括:The action recognition method according to claim 3, wherein the filtering out multiple associated images corresponding to the key frame images from the video clip comprises:从所述视频片段中选取包括关键帧图像的第一子视频片段;所述第一子视频片段还包括与该关键帧图像时序上相邻的N张图像;其中,N为正整数;Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;从所述第一子视频片段中筛选所述多张关联图像。The multiple associated images are filtered from the first sub video segment.
- 根据权利要求3所述的动作识别方法,其中,在得到多张目标对象图像之后,在确定所述目标对象的动作特征信息之前,还包括:The action recognition method according to claim 3, wherein after obtaining multiple target object images, before determining the action characteristic information of the target object, the method further comprises:将所述目标对象图像设置为具有预设图像分辨率的图像。The target object image is set as an image with a preset image resolution.
- 根据权利要求1-5任一项所述的动作识别方法,其中,所述基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息,包括:The action recognition method according to any one of claims 1 to 5, wherein the determining the scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information comprises:针对所述关键帧图像,从所述视频片段中筛选出与所述关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;对至少部分所述关联图像进行视频场景特征提取操作,得到所述场景特征信息;Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息;Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息。Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
- 根据权利要求6所述的动作识别方法,其中,所述对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息,包括:8. The action recognition method according to claim 6, wherein the performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information comprises:针对所述关键帧图像,从所述视频片段中选取包括关键帧图像的第二子视频片段;所述第二子视频片段还包括与该关键帧图像时序上相邻的P张图像;其中,P为正整数;For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;提取所述第二子视频片段中的图像中,除所述目标对象以外的其他对象的动作特征,并将得到动作特征作为所述初始时序特征信息。Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
- 根据权利要求6或7所述的动作识别方法,其中,所述基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息,包括:The action recognition method according to claim 6 or 7, wherein the determining the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information comprises:分别对所述初始时序特征信息和所述动作特征信息进行降维处理;Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;对降维处理后的初始时序特征信息进行均值池化操作;Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;将进行均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行合并操作,得到所述目标对象对应的时序特征信息。The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
- 根据权利要求8所述的动作识别方法,其中,所述基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息,还包括:8. The action recognition method according to claim 8, wherein the determining the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information further comprises:将得到的所述目标对象对应的时序特征信息作为新的初始时序特征信息,并返回所述分别对所述初始时序特征信息和所述动作特征信息进行降维处理的步骤。Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
- 一种动作识别装置,其中,包括:An action recognition device, which includes:视频获取模块,配置为获取视频片段;Video acquisition module, configured to acquire video clips;动作特征确定模块,配置为基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息;An action feature determining module, configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip;场景时序特征确定模块,配置为基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息;A scene timing feature determining module, configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information;动作识别模块,配置为基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。The action recognition module is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
- 根据权利要求10所述的动作识别装置,其中,所述动作特征确定模块还配置为确定关键帧图像中的对象边框:The motion recognition device according to claim 10, wherein the motion feature determination module is further configured to determine the object frame in the key frame image:从所述视频片段中筛选关键帧图像;Screening key frame images from the video clips;对筛选得到的所述关键帧图像进行对象检测,确定所述目标对象在所述关键帧图像中的初始对象边界框;Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;按照预设扩展尺寸信息,对所述初始对象边界框进行扩展,得到所述目标对象在所述关键帧图像中的所述对象边框。According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
- 根据权利要求10或11所述的动作识别装置,其中,所述动作特征确定模块在基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信时,配置为:The action recognition device according to claim 10 or 11, wherein the action feature determination module determines the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip. , The configuration is:针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;按照该关键帧图像对应的对象边框,分别从该关键帧图像对应的至少部分关联图像中截取部分图像,得到该关键帧图像对应的多张目标对象图像;According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;基于所述关键帧图像对应的多张目标对象图像,确定所述目标对象的动作特征信息。Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
- 根据权利要求12所述的动作识别装置,其中,所述动作特征确定模块在从所述视频片段中筛选出与关键帧图像对应的多张关联图像时,配置为:11. The motion recognition device according to claim 12, wherein the motion feature determination module is configured to:从所述视频片段中选取包括关键帧图像的第一子视频片段;所述第一子视频片段还包括与该关键帧图像时序上相邻的N张图像;其中,N为正整数;Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;从所述第一子视频片段中筛选所述多张关联图。Filter the multiple association graphs from the first sub video segment.
- 根据权利要求12所述的动作识别装置,其中,在得到多张目标对象图像之后,在确定所述目标对象的动作特征信息之前,所述动作特征确定模块还配置为:The motion recognition device according to claim 12, wherein after obtaining multiple target object images and before determining the motion characteristic information of the target object, the motion characteristic determination module is further configured to:将所述目标对象图像设置为具有预设图像分辨率的图像。The target object image is set as an image with a preset image resolution.
- 根据权利要求10至14任一项所述的动作识别装置,其中,所述场景时序特征确定模块在基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息时,配置为:The action recognition device according to any one of claims 10 to 14, wherein the scene timing feature determination module is determining the scene feature information and timing corresponding to the target object based on the video clip and the action feature information. For feature information, the configuration is:针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;对至少部分所述关联图像进行视频场景特征提取操作,得到所述场景特征信息;Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息;Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息。Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
- 根据权利要求15所述的动作识别装置,其中,所述场景时序特征确定模块在对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息时,配置为:The action recognition device according to claim 15, wherein the scene timing feature determination module performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, is configured to :针对所述关键帧图像,从所述视频片段中选取包括关键帧图像的第二子视频片段;所述第二子视频片段还包括与该关键帧图像时序上相邻的P张图像;其中,P为正整数;For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;提取所述第二子视频片段中的图像中,除所述目标对象以外的其他对象的动作特征,并将得到动作特征作为所述初始时序特征信。Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
- 根据权利要求15或16所述的动作识别装置,其中,所述场景时序特征确定模块在基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息时,配置为:The action recognition device according to claim 15 or 16, wherein the scene sequence feature determination module is configured to determine the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information for:分别对所述初始时序特征信息和所述动作特征信息进行降维处理;Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;对降维处理后的初始时序特征信息进行均值池化操作;Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;将进行均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行合并操作,得到所述目标对象对应的时序特征信息。The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
- 根据权利要求17所述的动作识别装置,其中,所述场景时序特征确定模块在基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息时,还配置为:The action recognition device according to claim 17, wherein the scene sequence feature determination module is further configured to determine the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information. :将得到的所述目标对象对应的时序特征信息作为新的初始时序特征信息,并返回所述分别对所述初始时序特征信息和所述动作特征信息进行降维处理的步骤。Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
- 一种电子设备,包括:相互连接的处理器和存储介质,所述存储介质存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器执行所述机器可读指令,以执行如权利要求1至9任一所述的动作识别方法。An electronic device comprising: a processor and a storage medium connected to each other. The storage medium stores machine-readable instructions executable by the processor. When the electronic device is running, the processor executes the machine-readable instructions. Instructions to execute the action recognition method according to any one of claims 1 to 9.
- 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时执行如权利要求1至9任一所述的动作识别方法。A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the action recognition method according to any one of claims 1 to 9 is executed.
- 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至9中的任一权利要求所述的方法。A computer program, comprising computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the Methods.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021562324A JP2022529299A (en) | 2020-03-11 | 2021-02-22 | Operation identification methods and devices, electronic devices, computer readable storage media |
KR1020217036106A KR20210145271A (en) | 2020-03-11 | 2021-02-22 | Motion recognition method and apparatus, electronic device, computer readable storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010166148.8 | 2020-03-11 | ||
CN202010166148.8A CN111401205B (en) | 2020-03-11 | 2020-03-11 | Action recognition method and device, electronic equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021179898A1 true WO2021179898A1 (en) | 2021-09-16 |
Family
ID=71432295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/077268 WO2021179898A1 (en) | 2020-03-11 | 2021-02-22 | Action recognition method and apparatus, electronic device, and computer-readable storage medium |
Country Status (5)
Country | Link |
---|---|
JP (1) | JP2022529299A (en) |
KR (1) | KR20210145271A (en) |
CN (1) | CN111401205B (en) |
TW (1) | TW202135002A (en) |
WO (1) | WO2021179898A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114268849A (en) * | 2022-01-29 | 2022-04-01 | 北京卡路里信息技术有限公司 | Video processing method and device |
CN115229804A (en) * | 2022-09-21 | 2022-10-25 | 荣耀终端有限公司 | Method and device for attaching component |
CN117711014A (en) * | 2023-07-28 | 2024-03-15 | 荣耀终端有限公司 | Method and device for identifying space-apart gestures, electronic equipment and readable storage medium |
WO2024082943A1 (en) * | 2022-10-20 | 2024-04-25 | 腾讯科技(深圳)有限公司 | Video detection method and apparatus, storage medium, and electronic device |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401205B (en) * | 2020-03-11 | 2022-09-23 | 深圳市商汤科技有限公司 | Action recognition method and device, electronic equipment and computer readable storage medium |
US11270147B1 (en) | 2020-10-05 | 2022-03-08 | International Business Machines Corporation | Action-object recognition in cluttered video scenes using text |
CN112800278B (en) * | 2021-03-30 | 2021-07-09 | 腾讯科技(深圳)有限公司 | Video type determination method and device and electronic equipment |
US11423252B1 (en) | 2021-04-29 | 2022-08-23 | International Business Machines Corporation | Object dataset creation or modification using labeled action-object videos |
CN114120180B (en) * | 2021-11-12 | 2023-07-21 | 北京百度网讯科技有限公司 | Time sequence nomination generation method, device, equipment and medium |
TWI797014B (en) * | 2022-05-16 | 2023-03-21 | 國立虎尾科技大學 | Table tennis pose classifying method and table tennis interaction system |
CN116824641B (en) * | 2023-08-29 | 2024-01-09 | 卡奥斯工业智能研究院(青岛)有限公司 | Gesture classification method, device, equipment and computer storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105183758A (en) * | 2015-07-22 | 2015-12-23 | 深圳市万姓宗祠网络科技股份有限公司 | Content recognition method for continuously recorded video or image |
CN109492581A (en) * | 2018-11-09 | 2019-03-19 | 中国石油大学(华东) | A kind of human motion recognition method based on TP-STG frame |
CN109800689A (en) * | 2019-01-04 | 2019-05-24 | 西南交通大学 | A kind of method for tracking target based on space-time characteristic fusion study |
CN110309784A (en) * | 2019-07-02 | 2019-10-08 | 北京百度网讯科技有限公司 | Action recognition processing method, device, equipment and storage medium |
CN110427807A (en) * | 2019-06-21 | 2019-11-08 | 诸暨思阔信息科技有限公司 | A kind of temporal events motion detection method |
CN111401205A (en) * | 2020-03-11 | 2020-07-10 | 深圳市商汤科技有限公司 | Action recognition method and device, electronic equipment and computer readable storage medium |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101334845B (en) * | 2007-06-27 | 2010-12-22 | 中国科学院自动化研究所 | Video frequency behaviors recognition method based on track sequence analysis and rule induction |
CN101236656B (en) * | 2008-02-29 | 2011-06-15 | 上海华平信息技术股份有限公司 | Movement target detection method based on block-dividing image |
CN101826155B (en) * | 2010-04-02 | 2012-07-25 | 浙江大学 | Method for identifying act of shooting based on Haar characteristic and dynamic time sequence matching |
US8855369B2 (en) * | 2012-06-22 | 2014-10-07 | Microsoft Corporation | Self learning face recognition using depth based tracking for database generation and update |
JP6393495B2 (en) * | 2014-03-20 | 2018-09-19 | 日本ユニシス株式会社 | Image processing apparatus and object recognition method |
EP3107069A4 (en) * | 2014-03-24 | 2017-10-04 | Hitachi, Ltd. | Object detection apparatus, object detection method, and mobile robot |
JP6128501B2 (en) * | 2016-03-17 | 2017-05-17 | ヤフー株式会社 | Time-series data analysis device, time-series data analysis method, and program |
JP2017187850A (en) * | 2016-04-01 | 2017-10-12 | 株式会社リコー | Image processing system, information processing device, and program |
US10997421B2 (en) * | 2017-03-30 | 2021-05-04 | Hrl Laboratories, Llc | Neuromorphic system for real-time visual activity recognition |
JP6773061B2 (en) * | 2018-02-16 | 2020-10-21 | 新東工業株式会社 | Evaluation system, evaluation device, evaluation method, evaluation program, and recording medium |
CN108537829B (en) * | 2018-03-28 | 2021-04-13 | 哈尔滨工业大学 | Monitoring video personnel state identification method |
CN110147711B (en) * | 2019-02-27 | 2023-11-14 | 腾讯科技(深圳)有限公司 | Video scene recognition method and device, storage medium and electronic device |
CN110414335A (en) * | 2019-06-20 | 2019-11-05 | 北京奇艺世纪科技有限公司 | Video frequency identifying method, device and computer readable storage medium |
CN110826447A (en) * | 2019-10-29 | 2020-02-21 | 北京工商大学 | Restaurant kitchen staff behavior identification method based on attention mechanism |
-
2020
- 2020-03-11 CN CN202010166148.8A patent/CN111401205B/en active Active
-
2021
- 2021-02-22 JP JP2021562324A patent/JP2022529299A/en active Pending
- 2021-02-22 WO PCT/CN2021/077268 patent/WO2021179898A1/en active Application Filing
- 2021-02-22 KR KR1020217036106A patent/KR20210145271A/en active Search and Examination
- 2021-03-09 TW TW110108378A patent/TW202135002A/en unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105183758A (en) * | 2015-07-22 | 2015-12-23 | 深圳市万姓宗祠网络科技股份有限公司 | Content recognition method for continuously recorded video or image |
CN109492581A (en) * | 2018-11-09 | 2019-03-19 | 中国石油大学(华东) | A kind of human motion recognition method based on TP-STG frame |
CN109800689A (en) * | 2019-01-04 | 2019-05-24 | 西南交通大学 | A kind of method for tracking target based on space-time characteristic fusion study |
CN110427807A (en) * | 2019-06-21 | 2019-11-08 | 诸暨思阔信息科技有限公司 | A kind of temporal events motion detection method |
CN110309784A (en) * | 2019-07-02 | 2019-10-08 | 北京百度网讯科技有限公司 | Action recognition processing method, device, equipment and storage medium |
CN111401205A (en) * | 2020-03-11 | 2020-07-10 | 深圳市商汤科技有限公司 | Action recognition method and device, electronic equipment and computer readable storage medium |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114268849A (en) * | 2022-01-29 | 2022-04-01 | 北京卡路里信息技术有限公司 | Video processing method and device |
CN115229804A (en) * | 2022-09-21 | 2022-10-25 | 荣耀终端有限公司 | Method and device for attaching component |
CN115229804B (en) * | 2022-09-21 | 2023-02-17 | 荣耀终端有限公司 | Method and device for attaching component |
WO2024082943A1 (en) * | 2022-10-20 | 2024-04-25 | 腾讯科技(深圳)有限公司 | Video detection method and apparatus, storage medium, and electronic device |
CN117711014A (en) * | 2023-07-28 | 2024-03-15 | 荣耀终端有限公司 | Method and device for identifying space-apart gestures, electronic equipment and readable storage medium |
CN117711014B (en) * | 2023-07-28 | 2024-08-27 | 荣耀终端有限公司 | Method and device for identifying space-apart gestures, electronic equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
KR20210145271A (en) | 2021-12-01 |
TW202135002A (en) | 2021-09-16 |
CN111401205B (en) | 2022-09-23 |
JP2022529299A (en) | 2022-06-20 |
CN111401205A (en) | 2020-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021179898A1 (en) | Action recognition method and apparatus, electronic device, and computer-readable storage medium | |
CN110569721B (en) | Recognition model training method, image recognition method, device, equipment and medium | |
CN105243395B (en) | A kind of human body image comparison method and device | |
WO2016127478A1 (en) | Image processing method and device, and terminal | |
WO2016187888A1 (en) | Keyword notification method and device based on character recognition, and computer program product | |
KR102087882B1 (en) | Device and method for media stream recognition based on visual image matching | |
US9213898B2 (en) | Object detection and extraction from image sequences | |
US11871125B2 (en) | Method of processing a series of events received asynchronously from an array of pixels of an event-based light sensor | |
JP7419080B2 (en) | computer systems and programs | |
JP2018170003A (en) | Detection device and method for event in video, and image processor | |
KR101821989B1 (en) | Method of providing detection of moving objects in the CCTV video data by reconstructive video processing | |
US10296539B2 (en) | Image extraction system, image extraction method, image extraction program, and recording medium storing program | |
US20210084198A1 (en) | Method and apparatus for removing video jitter | |
US20160110909A1 (en) | Method and apparatus for creating texture map and method of creating database | |
CN108921150B (en) | Face recognition system based on network hard disk video recorder | |
JP2010257267A (en) | Device, method and program for detecting object area | |
US9392146B2 (en) | Apparatus and method for extracting object | |
JPWO2018179119A1 (en) | Video analysis device, video analysis method, and program | |
CN111860559A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
KR20210067824A (en) | Method for single image dehazing based on deep learning, recording medium and device for performing the method | |
CN112232113B (en) | Person identification method, person identification device, storage medium, and electronic apparatus | |
US10339660B2 (en) | Video fingerprint system and method thereof | |
CN103268606B (en) | A kind of depth information compensation method of motion blur image and device | |
CN109492755B (en) | Image processing method, image processing apparatus, and computer-readable storage medium | |
KR101826463B1 (en) | Method and apparatus for synchronizing time line of videos |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2021562324 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21768369 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20217036106 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.01.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21768369 Country of ref document: EP Kind code of ref document: A1 |