WO2021179898A1 - Action recognition method and apparatus, electronic device, and computer-readable storage medium - Google Patents

Action recognition method and apparatus, electronic device, and computer-readable storage medium Download PDF

Info

Publication number
WO2021179898A1
WO2021179898A1 PCT/CN2021/077268 CN2021077268W WO2021179898A1 WO 2021179898 A1 WO2021179898 A1 WO 2021179898A1 CN 2021077268 W CN2021077268 W CN 2021077268W WO 2021179898 A1 WO2021179898 A1 WO 2021179898A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature information
target object
action
key frame
frame image
Prior art date
Application number
PCT/CN2021/077268
Other languages
French (fr)
Chinese (zh)
Inventor
吴建超
段佳琦
旷章辉
张伟
Original Assignee
深圳市商汤科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市商汤科技有限公司 filed Critical 深圳市商汤科技有限公司
Priority to JP2021562324A priority Critical patent/JP2022529299A/en
Priority to KR1020217036106A priority patent/KR20210145271A/en
Publication of WO2021179898A1 publication Critical patent/WO2021179898A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the present disclosure relates to the fields of computer technology and image processing, and in particular, to an action recognition method and device, electronic equipment, and computer-readable storage media.
  • Motion detection and recognition are widely used in robotics, safety and health and other fields.
  • the related art when performing action recognition, due to factors such as the limited data processing capability of the recognition device and the single type of data used for performing the action recognition, there is a defect that the accuracy of the action recognition is low.
  • the present disclosure at least provides an action recognition method and device, electronic equipment, and computer-readable storage medium.
  • an action recognition method including:
  • the object frame corresponding to the target object is used to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing
  • the number of images used for action recognition is conducive to improving the accuracy of action recognition; in addition, this aspect not only uses the action feature information of the target object for action classification and recognition, but also uses video clips and the determined action feature information mentioned above.
  • the scene feature information of the scene where the target object is located and the time series feature information associated with the target object's actions are extracted. On the basis of the action feature information, combining the scene information and the time series feature information can further improve the accuracy of action recognition.
  • the above-mentioned action recognition method further includes the step of determining an object frame in the key frame image:
  • the initial object bounding box is expanded to obtain the object edge of the target object in the key frame image.
  • the object detection method is used to determine the bounding box of the target object in the image, which reduces the amount of data that needs to be processed for action recognition, and after determining a smaller initial object bounding box, it is expanded , So that the object frame used for action recognition can include more complete target object information and more environmental information, and retain more spatial details, thereby helping to improve the accuracy of action recognition.
  • the determining the action feature information of the target object based on the object frame in the key frame image of the video clip includes:
  • the object frame corresponding to the key frame image respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
  • the action characteristic information of the target object is determined.
  • the target object in the key frame image is used to locate the target object, and the target object image used to determine the action feature information is intercepted from multiple associated images associated with the key frame image, which improves the determination of the action feature.
  • the accuracy of the image used in the information can be increased, and the number of images used to determine the action feature information can be increased, so that the accuracy of the action recognition can be improved.
  • filtering out multiple associated images corresponding to key frame images from the video clip includes:
  • the first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
  • the multiple associated images are filtered from the first sub video segment.
  • the images associated with the key frame images can be filtered from the sub video clips that are similar in the shooting time of the key frame images, and the images with the closest degree of association with the key frame images can be filtered, based on the degree of association with the key frame images The most recent image can improve the accuracy of the determined motion feature information.
  • the method further includes:
  • the target object image is set as an image with a preset image resolution.
  • the target object image is set to a preset resolution, which can increase the amount of information included in the target object image, that is, the captured target object image can be magnified, which is conducive to obtaining
  • the fine-grained details of the target object can thereby improve the accuracy of the determined action feature information.
  • the determining the scene characteristic information and time sequence characteristic information corresponding to the target object based on the video clip and the action characteristic information includes:
  • the time sequence feature information corresponding to the target object is determined.
  • the embodiments of the present disclosure by extracting scene features from the associated images associated with the key frame images, relatively complete scene feature information can be obtained, and the accuracy of action recognition can be improved based on the relatively complete scene feature information; in addition, the embodiments of the present disclosure Extract the time series characteristics of objects other than the target object, that is, the above-mentioned initial time series characteristic information, and determine the time series characteristic information associated with the target object based on the time series characteristics of other objects and the action characteristic information of the target object. The time sequence feature information associated with the target object can further improve the accuracy of action recognition.
  • the performing a time-series feature extraction operation on objects other than the target object in the video clip to obtain initial time-series feature information includes:
  • a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
  • sub-video segments that are closer to the shooting time of the key frame image are selected from the video segments to extract the timing features, which can reduce the amount of data of the extracted timing features, and can improve the determined timing features and
  • the relevance of key frame images is beneficial to improve the accuracy of action recognition; in addition, in the embodiments of the present disclosure, the action features of other objects are used as time-series features, which can improve the pertinence of the time-series features used in action recognition. Conducive to improving the accuracy of action recognition.
  • the determining the time sequence characteristic information corresponding to the target object based on the initial time sequence characteristic information and the action characteristic information includes:
  • the initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
  • the initial time series feature information and the action feature information are reduced in dimensionality, which can reduce the amount of data that needs to be processed and is beneficial to improve the action. Recognition efficiency; in addition, the embodiments of the present disclosure perform a mean pooling operation on the initial time series feature information after dimensionality reduction, which simplifies the operation steps of time series feature extraction, and can improve the efficiency of action recognition.
  • the determining the time sequence characteristic information corresponding to the target object based on the initial time sequence characteristic information and the action characteristic information further includes:
  • the time sequence feature extraction operation for determining the time sequence feature information corresponding to the target object is repeatedly executed, which can improve the accuracy of the determined time sequence feature information.
  • an action recognition device including:
  • Video acquisition module configured to acquire video clips
  • An action feature determining module configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip;
  • a scene timing feature determining module configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information;
  • the action recognition module is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
  • the action feature determining module is further configured to determine the object frame in the key frame image:
  • the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
  • the action feature determination module determines the action feature information of the target object based on the object frame in the key frame image of the video clip, the action feature determination module is configured to:
  • the object frame corresponding to the key frame image respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
  • the action characteristic information of the target object is determined.
  • the action feature determining module when the action feature determining module filters out multiple associated images corresponding to key frame images from the video clip, it is configured to:
  • the first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
  • the motion characteristic determination module is further configured to:
  • the target object image is set as an image with a preset image resolution.
  • the scene time sequence feature determination module determines the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information, it is configured to:
  • the time sequence feature information corresponding to the target object is determined.
  • the scene timing feature determination module when the scene timing feature determination module performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, it is configured to:
  • a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
  • the scene time sequence feature determination module determines the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information, it is configured to:
  • the initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
  • the scene sequence feature determination module determines the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information, it is further configured to:
  • the present disclosure provides an electronic device including: a processor and a storage medium connected to each other.
  • the storage medium stores machine-readable instructions executable by the processor.
  • the The processor executes the machine-readable instructions to execute the steps of the above-mentioned action recognition method.
  • the present disclosure also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the steps of the above-mentioned action recognition method when the computer program is run by a processor.
  • the present disclosure also provides a computer program, including computer-readable code.
  • a processor in the electronic device executes the method for implementing the above-mentioned action recognition method. step.
  • the foregoing apparatus, electronic equipment, and computer-readable storage medium of the present disclosure at least contain technical features that are substantially the same as or similar to the technical features of any aspect of the foregoing method or any embodiment of any aspect of the present disclosure. Therefore, regarding the foregoing apparatus, for the effect description of the electronic device, and the computer-readable storage medium, please refer to the effect description of the above method content, which will not be repeated here.
  • FIG. 1A shows a flowchart of an action recognition method provided by an embodiment of the present disclosure
  • FIG. 1B shows a schematic diagram of a network architecture provided by an embodiment of the present disclosure
  • FIG. 2 shows a flowchart of determining the action characteristic information of a target object in another action method provided by an embodiment of the present disclosure
  • FIG. 3 shows a flowchart of determining scene feature information and time sequence feature information corresponding to the target object in yet another action recognition method provided by an embodiment of the present disclosure
  • FIG. 4 shows a schematic structural diagram of a simplified timing feature extraction module in an embodiment of the present disclosure
  • FIG. 5 shows a flowchart of still another method for action recognition provided by an embodiment of the present disclosure
  • FIG. 6 shows a schematic structural diagram of an action recognition device provided by an embodiment of the present disclosure
  • Fig. 7 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the present disclosure provides an action recognition method and device, electronic equipment, and computer-readable storage medium.
  • the present disclosure uses the object frame corresponding to the target object to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing the amount of data used for motion recognition.
  • the number of images for action recognition is conducive to improving the accuracy of action recognition; in addition, the present disclosure not only uses the action feature information of the target object to perform action classification and recognition, but also uses video clips and the determined action feature information to extract The scene feature information of the scene where the target object is located and the time sequence feature information associated with the action of the target object. On the basis of the action feature information, combining the scene information and the time sequence feature information can further improve the accuracy of action recognition.
  • the embodiments of the present disclosure provide an action recognition method, which is applied to a hardware device such as a terminal device that performs action recognition, and the method may also be implemented by a processor executing a computer program.
  • the action recognition method provided by the embodiment of the present disclosure includes the following steps:
  • a video clip is a video clip used for action recognition, and includes multiple images.
  • the images include a target object that needs to be motion recognized, and the target object may be a human or an animal.
  • the above-mentioned video clips can be shot by the terminal device that performs motion recognition using its own camera or other shooting equipment, or it can be shot by other shooting devices. After shooting by other shooting devices, the video clip can be passed to the terminal device for motion recognition. .
  • S120 Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip.
  • the object frame is the bounding box surrounding the target object.
  • the image information in the bounding box is used to determine the motion characteristic information of the target object, the amount of data processed by the terminal device can be reduced.
  • the present disclosure does not limit the method of filtering key frame images from video clips.
  • the object borders in each key frame image can be used to determine the action feature information of the target object.
  • the frame of the object in the frame image determines the action characteristic information of the target object.
  • object detection methods can be used, for example, a human body detector is used to determine the object frame by using the human body detection method.
  • a human body detector is used to determine the object frame by using the human body detection method.
  • other methods can also be used to determine the object frame.
  • the present disclosure does not deal with the method for determining the object frame. limited.
  • the object frame detected by the human body detector may be used as the final object frame used to determine the action feature information.
  • the object frame detected by the human body detector may be a smaller frame including the target object, in order to obtain more complete target object information and more environmental information, after the human body detector detects the object frame, the The object frame detected by each human body detector may be expanded according to the preset expansion size information to obtain the final object frame of the target object in each of the key frame images. After that, the determined final object frame is used to determine the action characteristic information of the target object.
  • the expansion size information for expanding the object frame is preset.
  • the expansion size information includes a first extension length of the object frame in the length direction and a second extension length of the object frame in the width direction.
  • the length of the object frame is extended to both sides according to the first extension length, and the two sides in the length direction are respectively extended by half of the first extension length.
  • the width of the object frame is extended to both sides according to the second extension length, and both sides in the width direction are respectively extended by half of the second extension length.
  • the first extension length and the second extension length may be preset specific values, or may be determined based on the length and width of the object frame directly detected by the human body detector.
  • the first extension length may be equal to the length of the object frame directly detected by the human body detector
  • the second extension length may be equal to the width of the object frame directly detected by the human body detector.
  • the object detection method is used to determine the bounding box of the target object in the image, which reduces the amount of data that needs to be processed for action recognition. After determining a smaller initial bounding box of the object, it is expanded.
  • the frame of the object configured for action recognition can include more complete target object information and more environmental information, thereby helping to improve the accuracy of action recognition.
  • the aforementioned action feature information is extracted from the image in the video clip, and can characterize the action feature of the target object.
  • the scene feature information is used to characterize the scene feature of the scene in which the target object is located, and may be obtained by performing scene feature extraction from at least part of the associated images associated with the key frame image.
  • the time sequence feature information is the feature information related to the target object's action in time sequence. For example, it can be the action feature information of other objects in the video clip except the target object. In specific implementation, it can be based on the video clip and the target object's action feature information. The action feature information is determined.
  • S140 Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
  • the above three types of information can be combined. For example, splicing; after that, the combined information is classified to obtain the action type of the target object, so as to realize the action recognition of the target object.
  • the object frame corresponding to the target object is used to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing
  • the number of images used for action recognition is conducive to improving the accuracy of action recognition; in addition, the embodiments of the present disclosure not only use the action feature information of the target object for action classification and recognition, but also use video clips and the determined action features mentioned above.
  • Information, the scene feature information of the scene where the target object is located and the time sequence feature information associated with the target object’s actions are extracted. Based on the action feature information, combining the scene information and time sequence feature information can further improve the accuracy of action recognition.
  • FIG. 1B shows a schematic diagram of a network architecture provided by an embodiment of the present disclosure, and the network architecture includes: users A terminal 201, a network 202, and an action-recognized terminal device 203.
  • the user terminal 201 and the action recognition terminal device 203 establish a communication connection through the network 202.
  • the request information configured to determine the action type is passed through the network 202 is sent to the terminal device 203 for action recognition; then, the terminal device 203 for action recognition obtains the video clip and uses the object frame corresponding to the target object to determine the action feature information; and uses the video clip and the determined action feature information to extract The scene feature information of the scene where the target object is located and the time sequence feature information associated with the target object’s actions; finally, the scene information and time sequence feature information are comprehensively considered to determine the target object’s action type with a higher accuracy, and will determine The output action type is fed back to the user terminal 201.
  • the user terminal 201 may include a device with data processing capabilities
  • the motion recognition terminal device 203 may include an image acquisition device, and a processing device with data processing capabilities or a remote server.
  • the network 202 may adopt a wired connection or a wireless connection.
  • the terminal device for action recognition is a processing device with data processing capabilities
  • the user terminal 201 can communicate with the processing device through a wired connection, such as data communication via a bus; when the terminal device for action recognition 203 is a remote server At the time, the user terminal can interact with the remote server through the wireless network.
  • the foregoing determination of the motion feature information of the target object based on the object frame of the target object in the key frame image in the video clip can be specifically implemented by using the following steps:
  • the associated image associated with the key frame image is an image similar to the image feature of the key frame image, for example, it may be an image similar to the shooting time of the key frame image.
  • the following sub-steps can be used to filter the associated images corresponding to the key frame images:
  • Sub-step 1 Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is positive Integer.
  • the key frame image may be located in the first half of the first sub-video segment, or may be located in the second half of the first sub-video segment, of course, it can also be located in the first sub-video segment.
  • the middle of or near the middle may be located in the first half of the first sub-video segment.
  • a sub-video segment including a key frame image can be intercepted from the video segment, for example, a 64-frame sub-video segment can be intercepted.
  • the key frame image is in the middle of the sub-video segment or a position close to the middle.
  • the sub video segment includes the first 32 frames of the key frame image, the key frame image, and the last 31 frames of the key frame image; for another example, in the sub video segment, the key frame image is in the first half of the sub video segment.
  • the key frame image is in the second half of the sub video segment, and the sub video segment includes the first 50 frames of the key frame image, the key frame image, and the last 13 frames of the key frame image.
  • the key frame image may also be located at both ends of the first sub-video segment, that is, the above-mentioned N images adjacent to the key frame image in time sequence are the first N images of the key frame image. Or the next N images.
  • the present disclosure does not limit the position of the key frame image in the first sub-video segment.
  • the second sub-step is to filter the multiple associated images from the first sub-video segment.
  • the associated images may be filtered from the first sub-video segment based on a preset time interval, for example, T frames of associated images can be obtained by sparsely sampling the first sub-video segment with a time span ⁇ .
  • the related images obtained by screening may include or may not include key frame images, and have a certain degree of randomness. The present disclosure does not limit whether the related images include key frame images.
  • the image similarity between each frame of the image in the first sub-video segment and the key frame image may be calculated first, and then multiple images with the highest image similarity are selected as the associated images associated with the key frame image.
  • S220 According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image.
  • the object frame corresponding to the key image to intercept part of the image from some or all of the associated images associated with the key frame image. If it is to intercept the target object image from part of the related images, specifically it can select the part of the related image closest to the shooting time of the key frame image from all the related images to intercept the target object image. Of course, other methods can also be used to select the part of the related image. Image to capture the target object image. For example, according to a certain time interval, some related images are selected from all related images.
  • the specific process includes: first copying the object frame on all the associated images or part of the associated images in chronological order.
  • the coordinate information of the object frame on the key frame image is used to realize the frame copy on the associated image, for example, according to the coordinate information of the object frame on the key frame image, according to the time sequence. Offset the frame position or copy the frame position directly to get the object frame on the associated image.
  • the associated image is cropped according to the object frame to obtain the target object image, that is, the image within the object frame in the associated image is intercepted as the target object image.
  • the function of the key frame image is to realize the positioning of the target object image, and is not necessarily used to directly determine the action characteristic information. For example, when the associated image does not include the key frame image, the target object image used to determine the action feature information is not intercepted from the key frame image.
  • S230 Determine the action feature information of the target object based on the multiple target object images corresponding to the key frame.
  • action feature extraction can be performed on multiple target object images.
  • a three-dimensional (3D) convolutional neural network can be used to process the target object image to extract the action in the target object image.
  • Features to obtain the action feature information of the target object can be used to obtain the action feature information of the target object.
  • the following steps may be used to process the target object image:
  • the target object image is set as an image with a preset image resolution.
  • the aforementioned preset image resolution is higher than the original image resolution of the target object image.
  • existing methods or tools can be used to set the image resolution of the target object image, for example, interpolation and other methods can be used to adjust the image resolution of the target object image.
  • the target object image is set to the preset resolution, which can increase the amount of information included in the target object image, that is, the intercepted target object image can be enlarged, and the target object can be more fine-grained. Details, which can improve the accuracy of the determined action feature information.
  • the above-mentioned preset image resolution can be set to H ⁇ W, the number of target object images captured by each key frame image is T, and the number of channels of each frame of target object image is 3, then the input 3D convolutional neural
  • the action feature extracted by the network is a T ⁇ H ⁇ W ⁇ 3 image block.
  • a 2048-dimensional feature vector can be obtained, which is the aforementioned action feature information.
  • the target object frame in the key frame image is used for positioning, and the target object image used to determine the action feature information is intercepted from multiple associated images associated with the key frame image, which improves the determination of the action feature.
  • the accuracy of the image used in the information can be increased, and the number of images used to determine the action feature information can be increased, so that the accuracy of the action recognition can be improved.
  • the foregoing determination of scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information includes:
  • For the key frame image filter multiple associated images corresponding to the key frame image from the video clip, and perform a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information.
  • a 3D convolutional neural network can be used to perform video scene feature extraction and global average pooling on part or all of the associated images to obtain a 2048-dimensional feature vector, which is the aforementioned scene feature information.
  • S320 Perform a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information.
  • the initial time sequence feature information is the time sequence feature of other objects other than the target object, such as the action feature of other objects, which can be determined through the following sub-steps during specific implementation:
  • Sub-step 1 For the key frame image, select a second sub-video segment that includes a key frame image from the video segments; the second sub-video segment also includes P images that are temporally adjacent to the key frame image Image; where P is a positive integer.
  • the key frame image may be located in the first half of the second sub-video segment, or may be located in the second half of the second sub-video segment, and of course, it can also be located in the second sub-video segment.
  • the middle of or near the middle may be located in the first half of the second sub-video segment, or may be located in the second half of the second sub-video segment, and of course, it can also be located in the second sub-video segment. The middle of or near the middle.
  • the key frame image may also be located at both ends of the second sub-video segment, that is, the P images adjacent to the key frame image in time sequence are the previous P images of the key frame image. Or the next P images.
  • the present disclosure does not limit the position of the key frame image in the second sub-video segment.
  • a sub-video segment including a key frame image can be intercepted from a video segment.
  • a 2-second sub-video segment can be intercepted.
  • the sub-video has a longer time and is used to determine the timing of a long time sequence. feature.
  • the second sub-step is to extract the action features of objects other than the target object in each image in the second sub-video segment, and use the obtained action features as the initial time sequence feature information.
  • the 3D convolutional neural network can be used to extract the action features of objects other than the target object in the sub-video segment, and the obtained initial timing feature information can be based on the video timing feature bank (long-term Feature Bank, LFB) Form storage and use.
  • video timing feature bank long-term Feature Bank, LFB
  • sub-video segments that are closer to the shooting time of the key frame image are selected from the video segments to extract the timing features, which can reduce the amount of data of the extracted timing features, and can improve the determined timing features and
  • the relevance of the key frame images is beneficial to improve the accuracy of action recognition;
  • the action features of other objects are used as time-series features, which can improve the pertinence of the time-series features used in action recognition. Conducive to improving the accuracy of action recognition.
  • time sequence feature extraction on the initial time sequence feature information and the action feature information to obtain the time sequence feature information corresponding to the target object.
  • the following sub-steps can be used to perform time-series feature extraction on the initial time-series feature information and action feature information to obtain the time-series feature information corresponding to the target object:
  • Sub-step 1 Perform dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
  • the initial time series feature information and the action feature information can be reduced in dimensionality, which can reduce the amount of data that needs to be processed and help improve action recognition s efficiency.
  • the initial time sequence feature information and the action feature information can also be randomly deactivated (Dropout) processing.
  • the Dropout processing can be used to extract the initial time sequence.
  • the final network layer of the neural network for feature information and action feature information can also be implemented at each network layer of the neural network for extracting initial time series feature information and action feature information.
  • the second step is to perform the mean pooling operation on the initial time series feature information after dimensionality reduction processing.
  • Sub-step 3 The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
  • the above merging operation can specifically be channel splicing, that is, the channel of one feature information is added to the channel of another feature information to realize the merging; the merging operation can also be an addition operation, that is, the initial timing feature information and dimensionality reduction after the mean pooling operation The processed action feature information is added.
  • Sub-step two and sub-step three are essentially to perform a sequence feature extraction operation on the initial sequence feature information and the action feature information, and can be specifically implemented by using the simplified sequence feature extraction module as shown in FIG. 4.
  • the simplified time-series feature extraction module 400 shown in FIG. 4 is used to extract the above-mentioned time-series feature information, which may specifically include a linear layer 401, an average pooling (Average) layer 402, and a normalization and activation function (LN+ReLU).
  • a layer 403 and a dropout layer 404 A layer 403 and a dropout layer 404.
  • the time series feature extraction operation is simplified. Only the average pooling Average layer is used to perform the average pooling operation on the initial time series feature information after the dimensionality reduction process.
  • the normalization (softmax) operation is not performed, which is simplified
  • the operation steps of time sequence feature extraction are simplified, that is, the existing time sequence feature extraction module is simplified, and the efficiency of action recognition can be improved.
  • the existing time series feature extraction module does not include an average pooling layer, but includes a classification normalization softmax layer, and the processing complexity of the softmax layer is higher than the average pooling operation.
  • the existing time sequence feature extraction module further includes a linear layer before the random inactivation layer.
  • the simplified time sequence feature extraction module in the present disclosure does not include the linear layer, so the efficiency of action recognition can be further improved.
  • the time series feature information output by the time series feature extraction module may be a 512-dimensional feature vector, and the 512-dimensional feature vector is the time series feature information of the aforementioned target object.
  • extracting scene features from part or all of the associated images associated with the key frame images can obtain relatively complete scene feature information, and based on relatively complete scene feature information, the accuracy of action recognition can be improved.
  • the time series characteristics of other objects other than the target object that is, the above-mentioned initial time series characteristic information
  • the target object associated with the target object is determined.
  • Time sequence feature information, using the time sequence feature information associated with the target object can further improve the accuracy of action recognition.
  • time series feature extraction modules can be connected in series to extract the above time series feature information, and the time series feature information extracted by one time series feature extraction module is used as the input of another time series feature extraction module.
  • the time sequence feature information corresponding to the target object extracted by the previous time sequence feature extraction module may be used as the new initial time sequence feature information, and return to the above to reduce the initial time sequence feature information and the action feature information respectively.
  • three simplified timing feature extraction modules can be connected in series to determine the final timing feature information.
  • the embodiment of the present disclosure uses a person as a target object for action recognition.
  • the action recognition method of the embodiment of the present disclosure may include:
  • Step 1 Obtain a video clip 501, and filter key frame images from the above video clip;
  • Step 2 Use the human body detector 502 to locate the person in each key frame image to obtain the person, that is, the initial object bounding box of the target object;
  • Step 3 Expand the initial object bounding box according to the preset expansion size information to obtain the final object bounding box; then, use the object bounding box to perform partial image interception of the associated image associated with the key frame image to obtain each key image Corresponding target object image;
  • Step 4 Input the target object images corresponding to all the obtained key images into the 3D convolutional neural network 503, and use the 3D convolutional neural network 503 to extract the action features of the target object to obtain the action feature information corresponding to the target object.
  • Step 5 Input the associated image associated with the key frame image into the aforementioned 3D convolutional neural network 503, and use the 3D convolutional neural network 503 to extract the video scene features of the scene where the target object is located, to obtain scene feature information.
  • Step 6 Use another 3D convolutional neural network 503 to perform time-series feature extraction on the video clip, that is, extract the action features of objects other than the target object to obtain initial time-series feature information.
  • the above-mentioned initial time-series feature information can be based on time-series features. It exists in the form of a library; here, when the timing feature is extracted, it can be extracted from the entire video segment, or it can be extracted from the video segment, including a longer sub-video segment of the key frame image.
  • Step 7 Using the simplified time sequence feature extraction module 504, perform a time sequence feature extraction operation on the initial time sequence feature information and the action feature information to obtain the time sequence feature information corresponding to the target object.
  • Step 8 Perform splicing processing on the time series feature information, action feature information, and scene feature information, and use the action classifier 505 to classify the spliced information to obtain the action type of the target object.
  • the present disclosure also provides an action recognition device, which is applied to hardware devices such as terminal equipment that performs action recognition on a target object, and each module can implement the same method steps and methods as in the above-mentioned method.
  • an action recognition device which is applied to hardware devices such as terminal equipment that performs action recognition on a target object, and each module can implement the same method steps and methods as in the above-mentioned method. The same beneficial effects are achieved, and therefore the same parts will not be repeated in this disclosure.
  • an action device provided by the present disclosure may include:
  • the video acquisition module 610 is configured to acquire video clips.
  • the action feature determining module 620 is configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip.
  • the scene timing feature determining module 630 is configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information.
  • the action recognition module 640 is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
  • the action feature determining module 620 is further configured to determine the object frame in the key frame image:
  • the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
  • the action feature determination module 620 is configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip:
  • the object frame corresponding to the key frame image respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
  • the action characteristic information of the target object is determined.
  • the action feature determining module 620 when the action feature determining module 620 filters out multiple associated images corresponding to the key frame images from the video clip, it is configured to:
  • the first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
  • the motion characteristic determination module 620 is further configured to:
  • the target object image is set as an image with a preset image resolution.
  • the scene timing feature determination module 630 determines the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information, it is configured to:
  • the time sequence feature information corresponding to the target object is determined.
  • the scene timing feature determination module 630 when the scene timing feature determination module 630 performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, it is configured to:
  • a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
  • the scene timing feature determining module 630 determines the timing feature information corresponding to the target object based on the initial timing feature information and the action feature information, it is configured to:
  • the initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
  • the scene timing feature determining module 630 determines the timing feature information corresponding to the target object based on the initial timing feature information and the action feature information, it is further configured to:
  • the embodiment of the present disclosure discloses an electronic device, as shown in FIG. 7, comprising: a processor 701 and a storage medium 702 connected to each other, and the storage medium stores machine-readable instructions executable by the processor.
  • the processor executes the machine-readable instructions to execute the steps of the above-mentioned action recognition method.
  • the processor 701 and the storage medium 702 may be connected through a bus 703.
  • the embodiments of the present disclosure also provide a computer program product corresponding to the above method and device, including a computer-readable storage medium storing program code.
  • the instructions included in the program code can be used to execute the method in the previous method embodiment, and the specific implementation is Please refer to the method embodiment, which will not be repeated here.
  • the computer-readable storage medium may be a volatile or non-volatile storage medium.
  • the modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present disclosure.
  • the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile computer readable storage medium executable by a processor.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
  • the present disclosure provides an action recognition method and device, an electronic device, and a computer-readable storage medium, wherein the method includes: obtaining a video clip; determining the target object based on the object frame in the key frame image in the video clip The action feature information of the target object; based on the video clip and the action feature information, determine the scene feature information and time sequence feature information corresponding to the target object; based on the action feature information, the scene feature information, and the The time sequence feature information determines the action type of the target object.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

Provided are an action recognition method and apparatus, an electronic device and a computer-readable storage medium. In the method, action feature information is determined by using an object border frame corresponding to a target object instead of using the whole frame of an image, such that the data volume for action recognition in each frame of the image can be effectively reduced, thereby being able to increase the number of images for action recognition and facilitating the improvement in the accuracy of action recognition. In addition, in the method, the action feature information of the target object is used to perform action classification and recognition, and scenario feature information of a scenario where the target object is located and time sequence feature information associated with an action of the target object are extracted by using a video clip and the determined action feature information, and the accuracy of action recognition can be further improved by combining the scenario information and the time sequence feature information on the basis of the action feature information.

Description

动作识别方法及装置、电子设备、计算机可读存储介质Action recognition method and device, electronic equipment, and computer readable storage medium
相关申请的交叉引用Cross-references to related applications
本公开基于申请号为202010166148.8、申请日为2020年3月11日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。The present disclosure is filed based on a Chinese patent application with an application number of 202010166148.8 and an application date of March 11, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference.
技术领域Technical field
本公开涉及计算机技术、图像处理领域,具体而言,涉及一种动作识别方法及装置、电子设备、计算机可读存储介质。The present disclosure relates to the fields of computer technology and image processing, and in particular, to an action recognition method and device, electronic equipment, and computer-readable storage media.
背景技术Background technique
动作检测和识别广泛应用于机器人、安全和健康等领域中。相关技术中,在进行动作识别时,由于识别设备的数据处理能力有限、用于进行动作识别的数据类型单一等因素,导致存在动作识别准确度低的缺陷。Motion detection and recognition are widely used in robotics, safety and health and other fields. In the related art, when performing action recognition, due to factors such as the limited data processing capability of the recognition device and the single type of data used for performing the action recognition, there is a defect that the accuracy of the action recognition is low.
发明内容Summary of the invention
有鉴于此,本公开至少提供一种动作识别方法及装置、电子设备、计算机可读存储介质。In view of this, the present disclosure at least provides an action recognition method and device, electronic equipment, and computer-readable storage medium.
第一方面,本公开提供了一种动作识别方法,包括:In the first aspect, the present disclosure provides an action recognition method, including:
获取视频片段;Get video clips;
基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息;Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip;
基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息;Determine the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information;
基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
本公开实施例中,利用目标对象对应的对象边框来确定动作特征信息,而不是利用整帧图像来确定动作特征信息,能够有效降低每帧图像中用于进行动作识别的数据量,从而能够增加用于进行动作识别的图像的数量,有利于提高动作识别的准确度;另外,本方面不仅利用目标对象的动作特征信息来进行动作分类和识别,还利用视频片段和确定的上述动作特征信息,提取到了目标对象所处场景的场景特征信息以及与目标对象的动作有关联的时序特征信息,在动作特征信息的基础上,结合场景信息和时序特征信息能够进一步提高动作识别的准确度。In the embodiments of the present disclosure, the object frame corresponding to the target object is used to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing The number of images used for action recognition is conducive to improving the accuracy of action recognition; in addition, this aspect not only uses the action feature information of the target object for action classification and recognition, but also uses video clips and the determined action feature information mentioned above. The scene feature information of the scene where the target object is located and the time series feature information associated with the target object's actions are extracted. On the basis of the action feature information, combining the scene information and the time series feature information can further improve the accuracy of action recognition.
在一种可能的实施方式中,上述动作识别方法还包括确定关键帧图像中的对象边框的步骤:In a possible implementation manner, the above-mentioned action recognition method further includes the step of determining an object frame in the key frame image:
从所述视频片段中筛选关键帧图像;Screening key frame images from the video clips;
对筛选得到的所述关键帧图像进行对象检测,确定所述目标对象在所述关键帧图像中的初始对象边界框;Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;
按照预设扩展尺寸信息,对所述初始对象边界框进行扩展,得到所述目标对象在所述关键帧图像中的所述对象边。According to the preset expansion size information, the initial object bounding box is expanded to obtain the object edge of the target object in the key frame image.
本公开实施方式中,利用对象检测的方法确定目标对象在图像中的边框,减少了进行动作识别需要处理的数据量,并且在确定了一个较小的初始对象边界框后,对其进行了扩展,从而使得用于进行动作识别的对象边框能够包括更完整的目标对象的信息以及更多的环境信息,保留了更多空间细节,从而有利于提高动作识别的准确度。In the embodiments of the present disclosure, the object detection method is used to determine the bounding box of the target object in the image, which reduces the amount of data that needs to be processed for action recognition, and after determining a smaller initial object bounding box, it is expanded , So that the object frame used for action recognition can include more complete target object information and more environmental information, and retain more spatial details, thereby helping to improve the accuracy of action recognition.
在一种可能的实施方式中,所述基于目标对象在所述视频片段中的关键帧图像中的 对象边框,确定所述目标对象的动作特征信息,包括:In a possible implementation manner, the determining the action feature information of the target object based on the object frame in the key frame image of the video clip includes:
针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
按照该关键帧图像对应的对象边框,分别从该关键帧图像对应的至少部分关联图像中截取部分图像,得到该关键帧图像对应的多张目标对象图像;According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
基于所述关键帧图像对应的多张目标对象图像,确定所述目标对象的动作特征信息。Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
本公开实施方式中,利用目标对象在关键帧图像中的对象边框进行定位,从与关键帧图像相关联的多张关联图像中截取用于确定动作特征信息的目标对象图像,提高了确定动作特征信息所使用的图像的精准度,并且能够增加用于确定动作特征信息的图像的数量,从而能够提高动作识别的准确度。In the embodiments of the present disclosure, the target object in the key frame image is used to locate the target object, and the target object image used to determine the action feature information is intercepted from multiple associated images associated with the key frame image, which improves the determination of the action feature. The accuracy of the image used in the information can be increased, and the number of images used to determine the action feature information can be increased, so that the accuracy of the action recognition can be improved.
在一种可能的实施方式中,从所述视频片段中筛选出与关键帧图像对应的多张关联图像,包括:In a possible implementation manner, filtering out multiple associated images corresponding to key frame images from the video clip includes:
从所述视频片段中选取包括关键帧图像的第一子视频片段;所述第一子视频片段还包括与该关键帧图像时序上相邻的N张图像;其中,N为正整数;Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
从所述第一子视频片段中筛选所述多张关联图像。The multiple associated images are filtered from the first sub video segment.
本公开实施方式中,从与关键帧图像的拍摄时间相近的子视频片段中筛选与关键帧图像相关联的图像,能够筛选到与关键帧图像关联程度最近的图像,基于与关键帧图像关联程度最近的图像,能够提高确定的动作特征信息的准确度。In the embodiments of the present disclosure, the images associated with the key frame images can be filtered from the sub video clips that are similar in the shooting time of the key frame images, and the images with the closest degree of association with the key frame images can be filtered, based on the degree of association with the key frame images The most recent image can improve the accuracy of the determined motion feature information.
在一种可能的实施方式中,在得到多张目标对象图像之后,在确定所述目标对象的动作特征信息之前,还包括:In a possible implementation manner, after obtaining multiple target object images, before determining the motion feature information of the target object, the method further includes:
将所述目标对象图像设置为具有预设图像分辨率的图。The target object image is set as an image with a preset image resolution.
本公开实施方式中,在截取到目标对象图像之后,将目标对象图像设置为预设的分辨率,能够提高目标对象图像中包括的信息的数量,即可以放大截取的目标对象图像,有利于获取目标对象的细粒度细节,从而能够提高确定的动作特征信息的准确度。In the embodiments of the present disclosure, after the target object image is captured, the target object image is set to a preset resolution, which can increase the amount of information included in the target object image, that is, the captured target object image can be magnified, which is conducive to obtaining The fine-grained details of the target object can thereby improve the accuracy of the determined action feature information.
在一种可能的实施方式中,所述基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息,包括:In a possible implementation manner, the determining the scene characteristic information and time sequence characteristic information corresponding to the target object based on the video clip and the action characteristic information includes:
针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
对至少部分所述关联图像进行视频场景特征提取操作,得到所述场景特征信息;Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;
对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息;Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;
基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息。Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
本公开实施方式中,从与关键帧图像相关联的关联图像中提取场景特征,能够得到较为完整场景特征信息,基于较为完整的场景特征信息能够提高动作识别的准确度;另外,本公开实施方式中提取了除目标对象以外的其他对象的时序特征,即上述初始时序特征信息,并基于其他对象的时序特征和目标对象的动作特征信息,确定了与目标对象相关联的时序特征信息,利用该与目标对象相关联的时序特征信息,能够进一步提高动作识别的准确度。In the embodiments of the present disclosure, by extracting scene features from the associated images associated with the key frame images, relatively complete scene feature information can be obtained, and the accuracy of action recognition can be improved based on the relatively complete scene feature information; in addition, the embodiments of the present disclosure Extract the time series characteristics of objects other than the target object, that is, the above-mentioned initial time series characteristic information, and determine the time series characteristic information associated with the target object based on the time series characteristics of other objects and the action characteristic information of the target object. The time sequence feature information associated with the target object can further improve the accuracy of action recognition.
在一种可能的实施方式中,所述对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息,包括:In a possible implementation manner, the performing a time-series feature extraction operation on objects other than the target object in the video clip to obtain initial time-series feature information includes:
针对所述关键帧图像,从所述视频片段中选取包括关键帧图像的第二子视频片段;所述第二子视频片段还包括与该关键帧图像时序上相邻的P张图像;其中,P为正整数;For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
提取所述第二子视频片段中的图像中,除所述目标对象以外的其他对象的动作特征,并将得到动作特征作为所述初始时序特征信息。Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
本公开实施方式中,从视频片段中选取了与关键帧图像的拍摄时间较为接近的子视频片段来提取时序特征,能够减小提取得到的时序特征的数据量,并且能够提高确定的时序特征与关键帧图像的关联性,从而有利于提高动作识别的准确度;另外,本公开实施方式中,将其他对象的动作特征作为时序特征,能够提高动作识别所使用的时序性特征的针对性,有利于提高动作识别的准确度。In the embodiments of the present disclosure, sub-video segments that are closer to the shooting time of the key frame image are selected from the video segments to extract the timing features, which can reduce the amount of data of the extracted timing features, and can improve the determined timing features and The relevance of key frame images is beneficial to improve the accuracy of action recognition; in addition, in the embodiments of the present disclosure, the action features of other objects are used as time-series features, which can improve the pertinence of the time-series features used in action recognition. Conducive to improving the accuracy of action recognition.
在一种可能的实施方式中,所述基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息,包括:In a possible implementation manner, the determining the time sequence characteristic information corresponding to the target object based on the initial time sequence characteristic information and the action characteristic information includes:
分别对所述初始时序特征信息和所述动作特征信息进行降维处理;Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;
对降维处理后的初始时序特征信息进行均值池化操作;Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;
将进行均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行合并操作,得到所述目标对象对应的时序特征信息。The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
本公开实施方式中,在基于初始时序特征信息和动作特征信息,确定时序特征信息时,对初始时序特征信息和动作特征信息进行了降维处理,能够减少需要处理的数据量,有利于提高动作识别的效率;另外,本公开实施方式对降维后的初始时序特征信息进行了均值池化操作,简化了时序特征提取的操作步骤,能够提高动作识别的效率。In the embodiments of the present disclosure, when determining the time series feature information based on the initial time series feature information and the action feature information, the initial time series feature information and the action feature information are reduced in dimensionality, which can reduce the amount of data that needs to be processed and is beneficial to improve the action. Recognition efficiency; in addition, the embodiments of the present disclosure perform a mean pooling operation on the initial time series feature information after dimensionality reduction, which simplifies the operation steps of time series feature extraction, and can improve the efficiency of action recognition.
在一种可能的实施方式中,所述基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息,还包括:In a possible implementation manner, the determining the time sequence characteristic information corresponding to the target object based on the initial time sequence characteristic information and the action characteristic information further includes:
将得到的所述目标对象对应的时序特征信息作为新的初始时序特征信息,并返回所述分别对所述初始时序特征信息和所述动作特征信息进行降维处理的步骤。Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
本公开实施方式中,将基于初始时序特征信息和动作特征信息,确定所述目标对象对应的时序特征信息的时序特征提取操作重复执行,能够提高确定的时序特征信息的准确度。In the embodiments of the present disclosure, based on the initial time sequence feature information and the action feature information, the time sequence feature extraction operation for determining the time sequence feature information corresponding to the target object is repeatedly executed, which can improve the accuracy of the determined time sequence feature information.
第二方面,本公开提供了一种动作识别装置,包括:In a second aspect, the present disclosure provides an action recognition device, including:
视频获取模块,配置为获取视频片段;Video acquisition module, configured to acquire video clips;
动作特征确定模块,配置为基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息;An action feature determining module, configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip;
场景时序特征确定模块,配置为基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息;A scene timing feature determining module, configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information;
动作识别模块,配置为基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。The action recognition module is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
在一种可能的实施方式中,所述动作特征确定模块还配置为确定关键帧图像中的对象边框:In a possible implementation manner, the action feature determining module is further configured to determine the object frame in the key frame image:
从所述视频片段中筛选关键帧图像;Screening key frame images from the video clips;
对筛选得到的所述关键帧图像进行对象检测,确定所述目标对象在所述关键帧图像中的初始对象边界框;Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;
按照预设扩展尺寸信息,对所述初始对象边界框进行扩展,得到所述目标对象在所述关键帧图像中的所述对象边框。According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
在一种可能的实施方式中,所述动作特征确定模块在基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信时,配置为:In a possible implementation manner, when the action feature determination module determines the action feature information of the target object based on the object frame in the key frame image of the video clip, the action feature determination module is configured to:
针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
按照该关键帧图像对应的对象边框,分别从该关键帧图像对应的至少部分关联图像中截取部分图像,得到该关键帧图像对应的多张目标对象图像;According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
基于所述关键帧图像对应的多张目标对象图像,确定所述目标对象的动作特征信息。Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
在一种可能的实施方式中,所述动作特征确定模块在从所述视频片段中筛选出与关 键帧图像对应的多张关联图像时,配置为:In a possible implementation manner, when the action feature determining module filters out multiple associated images corresponding to key frame images from the video clip, it is configured to:
从所述视频片段中选取包括关键帧图像的第一子视频片段;所述第一子视频片段还包括与该关键帧图像时序上相邻的N张图像;其中,N为正整数;Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
从所述第一子视频片段中筛选所述多张关联图。Filter the multiple association graphs from the first sub video segment.
在一种可能的实施方式中,在得到多张目标对象图像之后,在确定所述目标对象的动作特征信息之前,所述动作特征确定模块还配置为:In a possible implementation manner, after obtaining multiple target object images, before determining the motion characteristic information of the target object, the motion characteristic determination module is further configured to:
将所述目标对象图像设置为具有预设图像分辨率的图像。The target object image is set as an image with a preset image resolution.
在一种可能的实施方式中,所述场景时序特征确定模块在基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息时,配置为:In a possible implementation manner, when the scene time sequence feature determination module determines the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information, it is configured to:
针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
对至少部分所述关联图像进行视频场景特征提取操作,得到所述场景特征信息;Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;
对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息;Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;
基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息。Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
在一种可能的实施方式中,所述场景时序特征确定模块在对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息时,配置为:In a possible implementation manner, when the scene timing feature determination module performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, it is configured to:
针对所述关键帧图像,从所述视频片段中选取包括关键帧图像的第二子视频片段;所述第二子视频片段还包括与该关键帧图像时序上相邻的P张图像;其中,P为正整数;For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
提取所述第二子视频片段中的图像中,除所述目标对象以外的其他对象的动作特征,并将得到动作特征作为所述初始时序特征信。Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
在一种可能的实施方式中,所述场景时序特征确定模块在基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息时,配置为:In a possible implementation manner, when the scene time sequence feature determination module determines the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information, it is configured to:
分别对所述初始时序特征信息和所述动作特征信息进行降维处理;Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;
对降维处理后的初始时序特征信息进行均值池化操作;Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;
将进行均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行合并操作,得到所述目标对象对应的时序特征信息。The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
在一种可能的实施方式中,所述场景时序特征确定模块在基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息时,还配置为:In a possible implementation manner, when the scene sequence feature determination module determines the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information, it is further configured to:
将得到的所述目标对象对应的时序特征信息作为新的初始时序特征信息,并返回所述分别对所述初始时序特征信息和所述动作特征信息进行降维处理的步骤。Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
第三方面,本公开提供了一种电子设备,包括:相互连接的处理器和存储介质,所述存储介质存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器执行所述机器可读指令,以执行上述动作识别方法的步骤。In a third aspect, the present disclosure provides an electronic device including: a processor and a storage medium connected to each other. The storage medium stores machine-readable instructions executable by the processor. When the electronic device is running, the The processor executes the machine-readable instructions to execute the steps of the above-mentioned action recognition method.
第四方面,本公开还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如上述动作识别方法的步骤。In a fourth aspect, the present disclosure also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the steps of the above-mentioned action recognition method when the computer program is run by a processor.
第五方面,本公开还提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现上述动作识别方法的步骤。In a fifth aspect, the present disclosure also provides a computer program, including computer-readable code. When the computer-readable code runs in an electronic device, a processor in the electronic device executes the method for implementing the above-mentioned action recognition method. step.
本公开上述装置、电子设备、和计算机可读存储介质,至少包含与本公开上述方法的任一方面或任一方面的任一实施方式的技术特征实质相同或相似的技术特征,因此关于上述装置、电子设备、和计算机可读存储介质的效果描述,可以参见上述方法内容的效果描述,这里不再赘述。The foregoing apparatus, electronic equipment, and computer-readable storage medium of the present disclosure at least contain technical features that are substantially the same as or similar to the technical features of any aspect of the foregoing method or any embodiment of any aspect of the present disclosure. Therefore, regarding the foregoing apparatus For the effect description of the electronic device, and the computer-readable storage medium, please refer to the effect description of the above method content, which will not be repeated here.
附图说明Description of the drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the following will briefly introduce the drawings that need to be used in the embodiments. It should be understood that the following drawings only show certain embodiments of the present disclosure, and therefore do not It should be regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can be obtained based on these drawings without creative work.
图1A示出了本公开实施例提供的一种动作识别方法的流程图;FIG. 1A shows a flowchart of an action recognition method provided by an embodiment of the present disclosure;
图1B示出了本公开实施例提供的一种网络架构示意图;FIG. 1B shows a schematic diagram of a network architecture provided by an embodiment of the present disclosure;
图2示出了本公开实施例提供的另一种动作方法中确定目标对象的动作特征信息的流程图;FIG. 2 shows a flowchart of determining the action characteristic information of a target object in another action method provided by an embodiment of the present disclosure;
图3示出了本公开实施例提供的再一种动作识别方法中确定述目标对象对应的场景特征信息和时序特征信息的流程图;FIG. 3 shows a flowchart of determining scene feature information and time sequence feature information corresponding to the target object in yet another action recognition method provided by an embodiment of the present disclosure;
图4示出了本公开实施例中的简化的时序特征提取模块的结构示意图;FIG. 4 shows a schematic structural diagram of a simplified timing feature extraction module in an embodiment of the present disclosure;
图5示出了本公开实施例提供的再一种动作识别方法的流程图;FIG. 5 shows a flowchart of still another method for action recognition provided by an embodiment of the present disclosure;
图6示出了本公开实施例提供的一种动作识别装置的结构示意图;FIG. 6 shows a schematic structural diagram of an action recognition device provided by an embodiment of the present disclosure;
图7示出了本公开实施例提供的一种电子设备的结构示意图。Fig. 7 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,应当理解,本公开中附图仅起到说明和描述的目的,并不用于限定本公开的保护范围。另外,应当理解,示意性的附图并未按实物比例绘制。本公开中使用的流程图示出了根据本公开的一些实施例实现的操作。应该理解,流程图的操作可以不按顺序实现,没有逻辑的上下文关系的步骤可以反转顺序或者同时实施。此外,本领域技术人员在本公开内容的指引下,可以向流程图添加一个或多个其他操作,也可以从流程图中移除一个或多个操作。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present disclosure. It should be understood that the attached The drawings are only for the purpose of illustration and description, and are not used to limit the protection scope of the present disclosure. In addition, it should be understood that the schematic drawings are not drawn to scale. The flowchart used in the present disclosure shows operations implemented according to some embodiments of the present disclosure. It should be understood that the operations of the flowchart may be implemented out of order, and steps without logical context may be reversed in order or implemented at the same time. In addition, under the guidance of the present disclosure, those skilled in the art can add one or more other operations to the flowchart, or remove one or more operations from the flowchart.
另外,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In addition, the described embodiments are only a part of the embodiments of the present disclosure, rather than all the embodiments. The components of the embodiments of the present disclosure generally described and illustrated in the drawings herein may be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed present disclosure, but merely represents selected embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative work shall fall within the protection scope of the present disclosure.
需要说明的是,本公开实施例中将会用到术语“包括”,用于指出其后所声明的特征的存在,但并不排除增加其它的特征。It should be noted that the term "including" will be used in the embodiments of the present disclosure to indicate the existence of the features declared thereafter, but it does not exclude the addition of other features.
针对目前动作动作识别中存在的识别精度低的技术问题,本公开提供了一种动作识别方法及装置、电子设备、计算机可读存储介质。其中,本公开利用目标对象对应的对象边框来确定动作特征信息,而不是利用整帧图像来确定动作特征信息,能够有效降低每帧图像中用于进行动作识别的数据量,从而能够增加用于进行动作识别的图像的数量,有利于提高动作识别的准确度;另外,本公开不仅利用目标对象的动作特征信息来进行动作分类和识别,还利用视频片段和确定的上述动作特征信息,提取到了目标对象所处场景的场景特征信息以及与目标对象的动作有关联的时序特征信息,在动作特征信息的基础上,结合场景信息和时序特征信息能够进一步提高动作识别的准确度。In view of the current technical problem of low recognition accuracy in action recognition, the present disclosure provides an action recognition method and device, electronic equipment, and computer-readable storage medium. Among them, the present disclosure uses the object frame corresponding to the target object to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing the amount of data used for motion recognition. The number of images for action recognition is conducive to improving the accuracy of action recognition; in addition, the present disclosure not only uses the action feature information of the target object to perform action classification and recognition, but also uses video clips and the determined action feature information to extract The scene feature information of the scene where the target object is located and the time sequence feature information associated with the action of the target object. On the basis of the action feature information, combining the scene information and the time sequence feature information can further improve the accuracy of action recognition.
下面通过具体的实施例对本公开的动作识别方法及装置、电子设备、计算机可读存储介质进行说明。The following describes the action recognition method and device, electronic equipment, and computer-readable storage medium of the present disclosure through specific embodiments.
本公开实施例提供了一种动作识别方法,该方法应用于进行动作识别的终端设备等硬件设备,该方法也可以是通过处理器执行计算机程序实现。具体地,如图1A所示,本公开实施例提供的动作识别方法包括如下步骤:The embodiments of the present disclosure provide an action recognition method, which is applied to a hardware device such as a terminal device that performs action recognition, and the method may also be implemented by a processor executing a computer program. Specifically, as shown in FIG. 1A, the action recognition method provided by the embodiment of the present disclosure includes the following steps:
S110、获取视频片段。S110. Obtain a video clip.
这里,视频片段是用于进行动作识别的视频片段,包括多张图像,图像中包括需要进行动作识别的目标对象,该目标对象可以是人或动物等。Here, a video clip is a video clip used for action recognition, and includes multiple images. The images include a target object that needs to be motion recognized, and the target object may be a human or an animal.
上述视频片段可以是进行动作识别的终端设备利用其自身的摄像头等拍摄设备拍摄的,也可以是其他拍摄设备拍摄的,其他拍摄设备拍摄后,将视频片段传递给进行动作识别的终端设备即可。The above-mentioned video clips can be shot by the terminal device that performs motion recognition using its own camera or other shooting equipment, or it can be shot by other shooting devices. After shooting by other shooting devices, the video clip can be passed to the terminal device for motion recognition. .
S120、基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息。S120: Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip.
这里,对象边框即包围目标对象的边界框,利用边界框内的图像信息确定目标对象的动作特征信息时,能够降低终端设备处理的数据量。Here, the object frame is the bounding box surrounding the target object. When the image information in the bounding box is used to determine the motion characteristic information of the target object, the amount of data processed by the terminal device can be reduced.
在基于对象边框确定动作特征信息之前,首先需要从视频判断中筛选关键帧图像,并确定目标对象在每个关键帧图像中的对象边框。Before determining the action feature information based on the object frame, it is first necessary to filter the key frame images from the video judgment, and determine the object frame of the target object in each key frame image.
在具体实施时,可以利用预设的时间间隔从视频片段中筛选关键帧图像,当然也可以利用其它方法从视频片段中筛选关键帧图像,例如将视频片段分为多个子片段后在每个子片段中提取一帧图像作为关键帧图像。本公开对从视频片段中筛选关键帧图像的方法并不进行限定。In the specific implementation, you can use the preset time interval to filter the key frame images from the video clips. Of course, you can also use other methods to filter the key frame images from the video clips. Extract a frame of image as the key frame image. The present disclosure does not limit the method of filtering key frame images from video clips.
在从视频片段中筛选得到多张关键帧图像之后,可以利用每张关键帧图像中的对象边框来确定目标对象的动作特征信息,当然也可以利用筛选得到的多张关键帧图像中的部分关键帧图像中的对象边框来确定目标对象的动作特征信息。在利用部分关键帧图像中的对象边框来确定目标对象的动作特征信息时,需要提取或确定部分关键帧图像中的对象边框,之后再利用提取或确定的边框来确定目标对象的动作特征信息。After filtering multiple key frame images from the video clips, the object borders in each key frame image can be used to determine the action feature information of the target object. Of course, you can also use some of the key frame images obtained by filtering. The frame of the object in the frame image determines the action characteristic information of the target object. When using the object borders in some key frame images to determine the motion feature information of the target object, it is necessary to extract or determine the object borders in the partial key frame image, and then use the extracted or determined borders to determine the motion feature information of the target object.
在具体实施时,可以利用对象检测的方法,例如使用人体检测器,利用人体检测的方法,确定对象边框,当然,也可以利用其它方法确定对象边框,本公开对确定对象边框的方法并不进行限定。In specific implementations, object detection methods can be used, for example, a human body detector is used to determine the object frame by using the human body detection method. Of course, other methods can also be used to determine the object frame. The present disclosure does not deal with the method for determining the object frame. limited.
在具体实施时,可以将人体检测器检测得到的对象边框作为用于确定动作特征信息的最终的对象边框。但是由于人体检测器检测得到对象边框可能是包括目标对象在内的较小的边框,为了获得更完整的目标对象的信息以及更多的环境信息,在人体检测器检测得到了对象边框之后,还可以按照预设的扩展尺寸信息,分别对每个人体检测器检测得到的对象边框进行扩展,得到所述目标对象在每个所述关键帧图像中的最终的对象边框。之后,用确定的最终的对象边框来确定目标对象的动作特征信息。In specific implementation, the object frame detected by the human body detector may be used as the final object frame used to determine the action feature information. However, because the object frame detected by the human body detector may be a smaller frame including the target object, in order to obtain more complete target object information and more environmental information, after the human body detector detects the object frame, the The object frame detected by each human body detector may be expanded according to the preset expansion size information to obtain the final object frame of the target object in each of the key frame images. After that, the determined final object frame is used to determine the action characteristic information of the target object.
上述对对象边框进行扩展的扩展尺寸信息是预先设定好的,例如,上述扩展尺寸信息包括对象边框在长度方向上的第一延伸长度和对象边框在宽度方向上的第二延伸长度。根据上述第一延伸长度对对象边框的长度分别向两侧进行延长,并且长度方向上两侧分别延长上述第一延伸长度的一半。根据上述第二延伸长度对对象边框的宽度分别向两侧进行延长,并且宽度方向上两侧分别延长上述第二延伸长度的一半。The expansion size information for expanding the object frame is preset. For example, the expansion size information includes a first extension length of the object frame in the length direction and a second extension length of the object frame in the width direction. The length of the object frame is extended to both sides according to the first extension length, and the two sides in the length direction are respectively extended by half of the first extension length. The width of the object frame is extended to both sides according to the second extension length, and both sides in the width direction are respectively extended by half of the second extension length.
上述第一延伸长度和第二延伸长度可以预先设定好的具体的数值,也可以基于人体检测器直接检测得到的对象边框的长度和宽度确定的数值。例如,第一延伸长度可以等于人体检测器直接检测得到的对象边框的长度,第二延伸长度可以等于人体检测器直接检测得到的对象边框的宽度。The first extension length and the second extension length may be preset specific values, or may be determined based on the length and width of the object frame directly detected by the human body detector. For example, the first extension length may be equal to the length of the object frame directly detected by the human body detector, and the second extension length may be equal to the width of the object frame directly detected by the human body detector.
通过上述方式,利用对象检测的方法确定目标对象在图像中的边框,减少了进行动作识别需要处理的数据量,并且在确定了一个较小的初始的对象边界框后,对其进行了扩展,从而使得配置为进行动作识别的对象边框能够包括更完整的目标对象的信息以及更多的环境信息,从而有利于提高动作识别的准确度。Through the above method, the object detection method is used to determine the bounding box of the target object in the image, which reduces the amount of data that needs to be processed for action recognition. After determining a smaller initial bounding box of the object, it is expanded. As a result, the frame of the object configured for action recognition can include more complete target object information and more environmental information, thereby helping to improve the accuracy of action recognition.
上述动作特征信息是从视频片段中的图像中提取的,能够表征目标对象的动作特征的信息。The aforementioned action feature information is extracted from the image in the video clip, and can characterize the action feature of the target object.
S130、基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征 信息和时序特征信息。S130. Determine scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information.
这里,场景特征信息用于表征目标对象所处场景的场景特征,可以是从与关键帧图像相关联的至少部分关联图像中进行场景特征提取得到的。Here, the scene feature information is used to characterize the scene feature of the scene in which the target object is located, and may be obtained by performing scene feature extraction from at least part of the associated images associated with the key frame image.
时序特征信息是与目标对象的动作在时序上有关联的特征信息,例如可以是视频片段中的除目标对象以外的其他对象的动作特征信息,在具体实施时,可以基于视频片段和目标对象的动作特征信息确定。The time sequence feature information is the feature information related to the target object's action in time sequence. For example, it can be the action feature information of other objects in the video clip except the target object. In specific implementation, it can be based on the video clip and the target object's action feature information. The action feature information is determined.
S140、基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。S140: Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
在确定了动作特征信息、场景特征信息和时序特征信息之后,可以将上述三种信息进行合并。例如拼接;之后,对合并得到的信息进行分类,得到目标对象的动作类型,实现目标对象的动作识别。After the action feature information, scene feature information, and time sequence feature information are determined, the above three types of information can be combined. For example, splicing; after that, the combined information is classified to obtain the action type of the target object, so as to realize the action recognition of the target object.
本公开实施例中,利用目标对象对应的对象边框来确定动作特征信息,而不是利用整帧图像来确定动作特征信息,能够有效降低每帧图像中用于进行动作识别的数据量,从而能够增加用于进行动作识别的图像的数量,有利于提高动作识别的准确度;另外,本公开实施例不仅利用目标对象的动作特征信息来进行动作分类和识别,还利用视频片段和确定的上述动作特征信息,提取到了目标对象所处场景的场景特征信息以及与目标对象的动作有关联的时序特征信息,在动作特征信息的基础上,结合场景信息和时序特征信息能够进一步提高动作识别的准确度。In the embodiments of the present disclosure, the object frame corresponding to the target object is used to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing The number of images used for action recognition is conducive to improving the accuracy of action recognition; in addition, the embodiments of the present disclosure not only use the action feature information of the target object for action classification and recognition, but also use video clips and the determined action features mentioned above. Information, the scene feature information of the scene where the target object is located and the time sequence feature information associated with the target object’s actions are extracted. Based on the action feature information, combining the scene information and time sequence feature information can further improve the accuracy of action recognition.
在本公开实施例中,可以通过如图1B所示的网络架构,实现对目标对象的动作识别,图1B示出了本公开实施例提供的一种网络架构示意图,该网络架构中包括:用户终端201、网络202和动作识别的终端设备203。为实现支撑一个示例性应用用户终端201和动作识别的终端设备203通过网络202建立有通信连接,用户终端201需要获取目标对象的动作类型时,首先,将配置为确定动作类型的请求信息通过网络202发送给动作识别的终端设备203;然后,动作识别的终端设备203,通过获取视频片段,利用目标对象对应的对象边框来确定动作特征信息;并,利用视频片段和确定的动作特征信息,提取目标对象所处场景的场景特征信息以及与目标对象的动作有关联的时序特征信息;最后,综合考虑场景信息和时序特征信息,以更高的准确率确定出目标对象的动作类型,并将确定出的动作类型反馈至用户终端201。In the embodiment of the present disclosure, the action recognition of the target object can be realized through the network architecture shown in FIG. 1B. FIG. 1B shows a schematic diagram of a network architecture provided by an embodiment of the present disclosure, and the network architecture includes: users A terminal 201, a network 202, and an action-recognized terminal device 203. In order to support an exemplary application, the user terminal 201 and the action recognition terminal device 203 establish a communication connection through the network 202. When the user terminal 201 needs to obtain the action type of the target object, first, the request information configured to determine the action type is passed through the network 202 is sent to the terminal device 203 for action recognition; then, the terminal device 203 for action recognition obtains the video clip and uses the object frame corresponding to the target object to determine the action feature information; and uses the video clip and the determined action feature information to extract The scene feature information of the scene where the target object is located and the time sequence feature information associated with the target object’s actions; finally, the scene information and time sequence feature information are comprehensively considered to determine the target object’s action type with a higher accuracy, and will determine The output action type is fed back to the user terminal 201.
作为示例,用户终端201可以包括具有数据处理能力的设备,动作识别的终端设备203可以包括图像采集装置,以及具有数据处理能力的处理设备或远程服务器。网络202可以采用有线连接或无线连接方式。其中,当动作识别的终端设备为具有数据处理能力的处理设备时,用户终端201可以通过有线连接的方式与处理设备通信连接,例如通过总线进行数据通信;当动作识别的终端设备203为远程服务器时,用户终端可以通过无线网络与远程服务器进行数据交互。As an example, the user terminal 201 may include a device with data processing capabilities, and the motion recognition terminal device 203 may include an image acquisition device, and a processing device with data processing capabilities or a remote server. The network 202 may adopt a wired connection or a wireless connection. Wherein, when the terminal device for action recognition is a processing device with data processing capabilities, the user terminal 201 can communicate with the processing device through a wired connection, such as data communication via a bus; when the terminal device for action recognition 203 is a remote server At the time, the user terminal can interact with the remote server through the wireless network.
在一些实施例中,如图2所示,上述基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息,具体可以利用如下步骤实现:In some embodiments, as shown in FIG. 2, the foregoing determination of the motion feature information of the target object based on the object frame of the target object in the key frame image in the video clip can be specifically implemented by using the following steps:
S210、针对关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像。S210. For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip.
这里,与关键帧图像相关联的关联图像为与关键帧图像的图像特征相似的图像,例如可以是与关键帧图像的拍摄时间相近的图像。Here, the associated image associated with the key frame image is an image similar to the image feature of the key frame image, for example, it may be an image similar to the shooting time of the key frame image.
在具体实施时,可以利用如下子步骤筛选关键帧图像对应的关联图像:In specific implementation, the following sub-steps can be used to filter the associated images corresponding to the key frame images:
子步骤一、从所述视频片段中选取包括关键帧图像的第一子视频片段;所述第一子视频片段还包括与该关键帧图像时序上相邻的N张图像;其中,N为正整数。Sub-step 1: Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is positive Integer.
上述第一子视频片段中,关键帧图像可能位于第一子视频片段的前半部分的片段中,也可能位于第一子视频片段的后半部分的片段中,当然也可以位于第一子视频片段的中 部或接近中部的位置。In the above-mentioned first sub-video segment, the key frame image may be located in the first half of the first sub-video segment, or may be located in the second half of the first sub-video segment, of course, it can also be located in the first sub-video segment. The middle of or near the middle.
一种可能的实施方式中,可以从视频片段中截取一段包括关键帧图像的子视频片段,例如,可以截取一段64帧的子视频片段。该子视频片段中,关键帧图像处于子视频片段的中部或接近中部的位置。例如,子视频片段包括关键帧图像的前32帧图像、关键帧图像和该关键帧图像的后31帧图像;再例如,该子视频片段中,关键帧图像处于子视频片段的前半部分的片段中,子视频片段包括关键帧图像的前10帧图像、关键帧图像和该关键帧图像的后53帧图像。再例如,该子视频片段中,关键帧图像处于子视频片段的后半部分的片段中,子视频片段包括关键帧图像的前50帧图像、关键帧图像和该关键帧图像的后13帧图像。In a possible implementation manner, a sub-video segment including a key frame image can be intercepted from the video segment, for example, a 64-frame sub-video segment can be intercepted. In the sub-video segment, the key frame image is in the middle of the sub-video segment or a position close to the middle. For example, the sub video segment includes the first 32 frames of the key frame image, the key frame image, and the last 31 frames of the key frame image; for another example, in the sub video segment, the key frame image is in the first half of the sub video segment. In the sub-video segment, the first 10 frames of the key frame image, the key frame image, and the last 53 frames of the key frame image. For another example, in the sub video segment, the key frame image is in the second half of the sub video segment, and the sub video segment includes the first 50 frames of the key frame image, the key frame image, and the last 13 frames of the key frame image. .
另外,上述第一子视频片段中,关键帧图像还可以位于第一子视频片段的两端,即,上述与该关键帧图像时序上相邻的N张图像是关键帧图像的前N张图像或后N张图像。本公开对关键帧图像在第一子视频片段中的位置并不进行限定。In addition, in the above-mentioned first sub-video segment, the key frame image may also be located at both ends of the first sub-video segment, that is, the above-mentioned N images adjacent to the key frame image in time sequence are the first N images of the key frame image. Or the next N images. The present disclosure does not limit the position of the key frame image in the first sub-video segment.
子步骤二,从所述第一子视频片段中筛选所述多张关联图像。The second sub-step is to filter the multiple associated images from the first sub-video segment.
一种可能的实现方式中,可以基于预设的时间间隔从第一子视频片段中筛选关联图像,例如,从第一子视频片段中以时间跨度τ稀疏采样得到T帧关联图像。筛选得到的关联图像可能包括关键帧图像,也可能不包括关键帧图像,具有一定的随机性,本公开对关联图像是否包括关键帧图像并不进行限定。In a possible implementation manner, the associated images may be filtered from the first sub-video segment based on a preset time interval, for example, T frames of associated images can be obtained by sparsely sampling the first sub-video segment with a time span τ. The related images obtained by screening may include or may not include key frame images, and have a certain degree of randomness. The present disclosure does not limit whether the related images include key frame images.
基于预定的时间间隔,从与关键帧图像的拍摄时间相近的子视频片段中筛选与关键帧图像相关联的图像,能够筛选到与关键帧图像关联程度最近的图像,基于与关键帧图像关联程度最近的图像,能够提高确定的动作特征信息的准确度。Based on a predetermined time interval, filter the images associated with the key frame image from the sub video clips that are similar to the shooting time of the key frame image, and can filter the image with the closest degree of association with the key frame image, based on the degree of association with the key frame image The most recent image can improve the accuracy of the determined motion feature information.
另外,还可以利用其他的方法来筛选与关键帧图像相关联的关联图像。例如,可以首先计算第一子视频片段中每帧图像与关键帧图像的图像相似度,之后选取图像相似度最高的多张图像作为与关键帧图像相关联的关联图像。In addition, other methods can also be used to filter the associated images associated with the key frame images. For example, the image similarity between each frame of the image in the first sub-video segment and the key frame image may be calculated first, and then multiple images with the highest image similarity are selected as the associated images associated with the key frame image.
S220、按照该关键帧图像对应的对象边框,分别从该关键帧图像对应的至少部分关联图像中截取部分图像,得到该关键帧图像对应的多张目标对象图像。S220: According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image.
这里是利用关键图像对应的对象边框,从与关键帧图像相关联的部分或全部关联图像中截取部分图像。如果是从部分关联图像中截取目标对象图像,具体的可以是从全部关联图像中选取与关键帧图像的拍摄时间最接近的部分关联图像来截取目标对象图像,当然也可以利用其他方法选取部分关联图像来截取目标对象图像。例如,按照一定的时间间隔,从全部关联图像中选取部分关联图像。Here is to use the object frame corresponding to the key image to intercept part of the image from some or all of the associated images associated with the key frame image. If it is to intercept the target object image from part of the related images, specifically it can select the part of the related image closest to the shooting time of the key frame image from all the related images to intercept the target object image. Of course, other methods can also be used to select the part of the related image. Image to capture the target object image. For example, according to a certain time interval, some related images are selected from all related images.
在按照关键帧图像对应的对象边框,截取目标对象图像时,具体过程包括:首先按照时间顺序,在所有的关联图像或部分关联图像上复制对象边框。其中,在关联图像上复制对象边框时,是利用对象边框在关键帧图像上的坐标信息实现在关联图像上的边框复制的,例如按照对象边框在关键帧图像上的坐标信息,根据时间顺序进行边框位置偏移或者直接复制边框位置,得到关联图像上的对象边框。然后,对象边框复制完成之后,按照对象边框对关联图像进行裁剪,得到目标对象图像,即截取关联图像中的对象边框内的图像作为上述目标对象图像。When intercepting the target object image according to the object frame corresponding to the key frame image, the specific process includes: first copying the object frame on all the associated images or part of the associated images in chronological order. Among them, when copying the object frame on the associated image, the coordinate information of the object frame on the key frame image is used to realize the frame copy on the associated image, for example, according to the coordinate information of the object frame on the key frame image, according to the time sequence. Offset the frame position or copy the frame position directly to get the object frame on the associated image. Then, after the object frame copy is completed, the associated image is cropped according to the object frame to obtain the target object image, that is, the image within the object frame in the associated image is intercepted as the target object image.
关键帧图像的作用是实现目标对象图像的定位,并不一定用于直接确定动作特征信息。例如,在关联图像不包括关键帧图像时,则不从关键帧图像中截取用于确定动作特征信息的目标对象图像。The function of the key frame image is to realize the positioning of the target object image, and is not necessarily used to directly determine the action characteristic information. For example, when the associated image does not include the key frame image, the target object image used to determine the action feature information is not intercepted from the key frame image.
S230、基于关键帧对应的多张目标对象图像,确定所述目标对象的动作特征信息。S230: Determine the action feature information of the target object based on the multiple target object images corresponding to the key frame.
在截取到上述目标对象图像之后,可以对多张目标对象图像分别进行动作特征提取,具体可以利用三维(three Dimensional,3D)卷积神经网络对目标对象图像进行处理,提取目标对象图像中的动作特征,得到目标对象的动作特征信息。After the above-mentioned target object image is intercepted, action feature extraction can be performed on multiple target object images. Specifically, a three-dimensional (3D) convolutional neural network can be used to process the target object image to extract the action in the target object image. Features to obtain the action feature information of the target object.
另外,本公开实施例中在得到多张目标对象图像之后,在确定所述目标对象的动作 特征信息之前,还可以利用如下步骤对目标对象图像进行处理:In addition, after obtaining multiple target object images in the embodiments of the present disclosure, before determining the action feature information of the target object, the following steps may be used to process the target object image:
将所述目标对象图像设置为具有预设图像分辨率的图像。上述预设图像分辨率较目标对象图像的原图像分辨率较高。在具体实施时,可以利用现有的方法或工具来设置目标对象图像的图像分辨率,例如,利用插值等方法来调整目标对象图像的图像分辨率。The target object image is set as an image with a preset image resolution. The aforementioned preset image resolution is higher than the original image resolution of the target object image. In specific implementation, existing methods or tools can be used to set the image resolution of the target object image, for example, interpolation and other methods can be used to adjust the image resolution of the target object image.
这里在截取到目标对象图像之后,将目标对象图像设置为预设的分辨率,能够提高目标对象图像中包括的信息的数量,即可以放大截取的目标对象图像,保留目标对象更多的细粒度细节,从而能够提高确定的动作特征信息的准确度。Here after the target object image is intercepted, the target object image is set to the preset resolution, which can increase the amount of information included in the target object image, that is, the intercepted target object image can be enlarged, and the target object can be more fine-grained. Details, which can improve the accuracy of the determined action feature information.
在具体实施时,可以将上述预设图像分辨率设置为H×W,每帧关键帧图像截取的目标对象图像为T个,每帧目标对象图像的通道数为3,那么输入3D卷积神经网络进行动作特征提取的是T×H×W×3的图像块。经过3D卷积神经网络对输入的图像块进行全局平均池化后,可以得到2048维的特征向量,该特征向量即为上述动作特征信息。In specific implementation, the above-mentioned preset image resolution can be set to H×W, the number of target object images captured by each key frame image is T, and the number of channels of each frame of target object image is 3, then the input 3D convolutional neural The action feature extracted by the network is a T×H×W×3 image block. After the 3D convolutional neural network performs global average pooling on the input image blocks, a 2048-dimensional feature vector can be obtained, which is the aforementioned action feature information.
本公开实施例中,利用目标对象在关键帧图像中的对象边框进行定位,从与关键帧图像相关联的多张关联图像中截取用于确定动作特征信息的目标对象图像,提高了确定动作特征信息所使用的图像的精准度,并且能够增加用于确定动作特征信息的图像的数量,从而能够提高动作识别的准确度。In the embodiments of the present disclosure, the target object frame in the key frame image is used for positioning, and the target object image used to determine the action feature information is intercepted from multiple associated images associated with the key frame image, which improves the determination of the action feature. The accuracy of the image used in the information can be increased, and the number of images used to determine the action feature information can be increased, so that the accuracy of the action recognition can be improved.
在一些实施例中,如图3所示,上述所述基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息,包括:In some embodiments, as shown in FIG. 3, the foregoing determination of scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information includes:
S310、针对关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像,对至少部分关联图像进行视频场景特征提取操作,得到所述场景特征信息。S310. For the key frame image, filter multiple associated images corresponding to the key frame image from the video clip, and perform a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information.
这里,具体可以利用3D卷积神经网络对部分或全部的关联图像进行视频场景特征提取和全局平均池化,得到2048维的特征向量,该特征向量即为上述场景特征信息。Here, a 3D convolutional neural network can be used to perform video scene feature extraction and global average pooling on part or all of the associated images to obtain a 2048-dimensional feature vector, which is the aforementioned scene feature information.
S320、对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息。S320: Perform a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information.
这里,初始时序特征信息是除目标对象以外的其他对象的时序特征,例如其他对象的动作特征,在具体实施时,可以通过如下子步骤确定:Here, the initial time sequence feature information is the time sequence feature of other objects other than the target object, such as the action feature of other objects, which can be determined through the following sub-steps during specific implementation:
子步骤一、针对所述关键帧图像,从所述视频片段中选取包括关键帧图像的第二子视频片段;所述第二子视频片段还包括与该关键帧图像时序上相邻的P张图像;其中,P为正整数。Sub-step 1. For the key frame image, select a second sub-video segment that includes a key frame image from the video segments; the second sub-video segment also includes P images that are temporally adjacent to the key frame image Image; where P is a positive integer.
上述第二子视频片段中,关键帧图像可能位于第二子视频片段的前半部分的片段中,也可能位于第二子视频片段的后半部分的片段中,当然也可以位于第二子视频片段的中部或接近中部的位置。In the above second sub-video segment, the key frame image may be located in the first half of the second sub-video segment, or may be located in the second half of the second sub-video segment, and of course, it can also be located in the second sub-video segment. The middle of or near the middle.
另外,上述第二子视频片段中,关键帧图像还可以位于第二子视频片段的两端,即,上述与该关键帧图像时序上相邻的P张图像是关键帧图像的前P张图像或后P张图像。本公开对关键帧图像在第二子视频片段中的位置并不进行限定。In addition, in the above-mentioned second sub-video segment, the key frame image may also be located at both ends of the second sub-video segment, that is, the P images adjacent to the key frame image in time sequence are the previous P images of the key frame image. Or the next P images. The present disclosure does not limit the position of the key frame image in the second sub-video segment.
一种可能的实现方式中,从视频片段中截取一段包括关键帧图像的子视频片段,例如可以截取一段2秒钟的子视频片段,该子视频的时间较长用于确定一个长时序的时序特征。In a possible implementation manner, a sub-video segment including a key frame image can be intercepted from a video segment. For example, a 2-second sub-video segment can be intercepted. The sub-video has a longer time and is used to determine the timing of a long time sequence. feature.
子步骤二、提取所述第二子视频片段中的每张图像中,除所述目标对象以外的其他对象的动作特征,并将得到动作特征作为所述初始时序特征信息。The second sub-step is to extract the action features of objects other than the target object in each image in the second sub-video segment, and use the obtained action features as the initial time sequence feature information.
这里,具体可以利用3D卷积神经网络提取子视频片段中除所述目标对象以外的其他对象的动作特征,得到的初始时序特征信息可以以视频时序特征库(long-term Feature Bank,LFB)的形式存储和使用。Here, the 3D convolutional neural network can be used to extract the action features of objects other than the target object in the sub-video segment, and the obtained initial timing feature information can be based on the video timing feature bank (long-term Feature Bank, LFB) Form storage and use.
本公开实施例中,从视频片段中选取了与关键帧图像的拍摄时间较为接近的子视频片段来提取时序特征,能够减小提取得到的时序特征的数据量,并且能够提高确定的时序特征与关键帧图像的关联性,从而有利于提高动作识别的准确度;另外,本公开实施 例中,将其他对象的动作特征作为时序特征,能够提高动作识别所使用的时序性特征的针对性,有利于提高动作识别的准确度。In the embodiments of the present disclosure, sub-video segments that are closer to the shooting time of the key frame image are selected from the video segments to extract the timing features, which can reduce the amount of data of the extracted timing features, and can improve the determined timing features and The relevance of the key frame images is beneficial to improve the accuracy of action recognition; in addition, in the embodiments of the present disclosure, the action features of other objects are used as time-series features, which can improve the pertinence of the time-series features used in action recognition. Conducive to improving the accuracy of action recognition.
S330、基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息。S330: Based on the initial time sequence feature information and the action feature information, determine the time sequence feature information corresponding to the target object.
这里,具体可以对初始时序特征信息和动作特征信息进行时序特征提取,得到目标对象对应的时序特征信息。Here, it is specifically possible to perform time sequence feature extraction on the initial time sequence feature information and the action feature information to obtain the time sequence feature information corresponding to the target object.
一种可能的实现方式中,可以利用如下子步骤对初始时序特征信息和动作特征信息进行时序特征提取,以得到目标对象对应的时序特征信息:In a possible implementation manner, the following sub-steps can be used to perform time-series feature extraction on the initial time-series feature information and action feature information to obtain the time-series feature information corresponding to the target object:
子步骤一、分别对所述初始时序特征信息和所述动作特征信息进行降维处理。Sub-step 1: Perform dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
在得到除目标对象以外的其他对象的初始时序特征信息和目标对象的动作特征信息之后,对初始时序特征信息和动作特征信息进行降维处理,能够减少需要处理的数据量,有利于提高动作识别的效率。After obtaining the initial time series feature information of objects other than the target object and the action feature information of the target object, the initial time series feature information and the action feature information can be reduced in dimensionality, which can reduce the amount of data that needs to be processed and help improve action recognition s efficiency.
一种可能的实现方式中,在得到初始时序特征信息和动作特征信息之后,还可以对初始时序特征信息和动作特征信息进行随机失活(Dropout)处理,Dropout处理可以是在用于提取初始时序特征信息和动作特征信息的神经网络的最后一个网络层实现,也可以是在提取初始时序特征信息和动作特征信息的神经网络的各个网络层实现。In a possible implementation manner, after the initial time sequence feature information and the action feature information are obtained, the initial time sequence feature information and the action feature information can also be randomly deactivated (Dropout) processing. The Dropout processing can be used to extract the initial time sequence. The final network layer of the neural network for feature information and action feature information can also be implemented at each network layer of the neural network for extracting initial time series feature information and action feature information.
子步骤二、对降维处理后的初始时序特征信息进行均值池化操作。The second step is to perform the mean pooling operation on the initial time series feature information after dimensionality reduction processing.
子步骤三、将进行均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行合并操作,得到所述目标对象对应的时序特征信息。上述合并操作具体可以是通道拼接,即将一个特征信息的通道增加到另一个特征信息的通道后实现合并;合并操作还可以是相加操作,即将均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行相加操作。Sub-step 3: The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object. The above merging operation can specifically be channel splicing, that is, the channel of one feature information is added to the channel of another feature information to realize the merging; the merging operation can also be an addition operation, that is, the initial timing feature information and dimensionality reduction after the mean pooling operation The processed action feature information is added.
子步骤二和子步骤三实质上是对初始时序特征信息和动作特征信息进行时序特征提取操作,具体可以利用如图4所示简化后的时序特征提取模块来实现。如图4所示的简化的时序特征提取模块400用于来提取上述时序特征信息,具体可以包括线性(Linear)层401、平均池化(Average)层402、标准化和激活函数(LN+ReLU)层403和随机失活(Dropout)层404。上述子步骤二中,对时序特征提取操作进行了简化,只利用平均池化Average层对降维处理后的初始时序特征信息进行均值池化操作,并未进行归一化(softmax)操作,简化了时序特征提取的操作步骤,即简化了现有的时序特征提取模块,能够提高动作识别的效率。其中,现有的时序特征提取模块不包括平均池化层,而是包括分类归一化softmax层,该softmax层进行的处理复杂度高于平均池化操作。另外,现有的时序特征提取模块在随机失活层之前还包括一线性层,本公开中的简化后的时序特征提取模块不包括该线性层,因此能够进一步提高动作识别的效率。Sub-step two and sub-step three are essentially to perform a sequence feature extraction operation on the initial sequence feature information and the action feature information, and can be specifically implemented by using the simplified sequence feature extraction module as shown in FIG. 4. The simplified time-series feature extraction module 400 shown in FIG. 4 is used to extract the above-mentioned time-series feature information, which may specifically include a linear layer 401, an average pooling (Average) layer 402, and a normalization and activation function (LN+ReLU). A layer 403 and a dropout layer 404. In the second sub-step above, the time series feature extraction operation is simplified. Only the average pooling Average layer is used to perform the average pooling operation on the initial time series feature information after the dimensionality reduction process. The normalization (softmax) operation is not performed, which is simplified The operation steps of time sequence feature extraction are simplified, that is, the existing time sequence feature extraction module is simplified, and the efficiency of action recognition can be improved. Among them, the existing time series feature extraction module does not include an average pooling layer, but includes a classification normalization softmax layer, and the processing complexity of the softmax layer is higher than the average pooling operation. In addition, the existing time sequence feature extraction module further includes a linear layer before the random inactivation layer. The simplified time sequence feature extraction module in the present disclosure does not include the linear layer, so the efficiency of action recognition can be further improved.
在具体实施时,时序特征提取模块输出的时序特征信息可以是512维的特征向量,该512维的特征向量即为上述目标对象的时序特征信息。In specific implementation, the time series feature information output by the time series feature extraction module may be a 512-dimensional feature vector, and the 512-dimensional feature vector is the time series feature information of the aforementioned target object.
本公开实施例中,从与关键帧图像相关联的部分或全部关联图像中提取场景特征,能够得到较为完整场景特征信息,基于较为完整的场景特征信息能够提高动作识别的准确度。另外,本公开实施例中提取了除目标对象以外的其他对象的时序特征,即上述初始时序特征信息,并基于其他对象的时序特征和目标对象的动作特征信息,确定了与目标对象相关联的时序特征信息,利用该与目标对象相关联的时序特征信息,能够进一步提高动作识别的准确度。In the embodiments of the present disclosure, extracting scene features from part or all of the associated images associated with the key frame images can obtain relatively complete scene feature information, and based on relatively complete scene feature information, the accuracy of action recognition can be improved. In addition, in the embodiments of the present disclosure, the time series characteristics of other objects other than the target object, that is, the above-mentioned initial time series characteristic information, are extracted, and based on the time series characteristics of other objects and the action characteristic information of the target object, the target object associated with the target object is determined. Time sequence feature information, using the time sequence feature information associated with the target object, can further improve the accuracy of action recognition.
为了进一步提高提取的时序特征信息的准确度,可以串联多个时序特征提取模块来提取上述时序特征信息,一个时序特征提取模块提取得到的时序特征信息作为另一个时序特征提取模块的输入。具体地,可以将上一个时序特征提取模块提取得到的所述目标对象对应的时序特征信息作为新的初始时序特征信息,并返回上述分别对所述初始时序 特征信息和所述动作特征信息进行降维处理的步骤。In order to further improve the accuracy of the extracted time series feature information, multiple time series feature extraction modules can be connected in series to extract the above time series feature information, and the time series feature information extracted by one time series feature extraction module is used as the input of another time series feature extraction module. Specifically, the time sequence feature information corresponding to the target object extracted by the previous time sequence feature extraction module may be used as the new initial time sequence feature information, and return to the above to reduce the initial time sequence feature information and the action feature information respectively. Dimensional processing steps.
在具体实施时,可以串联3个简化后的时序特征提取模块来确定最终的时序特征信息。In specific implementation, three simplified timing feature extraction modules can be connected in series to determine the final timing feature information.
下面再通过一个具体的实施例对本公开的动作识别方法进行说明。The action recognition method of the present disclosure will be described below through a specific embodiment.
如图5所示,本公开实施例以人作为目标对象进行动作识别。具体地,本公开实施例的动作识别方法可以包括:As shown in FIG. 5, the embodiment of the present disclosure uses a person as a target object for action recognition. Specifically, the action recognition method of the embodiment of the present disclosure may include:
步骤一、获取视频片段501,并从上述视频片段中筛选关键帧图像;Step 1: Obtain a video clip 501, and filter key frame images from the above video clip;
步骤二、利用人体检测器502,对每个关键帧图像进行人物定位,得到人物,即目标对象的初始对象边界框;Step 2: Use the human body detector 502 to locate the person in each key frame image to obtain the person, that is, the initial object bounding box of the target object;
步骤三、按照预设扩展尺寸信息,对上述初始对象边界框进行扩展,得到最终的对象边框;之后,利用对象边框对与关键帧图像相关联的关联图像进行部分图像截取,得到每个关键图像对应的目标对象图像;Step 3: Expand the initial object bounding box according to the preset expansion size information to obtain the final object bounding box; then, use the object bounding box to perform partial image interception of the associated image associated with the key frame image to obtain each key image Corresponding target object image;
步骤四、将得到的所有关键图像对应的目标对象图像输入3D卷积神经网络503,利用3D卷积神经网络503提取目标对象的动作特征,得到目标对象对应的动作特征信息。Step 4: Input the target object images corresponding to all the obtained key images into the 3D convolutional neural network 503, and use the 3D convolutional neural network 503 to extract the action features of the target object to obtain the action feature information corresponding to the target object.
步骤五、将与关键帧图像相关联的关联图像输入上述3D卷积神经网络503,利用3D卷积神经网络503提取目标对象所处场景的视频场景特征,得到场景特征信息。Step 5: Input the associated image associated with the key frame image into the aforementioned 3D convolutional neural network 503, and use the 3D convolutional neural network 503 to extract the video scene features of the scene where the target object is located, to obtain scene feature information.
步骤六、利用另一个3D卷积神经网络503对视频片段进行时序特征提取,即提取除所述目标对象以外的其他对象的动作特征,得到初始时序特征信息,上述初始时序特征信息可以以时序特征库的形式存在;这里,在进行时序特征提取的时候,既可以从整个视频片段中提取,也可以是从视频片段中的,包括关键帧图像的一个较长的子视频片段中提取。Step 6. Use another 3D convolutional neural network 503 to perform time-series feature extraction on the video clip, that is, extract the action features of objects other than the target object to obtain initial time-series feature information. The above-mentioned initial time-series feature information can be based on time-series features. It exists in the form of a library; here, when the timing feature is extracted, it can be extracted from the entire video segment, or it can be extracted from the video segment, including a longer sub-video segment of the key frame image.
步骤七、利用简化的时序特征提取模块504,对所述初始时序特征信息和所述动作特征信息进行时序特征提取操作,得到目标对象对应的时序特征信息。Step 7: Using the simplified time sequence feature extraction module 504, perform a time sequence feature extraction operation on the initial time sequence feature information and the action feature information to obtain the time sequence feature information corresponding to the target object.
步骤八、将上述时序特征信息、动作特征信息和场景特征信息进行拼接处理,并利用动作分类器505对拼接得到的信息进行分类,得到目标对象的动作类型。Step 8. Perform splicing processing on the time series feature information, action feature information, and scene feature information, and use the action classifier 505 to classify the spliced information to obtain the action type of the target object.
对应于上述动作识别方法,本公开还提供了一种动作识别装置,该装置应用于对目标对象进行动作识别的终端设备等硬件设备上,并且各个模块能够实现与上述方法中相同的方法步骤以及取得相同的有益效果,因此对于其中相同的部分,本公开不再进行赘述。Corresponding to the above-mentioned action recognition method, the present disclosure also provides an action recognition device, which is applied to hardware devices such as terminal equipment that performs action recognition on a target object, and each module can implement the same method steps and methods as in the above-mentioned method. The same beneficial effects are achieved, and therefore the same parts will not be repeated in this disclosure.
具体的,如图6所示,本公开提供的一种动作装置可以包括:Specifically, as shown in FIG. 6, an action device provided by the present disclosure may include:
视频获取模块610,配置为获取视频片段。The video acquisition module 610 is configured to acquire video clips.
动作特征确定模块620,配置为基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息。The action feature determining module 620 is configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip.
场景时序特征确定模块630,配置为基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息。The scene timing feature determining module 630 is configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information.
动作识别模块640,配置为基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。The action recognition module 640 is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
在一些实施例中,所述动作特征确定模块620还配置为确定关键帧图像中的对象边框:In some embodiments, the action feature determining module 620 is further configured to determine the object frame in the key frame image:
从所述视频片段中筛选关键帧图像;Screening key frame images from the video clips;
对筛选得到的所述关键帧图像进行对象检测,确定所述目标对象在所述关键帧图像中的初始对象边界框;Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;
按照预设扩展尺寸信息,对所述初始对象边界框进行扩展,得到所述目标对象在所述关键帧图像中的所述对象边框。According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
在一些实施例中,所述动作特征确定模块620在基于目标对象在所述视频片段中的关 键帧图像中的对象边框,确定所述目标对象的动作特征信时,配置为:In some embodiments, the action feature determination module 620 is configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip:
针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
按照该关键帧图像对应的对象边框,分别从该关键帧图像对应的至少部分关联图像中截取部分图像,得到该关键帧图像对应的多张目标对象图像;According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
基于所述关键帧图像对应的多张目标对象图像,确定所述目标对象的动作特征信息。Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
在一些实施例中,所述动作特征确定模块620在从所述视频片段中筛选出与关键帧图像对应的多张关联图像时,配置为:In some embodiments, when the action feature determining module 620 filters out multiple associated images corresponding to the key frame images from the video clip, it is configured to:
从所述视频片段中选取包括关键帧图像的第一子视频片段;所述第一子视频片段还包括与该关键帧图像时序上相邻的N张图像;其中,N为正整数;Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
从所述第一子视频片段中筛选所述多张关联图。Filter the multiple association graphs from the first sub video segment.
在一些实施例中,在得到多张目标对象图像之后,在确定所述目标对象的动作特征信息之前,所述动作特征确定模块620还配置为:In some embodiments, after obtaining multiple target object images, before determining the motion characteristic information of the target object, the motion characteristic determination module 620 is further configured to:
将所述目标对象图像设置为具有预设图像分辨率的图像。The target object image is set as an image with a preset image resolution.
在一些实施例中,所述场景时序特征确定模块630在基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息时,配置为:In some embodiments, when the scene timing feature determination module 630 determines the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information, it is configured to:
针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
对至少部分所述关联图像进行视频场景特征提取操作,得到所述场景特征信息;Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;
对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息;Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;
基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息。Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
在一些实施例中,所述场景时序特征确定模块630在对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息时,配置为:In some embodiments, when the scene timing feature determination module 630 performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, it is configured to:
针对所述关键帧图像,从所述视频片段中选取包括关键帧图像的第二子视频片段;所述第二子视频片段还包括与该关键帧图像时序上相邻的P张图像;其中,P为正整数;For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
提取所述第二子视频片段中的图像中,除所述目标对象以外的其他对象的动作特征,并将得到动作特征作为所述初始时序特征信。Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
在一些实施例中,所述场景时序特征确定模块630在基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息时,配置为:In some embodiments, when the scene timing feature determining module 630 determines the timing feature information corresponding to the target object based on the initial timing feature information and the action feature information, it is configured to:
分别对所述初始时序特征信息和所述动作特征信息进行降维处理;Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;
对降维处理后的初始时序特征信息进行均值池化操作;Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;
将进行均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行合并操作,得到所述目标对象对应的时序特征信息。The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
在一些实施例中,所述场景时序特征确定模块630在基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息时,还配置为:In some embodiments, when the scene timing feature determining module 630 determines the timing feature information corresponding to the target object based on the initial timing feature information and the action feature information, it is further configured to:
将得到的所述目标对象对应的时序特征信息作为新的初始时序特征信息,并返回所述分别对所述初始时序特征信息和所述动作特征信息进行降维处理的步骤。Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
本公开实施例公开了一种电子设备,如图7所示,包括:相互连接的处理器701和存储介质702,所述存储介质存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器执行所述机器可读指令,以执行上述动作识别方法的步骤。具体地,处理器701和存储介质702可以通过总线703连接。The embodiment of the present disclosure discloses an electronic device, as shown in FIG. 7, comprising: a processor 701 and a storage medium 702 connected to each other, and the storage medium stores machine-readable instructions executable by the processor. When the device is running, the processor executes the machine-readable instructions to execute the steps of the above-mentioned action recognition method. Specifically, the processor 701 and the storage medium 702 may be connected through a bus 703.
所述机器可读指令被所述处理器701执行时执行以下动作识别方法的步骤:When the machine-readable instructions are executed by the processor 701, the following steps of the action recognition method are executed:
获取视频片段;Get video clips;
基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息;Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip;
基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息;Determine the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information;
基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
除此之外,机器可读指令被处理器701执行时,还可以执行上述方法部分描述的任一实施方式中的方法内容,这里不再赘述。In addition, when the machine-readable instruction is executed by the processor 701, the method content in any one of the embodiments described in the above method section can also be executed, which will not be repeated here.
本公开实施例还提供的一种对应于上述方法及装置的计算机程序产品,包括存储了程序代码的计算机可读存储介质,程序代码包括的指令可用于执行前面方法实施例中的方法,具体实现可参见方法实施例,在此不再赘述。该计算机可读存储介质可以是易失性或非易失性存储介质。The embodiments of the present disclosure also provide a computer program product corresponding to the above method and device, including a computer-readable storage medium storing program code. The instructions included in the program code can be used to execute the method in the previous method embodiment, and the specific implementation is Please refer to the method embodiment, which will not be repeated here. The computer-readable storage medium may be a volatile or non-volatile storage medium.
上文对各个实施例的描述倾向于强调各个实施例之间的不同之处,其相同或相似之处可以相互参考,为了简洁,本文不再赘述。The above description of the various embodiments tends to emphasize the differences between the various embodiments, and the same or similarities can be referred to each other. For the sake of brevity, the details are not repeated herein.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考方法实施例中的对应过程,本公开中不再赘述。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。Those skilled in the art can clearly understand that, for convenience and concise description, the specific working process of the system and device described above can refer to the corresponding process in the method embodiment, which will not be repeated in this disclosure. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, device, and method may be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other divisions in actual implementation. For example, multiple modules or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some communication interfaces, devices or modules, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本公开实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present disclosure.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品可存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present disclosure essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
以上仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以权利要求的保护范围为准。The above are only specific implementations of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present disclosure, and they shall be covered Within the protection scope of this disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.
工业实用性Industrial applicability
本公开提供了动作识别方法及装置、电子设备、计算机可读存储介质,其中,所述方法包括:获取视频片段;基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息;基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息;基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。The present disclosure provides an action recognition method and device, an electronic device, and a computer-readable storage medium, wherein the method includes: obtaining a video clip; determining the target object based on the object frame in the key frame image in the video clip The action feature information of the target object; based on the video clip and the action feature information, determine the scene feature information and time sequence feature information corresponding to the target object; based on the action feature information, the scene feature information, and the The time sequence feature information determines the action type of the target object.

Claims (21)

  1. 一种动作识别方法,其中,包括:An action recognition method, which includes:
    获取视频片段;Get video clips;
    基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息;Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip;
    基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息;Determine the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information;
    基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
  2. 根据权利要求1所述的动作识别方法,其中,还包括确定所述关键帧图像中的对象边框的步骤:The action recognition method according to claim 1, further comprising the step of determining the border of the object in the key frame image:
    从所述视频片段中筛选关键帧图像;Screening key frame images from the video clips;
    对筛选得到的所述关键帧图像进行对象检测,确定所述目标对象在所述关键帧图像中的初始对象边界框;Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;
    按照预设扩展尺寸信息,对所述初始对象边界框进行扩展,得到所述目标对象在所述关键帧图像中的所述对象边框。According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
  3. 根据权利要求1或2所述的动作识别方法,其中,所述基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息,包括:The action recognition method according to claim 1 or 2, wherein the determining the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip comprises:
    针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
    按照该关键帧图像对应的对象边框,分别从该关键帧图像对应的至少部分关联图像中截取部分图像,得到该关键帧图像对应的多张目标对象图像;According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
    基于所述关键帧图像对应的多张目标对象图像,确定所述目标对象的动作特征信息。Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
  4. 根据权利要求3所述的动作识别方法,其中,从所述视频片段中筛选出与关键帧图像对应的多张关联图像,包括:The action recognition method according to claim 3, wherein the filtering out multiple associated images corresponding to the key frame images from the video clip comprises:
    从所述视频片段中选取包括关键帧图像的第一子视频片段;所述第一子视频片段还包括与该关键帧图像时序上相邻的N张图像;其中,N为正整数;Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
    从所述第一子视频片段中筛选所述多张关联图像。The multiple associated images are filtered from the first sub video segment.
  5. 根据权利要求3所述的动作识别方法,其中,在得到多张目标对象图像之后,在确定所述目标对象的动作特征信息之前,还包括:The action recognition method according to claim 3, wherein after obtaining multiple target object images, before determining the action characteristic information of the target object, the method further comprises:
    将所述目标对象图像设置为具有预设图像分辨率的图像。The target object image is set as an image with a preset image resolution.
  6. 根据权利要求1-5任一项所述的动作识别方法,其中,所述基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息,包括:The action recognition method according to any one of claims 1 to 5, wherein the determining the scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information comprises:
    针对所述关键帧图像,从所述视频片段中筛选出与所述关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
    对至少部分所述关联图像进行视频场景特征提取操作,得到所述场景特征信息;Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;
    对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息;Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;
    基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息。Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
  7. 根据权利要求6所述的动作识别方法,其中,所述对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息,包括:8. The action recognition method according to claim 6, wherein the performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information comprises:
    针对所述关键帧图像,从所述视频片段中选取包括关键帧图像的第二子视频片段;所述第二子视频片段还包括与该关键帧图像时序上相邻的P张图像;其中,P为正整数;For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
    提取所述第二子视频片段中的图像中,除所述目标对象以外的其他对象的动作特征,并将得到动作特征作为所述初始时序特征信息。Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
  8. 根据权利要求6或7所述的动作识别方法,其中,所述基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息,包括:The action recognition method according to claim 6 or 7, wherein the determining the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information comprises:
    分别对所述初始时序特征信息和所述动作特征信息进行降维处理;Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;
    对降维处理后的初始时序特征信息进行均值池化操作;Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;
    将进行均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行合并操作,得到所述目标对象对应的时序特征信息。The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
  9. 根据权利要求8所述的动作识别方法,其中,所述基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息,还包括:8. The action recognition method according to claim 8, wherein the determining the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information further comprises:
    将得到的所述目标对象对应的时序特征信息作为新的初始时序特征信息,并返回所述分别对所述初始时序特征信息和所述动作特征信息进行降维处理的步骤。Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
  10. 一种动作识别装置,其中,包括:An action recognition device, which includes:
    视频获取模块,配置为获取视频片段;Video acquisition module, configured to acquire video clips;
    动作特征确定模块,配置为基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信息;An action feature determining module, configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip;
    场景时序特征确定模块,配置为基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息;A scene timing feature determining module, configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information;
    动作识别模块,配置为基于所述动作特征信息、所述场景特征信息和所述时序特征信息,确定所述目标对象的动作类型。The action recognition module is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
  11. 根据权利要求10所述的动作识别装置,其中,所述动作特征确定模块还配置为确定关键帧图像中的对象边框:The motion recognition device according to claim 10, wherein the motion feature determination module is further configured to determine the object frame in the key frame image:
    从所述视频片段中筛选关键帧图像;Screening key frame images from the video clips;
    对筛选得到的所述关键帧图像进行对象检测,确定所述目标对象在所述关键帧图像中的初始对象边界框;Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;
    按照预设扩展尺寸信息,对所述初始对象边界框进行扩展,得到所述目标对象在所述关键帧图像中的所述对象边框。According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
  12. 根据权利要求10或11所述的动作识别装置,其中,所述动作特征确定模块在基于目标对象在所述视频片段中的关键帧图像中的对象边框,确定所述目标对象的动作特征信时,配置为:The action recognition device according to claim 10 or 11, wherein the action feature determination module determines the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip. , The configuration is:
    针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
    按照该关键帧图像对应的对象边框,分别从该关键帧图像对应的至少部分关联图像中截取部分图像,得到该关键帧图像对应的多张目标对象图像;According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;
    基于所述关键帧图像对应的多张目标对象图像,确定所述目标对象的动作特征信息。Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
  13. 根据权利要求12所述的动作识别装置,其中,所述动作特征确定模块在从所述视频片段中筛选出与关键帧图像对应的多张关联图像时,配置为:11. The motion recognition device according to claim 12, wherein the motion feature determination module is configured to:
    从所述视频片段中选取包括关键帧图像的第一子视频片段;所述第一子视频片段还包括与该关键帧图像时序上相邻的N张图像;其中,N为正整数;Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;
    从所述第一子视频片段中筛选所述多张关联图。Filter the multiple association graphs from the first sub video segment.
  14. 根据权利要求12所述的动作识别装置,其中,在得到多张目标对象图像之后,在确定所述目标对象的动作特征信息之前,所述动作特征确定模块还配置为:The motion recognition device according to claim 12, wherein after obtaining multiple target object images and before determining the motion characteristic information of the target object, the motion characteristic determination module is further configured to:
    将所述目标对象图像设置为具有预设图像分辨率的图像。The target object image is set as an image with a preset image resolution.
  15. 根据权利要求10至14任一项所述的动作识别装置,其中,所述场景时序特征确定模块在基于所述视频片段和所述动作特征信息,确定所述目标对象对应的场景特征信息和时序特征信息时,配置为:The action recognition device according to any one of claims 10 to 14, wherein the scene timing feature determination module is determining the scene feature information and timing corresponding to the target object based on the video clip and the action feature information. For feature information, the configuration is:
    针对所述关键帧图像,从所述视频片段中筛选出与该关键帧图像对应的多张关联图像;For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;
    对至少部分所述关联图像进行视频场景特征提取操作,得到所述场景特征信息;Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;
    对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息;Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;
    基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息。Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
  16. 根据权利要求15所述的动作识别装置,其中,所述场景时序特征确定模块在对所述视频片段中的除目标对象以外的其他对象进行时序特征提取操作,得到初始时序特征信息时,配置为:The action recognition device according to claim 15, wherein the scene timing feature determination module performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, is configured to :
    针对所述关键帧图像,从所述视频片段中选取包括关键帧图像的第二子视频片段;所述第二子视频片段还包括与该关键帧图像时序上相邻的P张图像;其中,P为正整数;For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;
    提取所述第二子视频片段中的图像中,除所述目标对象以外的其他对象的动作特征,并将得到动作特征作为所述初始时序特征信。Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
  17. 根据权利要求15或16所述的动作识别装置,其中,所述场景时序特征确定模块在基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息时,配置为:The action recognition device according to claim 15 or 16, wherein the scene sequence feature determination module is configured to determine the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information for:
    分别对所述初始时序特征信息和所述动作特征信息进行降维处理;Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;
    对降维处理后的初始时序特征信息进行均值池化操作;Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;
    将进行均值池化操作后的初始时序特征信息和降维处理后的动作特征信息进行合并操作,得到所述目标对象对应的时序特征信息。The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
  18. 根据权利要求17所述的动作识别装置,其中,所述场景时序特征确定模块在基于所述初始时序特征信息和所述动作特征信息,确定所述目标对象对应的时序特征信息时,还配置为:The action recognition device according to claim 17, wherein the scene sequence feature determination module is further configured to determine the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information. :
    将得到的所述目标对象对应的时序特征信息作为新的初始时序特征信息,并返回所述分别对所述初始时序特征信息和所述动作特征信息进行降维处理的步骤。Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
  19. 一种电子设备,包括:相互连接的处理器和存储介质,所述存储介质存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器执行所述机器可读指令,以执行如权利要求1至9任一所述的动作识别方法。An electronic device comprising: a processor and a storage medium connected to each other. The storage medium stores machine-readable instructions executable by the processor. When the electronic device is running, the processor executes the machine-readable instructions. Instructions to execute the action recognition method according to any one of claims 1 to 9.
  20. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时执行如权利要求1至9任一所述的动作识别方法。A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the action recognition method according to any one of claims 1 to 9 is executed.
  21. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至9中的任一权利要求所述的方法。A computer program, comprising computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the Methods.
PCT/CN2021/077268 2020-03-11 2021-02-22 Action recognition method and apparatus, electronic device, and computer-readable storage medium WO2021179898A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2021562324A JP2022529299A (en) 2020-03-11 2021-02-22 Operation identification methods and devices, electronic devices, computer readable storage media
KR1020217036106A KR20210145271A (en) 2020-03-11 2021-02-22 Motion recognition method and apparatus, electronic device, computer readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010166148.8 2020-03-11
CN202010166148.8A CN111401205B (en) 2020-03-11 2020-03-11 Action recognition method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2021179898A1 true WO2021179898A1 (en) 2021-09-16

Family

ID=71432295

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/077268 WO2021179898A1 (en) 2020-03-11 2021-02-22 Action recognition method and apparatus, electronic device, and computer-readable storage medium

Country Status (5)

Country Link
JP (1) JP2022529299A (en)
KR (1) KR20210145271A (en)
CN (1) CN111401205B (en)
TW (1) TW202135002A (en)
WO (1) WO2021179898A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114268849A (en) * 2022-01-29 2022-04-01 北京卡路里信息技术有限公司 Video processing method and device
CN115229804A (en) * 2022-09-21 2022-10-25 荣耀终端有限公司 Method and device for attaching component
CN117711014A (en) * 2023-07-28 2024-03-15 荣耀终端有限公司 Method and device for identifying space-apart gestures, electronic equipment and readable storage medium
WO2024082943A1 (en) * 2022-10-20 2024-04-25 腾讯科技(深圳)有限公司 Video detection method and apparatus, storage medium, and electronic device

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401205B (en) * 2020-03-11 2022-09-23 深圳市商汤科技有限公司 Action recognition method and device, electronic equipment and computer readable storage medium
US11270147B1 (en) 2020-10-05 2022-03-08 International Business Machines Corporation Action-object recognition in cluttered video scenes using text
CN112800278B (en) * 2021-03-30 2021-07-09 腾讯科技(深圳)有限公司 Video type determination method and device and electronic equipment
US11423252B1 (en) 2021-04-29 2022-08-23 International Business Machines Corporation Object dataset creation or modification using labeled action-object videos
CN114120180B (en) * 2021-11-12 2023-07-21 北京百度网讯科技有限公司 Time sequence nomination generation method, device, equipment and medium
TWI797014B (en) * 2022-05-16 2023-03-21 國立虎尾科技大學 Table tennis pose classifying method and table tennis interaction system
CN116824641B (en) * 2023-08-29 2024-01-09 卡奥斯工业智能研究院(青岛)有限公司 Gesture classification method, device, equipment and computer storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183758A (en) * 2015-07-22 2015-12-23 深圳市万姓宗祠网络科技股份有限公司 Content recognition method for continuously recorded video or image
CN109492581A (en) * 2018-11-09 2019-03-19 中国石油大学(华东) A kind of human motion recognition method based on TP-STG frame
CN109800689A (en) * 2019-01-04 2019-05-24 西南交通大学 A kind of method for tracking target based on space-time characteristic fusion study
CN110309784A (en) * 2019-07-02 2019-10-08 北京百度网讯科技有限公司 Action recognition processing method, device, equipment and storage medium
CN110427807A (en) * 2019-06-21 2019-11-08 诸暨思阔信息科技有限公司 A kind of temporal events motion detection method
CN111401205A (en) * 2020-03-11 2020-07-10 深圳市商汤科技有限公司 Action recognition method and device, electronic equipment and computer readable storage medium

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334845B (en) * 2007-06-27 2010-12-22 中国科学院自动化研究所 Video frequency behaviors recognition method based on track sequence analysis and rule induction
CN101236656B (en) * 2008-02-29 2011-06-15 上海华平信息技术股份有限公司 Movement target detection method based on block-dividing image
CN101826155B (en) * 2010-04-02 2012-07-25 浙江大学 Method for identifying act of shooting based on Haar characteristic and dynamic time sequence matching
US8855369B2 (en) * 2012-06-22 2014-10-07 Microsoft Corporation Self learning face recognition using depth based tracking for database generation and update
JP6393495B2 (en) * 2014-03-20 2018-09-19 日本ユニシス株式会社 Image processing apparatus and object recognition method
EP3107069A4 (en) * 2014-03-24 2017-10-04 Hitachi, Ltd. Object detection apparatus, object detection method, and mobile robot
JP6128501B2 (en) * 2016-03-17 2017-05-17 ヤフー株式会社 Time-series data analysis device, time-series data analysis method, and program
JP2017187850A (en) * 2016-04-01 2017-10-12 株式会社リコー Image processing system, information processing device, and program
US10997421B2 (en) * 2017-03-30 2021-05-04 Hrl Laboratories, Llc Neuromorphic system for real-time visual activity recognition
JP6773061B2 (en) * 2018-02-16 2020-10-21 新東工業株式会社 Evaluation system, evaluation device, evaluation method, evaluation program, and recording medium
CN108537829B (en) * 2018-03-28 2021-04-13 哈尔滨工业大学 Monitoring video personnel state identification method
CN110147711B (en) * 2019-02-27 2023-11-14 腾讯科技(深圳)有限公司 Video scene recognition method and device, storage medium and electronic device
CN110414335A (en) * 2019-06-20 2019-11-05 北京奇艺世纪科技有限公司 Video frequency identifying method, device and computer readable storage medium
CN110826447A (en) * 2019-10-29 2020-02-21 北京工商大学 Restaurant kitchen staff behavior identification method based on attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183758A (en) * 2015-07-22 2015-12-23 深圳市万姓宗祠网络科技股份有限公司 Content recognition method for continuously recorded video or image
CN109492581A (en) * 2018-11-09 2019-03-19 中国石油大学(华东) A kind of human motion recognition method based on TP-STG frame
CN109800689A (en) * 2019-01-04 2019-05-24 西南交通大学 A kind of method for tracking target based on space-time characteristic fusion study
CN110427807A (en) * 2019-06-21 2019-11-08 诸暨思阔信息科技有限公司 A kind of temporal events motion detection method
CN110309784A (en) * 2019-07-02 2019-10-08 北京百度网讯科技有限公司 Action recognition processing method, device, equipment and storage medium
CN111401205A (en) * 2020-03-11 2020-07-10 深圳市商汤科技有限公司 Action recognition method and device, electronic equipment and computer readable storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114268849A (en) * 2022-01-29 2022-04-01 北京卡路里信息技术有限公司 Video processing method and device
CN115229804A (en) * 2022-09-21 2022-10-25 荣耀终端有限公司 Method and device for attaching component
CN115229804B (en) * 2022-09-21 2023-02-17 荣耀终端有限公司 Method and device for attaching component
WO2024082943A1 (en) * 2022-10-20 2024-04-25 腾讯科技(深圳)有限公司 Video detection method and apparatus, storage medium, and electronic device
CN117711014A (en) * 2023-07-28 2024-03-15 荣耀终端有限公司 Method and device for identifying space-apart gestures, electronic equipment and readable storage medium
CN117711014B (en) * 2023-07-28 2024-08-27 荣耀终端有限公司 Method and device for identifying space-apart gestures, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
KR20210145271A (en) 2021-12-01
TW202135002A (en) 2021-09-16
CN111401205B (en) 2022-09-23
JP2022529299A (en) 2022-06-20
CN111401205A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
WO2021179898A1 (en) Action recognition method and apparatus, electronic device, and computer-readable storage medium
CN110569721B (en) Recognition model training method, image recognition method, device, equipment and medium
CN105243395B (en) A kind of human body image comparison method and device
WO2016127478A1 (en) Image processing method and device, and terminal
WO2016187888A1 (en) Keyword notification method and device based on character recognition, and computer program product
KR102087882B1 (en) Device and method for media stream recognition based on visual image matching
US9213898B2 (en) Object detection and extraction from image sequences
US11871125B2 (en) Method of processing a series of events received asynchronously from an array of pixels of an event-based light sensor
JP7419080B2 (en) computer systems and programs
JP2018170003A (en) Detection device and method for event in video, and image processor
KR101821989B1 (en) Method of providing detection of moving objects in the CCTV video data by reconstructive video processing
US10296539B2 (en) Image extraction system, image extraction method, image extraction program, and recording medium storing program
US20210084198A1 (en) Method and apparatus for removing video jitter
US20160110909A1 (en) Method and apparatus for creating texture map and method of creating database
CN108921150B (en) Face recognition system based on network hard disk video recorder
JP2010257267A (en) Device, method and program for detecting object area
US9392146B2 (en) Apparatus and method for extracting object
JPWO2018179119A1 (en) Video analysis device, video analysis method, and program
CN111860559A (en) Image processing method, image processing device, electronic equipment and storage medium
KR20210067824A (en) Method for single image dehazing based on deep learning, recording medium and device for performing the method
CN112232113B (en) Person identification method, person identification device, storage medium, and electronic apparatus
US10339660B2 (en) Video fingerprint system and method thereof
CN103268606B (en) A kind of depth information compensation method of motion blur image and device
CN109492755B (en) Image processing method, image processing apparatus, and computer-readable storage medium
KR101826463B1 (en) Method and apparatus for synchronizing time line of videos

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021562324

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21768369

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20217036106

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21768369

Country of ref document: EP

Kind code of ref document: A1