WO2021179898A1

WO2021179898A1 - Action recognition method and apparatus, electronic device, and computer-readable storage medium

Info

Publication number: WO2021179898A1
Application number: PCT/CN2021/077268
Authority: WO
Inventors: 吴建超; 段佳琦; 旷章辉; 张伟
Original assignee: 深圳市商汤科技有限公司
Priority date: 2020-03-11
Filing date: 2021-02-22
Publication date: 2021-09-16
Also published as: KR20210145271A; TW202135002A; CN111401205B; JP2022529299A; CN111401205A

Abstract

Provided are an action recognition method and apparatus, an electronic device and a computer-readable storage medium. In the method, action feature information is determined by using an object border frame corresponding to a target object instead of using the whole frame of an image, such that the data volume for action recognition in each frame of the image can be effectively reduced, thereby being able to increase the number of images for action recognition and facilitating the improvement in the accuracy of action recognition. In addition, in the method, the action feature information of the target object is used to perform action classification and recognition, and scenario feature information of a scenario where the target object is located and time sequence feature information associated with an action of the target object are extracted by using a video clip and the determined action feature information, and the accuracy of action recognition can be further improved by combining the scenario information and the time sequence feature information on the basis of the action feature information.

Description

Action recognition method and device, electronic equipment, and computer readable storage medium

Cross-references to related applications

The present disclosure is filed based on a Chinese patent application with an application number of 202010166148.8 and an application date of March 11, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference.

Technical field

The present disclosure relates to the fields of computer technology and image processing, and in particular, to an action recognition method and device, electronic equipment, and computer-readable storage media.

Background technique

Motion detection and recognition are widely used in robotics, safety and health and other fields. In the related art, when performing action recognition, due to factors such as the limited data processing capability of the recognition device and the single type of data used for performing the action recognition, there is a defect that the accuracy of the action recognition is low.

Summary of the invention

In view of this, the present disclosure at least provides an action recognition method and device, electronic equipment, and computer-readable storage medium.

In the first aspect, the present disclosure provides an action recognition method, including:

Get video clips;

Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip;

Determine the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information;

Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.

In the embodiments of the present disclosure, the object frame corresponding to the target object is used to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing The number of images used for action recognition is conducive to improving the accuracy of action recognition; in addition, this aspect not only uses the action feature information of the target object for action classification and recognition, but also uses video clips and the determined action feature information mentioned above. The scene feature information of the scene where the target object is located and the time series feature information associated with the target object's actions are extracted. On the basis of the action feature information, combining the scene information and the time series feature information can further improve the accuracy of action recognition.

In a possible implementation manner, the above-mentioned action recognition method further includes the step of determining an object frame in the key frame image:

Screening key frame images from the video clips;

Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;

According to the preset expansion size information, the initial object bounding box is expanded to obtain the object edge of the target object in the key frame image.

In the embodiments of the present disclosure, the object detection method is used to determine the bounding box of the target object in the image, which reduces the amount of data that needs to be processed for action recognition, and after determining a smaller initial object bounding box, it is expanded , So that the object frame used for action recognition can include more complete target object information and more environmental information, and retain more spatial details, thereby helping to improve the accuracy of action recognition.

In a possible implementation manner, the determining the action feature information of the target object based on the object frame in the key frame image of the video clip includes:

For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;

According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;

Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.

In the embodiments of the present disclosure, the target object in the key frame image is used to locate the target object, and the target object image used to determine the action feature information is intercepted from multiple associated images associated with the key frame image, which improves the determination of the action feature. The accuracy of the image used in the information can be increased, and the number of images used to determine the action feature information can be increased, so that the accuracy of the action recognition can be improved.

In a possible implementation manner, filtering out multiple associated images corresponding to key frame images from the video clip includes:

Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;

The multiple associated images are filtered from the first sub video segment.

In the embodiments of the present disclosure, the images associated with the key frame images can be filtered from the sub video clips that are similar in the shooting time of the key frame images, and the images with the closest degree of association with the key frame images can be filtered, based on the degree of association with the key frame images The most recent image can improve the accuracy of the determined motion feature information.

In a possible implementation manner, after obtaining multiple target object images, before determining the motion feature information of the target object, the method further includes:

The target object image is set as an image with a preset image resolution.

In the embodiments of the present disclosure, after the target object image is captured, the target object image is set to a preset resolution, which can increase the amount of information included in the target object image, that is, the captured target object image can be magnified, which is conducive to obtaining The fine-grained details of the target object can thereby improve the accuracy of the determined action feature information.

In a possible implementation manner, the determining the scene characteristic information and time sequence characteristic information corresponding to the target object based on the video clip and the action characteristic information includes:

Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;

Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;

Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.

In the embodiments of the present disclosure, by extracting scene features from the associated images associated with the key frame images, relatively complete scene feature information can be obtained, and the accuracy of action recognition can be improved based on the relatively complete scene feature information; in addition, the embodiments of the present disclosure Extract the time series characteristics of objects other than the target object, that is, the above-mentioned initial time series characteristic information, and determine the time series characteristic information associated with the target object based on the time series characteristics of other objects and the action characteristic information of the target object. The time sequence feature information associated with the target object can further improve the accuracy of action recognition.

In a possible implementation manner, the performing a time-series feature extraction operation on objects other than the target object in the video clip to obtain initial time-series feature information includes:

For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;

Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.

In the embodiments of the present disclosure, sub-video segments that are closer to the shooting time of the key frame image are selected from the video segments to extract the timing features, which can reduce the amount of data of the extracted timing features, and can improve the determined timing features and The relevance of key frame images is beneficial to improve the accuracy of action recognition; in addition, in the embodiments of the present disclosure, the action features of other objects are used as time-series features, which can improve the pertinence of the time-series features used in action recognition. Conducive to improving the accuracy of action recognition.

In a possible implementation manner, the determining the time sequence characteristic information corresponding to the target object based on the initial time sequence characteristic information and the action characteristic information includes:

Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;

Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;

The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.

In the embodiments of the present disclosure, when determining the time series feature information based on the initial time series feature information and the action feature information, the initial time series feature information and the action feature information are reduced in dimensionality, which can reduce the amount of data that needs to be processed and is beneficial to improve the action. Recognition efficiency; in addition, the embodiments of the present disclosure perform a mean pooling operation on the initial time series feature information after dimensionality reduction, which simplifies the operation steps of time series feature extraction, and can improve the efficiency of action recognition.

In a possible implementation manner, the determining the time sequence characteristic information corresponding to the target object based on the initial time sequence characteristic information and the action characteristic information further includes:

Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.

In the embodiments of the present disclosure, based on the initial time sequence feature information and the action feature information, the time sequence feature extraction operation for determining the time sequence feature information corresponding to the target object is repeatedly executed, which can improve the accuracy of the determined time sequence feature information.

In a second aspect, the present disclosure provides an action recognition device, including:

Video acquisition module, configured to acquire video clips;

An action feature determining module, configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip;

A scene timing feature determining module, configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information;

The action recognition module is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.

In a possible implementation manner, the action feature determining module is further configured to determine the object frame in the key frame image:

Screening key frame images from the video clips;

According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.

In a possible implementation manner, when the action feature determination module determines the action feature information of the target object based on the object frame in the key frame image of the video clip, the action feature determination module is configured to:

In a possible implementation manner, when the action feature determining module filters out multiple associated images corresponding to key frame images from the video clip, it is configured to:

Filter the multiple association graphs from the first sub video segment.

In a possible implementation manner, after obtaining multiple target object images, before determining the motion characteristic information of the target object, the motion characteristic determination module is further configured to:

The target object image is set as an image with a preset image resolution.

In a possible implementation manner, when the scene time sequence feature determination module determines the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information, it is configured to:

In a possible implementation manner, when the scene timing feature determination module performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, it is configured to:

In a possible implementation manner, when the scene time sequence feature determination module determines the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information, it is configured to:

In a possible implementation manner, when the scene sequence feature determination module determines the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information, it is further configured to:

In a third aspect, the present disclosure provides an electronic device including: a processor and a storage medium connected to each other. The storage medium stores machine-readable instructions executable by the processor. When the electronic device is running, the The processor executes the machine-readable instructions to execute the steps of the above-mentioned action recognition method.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the steps of the above-mentioned action recognition method when the computer program is run by a processor.

In a fifth aspect, the present disclosure also provides a computer program, including computer-readable code. When the computer-readable code runs in an electronic device, a processor in the electronic device executes the method for implementing the above-mentioned action recognition method. step.

The foregoing apparatus, electronic equipment, and computer-readable storage medium of the present disclosure at least contain technical features that are substantially the same as or similar to the technical features of any aspect of the foregoing method or any embodiment of any aspect of the present disclosure. Therefore, regarding the foregoing apparatus For the effect description of the electronic device, and the computer-readable storage medium, please refer to the effect description of the above method content, which will not be repeated here.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the following will briefly introduce the drawings that need to be used in the embodiments. It should be understood that the following drawings only show certain embodiments of the present disclosure, and therefore do not It should be regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can be obtained based on these drawings without creative work.

FIG. 1A shows a flowchart of an action recognition method provided by an embodiment of the present disclosure;

FIG. 1B shows a schematic diagram of a network architecture provided by an embodiment of the present disclosure;

FIG. 2 shows a flowchart of determining the action characteristic information of a target object in another action method provided by an embodiment of the present disclosure;

FIG. 3 shows a flowchart of determining scene feature information and time sequence feature information corresponding to the target object in yet another action recognition method provided by an embodiment of the present disclosure;

FIG. 4 shows a schematic structural diagram of a simplified timing feature extraction module in an embodiment of the present disclosure;

FIG. 5 shows a flowchart of still another method for action recognition provided by an embodiment of the present disclosure;

FIG. 6 shows a schematic structural diagram of an action recognition device provided by an embodiment of the present disclosure;

Fig. 7 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present disclosure. It should be understood that the attached The drawings are only for the purpose of illustration and description, and are not used to limit the protection scope of the present disclosure. In addition, it should be understood that the schematic drawings are not drawn to scale. The flowchart used in the present disclosure shows operations implemented according to some embodiments of the present disclosure. It should be understood that the operations of the flowchart may be implemented out of order, and steps without logical context may be reversed in order or implemented at the same time. In addition, under the guidance of the present disclosure, those skilled in the art can add one or more other operations to the flowchart, or remove one or more operations from the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present disclosure, rather than all the embodiments. The components of the embodiments of the present disclosure generally described and illustrated in the drawings herein may be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed present disclosure, but merely represents selected embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative work shall fall within the protection scope of the present disclosure.

It should be noted that the term "including" will be used in the embodiments of the present disclosure to indicate the existence of the features declared thereafter, but it does not exclude the addition of other features.

In view of the current technical problem of low recognition accuracy in action recognition, the present disclosure provides an action recognition method and device, electronic equipment, and computer-readable storage medium. Among them, the present disclosure uses the object frame corresponding to the target object to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing the amount of data used for motion recognition. The number of images for action recognition is conducive to improving the accuracy of action recognition; in addition, the present disclosure not only uses the action feature information of the target object to perform action classification and recognition, but also uses video clips and the determined action feature information to extract The scene feature information of the scene where the target object is located and the time sequence feature information associated with the action of the target object. On the basis of the action feature information, combining the scene information and the time sequence feature information can further improve the accuracy of action recognition.

The following describes the action recognition method and device, electronic equipment, and computer-readable storage medium of the present disclosure through specific embodiments.

The embodiments of the present disclosure provide an action recognition method, which is applied to a hardware device such as a terminal device that performs action recognition, and the method may also be implemented by a processor executing a computer program. Specifically, as shown in FIG. 1A, the action recognition method provided by the embodiment of the present disclosure includes the following steps:

S110. Obtain a video clip.

Here, a video clip is a video clip used for action recognition, and includes multiple images. The images include a target object that needs to be motion recognized, and the target object may be a human or an animal.

The above-mentioned video clips can be shot by the terminal device that performs motion recognition using its own camera or other shooting equipment, or it can be shot by other shooting devices. After shooting by other shooting devices, the video clip can be passed to the terminal device for motion recognition. .

S120: Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip.

Here, the object frame is the bounding box surrounding the target object. When the image information in the bounding box is used to determine the motion characteristic information of the target object, the amount of data processed by the terminal device can be reduced.

Before determining the action feature information based on the object frame, it is first necessary to filter the key frame images from the video judgment, and determine the object frame of the target object in each key frame image.

In the specific implementation, you can use the preset time interval to filter the key frame images from the video clips. Of course, you can also use other methods to filter the key frame images from the video clips. Extract a frame of image as the key frame image. The present disclosure does not limit the method of filtering key frame images from video clips.

After filtering multiple key frame images from the video clips, the object borders in each key frame image can be used to determine the action feature information of the target object. Of course, you can also use some of the key frame images obtained by filtering. The frame of the object in the frame image determines the action characteristic information of the target object. When using the object borders in some key frame images to determine the motion feature information of the target object, it is necessary to extract or determine the object borders in the partial key frame image, and then use the extracted or determined borders to determine the motion feature information of the target object.

In specific implementations, object detection methods can be used, for example, a human body detector is used to determine the object frame by using the human body detection method. Of course, other methods can also be used to determine the object frame. The present disclosure does not deal with the method for determining the object frame. limited.

In specific implementation, the object frame detected by the human body detector may be used as the final object frame used to determine the action feature information. However, because the object frame detected by the human body detector may be a smaller frame including the target object, in order to obtain more complete target object information and more environmental information, after the human body detector detects the object frame, the The object frame detected by each human body detector may be expanded according to the preset expansion size information to obtain the final object frame of the target object in each of the key frame images. After that, the determined final object frame is used to determine the action characteristic information of the target object.

The expansion size information for expanding the object frame is preset. For example, the expansion size information includes a first extension length of the object frame in the length direction and a second extension length of the object frame in the width direction. The length of the object frame is extended to both sides according to the first extension length, and the two sides in the length direction are respectively extended by half of the first extension length. The width of the object frame is extended to both sides according to the second extension length, and both sides in the width direction are respectively extended by half of the second extension length.

The first extension length and the second extension length may be preset specific values, or may be determined based on the length and width of the object frame directly detected by the human body detector. For example, the first extension length may be equal to the length of the object frame directly detected by the human body detector, and the second extension length may be equal to the width of the object frame directly detected by the human body detector.

Through the above method, the object detection method is used to determine the bounding box of the target object in the image, which reduces the amount of data that needs to be processed for action recognition. After determining a smaller initial bounding box of the object, it is expanded. As a result, the frame of the object configured for action recognition can include more complete target object information and more environmental information, thereby helping to improve the accuracy of action recognition.

The aforementioned action feature information is extracted from the image in the video clip, and can characterize the action feature of the target object.

S130. Determine scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information.

Here, the scene feature information is used to characterize the scene feature of the scene in which the target object is located, and may be obtained by performing scene feature extraction from at least part of the associated images associated with the key frame image.

The time sequence feature information is the feature information related to the target object's action in time sequence. For example, it can be the action feature information of other objects in the video clip except the target object. In specific implementation, it can be based on the video clip and the target object's action feature information. The action feature information is determined.

S140: Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.

After the action feature information, scene feature information, and time sequence feature information are determined, the above three types of information can be combined. For example, splicing; after that, the combined information is classified to obtain the action type of the target object, so as to realize the action recognition of the target object.

In the embodiments of the present disclosure, the object frame corresponding to the target object is used to determine the motion characteristic information, instead of using the entire frame of image to determine the motion characteristic information, which can effectively reduce the amount of data used for motion recognition in each frame of image, thereby increasing The number of images used for action recognition is conducive to improving the accuracy of action recognition; in addition, the embodiments of the present disclosure not only use the action feature information of the target object for action classification and recognition, but also use video clips and the determined action features mentioned above. Information, the scene feature information of the scene where the target object is located and the time sequence feature information associated with the target object’s actions are extracted. Based on the action feature information, combining the scene information and time sequence feature information can further improve the accuracy of action recognition.

In the embodiment of the present disclosure, the action recognition of the target object can be realized through the network architecture shown in FIG. 1B. FIG. 1B shows a schematic diagram of a network architecture provided by an embodiment of the present disclosure, and the network architecture includes: users A terminal 201, a network 202, and an action-recognized terminal device 203. In order to support an exemplary application, the user terminal 201 and the action recognition terminal device 203 establish a communication connection through the network 202. When the user terminal 201 needs to obtain the action type of the target object, first, the request information configured to determine the action type is passed through the network 202 is sent to the terminal device 203 for action recognition; then, the terminal device 203 for action recognition obtains the video clip and uses the object frame corresponding to the target object to determine the action feature information; and uses the video clip and the determined action feature information to extract The scene feature information of the scene where the target object is located and the time sequence feature information associated with the target object’s actions; finally, the scene information and time sequence feature information are comprehensively considered to determine the target object’s action type with a higher accuracy, and will determine The output action type is fed back to the user terminal 201.

As an example, the user terminal 201 may include a device with data processing capabilities, and the motion recognition terminal device 203 may include an image acquisition device, and a processing device with data processing capabilities or a remote server. The network 202 may adopt a wired connection or a wireless connection. Wherein, when the terminal device for action recognition is a processing device with data processing capabilities, the user terminal 201 can communicate with the processing device through a wired connection, such as data communication via a bus; when the terminal device for action recognition 203 is a remote server At the time, the user terminal can interact with the remote server through the wireless network.

In some embodiments, as shown in FIG. 2, the foregoing determination of the motion feature information of the target object based on the object frame of the target object in the key frame image in the video clip can be specifically implemented by using the following steps:

S210. For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip.

Here, the associated image associated with the key frame image is an image similar to the image feature of the key frame image, for example, it may be an image similar to the shooting time of the key frame image.

In specific implementation, the following sub-steps can be used to filter the associated images corresponding to the key frame images:

Sub-step 1: Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is positive Integer.

In the above-mentioned first sub-video segment, the key frame image may be located in the first half of the first sub-video segment, or may be located in the second half of the first sub-video segment, of course, it can also be located in the first sub-video segment. The middle of or near the middle.

In a possible implementation manner, a sub-video segment including a key frame image can be intercepted from the video segment, for example, a 64-frame sub-video segment can be intercepted. In the sub-video segment, the key frame image is in the middle of the sub-video segment or a position close to the middle. For example, the sub video segment includes the first 32 frames of the key frame image, the key frame image, and the last 31 frames of the key frame image; for another example, in the sub video segment, the key frame image is in the first half of the sub video segment. In the sub-video segment, the first 10 frames of the key frame image, the key frame image, and the last 53 frames of the key frame image. For another example, in the sub video segment, the key frame image is in the second half of the sub video segment, and the sub video segment includes the first 50 frames of the key frame image, the key frame image, and the last 13 frames of the key frame image. .

In addition, in the above-mentioned first sub-video segment, the key frame image may also be located at both ends of the first sub-video segment, that is, the above-mentioned N images adjacent to the key frame image in time sequence are the first N images of the key frame image. Or the next N images. The present disclosure does not limit the position of the key frame image in the first sub-video segment.

The second sub-step is to filter the multiple associated images from the first sub-video segment.

In a possible implementation manner, the associated images may be filtered from the first sub-video segment based on a preset time interval, for example, T frames of associated images can be obtained by sparsely sampling the first sub-video segment with a time span τ. The related images obtained by screening may include or may not include key frame images, and have a certain degree of randomness. The present disclosure does not limit whether the related images include key frame images.

Based on a predetermined time interval, filter the images associated with the key frame image from the sub video clips that are similar to the shooting time of the key frame image, and can filter the image with the closest degree of association with the key frame image, based on the degree of association with the key frame image The most recent image can improve the accuracy of the determined motion feature information.

In addition, other methods can also be used to filter the associated images associated with the key frame images. For example, the image similarity between each frame of the image in the first sub-video segment and the key frame image may be calculated first, and then multiple images with the highest image similarity are selected as the associated images associated with the key frame image.

S220: According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image.

Here is to use the object frame corresponding to the key image to intercept part of the image from some or all of the associated images associated with the key frame image. If it is to intercept the target object image from part of the related images, specifically it can select the part of the related image closest to the shooting time of the key frame image from all the related images to intercept the target object image. Of course, other methods can also be used to select the part of the related image. Image to capture the target object image. For example, according to a certain time interval, some related images are selected from all related images.

When intercepting the target object image according to the object frame corresponding to the key frame image, the specific process includes: first copying the object frame on all the associated images or part of the associated images in chronological order. Among them, when copying the object frame on the associated image, the coordinate information of the object frame on the key frame image is used to realize the frame copy on the associated image, for example, according to the coordinate information of the object frame on the key frame image, according to the time sequence. Offset the frame position or copy the frame position directly to get the object frame on the associated image. Then, after the object frame copy is completed, the associated image is cropped according to the object frame to obtain the target object image, that is, the image within the object frame in the associated image is intercepted as the target object image.

The function of the key frame image is to realize the positioning of the target object image, and is not necessarily used to directly determine the action characteristic information. For example, when the associated image does not include the key frame image, the target object image used to determine the action feature information is not intercepted from the key frame image.

S230: Determine the action feature information of the target object based on the multiple target object images corresponding to the key frame.

After the above-mentioned target object image is intercepted, action feature extraction can be performed on multiple target object images. Specifically, a three-dimensional (3D) convolutional neural network can be used to process the target object image to extract the action in the target object image. Features to obtain the action feature information of the target object.

In addition, after obtaining multiple target object images in the embodiments of the present disclosure, before determining the action feature information of the target object, the following steps may be used to process the target object image:

The target object image is set as an image with a preset image resolution. The aforementioned preset image resolution is higher than the original image resolution of the target object image. In specific implementation, existing methods or tools can be used to set the image resolution of the target object image, for example, interpolation and other methods can be used to adjust the image resolution of the target object image.

Here after the target object image is intercepted, the target object image is set to the preset resolution, which can increase the amount of information included in the target object image, that is, the intercepted target object image can be enlarged, and the target object can be more fine-grained. Details, which can improve the accuracy of the determined action feature information.

In specific implementation, the above-mentioned preset image resolution can be set to H×W, the number of target object images captured by each key frame image is T, and the number of channels of each frame of target object image is 3, then the input 3D convolutional neural The action feature extracted by the network is a T×H×W×3 image block. After the 3D convolutional neural network performs global average pooling on the input image blocks, a 2048-dimensional feature vector can be obtained, which is the aforementioned action feature information.

In the embodiments of the present disclosure, the target object frame in the key frame image is used for positioning, and the target object image used to determine the action feature information is intercepted from multiple associated images associated with the key frame image, which improves the determination of the action feature. The accuracy of the image used in the information can be increased, and the number of images used to determine the action feature information can be increased, so that the accuracy of the action recognition can be improved.

In some embodiments, as shown in FIG. 3, the foregoing determination of scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information includes:

S310. For the key frame image, filter multiple associated images corresponding to the key frame image from the video clip, and perform a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information.

Here, a 3D convolutional neural network can be used to perform video scene feature extraction and global average pooling on part or all of the associated images to obtain a 2048-dimensional feature vector, which is the aforementioned scene feature information.

S320: Perform a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information.

Here, the initial time sequence feature information is the time sequence feature of other objects other than the target object, such as the action feature of other objects, which can be determined through the following sub-steps during specific implementation:

Sub-step 1. For the key frame image, select a second sub-video segment that includes a key frame image from the video segments; the second sub-video segment also includes P images that are temporally adjacent to the key frame image Image; where P is a positive integer.

In the above second sub-video segment, the key frame image may be located in the first half of the second sub-video segment, or may be located in the second half of the second sub-video segment, and of course, it can also be located in the second sub-video segment. The middle of or near the middle.

In addition, in the above-mentioned second sub-video segment, the key frame image may also be located at both ends of the second sub-video segment, that is, the P images adjacent to the key frame image in time sequence are the previous P images of the key frame image. Or the next P images. The present disclosure does not limit the position of the key frame image in the second sub-video segment.

In a possible implementation manner, a sub-video segment including a key frame image can be intercepted from a video segment. For example, a 2-second sub-video segment can be intercepted. The sub-video has a longer time and is used to determine the timing of a long time sequence. feature.

The second sub-step is to extract the action features of objects other than the target object in each image in the second sub-video segment, and use the obtained action features as the initial time sequence feature information.

Here, the 3D convolutional neural network can be used to extract the action features of objects other than the target object in the sub-video segment, and the obtained initial timing feature information can be based on the video timing feature bank (long-term Feature Bank, LFB) Form storage and use.

In the embodiments of the present disclosure, sub-video segments that are closer to the shooting time of the key frame image are selected from the video segments to extract the timing features, which can reduce the amount of data of the extracted timing features, and can improve the determined timing features and The relevance of the key frame images is beneficial to improve the accuracy of action recognition; in addition, in the embodiments of the present disclosure, the action features of other objects are used as time-series features, which can improve the pertinence of the time-series features used in action recognition. Conducive to improving the accuracy of action recognition.

S330: Based on the initial time sequence feature information and the action feature information, determine the time sequence feature information corresponding to the target object.

Here, it is specifically possible to perform time sequence feature extraction on the initial time sequence feature information and the action feature information to obtain the time sequence feature information corresponding to the target object.

In a possible implementation manner, the following sub-steps can be used to perform time-series feature extraction on the initial time-series feature information and action feature information to obtain the time-series feature information corresponding to the target object:

Sub-step 1: Perform dimensionality reduction processing on the initial time series feature information and the action feature information respectively.

After obtaining the initial time series feature information of objects other than the target object and the action feature information of the target object, the initial time series feature information and the action feature information can be reduced in dimensionality, which can reduce the amount of data that needs to be processed and help improve action recognition s efficiency.

In a possible implementation manner, after the initial time sequence feature information and the action feature information are obtained, the initial time sequence feature information and the action feature information can also be randomly deactivated (Dropout) processing. The Dropout processing can be used to extract the initial time sequence. The final network layer of the neural network for feature information and action feature information can also be implemented at each network layer of the neural network for extracting initial time series feature information and action feature information.

The second step is to perform the mean pooling operation on the initial time series feature information after dimensionality reduction processing.

Sub-step 3: The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object. The above merging operation can specifically be channel splicing, that is, the channel of one feature information is added to the channel of another feature information to realize the merging; the merging operation can also be an addition operation, that is, the initial timing feature information and dimensionality reduction after the mean pooling operation The processed action feature information is added.

Sub-step two and sub-step three are essentially to perform a sequence feature extraction operation on the initial sequence feature information and the action feature information, and can be specifically implemented by using the simplified sequence feature extraction module as shown in FIG. 4. The simplified time-series feature extraction module 400 shown in FIG. 4 is used to extract the above-mentioned time-series feature information, which may specifically include a linear layer 401, an average pooling (Average) layer 402, and a normalization and activation function (LN+ReLU). A layer 403 and a dropout layer 404. In the second sub-step above, the time series feature extraction operation is simplified. Only the average pooling Average layer is used to perform the average pooling operation on the initial time series feature information after the dimensionality reduction process. The normalization (softmax) operation is not performed, which is simplified The operation steps of time sequence feature extraction are simplified, that is, the existing time sequence feature extraction module is simplified, and the efficiency of action recognition can be improved. Among them, the existing time series feature extraction module does not include an average pooling layer, but includes a classification normalization softmax layer, and the processing complexity of the softmax layer is higher than the average pooling operation. In addition, the existing time sequence feature extraction module further includes a linear layer before the random inactivation layer. The simplified time sequence feature extraction module in the present disclosure does not include the linear layer, so the efficiency of action recognition can be further improved.

In specific implementation, the time series feature information output by the time series feature extraction module may be a 512-dimensional feature vector, and the 512-dimensional feature vector is the time series feature information of the aforementioned target object.

In the embodiments of the present disclosure, extracting scene features from part or all of the associated images associated with the key frame images can obtain relatively complete scene feature information, and based on relatively complete scene feature information, the accuracy of action recognition can be improved. In addition, in the embodiments of the present disclosure, the time series characteristics of other objects other than the target object, that is, the above-mentioned initial time series characteristic information, are extracted, and based on the time series characteristics of other objects and the action characteristic information of the target object, the target object associated with the target object is determined. Time sequence feature information, using the time sequence feature information associated with the target object, can further improve the accuracy of action recognition.

In order to further improve the accuracy of the extracted time series feature information, multiple time series feature extraction modules can be connected in series to extract the above time series feature information, and the time series feature information extracted by one time series feature extraction module is used as the input of another time series feature extraction module. Specifically, the time sequence feature information corresponding to the target object extracted by the previous time sequence feature extraction module may be used as the new initial time sequence feature information, and return to the above to reduce the initial time sequence feature information and the action feature information respectively. Dimensional processing steps.

In specific implementation, three simplified timing feature extraction modules can be connected in series to determine the final timing feature information.

The action recognition method of the present disclosure will be described below through a specific embodiment.

As shown in FIG. 5, the embodiment of the present disclosure uses a person as a target object for action recognition. Specifically, the action recognition method of the embodiment of the present disclosure may include:

Step 1: Obtain a video clip 501, and filter key frame images from the above video clip;

Step 2: Use the human body detector 502 to locate the person in each key frame image to obtain the person, that is, the initial object bounding box of the target object;

Step 3: Expand the initial object bounding box according to the preset expansion size information to obtain the final object bounding box; then, use the object bounding box to perform partial image interception of the associated image associated with the key frame image to obtain each key image Corresponding target object image;

Step 4: Input the target object images corresponding to all the obtained key images into the 3D convolutional neural network 503, and use the 3D convolutional neural network 503 to extract the action features of the target object to obtain the action feature information corresponding to the target object.

Step 5: Input the associated image associated with the key frame image into the aforementioned 3D convolutional neural network 503, and use the 3D convolutional neural network 503 to extract the video scene features of the scene where the target object is located, to obtain scene feature information.

Step 6. Use another 3D convolutional neural network 503 to perform time-series feature extraction on the video clip, that is, extract the action features of objects other than the target object to obtain initial time-series feature information. The above-mentioned initial time-series feature information can be based on time-series features. It exists in the form of a library; here, when the timing feature is extracted, it can be extracted from the entire video segment, or it can be extracted from the video segment, including a longer sub-video segment of the key frame image.

Step 7: Using the simplified time sequence feature extraction module 504, perform a time sequence feature extraction operation on the initial time sequence feature information and the action feature information to obtain the time sequence feature information corresponding to the target object.

Step 8. Perform splicing processing on the time series feature information, action feature information, and scene feature information, and use the action classifier 505 to classify the spliced information to obtain the action type of the target object.

Corresponding to the above-mentioned action recognition method, the present disclosure also provides an action recognition device, which is applied to hardware devices such as terminal equipment that performs action recognition on a target object, and each module can implement the same method steps and methods as in the above-mentioned method. The same beneficial effects are achieved, and therefore the same parts will not be repeated in this disclosure.

Specifically, as shown in FIG. 6, an action device provided by the present disclosure may include:

The video acquisition module 610 is configured to acquire video clips.

The action feature determining module 620 is configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip.

The scene timing feature determining module 630 is configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information.

The action recognition module 640 is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.

In some embodiments, the action feature determining module 620 is further configured to determine the object frame in the key frame image:

Screening key frame images from the video clips;

In some embodiments, the action feature determination module 620 is configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip:

In some embodiments, when the action feature determining module 620 filters out multiple associated images corresponding to the key frame images from the video clip, it is configured to:

Filter the multiple association graphs from the first sub video segment.

In some embodiments, after obtaining multiple target object images, before determining the motion characteristic information of the target object, the motion characteristic determination module 620 is further configured to:

The target object image is set as an image with a preset image resolution.

In some embodiments, when the scene timing feature determination module 630 determines the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information, it is configured to:

In some embodiments, when the scene timing feature determination module 630 performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, it is configured to:

In some embodiments, when the scene timing feature determining module 630 determines the timing feature information corresponding to the target object based on the initial timing feature information and the action feature information, it is configured to:

In some embodiments, when the scene timing feature determining module 630 determines the timing feature information corresponding to the target object based on the initial timing feature information and the action feature information, it is further configured to:

The embodiment of the present disclosure discloses an electronic device, as shown in FIG. 7, comprising: a processor 701 and a storage medium 702 connected to each other, and the storage medium stores machine-readable instructions executable by the processor. When the device is running, the processor executes the machine-readable instructions to execute the steps of the above-mentioned action recognition method. Specifically, the processor 701 and the storage medium 702 may be connected through a bus 703.

When the machine-readable instructions are executed by the processor 701, the following steps of the action recognition method are executed:

Get video clips;

In addition, when the machine-readable instruction is executed by the processor 701, the method content in any one of the embodiments described in the above method section can also be executed, which will not be repeated here.

The embodiments of the present disclosure also provide a computer program product corresponding to the above method and device, including a computer-readable storage medium storing program code. The instructions included in the program code can be used to execute the method in the previous method embodiment, and the specific implementation is Please refer to the method embodiment, which will not be repeated here. The computer-readable storage medium may be a volatile or non-volatile storage medium.

The above description of the various embodiments tends to emphasize the differences between the various embodiments, and the same or similarities can be referred to each other. For the sake of brevity, the details are not repeated herein.

Those skilled in the art can clearly understand that, for convenience and concise description, the specific working process of the system and device described above can refer to the corresponding process in the method embodiment, which will not be repeated in this disclosure. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, device, and method may be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other divisions in actual implementation. For example, multiple modules or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some communication interfaces, devices or modules, and may be in electrical, mechanical or other forms.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present disclosure.

In addition, the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present disclosure essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

The above are only specific implementations of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present disclosure, and they shall be covered Within the protection scope of this disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Industrial applicability

The present disclosure provides an action recognition method and device, an electronic device, and a computer-readable storage medium, wherein the method includes: obtaining a video clip; determining the target object based on the object frame in the key frame image in the video clip The action feature information of the target object; based on the video clip and the action feature information, determine the scene feature information and time sequence feature information corresponding to the target object; based on the action feature information, the scene feature information, and the The time sequence feature information determines the action type of the target object.

Claims

An action recognition method, which includes:

Get video clips;

Determine the motion feature information of the target object based on the object frame in the key frame image of the target object in the video clip;

Determine the scene feature information and time sequence feature information corresponding to the target object based on the video segment and the action feature information;

Determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
The action recognition method according to claim 1, further comprising the step of determining the border of the object in the key frame image:

Screening key frame images from the video clips;

Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;

According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
The action recognition method according to claim 1 or 2, wherein the determining the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip comprises:

For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;

According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;

Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
The action recognition method according to claim 3, wherein the filtering out multiple associated images corresponding to the key frame images from the video clip comprises:

Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;

The multiple associated images are filtered from the first sub video segment.
The action recognition method according to claim 3, wherein after obtaining multiple target object images, before determining the action characteristic information of the target object, the method further comprises:

The target object image is set as an image with a preset image resolution.
The action recognition method according to any one of claims 1 to 5, wherein the determining the scene feature information and time sequence feature information corresponding to the target object based on the video clip and the action feature information comprises:

For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;

Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;

Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;

Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
8. The action recognition method according to claim 6, wherein the performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information comprises:

For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;

Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
The action recognition method according to claim 6 or 7, wherein the determining the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information comprises:

Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;

Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;

The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
8. The action recognition method according to claim 8, wherein the determining the time sequence feature information corresponding to the target object based on the initial time sequence feature information and the action feature information further comprises:

Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
An action recognition device, which includes:

Video acquisition module, configured to acquire video clips;

An action feature determining module, configured to determine the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip;

A scene timing feature determining module, configured to determine the scene feature information and timing feature information corresponding to the target object based on the video clip and the action feature information;

The action recognition module is configured to determine the action type of the target object based on the action feature information, the scene feature information, and the time sequence feature information.
The motion recognition device according to claim 10, wherein the motion feature determination module is further configured to determine the object frame in the key frame image:

Screening key frame images from the video clips;

Performing object detection on the key frame image obtained through screening, and determining the initial object bounding box of the target object in the key frame image;

According to the preset expansion size information, the initial object bounding box is expanded to obtain the object bounding box of the target object in the key frame image.
The action recognition device according to claim 10 or 11, wherein the action feature determination module determines the action feature information of the target object based on the object frame in the key frame image of the target object in the video clip. , The configuration is:

For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;

According to the object frame corresponding to the key frame image, respectively intercept partial images from at least part of the associated images corresponding to the key frame image to obtain multiple target object images corresponding to the key frame image;

Based on the multiple target object images corresponding to the key frame image, the action characteristic information of the target object is determined.
11. The motion recognition device according to claim 12, wherein the motion feature determination module is configured to:

Select a first sub-video segment that includes a key frame image from the video segments; the first sub-video segment also includes N images that are temporally adjacent to the key frame image; where N is a positive integer;

Filter the multiple association graphs from the first sub video segment.
The motion recognition device according to claim 12, wherein after obtaining multiple target object images and before determining the motion characteristic information of the target object, the motion characteristic determination module is further configured to:

The target object image is set as an image with a preset image resolution.
The action recognition device according to any one of claims 10 to 14, wherein the scene timing feature determination module is determining the scene feature information and timing corresponding to the target object based on the video clip and the action feature information. For feature information, the configuration is:

For the key frame image, filter out multiple associated images corresponding to the key frame image from the video clip;

Performing a video scene feature extraction operation on at least part of the associated images to obtain the scene feature information;

Performing a time sequence feature extraction operation on objects other than the target object in the video clip to obtain initial time sequence feature information;

Based on the initial time sequence feature information and the action feature information, the time sequence feature information corresponding to the target object is determined.
The action recognition device according to claim 15, wherein the scene timing feature determination module performs a timing feature extraction operation on objects other than the target object in the video clip to obtain initial timing feature information, is configured to :

For the key frame image, a second sub video segment including a key frame image is selected from the video segments; the second sub video segment further includes P images that are temporally adjacent to the key frame image; wherein, P is a positive integer;

Extracting the motion characteristics of objects other than the target object in the image in the second sub-video segment, and using the obtained motion characteristics as the initial time sequence characteristic information.
The action recognition device according to claim 15 or 16, wherein the scene sequence feature determination module is configured to determine the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information for:

Performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively;

Perform average pooling operation on the initial time series feature information after dimensionality reduction processing;

The initial time series feature information after the mean pooling operation and the action feature information after the dimensionality reduction process are combined to obtain the time series feature information corresponding to the target object.
The action recognition device according to claim 17, wherein the scene sequence feature determination module is further configured to determine the sequence feature information corresponding to the target object based on the initial sequence feature information and the action feature information. :

Use the obtained time series feature information corresponding to the target object as new initial time series feature information, and return to the step of performing dimensionality reduction processing on the initial time series feature information and the action feature information respectively.
An electronic device comprising: a processor and a storage medium connected to each other. The storage medium stores machine-readable instructions executable by the processor. When the electronic device is running, the processor executes the machine-readable instructions. Instructions to execute the action recognition method according to any one of claims 1 to 9.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the action recognition method according to any one of claims 1 to 9 is executed.
A computer program, comprising computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the Methods.