CN114360053A

CN114360053A - Action recognition method, terminal and storage medium

Info

Publication number: CN114360053A
Application number: CN202111534245.9A
Authority: CN
Inventors: 闫浩; 张锲石; 程俊
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-04-15

Abstract

The application is applicable to the technical field of machine vision, and provides an action recognition method, a terminal and a storage medium, wherein the method comprises the following steps: acquiring feature vector representation of a target video; determining an attention weight of a foreground image frame in the target video according to a preset attention dynamic threshold, wherein the attention weight is used for indicating the probability that the image frame contains actions; obtaining a target feature vector according to the attention weight of the foreground image frame and the feature vector representation; and inputting the target characteristic vector into a motion recognition network for motion recognition to obtain a motion recognition result in the target video. The scheme can improve the reliability of image action recognition and ensure the accuracy of the video action recognition result.

Description

Action recognition method, terminal and storage medium

Technical Field

The application belongs to the technical field of machine vision, and particularly relates to a motion recognition method, a terminal and a storage medium.

Background

With the rapid growth of the number of videos, computer vision technology has received extensive attention from researchers. The technology for extracting the frame with the action from the uncut video to realize the action positioning task has wide application in the fields of video monitoring, automatic driving, video description, video searching, man-machine interaction, automatic sports comment, patient monitoring and the like.

For the prior art, a strongly supervised motion localization method has made a great progress, but it requires artificially labeling each motion and the time when each motion occurs, which is very time-consuming and labor-consuming, and is difficult to adapt to most realistic scenes. Therefore, a weak supervision method is provided, and the weak supervision action positioning only needs a video-level label, so that the video data annotation cost is greatly reduced, and the artificial annotation deviation is avoided.

However, in the weak surveillance action positioning method, for each video, a certain number of video frames are usually selected, and a positioning task is implemented by determining whether a video segment is an action segment through an effective classification task. It can only identify the most recognized motion and background segments in the video. However, in addition to the most recognized video segments, a large amount of blurred and disturbed video segments created in real life cannot be correctly recognized.

Disclosure of Invention

The embodiment of the application provides an action identification method, a terminal and a storage medium, which are used for solving the problems that the existing weak supervision action positioning method can only identify the most identified action and background segment in a video, and a large number of fuzzy video segments cannot be correctly identified.

A first aspect of an embodiment of the present application provides an action recognition method, including:

acquiring feature vector representation of a target video;

determining an attention weight of a foreground image frame in the target video according to a preset attention dynamic threshold, wherein the attention weight is used for indicating the probability that the image frame contains actions;

obtaining a target feature vector according to the attention weight of the foreground image frame and the feature vector representation;

and inputting the target characteristic vector into a motion recognition network for motion recognition to obtain a motion recognition result in the target video.

A second aspect of an embodiment of the present application provides an action recognition apparatus, including:

the first acquisition module is used for acquiring the feature vector representation of the target video;

the determining module is used for determining the attention weight of a foreground image frame in the target video according to a preset attention dynamic threshold, wherein the attention weight is used for indicating the probability that the image frame contains actions;

the second acquisition module is used for obtaining a target feature vector according to the attention weight of the foreground image frame and the feature vector representation;

and the action recognition module is used for inputting the target characteristic vector into an action recognition network for action recognition to obtain an action recognition result in the target video.

A third aspect of embodiments of the present application provides a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, performs the steps of the method according to the first aspect.

A fifth aspect of the present application provides a computer program product, which, when run on a terminal, causes the terminal to perform the steps of the method of the first aspect described above.

As can be seen from the above, in the embodiment of the present application, by obtaining a feature vector representation of a target video, determining an attention weight of a foreground image frame in the target video according to a preset attention dynamic threshold, obtaining a target feature vector according to the attention weight and the feature vector representation of the foreground image frame, inputting the target feature vector into a motion recognition network for motion recognition, obtaining a motion recognition result in the target video, and by combining the process with an attention mechanism, recognizing the foreground image frame in the video by introducing the attention dynamic threshold, implementing distinction between a foreground and a background, and combining the attention weight of the foreground image frame with the feature vector representation, implementing attention weighting processing, so as to obtain the target feature vector after the attention weighting processing, implementing final motion recognition, and fusing the attention mechanism and the attention weighting mechanism on the basis of a feature sequence, after the irrelevant image frame is planed, the image characteristics are highlighted, the reliability of image action recognition is improved, and the accuracy of a video action recognition result is ensured.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a first flowchart of a motion recognition method according to an embodiment of the present application;

fig. 2 is a block diagram of a framework for performing motion recognition based on a target video according to an embodiment of the present disclosure;

fig. 3 is a flowchart ii of an action recognition method according to an embodiment of the present application;

fig. 4 is a structural diagram of a motion recognition apparatus according to an embodiment of the present application;

fig. 5 is a structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In particular implementations, the terminals described in embodiments of the present application include, but are not limited to, other portable devices such as mobile phones, laptop computers, or tablet computers having touch sensitive surfaces (e.g., touch screen displays and/or touch pads). It should also be understood that in some embodiments, the device is not a portable communication device, but is a desktop computer having a touch-sensitive surface (e.g., a touch screen display and/or touchpad).

In the discussion that follows, a terminal that includes a display and a touch-sensitive surface is described. However, it should be understood that the terminal may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.

The terminal supports various applications, such as one or more of the following: a drawing application, a presentation application, a word processing application, a website creation application, a disc burning application, a spreadsheet application, a gaming application, a telephone application, a video conferencing application, an email application, an instant messaging application, an exercise support application, a photo management application, a digital camera application, a web browsing application, a digital music player application, and/or a digital video player application.

Various applications that may be executed on the terminal may use at least one common physical user interface device, such as a touch-sensitive surface. One or more functions of the touch-sensitive surface and corresponding information displayed on the terminal can be adjusted and/or changed between applications and/or within respective applications. In this way, a common physical architecture (e.g., touch-sensitive surface) of the terminal can support various applications with user interfaces that are intuitive and transparent to the user.

It should be understood that, the sequence numbers of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation to the implementation process of the embodiment of the present application.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

Referring to fig. 1, fig. 1 is a first flowchart of a motion recognition method according to an embodiment of the present application. As shown in fig. 1, a motion recognition method includes the steps of:

step 101, obtaining a feature vector representation of a target video.

For different applications, the objects to be recognized are different, some are speech and some are images and some are sensor data, but they all have corresponding digital representations in a computer, and usually, we convert them into a feature vector and then input it into a neural network.

Each piece of data input to the neural network is called a feature, the image features are represented by a vector with set dimensions, a feature vector representation is obtained, and subsequent processing operations are performed based on the feature vector representation.

Specifically, when the feature vector representation of the target video is obtained, the feature vector representation of each image frame in the video may be obtained first, and the feature vector representation corresponding to the target video as a whole may be obtained based on the feature vector representation of each image frame.

Or, feature processing may be performed from different video feature angles, so as to obtain feature vector representations of the target video in different feature dimensions.

For example, in one embodiment, as shown in conjunction with fig. 2, the feature vector representation includes a first feature vector and a second feature vector; correspondingly, obtaining a feature vector representation of a target video comprises:

dividing the target video into T non-overlapping segments; t is an integer greater than 1;

extracting a feature sequence of the image frame in each segment through a feature extractor to obtain RGB (red, green, blue) features and optical flow features of each segment; and respectively outputting the RGB features and the optical flow features to a feature embedding module to obtain first feature vectors corresponding to the RGB features and second feature vectors corresponding to the optical flow features.

In this process, a feature sequence at a frame level extracted from the video needs to be obtained. Specifically, the target video is divided into T non-overlapping segments, each of which is composed of 16 frames of images, for example, so as to perform reasonable segmentation processing on the video with a large number of frames. T may take different values for different videos. The method comprises the steps of obtaining RGB frames and optical flow segments contained in T segments in a video to obtain optical flow and RGB flow, then carrying out feature extraction on the optical flow and the RGB flow, extracting 1-X-D dimensional features, specifically inputting the dimensional features into a pre-trained I3D network respectively, and processing to obtain corresponding RGB and optical flow segment-level features.

In order to make the method suitable for the weakly supervised time sequence action positioning task, the feature can be further transmitted into a new convolutional layer (specifically, for example, a time domain convolutional layer) to carry out convolution processing, so as to realize feature embedding, and obtain a group of new features x_iX, also in T1 x D dimensions_iIn the subsequent processing process, the first feature vector of the RGB features and the second feature vector of the optical flow features are respectively sent to two different streams for independent action recognition processing, two independent action recognition processing results are obtained, and the two action recognition processing results can be fused to obtain an action recognition result corresponding to the whole video.

Wherein optionally the neural network structures involved in the two processing flows have the same design, but they do not share parameters.

And step 102, determining the attention weight of a foreground image frame in the target video according to a preset attention dynamic threshold.

Wherein the attention weight is used to indicate a probability that the image frame contains an action.

An unclipped video contains background and motion, and we need to focus on segments that may contain motion, and delete segments that may contain background.

In this step, an attention mechanism and an attention dynamic threshold need to be introduced to perform foreground and background recognition on image frames in the target video, and specifically, foreground image frames meeting a threshold condition are selected from each image frame through the attention dynamic threshold so as to realize preliminary recognition and screening of the possibility of whether the image frames contain actions.

The attention-dynamics threshold here is an adjustable decision value. By utilizing the dynamic threshold, the frame number selected when the foreground image frames are selected for different videos through the attention mechanism is different, and the method has better ductility for different videos so as to improve the accuracy of image positioning and action identification.

By way of implementation and not limitation, the determining the attention weight of the foreground image frame in the target video according to the preset attention dynamic threshold includes:

inputting the feature vector representation to an attention mechanism module to obtain attention weight corresponding to an image frame in a target video; and selecting a foreground image frame from the target video and determining the attention weight of the foreground image frame by using a preset attention dynamic threshold and combining the size relationship between the attention weight and the attention dynamic threshold.

Here, the attention mechanism module processes the feature vector representation, specifically, the feature vector representation is transmitted to a full-connection layer network, so that the attention mechanism processing is implemented on the feature vector representation. Specifically, the following formula may be adopted:

A_i＝σ(wA·x_i+bA)；

wherein, σ (·) represents a sigmoid function, wA is a weight vector used in the attention mechanism, bA is an attention offset used in the attention mechanism, and a_iTo generate an attention weight value, A_iA value of between 0 and 1 indicates a likelihood that the ith image frame contains motion, a_iA value of 0 represents the background, a value of 1 represents the action, and 0-1 represents the probability of having the action. The full-connection layer network serves as an attention mechanism module to process the feature vector representation, and the attention weight of each image frame is obtained.

The attention weight of the image frames in the target video may be calculated by the attention mechanism module for the whole target video, or calculated by the attention mechanism module for the image frames in each segment based on the segment selected from the target video.

After the attention weight corresponding to the image frame in the target video is obtained, the attention weight of the foreground image frame in the target video can be determined according to a preset attention dynamic threshold value and by combining the magnitude relation between the attention weight and the attention dynamic threshold value.

Specifically, the image frames with the attention weight larger than the attention dynamic threshold are selected as foreground image frames, the image frames with the attention weight smaller than or equal to the attention dynamic threshold are selected as background image frames, and meanwhile the attention weight corresponding to the foreground image frames is determined, so that the foreground and the background are accurately distinguished.

And 103, obtaining a target feature vector according to the attention weight and feature vector representation of the foreground image frame.

After the foreground image frame with higher action probability is selected, the attention weight of the foreground image frame can be used as the feature vector representation of the target video for attention weighting, so that the target feature vector after attention weighting is obtained, the target feature vector is formed into the foreground feature vector in the target video, the foreground feature can be focused more accurately, and the identification and prediction of the action in the video are realized.

The target feature vector may be obtained by performing weighting calculation on the attention weight of the foreground image frame and then multiplying the result by the feature vector representation.

Specifically, in an alternative embodiment, obtaining the target feature vector according to the attention weight and the feature vector representation of the foreground image frame includes:

averaging the attention weights of the foreground image frames to obtain a target attention weight corresponding to the target video; and multiplying the target attention weight by the feature vector expression to obtain a target feature vector.

For example, when the attention dynamic threshold is selected to be 0.5, a frame with the attention weight greater than the threshold 0.5 is taken from the image frames of the target video as a foreground image frame which is preliminarily considered to possibly contain the action, each frame corresponds to a time, it can be considered that the start time and the end time of the action are obtained, and then an average value of the possibility that the image frame in the action occurrence segment defined by the start time and the end time of the action contains the action is obtained as the attention weight corresponding to the whole video containing the coherent action, so as to improve the accuracy of subsequent action identification.

Here, the foreground feature x may be generated by applying an attention-weighted pooling layer to the feature vector_fg. The calculation equation is:

wherein, a_iIs the attention weight of the foreground image frame, T is the number of foreground image frames, x_iIs a feature vector representation of the target video.

Corresponding to this process, the determination of the attention dynamic threshold preset in step 102 may be implemented by using the target attention weight calculated in the process.

The preset attention dynamic threshold value can be dynamically selected according to an optimal threshold value obtained in training in the model application process after model training is finished, or dynamically selected according to optimal threshold values obtained in model training and corresponding to different video lengths, different action objects to be recognized and the like.

In the model training process, the target video is a video in a training sample set, and the video in the training sample set has an action category label; the preset attention dynamic threshold value is obtained by the following method:

by using

Calculating the attention dynamic threshold;

wherein k is_nFor attention dynamic threshold, α_nTarget attention weights generated in the previous model training process; m is the number of iterative training times of the executed model, M is a positive integer,

the sum of the target attention weights generated in the M times of executed model iterative training is calculated.

That is, the attention dynamic threshold value adopted by the current target video in the motion recognition process is calculated based on the target attention weight (i.e., the average attention value) corresponding to the video that has been previously subjected to sample training in the model.

In the iterative training process of the model, the determination of the attention dynamic threshold in the current training process needs to be based on the target attention weight generated in the previous model training process and the sum of the target attention weights generated in the executed historical iterative training process (specifically M times), and specifically, the ratio of the two is used as the attention dynamic threshold in the current training process to implement the training operation of the current model.

In the process, the attention dynamic threshold value in the model is modulated and optimized along with the input training of the video serving as the training sample in the training sample set in the model.

Further, during the training process of the model, the model can be optimized in parameters in various ways.

In an optional embodiment, after inputting the feature vector representation to the attention mechanism module and obtaining the attention weight corresponding to the image frame in the target video, the method further includes:

performing parameter optimization on the attention mechanism module by using an attention loss function;

wherein the attention loss function is:

wherein m is a hyper-parameter, a is an attention weight corresponding to an image frame in the target video, A is a set of attention weights corresponding to the image frame in the target video,

for the value size in the set at the top m/knThe attention weight of (a) is given,

attention weights are m/kn values in the set.

In the attention loss function, the attention weights corresponding to the image frames in the target video are selected from the set of attention weights in the order of the numerical values from large to small, and the image frames are sorted in the front (i.e. the part with the larger numerical value is selected as the first part of the attention loss function)

) M/kn attention weights, and the selection order is at the end (i.e. the smaller part of the value)

) The m/kn attention weights of (1) in combination achieve optimization constraints on model parameters.

The attention loss function is provided and combined with an attention dynamic threshold value, so that the attention loss can force the top attention to be close to 1 and the bottom attention to be close to 0, wherein the maximum and minimum attention parts dynamically adjust the loss regression according to the attention value of each video, the selection of the dynamically determined maximum and minimum attention total value is obtained, the action and the background are better distinguished, and the accuracy of the action recognition of the model is improved.

In the processing process, in order to improve the flexibility of the attention mechanism, a dynamic attention threshold is introduced to automatically adjust and distinguish the threshold part of the foreground image frame and the background image frame for each video in the training sample set, so that the attention mechanism can be better close to the extreme values of different videos.

And 104, inputting the target characteristic vector into a motion recognition network for motion recognition to obtain a motion recognition result in the target video.

The action recognition network is a multi-classifier, and can predict probability values of different types of actions contained in a target video based on input attention weighted target feature vectors, for example, c actions exist, the prediction vector outputting 1 × c is [0.1,0.2,0.1,0.7,0.4], the decimal of 0-1 represents the probability of the corresponding action, and the classification recognition of the multiple actions of the target video is realized. Or the action recognition network is a classifier which predicts the probability value of the target video containing the action of the specific type based on the input attention weighted target feature vector to realize the recognition of the specific action.

Referring to fig. 2, the motion recognition network is specifically a softmax layer of a full connectivity layer (FC), and a target feature vector (i.e. foreground feature x)_fg) And inputting the result into a softmax layer of the full connection layer to obtain a final video-level classification result Y.

As a specific embodiment, the target feature vector includes a first target feature vector generated based on the first feature vector and a second target feature vector generated based on the second feature vector; correspondingly, the target feature vector is input into a motion recognition network for motion recognition, and a motion recognition result in the target video is obtained, wherein the motion recognition result comprises the following steps:

inputting the first target feature vector into a motion recognition network for motion recognition to obtain a first recognition result;

inputting the second target feature vector into a motion recognition network for motion recognition to obtain a second recognition result;

and fusing the first recognition result and the second recognition result according to a set proportion to obtain a motion recognition result in the target video.

In the processing situation that when the feature vector of the target video is obtained, the first feature vector corresponding to the RGB features and the second feature vector corresponding to the optical flow features are obtained, in step 103, the target feature vector needs to be obtained by processing according to the attention weights of the first feature vector and the foreground image frame and according to the attention weights of the second feature vector and the foreground image frame, that is, the first target feature vector generated based on the first feature vector and the second target feature vector generated based on the second feature vector are obtained (the specific generation process refers to the related description of the target feature vector generated in step 103).

And inputting the first target feature vector and the second target feature vector into a motion recognition network (FC) respectively for motion recognition to obtain respective motion recognition results Y, and then proportionally fusing video-level motion recognition results of RGB streams and optical streams to obtain a final motion recognition result Y.

When the fusion is performed according to the set ratio, the prediction probability value corresponding to the first recognition result and the prediction probability value corresponding to the second recognition result may be given different weights and then summed. The set ratio is, for example, 1:1, that is, the two recognition results are given a weight value of 0.5 and then added and summed, so that a final motion recognition result corresponding to the target video can be calculated.

Further, after the target feature vector is input into a motion recognition network for motion recognition to obtain a motion recognition result in the target video, the method further includes:

and performing parameter optimization on the action recognition network by using a cross entropy loss function based on the action recognition result and the action category label set for the video in the training sample set.

In the embodiment of the application, through obtaining the feature vector representation of a target video, determining the attention weight of a foreground image frame in the target video according to a preset attention dynamic threshold, obtaining a target feature vector according to the attention weight and the feature vector representation of the foreground image frame, inputting the target feature vector into a motion recognition network for motion recognition to obtain a motion recognition result in the target video, the process combines an attention mechanism, introduces the attention dynamic threshold to recognize the foreground image frame in the video to realize the distinction between a foreground and a background, combines the attention weight and the feature vector representation of the foreground image frame to realize the attention weighting processing so as to obtain the target feature vector after the attention weighting processing, realizes the final motion recognition, and the process integrates the attention mechanism and the attention weighting mechanism on the basis of a feature sequence, after the irrelevant image frame is planed, the image characteristics are highlighted, the reliability of image action recognition is improved, and the accuracy of a video action recognition result is ensured.

The embodiment of the application also provides different implementation modes of the action recognition method.

Referring to fig. 3, fig. 3 is a second flowchart of a motion recognition method according to an embodiment of the present application. As shown in fig. 3, a motion recognition method includes the steps of:

step 301, obtaining a feature vector representation of a target video;

the implementation process of this step is the same as that of step 101 in the foregoing embodiment, and is not described here again.

Step 302, determining an attention weight of a foreground image frame in a target video according to a preset attention dynamic threshold, wherein the attention weight is used for indicating the probability that the image frame contains an action;

the implementation process of this step is the same as that of step 102 in the foregoing embodiment, and is not described here again.

And step 303, obtaining a target feature vector according to the attention weight and feature vector representation of the foreground image frame.

The implementation process of this step is the same as the implementation process of step 103 in the foregoing embodiment, and is not described here again.

And step 304, inputting the target characteristic vector into a motion recognition network for motion recognition to obtain a motion recognition result in the target video.

The implementation process of this step is the same as that of step 104 in the foregoing embodiment, and is not described here again.

Further, after determining the attention weight of the foreground image frame in the target video according to a preset attention dynamic threshold, the method further includes:

step 305, taking the video time corresponding to the first foreground image frame in the target video as the action start time, and taking the video time corresponding to the last foreground image frame in the target video as the action end time.

Specifically, in the case of dividing the target video into T non-overlapping segments, a video time corresponding to a first foreground image frame in each segment of the target video may be specifically used as an action start time, and a video time corresponding to a last foreground image frame may be used as an action end time corresponding to a real time of the action.

And under the condition that the target video is not divided into T non-overlapping segments, directly taking the video time corresponding to the first foreground image frame in the whole target video as the action starting time, and taking the video time corresponding to the last foreground image frame in the whole target video as the action ending time.

The process realizes the determination of the starting and ending time of the action in the video, and forms an action proposal { (ts, te, c, s) } in the target video by combining the action recognition result in the video, wherein ts is the starting time of the action, te is the ending time of the action, c is the action type prediction, and s is the confidence value of the action proposal, so that the complete implementation process of the time sequence action positioning method is completed, and the action and the time of the action are accurately judged.

In the embodiment of the application, the attention weight of a foreground image frame in a target video is determined according to a preset attention dynamic threshold value by obtaining the characteristic vector representation of the target video, the target characteristic vector is obtained according to the attention weight and the characteristic vector representation of the foreground image frame, the target characteristic vector is input into an action recognition network for action recognition, and an action recognition result in the target video is obtained. The accuracy of the video motion recognition result is ensured.

Referring to fig. 4, fig. 4 is a structural diagram of a motion recognition device according to an embodiment of the present application, and for convenience of explanation, only portions related to the embodiment of the present application are shown.

The motion recognition device 400 includes:

a first obtaining module 401, configured to obtain a feature vector representation of a target video;

a determining module 402, configured to determine an attention weight of a foreground image frame in the target video according to a preset attention dynamic threshold, where the attention weight is used to indicate a probability that an image frame includes an action;

a second obtaining module 403, configured to obtain a target feature vector according to the attention weight of the foreground image frame and the feature vector representation;

and the action recognition module 404 is configured to input the target feature vector into an action recognition network for action recognition, so as to obtain an action recognition result in the target video.

The determining module 402 is specifically configured to:

inputting the feature vector representation to an attention mechanism module to obtain an attention weight corresponding to an image frame in the target video;

and selecting a foreground image frame from the target video and determining the attention weight of the foreground image frame by utilizing the preset attention dynamic threshold and combining the magnitude relation between the attention weight and the attention dynamic threshold.

The second obtaining module 403 is specifically configured to:

averaging the attention weights of the foreground image frames to obtain a target attention weight corresponding to the target video;

and multiplying the target attention weight by the feature vector representation to obtain the target feature vector.

The target video is a video in a training sample set, and the video in the training sample set is provided with an action category label; the preset attention dynamic threshold is obtained by the following method:

by using

Calculating the attention dynamic threshold;

wherein k is_nFor the attention dynamic threshold, α_n(ii) the target attention weight generated during the previous model training process; m is the number of iterative training times of the executed model, and M is the positive integerThe number of the first and second groups is,

Wherein, the device still includes: an optimization module to:

wherein the attention loss function is:

wherein m is a hyper-parameter, a is the attention weight corresponding to an image frame in the target video, A is the set of the attention weights corresponding to image frames in the target video,

attention weights with a numerical magnitude in the first m/kn in the set,

attention weights are m/kn-numbered values in the set.

Wherein the feature vector representation comprises a first feature vector and a second feature vector; the first obtaining module 401 is specifically configured to:

extracting a feature sequence of the image frame in each segment through a feature extractor to obtain RGB (red, green, blue) features and optical flow features of each segment;

and respectively outputting the RGB features and the optical flow features to a feature embedding module to obtain the first feature vector corresponding to the RGB features and the second feature vector corresponding to the optical flow features.

Wherein the target feature vector comprises a first target feature vector generated based on the first feature vector and a second target feature vector generated based on the second feature vector; the action recognition module 404 is specifically configured to:

inputting the second target feature vector into the action recognition network for action recognition to obtain a second recognition result;

and fusing the first recognition result and the second recognition result according to a set proportion to obtain the action recognition result in the target video.

Wherein the apparatus further comprises:

and the time determining module is used for taking the video time corresponding to the first foreground image frame in the target video as the action starting time and taking the video time corresponding to the last foreground image frame in the target video as the action ending time.

The motion recognition device provided by the embodiment of the application can realize each process of the embodiment of the motion recognition method, can achieve the same technical effect, and is not repeated here to avoid repetition.

Fig. 5 is a structural diagram of a terminal according to an embodiment of the present application. As shown in fig. 5, the terminal 5 of this embodiment includes: at least one processor 50 (only one shown in fig. 5), a memory 51, and a computer program 52 stored in the memory 51 and executable on the at least one processor 50, the steps of any of the various method embodiments described above being implemented when the computer program 52 is executed by the processor 50.

The terminal 5 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal 5 may include, but is not limited to, a processor 50, a memory 51. It will be appreciated by those skilled in the art that fig. 5 is only an example of a terminal 5 and does not constitute a limitation of the terminal 5 and may include more or less components than those shown, or some components in combination, or different components, for example the terminal may also include input output devices, network access devices, buses, etc.

The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal 5, such as a hard disk or a memory of the terminal 5. The memory 51 may also be an external storage device of the terminal 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the terminal 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal 5. The memory 51 is used for storing the computer program and other programs and data required by the terminal. The memory 51 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal and method may be implemented in other ways. For example, the above-described apparatus/terminal embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The present application realizes all or part of the processes in the method of the above embodiments, and may also be implemented by a computer program product, when the computer program product runs on a terminal, the steps in the above method embodiments may be implemented when the terminal executes the computer program product.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A motion recognition method, comprising:

acquiring feature vector representation of a target video;

2. The method according to claim 1, wherein the determining the attention weight of the foreground image frame in the target video according to a preset attention dynamic threshold comprises:

3. The method of claim 2, wherein deriving a target feature vector from the attention weight of the foreground image frame and the feature vector representation comprises:

4. The method of claim 3, wherein the target video is a video in a training sample set, and the video in the training sample set has an action category label; the preset attention dynamic threshold is obtained by the following method:

by using

Calculating the attention dynamic threshold;

wherein k is_nFor the attention dynamic threshold, α_n(ii) the target attention weight generated during the previous model training process; m is the number of iterative training times of the executed model, M is a positive integer,

5. The method of claim 4, wherein after inputting the feature vector representation to an attention mechanism module to obtain attention weights corresponding to image frames in the target video, further comprising:

wherein the attention loss function is:

wherein m is a hyper-parameter, and a is corresponding to the image frame in the target videoThe attention weight, A, is a set of the attention weights corresponding to image frames in the target video,

attention weights with a numerical magnitude in the first m/kn in the set,

attention weights are m/kn-numbered values in the set.

6. The method of claim 1, wherein the feature vector representation comprises a first feature vector and a second feature vector; the obtaining of the feature vector representation of the target video comprises:

7. The method of claim 6, wherein the target feature vector comprises a first target feature vector generated based on the first feature vector and a second target feature vector generated based on the second feature vector; the step of inputting the target feature vector into a motion recognition network for motion recognition to obtain a motion recognition result in the target video includes:

8. The method according to claim 1, wherein after determining the attention weight of the foreground image frame in the target video according to the preset attention dynamic threshold, further comprising:

and taking the video time corresponding to the first foreground image frame in the target video as the action starting time, and taking the video time corresponding to the last foreground image frame in the target video as the action ending time.

9. A terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.