CN115035605B

CN115035605B - Action recognition method, device and equipment based on deep learning and storage medium

Info

Publication number: CN115035605B
Application number: CN202210953176.3A
Authority: CN
Inventors: 杨政华; 吴志伟; 杨海军; 陈丽珍
Original assignee: Guangdong Lvan Industry And Commerce Co ltd
Current assignee: Guangdong Lvan Industry And Commerce Co ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2023-04-07
Anticipated expiration: 2042-08-10
Also published as: CN115035605A

Abstract

The invention discloses a method, a device, equipment and a storage medium for action recognition based on deep learning. The method comprises the following steps: acquiring a motion recognition model and a fixed video frame number; determining a first sample image set with a fixed video frame number according to a sample video of the motion recognition model; erasing at least one frame of sample image in the first sample image set in a time dimension manner to obtain a second sample image set; performing average pooling and dimension compression on the channel dimension and the space dimension of the feature map corresponding to the second sample image set to obtain a compression result; performing feature extraction on the compression result through a first convolution unit to obtain a first sample feature; performing feature extraction on the first sample feature through a second convolution unit to obtain attention weight; and training the motion recognition model according to the attention weight, the feature diagram and the motion type of the sample video. The embodiment of the invention can improve the accuracy of the motion recognition model.

Description

Action recognition method, device and equipment based on deep learning and storage medium

Technical Field

The invention relates to the technical field of action recognition, in particular to an action recognition method, an action recognition device, action recognition equipment and a storage medium based on deep learning.

Background

Motion recognition is the recognition of the motion it contains to the input video so that based on the motion of the person, the system makes instructions for the next step. For example, the monitoring video gives an alarm in advance for whether the actions of the old and the children are abnormal, identifies whether the actions of the pedestrians are abnormal in night monitoring or judges whether the pedestrians are potential criminals, and the like.

The conventional motion recognition field has many problems, for example, the number of positive samples is small, but the deep learning-based method has high requirements on the data volume. Meanwhile, video information not only contains space dimension information, but also more importantly identifies time dimension information, and how to refine and identify actions of time sequence frames is a difficult point existing in action identification.

At present, video data of network streaming only has a small number of frames, while a deep learning model based on offline training has a fixed frame number requirement on input, a common processing method is to not process a small number of frame data, and the processing mode of direct filtering can increase the transmission rate of potential crime videos, reduce the early warning effect on dangerous people, and have more risks.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for recognizing actions based on deep learning, which can improve the accuracy of an action recognition model.

According to an aspect of the present invention, there is provided a deep learning-based motion recognition method, the method including:

acquiring a motion recognition model to be trained based on deep learning and a fixed video frame number associated with the motion recognition model;

determining a first sample image set of the fixed video frame number according to the sample video of the motion recognition model;

erasing at least one frame of sample image in the first sample image set in a time dimension manner to obtain a second sample image set of the action recognition model;

performing feature extraction on the second sample image set to obtain a feature map and a channel dimension and a space dimension of the feature map, and performing average pooling and dimension compression on the channel dimension and the space dimension of the feature map to obtain a compression result;

performing feature extraction on the compression result through a first convolution unit to obtain a first sample feature;

performing feature extraction on the first sample feature through a second convolution unit to obtain attention weight;

training the motion recognition model according to the attention weight, the feature map and the motion type of the sample video;

the dimension of the first sample feature is λ T ', T' is the time dimension of the feature map input by the current convolution, λ is an amplification coefficient, the convolution kernel size of the first convolution unit is 1 × 1, and the convolution kernel size of the second convolution unit is 3 × 3.

According to another aspect of the present invention, there is provided a deep learning-based motion recognition apparatus, the apparatus including:

the model and frame number acquisition module is used for acquiring a motion recognition model to be trained based on deep learning and a fixed video frame number associated with the motion recognition model;

the first sample image set module is used for determining a first sample image set with the fixed video frame number according to the sample video of the motion recognition model;

a second sample image set module, configured to erase at least one frame of sample images in the first sample image set in a time dimension, so as to obtain a second sample image set of the motion recognition model;

the compression result determining module is used for performing feature extraction on the second sample image set to obtain a feature map and a channel dimension and a space dimension of the feature map, and performing average pooling and dimension compression on the channel dimension and the space dimension of the feature map to obtain a compression result;

the first sample characteristic module is used for carrying out characteristic extraction on the compression result through a first convolution unit to obtain a first sample characteristic;

the attention weight determining module is used for extracting the features of the first sample feature through a second convolution unit to obtain an attention weight;

the model training module is used for training the action recognition model according to the attention weight, the feature map and the action type of the sample video;

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method for deep learning based action recognition according to any of the embodiments of the present invention.

According to another aspect of the present invention, a computer-readable storage medium is provided, and computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed, a processor implements the deep learning based action recognition method according to any embodiment of the present invention.

According to the technical scheme of the embodiment of the invention, a first sample image set with a fixed video frame number is determined for a deep learning-based motion recognition model according to a sample video of a motion recognition model, a second sample image set is obtained by erasing the first sample image set in a time dimension, an attention weight is obtained by processing a feature map of the second sample image set through cooperation of a first convolution unit and a second convolution unit, the motion recognition model is trained by adopting the attention weight, the feature map and the motion type of the sample video, the problem that the motion recognition of a video with a small frame number in the prior art is difficult is solved, and the accuracy of the motion recognition model can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a deep learning-based action recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a deep learning-based action recognition method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a deep learning-based motion recognition apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device implementing a deep learning-based motion recognition method according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of an embodiment of the present invention, which provides a deep learning-based motion recognition method, where the embodiment is applicable to a case where a motion recognition model is trained, and the method may be performed by a deep learning-based motion recognition apparatus, which may be implemented in hardware and/or software, and the deep learning-based motion recognition apparatus may be configured in an electronic device. As shown in fig. 1, the method includes:

s110, obtaining a motion recognition model to be trained based on deep learning and a fixed video frame number associated with the motion recognition model.

The action recognition model based on deep learning is used for carrying out action recognition on an input video, and the deep learning network can be a convolutional neural network or the like. The fixed video frame number associated with the motion recognition model is the input frame number of the motion recognition model, and can be preset according to requirements, and can be 8, 16, 32 and the like.

And S120, determining a first sample image set with the fixed video frame number according to the sample video of the motion recognition model.

In the training process of the action recognition model based on deep learning, the predicted action type of the sample video is obtained by carrying out action recognition on the sample video, a loss function of the action recognition model is constructed by combining the predicted action type of the sample video and the real action type marked by the sample video, and the action recognition model is trained by adopting the loss function.

The embodiment of the present disclosure does not limit the frame number of the sample video, and the frame number of each sample video may be different. The number of frames of the sample video may be equal to or greater than the fixed video frame number, or may be less than the fixed video frame number. Under the condition that the number of the sample video frames is equal to or greater than the number of the fixed video frames, a first sample image with the fixed video frames can be extracted from the sample video to serve as a first sample image set; under the condition that the number of the sample video frames is less than the number of the fixed video frames, frame interpolation and filling can be carried out on the sample video to obtain a first sample image set.

Optionally, determining the first sample image set with the fixed video frame number according to the sample video of the motion recognition model includes:

if the frame number of the sample video is smaller than the fixed video frame number, determining the copy times and the frame extraction times by the following formula:

；

wherein C is the copy frequency, E is the frame extraction frequency, K is the maximum frame interpolation frequency, N and T are the frame number of the sample video and the frame number of the fixed video respectively, and ceil () is rounding up;

and copying each frame image in the sample video by adopting the copying times, and randomly extracting frames from the copied sample video by adopting the frame extracting times to obtain a first sample image set with a fixed video frame number, so that the video frame number is equal to the fixed video frame number, the input requirement of an action recognition model is met, and the first sample image set with the fixed video frame number is obtained.

By judging that the frame number of a sample video is smaller than the frame number of a fixed video, calculating the copy times according to the maximum frame insertion times, copying each frame image in the sample video by adopting the calculated copy times, performing frame insertion and completion, ensuring that the frame insertion times of single frames do not differ too much, specifically ensuring that the frame insertion times of the single frames in an original video differ by 1 at most, and ensuring that the frame number of the copied sample video meets the requirement of an input frame number of a motion recognition model; the frame extraction times are calculated according to the frame number of the sample video, the fixed video frame number and the maximum frame insertion times to obtain a first sample image set of the fixed video frame number, the motion model is trained on the basis of the first sample image set obtained by randomly extracting the frame number, and the accuracy of the motion recognition model can be improved.

if the frame number of the sample video is equal to or greater than the fixed video frame number, determining a video frame taking interval by the following formula;

wherein L is a video frame taking interval,

the number of frames of the sample video and the number of fixed video frames in turn,

to round down;

determining a starting frame by the following formula;

wherein the content of the first and second substances,

for the start frame, random () represents random, and sqrt represents square.

The frame number of the sample video is judged to be equal to or greater than the fixed video frame number, the frame taking interval of the video is calculated according to the frame number of the sample video and the fixed video frame number, the integrity of the action label can be ensured by taking frames based on the video frame taking interval, the coverage range of frame taking is improved, and overfitting of time dimension is reduced; according to the frame number of the sample video, the frame number of the fixed video and the frame taking interval of the video, the position of the initial frame is calculated, the randomness of extraction is improved, and the accuracy of the motion recognition model can be improved.

S130, erasing at least one frame of sample image in the first sample image set in a time dimension mode to obtain a second sample image set of the action recognition model.

Video data includes not only spatial dimension information, but also temporal dimension information, which is more important in the field of motion recognition, and spatial information appears redundant in many cases. According to the embodiment of the disclosure, random erasure of a time dimension is introduced, at least one frame of sample image in the first sample image set is erased, and the second sample image set is obtained, so that redundancy of the time dimension can be reduced, and the effect of reducing overfitting is achieved. Specifically, at least one frame of sample image to be erased is selected from the first sample image set, and the sample image to be erased is subjected to overall processing, so that the influence of the moment to be erased corresponding to the sample image to be erased is removed.

Optionally, erasing at least one frame of sample image in the first sample image set in a time dimension includes:

selecting a sample image to be erased from the first sample image set according to the fixed video frame number and the erasing proportion;

and setting the pixel value of the image to be erased as a preset pixel value.

The erasing ratio can be randomly selected, and the preferred erasing ratio can be 0.1. The preset pixel value may be a preset pixel value, and is used for setting the pixel value of the image to be erased according to the preset pixel value.

By selecting the sample image to be erased from the first sample image set according to the fixed video frame number and the erasing proportion and setting the pixel value of the sample image to be erased as the preset pixel value, the redundancy of the time dimension can be reduced, and the erasing flexibility of the time dimension can be improved.

Optionally, the selecting a sample image to be erased from the first sample image set according to the fixed video frame number and the erasure proportion includes:

the erased frame interval is determined according to the following formula:

q=floor(1/τ)；

wherein q is an erasure frame interval, floor () is rounded down, τ is an erasure proportion;

randomly selecting an initial erased frame from the first q frames of the first sample image set;

and selecting the sample image with the number of the frames to be erased from the first sample image set as the sample image to be erased according to the initial erased frame and the erased frame interval.

And selecting the sample images to be erased according to the initial erasing frame and the erasing frame interval, and uniformly selecting the sample images to be erased from the first sample image set, so that the second sample images are wide in coverage and uniform in distribution in the time dimension.

Selecting a sample image to be erased from the first sample image set according to the fixed video frame number and the erasure proportion may further include: and randomly selecting at least one frame of image from the first sample image set based on the random number as a sample image to be erased.

By the two time-dimension data erasing algorithms, the generalization capability of the action recognition model can be increased, and the sample images to be erased are selected based on the initial erasing frame and the erasing frame interval or the sample images to be erased are selected from the first sample image set based on the random number, so that the function of reducing overfitting is achieved.

S140, performing feature extraction on the second sample image set to obtain a feature map and a channel dimension and a space dimension of the feature map, and performing average pooling and dimension compression on the channel dimension and the space dimension of the feature map to obtain a compression result.

The channel dimension can be determined by the number of channels of the input matrix, and the feature map space dimension comprises a feature map width dimension and a feature map height dimension. And performing feature extraction on the second sample image set to obtain a feature map and the channel dimension and the space dimension of the feature map, and performing average pooling on the channel dimension and the space dimension of the feature map to obtain a one-dimensional vector as an average pooling result. And carrying out dimension transformation on the average pooling result to obtain a dimension compression result.

S150, performing feature extraction on the compression result through a first convolution unit to obtain a first sample feature.

And S160, performing feature extraction on the first sample feature through a second convolution unit to obtain attention weight.

S170, training the motion recognition model according to the attention weight, the feature map and the motion type of the sample video;

The motion type of the sample video may be preset according to a real motion, and for example, the motion type of the sample video may be abnormal motion of an old person and a child in a surveillance video, abnormal motion of a pedestrian in night surveillance, abnormal motion of a potential criminal, and the like. The convolution kernel of the first convolution unit of 1 × 1 can improve the processing efficiency, and the convolution kernel of the second convolution unit of 3 × 3 can improve the information interaction between different dimensions.

According to the technical scheme of the embodiment of the invention, a first sample image set with a fixed video frame number is determined for a deep learning-based motion recognition model according to a sample video of the motion recognition model, a second sample image set is obtained by erasing the first sample image set in a time dimension, an attention weight is obtained by processing a feature map of the second sample image set through cooperation of a first convolution unit and a second convolution unit, and the motion recognition model is trained by adopting the attention weight, the feature map of the second sample image set and the motion type of the sample video, so that the accuracy of the motion recognition model can be improved.

Example two

Fig. 2 is a flowchart of an action recognition method based on deep learning according to a second embodiment of the present invention, and in this embodiment, on the basis of the foregoing embodiments, average pooling and dimension compression are performed on the channel dimension and the space dimension of the feature map, and a compression result is further optimized and expanded, and may be combined with the foregoing optional embodiments. The average pooling and dimension compression of the channel dimension and the space dimension of the feature map to obtain a compression result comprises the following steps: acquiring input information of the action recognition model, wherein the input information sequentially comprises a batch size, a channel dimension of a second sample image set feature map, a video time dimension, a feature map width dimension and a feature map height dimension; converting the channel dimension and the video time dimension in the input information to obtain converted input information; the converted input information sequentially comprises the batch size, the video time dimension, the channel dimension of a second sample image set feature map, the feature map width dimension and the feature map height dimension; performing average pooling on the channel dimension, the feature map width dimension and the feature map height dimension in the transformed input information to obtain an average pooling result; performing dimensionality compression on the average pooling result to obtain a compression result; wherein the compression result comprises the batch size and the video time dimension in sequence. As shown in fig. 2, the method includes:

s201, obtaining a motion recognition model to be trained based on deep learning and a fixed video frame number associated with the motion recognition model.

S202, determining a first sample image set with the fixed video frame number according to the sample video of the motion recognition model.

S203, erasing at least one frame of sample image in the first sample image set in a time dimension mode to obtain a second sample image set of the motion recognition model.

And S204, acquiring input information of the motion recognition model.

The input information sequentially comprises the batch size, the channel dimension of the second sample image set feature map, the video time dimension, the feature map width dimension and the feature map height dimension.

S205, converting the channel dimension and the video time dimension in the input information to obtain converted input information.

The transformed input information sequentially comprises a batch size, a video time dimension, a channel dimension of a second sample image set feature map, a feature map width dimension and a feature map height dimension.

S206, carrying out average pooling on the channel dimension, the feature diagram width dimension and the feature diagram height dimension in the transformed input information to obtain an average pooling result.

And S207, performing dimensionality compression on the average pooling result to obtain a compression result.

Wherein the compression result comprises the batch size and the video time dimension in sequence.

And S208, performing feature extraction on the compression result through a first convolution unit to obtain a first sample feature.

And S209, performing feature extraction on the first sample feature through a second convolution unit to obtain the attention weight.

S210, training the motion recognition model according to the attention weight, the feature diagram and the motion type of the sample video.

The content may specifically be: when a current convolutional neural network-based model is trained, input dimension information is generally (B, C, T ', W, H), where B represents a batch size of model training, C represents a channel number, T' represents a time dimension number, and W and H represent a dimension width and a height of a feature map, respectively.

(1) Input = (B, C, T', W, H);

(2) Converting (B, C, T ', W, H) to (B, T', C, W, H);

(3) Converting the input information into (B, T ', 1) by 3-dimensional average pooling, and compressing the dimension into input ' = (B, T '), namely converting the information with B dimensions of T ' multiplied by 1 into the information with B dimensions of T ';

(4) Performing convolution operation on the characteristic input ', and nonlinear activation on a Relu (Rectified Linear Unit), wherein the characteristic of a time dimension is generally small after downsampling, at the moment, the size of a convolution kernel is designed to be 1, the output dimension is lambda T', and lambda is greater than 1,2 or 3, so that the dimension of a hidden space is increased, and the characteristic can be conveniently expressed;

(5) Performing convolution operation on the convolved features, wherein the size of a convolution kernel is 3, and the number of output channels isTAnd the convolution kernel is 3, so that information interaction among channels is facilitated, the information is input into the sigmoid activation function, and the attention weight y is output.

(6) Processing the input information by adopting the attention weight y to obtain the characteristic y based on the time dimension attention mechanism

input。

According to the technical scheme of the embodiment of the invention, the time dimension characteristic T 'is obtained by performing global average pooling on W, H and C dimensions, and the information of the time dimension characteristic T' is compressed very little after the time dimension characteristic is downsampled for a plurality of times in general cases. Therefore, the convolution with convolution kernel of 1 increases channel information to make the output be lambda T ', inputs nonlinear activation function, improves the nonlinear expression capability of the module, performs time dimension information interaction through convolution with convolution kernel of 3 to ensure the interaction between information, and outputs the T'; the method comprises the steps of modeling based on the time dimension of an attention mechanism, outputting attention weight, combining the attention weight and input to obtain time dimension characteristics, constructing a loss function by using the time dimension characteristics and a result marked by a sample, performing gradient descent processing, updating parameters of an action recognition model, and improving the accuracy of the action recognition model.

EXAMPLE III

Fig. 3 is a schematic structural diagram of an action recognition device based on deep learning according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes:

a model and frame number obtaining module 310, configured to obtain a deep learning-based motion recognition model to be trained, and a fixed video frame number associated with the motion recognition model;

a first sample image set module 320, configured to determine a first sample image set of the fixed video frame number according to the sample video of the motion recognition model;

a second sample image set module 330, configured to erase at least one frame of sample images in the first sample image set in a time dimension, so as to obtain a second sample image set of the motion recognition model;

the compression result determining module 340 is configured to perform feature extraction on the second sample image set to obtain a feature map and a channel dimension and a space dimension of the feature map, and perform average pooling and dimension compression on the channel dimension and the space dimension of the feature map to obtain a compression result;

a first sample feature module 350, configured to perform feature extraction on the compression result through a first convolution unit to obtain a first sample feature;

an attention weight determining module 360, configured to perform feature extraction on the first sample feature through a second convolution unit to obtain an attention weight;

a model training module 370, configured to train the motion recognition model according to the attention weight, the feature map, and the motion type of the sample video;

According to the technical scheme of the embodiment of the invention, a first sample image set with a fixed video frame number is determined for a deep learning-based motion recognition model according to a sample video of the motion recognition model, a second sample image set is obtained by erasing the time dimension of the first sample image set, an attention weight is obtained by processing a feature map of the second sample image set through the cooperation of a first convolution unit and a second convolution unit, and the motion recognition model is trained by adopting the attention weight, the feature map of the second sample image set and the motion type of the sample video, so that the accuracy of the motion recognition model can be improved.

Further, the compression result determining module 340 includes:

the input information acquisition unit is used for acquiring input information of the action recognition model, wherein the input information sequentially comprises a batch size, a channel dimension of a second sample image set feature map, a video time dimension, a feature map width dimension and a feature map height dimension;

the conversion input information determining unit is used for converting the channel dimension and the video time dimension in the input information to obtain converted input information; the converted input information sequentially comprises the batch size, the video time dimension, the channel dimension of a second sample image set feature map, the feature map width dimension and the feature map height dimension;

an average pooling result determining unit, configured to perform average pooling on a channel dimension, a feature map width dimension, and a feature map height dimension in the converted input information to obtain an average pooling result;

a compression result determining unit, configured to perform dimension compression on the average pooling result to obtain a compression result; wherein the compression result comprises the batch size and the video time dimension in sequence.

Further, the second sample image set module 330 includes:

a sample image to be erased selecting unit, configured to select a sample image to be erased from the first sample image set according to the fixed video frame number and the erasure proportion;

and the preset pixel value setting unit is used for setting the pixel value of the image to be erased to be a preset pixel value.

Further, the to-be-erased sample image selecting unit is specifically configured to:

the erased frame interval is determined according to the following formula:

q=floor(1/τ)；

Further, the first sample image set module 320 is specifically configured to:

；

and copying each frame image in the sample video by adopting the copying times, and randomly extracting frames from the copied sample video by adopting the frame extracting times to obtain a first sample image set with a fixed video frame number.

Further, the first sample image set module 320 is specifically configured to:

if the frame number of the sample video is equal to or greater than the fixed video frame number, determining a video frame taking interval through the following formula;

wherein L is a video frame taking interval,

the frame number of the sample video and the fixed video frame number in turn,

to round down;

determining a starting frame by the following formula;

wherein, the first and the second end of the pipe are connected with each other,

for the start frame, random () represents random, and sqrt represents square.

The action recognition device based on deep learning provided by the embodiment of the invention can execute the action recognition method based on deep learning provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 4 illustrates a schematic diagram of an electronic device 40 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 40 includes at least one processor 41, and a Memory communicatively connected to the at least one processor 41, such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, and the like, wherein the Memory stores a computer program executable by the at least one processor, and the processor 41 can execute various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from the storage unit 48 into the Random Access Memory (RAM) 43. In the RAM43, various programs and data necessary for the operation of the electronic apparatus 40 can also be stored. The processor 41, the ROM42, and the RAM43 are connected to each other via a bus 44. An Input/Output (I/O) interface 45 is also connected to bus 44.

A number of components in the electronic device 40 are connected to the I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, etc.; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the electronic device 40 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 41 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. Processor 41 performs the various methods and processes described above, such as a deep learning based action recognition method.

In some embodiments, deep learning based action recognition may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 40 via the ROM42 and/or the communication unit 49. When the computer program is loaded into RAM43 and executed by processor 41, one or more steps of the deep learning based action recognition described above may be performed. Alternatively, in other embodiments, processor 41 may be configured to perform the deep learning based action recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, integrated circuitry, field-Programmable Gate arrays (FPGAs), application-Specific Integrated circuits (ASICs), application-Specific Standard Parts (ASSPs), system-On-Chip Systems (SOCs), load Programmable Logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (EPROM or flash Memory), an optical fiber, a Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the conventional physical host and VPS (Virtual Private Server) service.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A motion recognition method based on deep learning is characterized by comprising the following steps:

the dimension of the first sample feature is lambda T ', T' is the time dimension of the feature map input by the current convolution, lambda is an amplification coefficient, the convolution kernel size of the first convolution unit is 1 multiplied by 1, and the convolution kernel size of the second convolution unit is 3 multiplied by 3;

wherein, the average pooling and dimension compression are performed on the channel dimension and the space dimension of the feature map to obtain a compression result, and the method comprises the following steps:

acquiring input information of the action recognition model, wherein the input information sequentially comprises a batch size, a channel dimension of a second sample image set feature map, a video time dimension, a feature map width dimension and a feature map height dimension;

converting the channel dimension and the video time dimension in the input information to obtain converted input information; the converted input information sequentially comprises the batch size, the video time dimension, the channel dimension of a second sample image set feature map, the feature map width dimension and the feature map height dimension;

performing average pooling on the channel dimension, the feature map width dimension and the feature map height dimension in the transformed input information to obtain an average pooling result;

performing dimensionality compression on the average pooling result to obtain a compression result; wherein the compression result comprises in sequence a batch size and a video time dimension.

2. The method according to claim 1, wherein the erasing at least one frame of the sample image in the first sample image set in a time dimension comprises:

and setting the pixel value of the image to be erased as a preset pixel value.

3. The method of claim 2, wherein said selecting a sample image to be erased from said first sample image set based on said fixed number of video frames and an erasure proportion comprises:

the erased frame interval is determined according to the following formula:

q=floor(1/τ)

and selecting a sample image with the frame number to be erased from the first sample image set as a sample image to be erased according to the initial erased frame and the erased frame interval.

4. The method of claim 1, wherein determining the first set of sample images for the fixed number of video frames from the sample videos for the motion recognition model comprises:

wherein the content of the first and second substances,Cin order to be able to make a copy of the data,Ein order to take out the number of frames,Kis the maximum number of frame interpolation times,Nis the number of frames of the sample video, Tthe ceil () is the integer of the upper part for the fixed video frame number;

and copying each frame image in the sample video by adopting the copying times, and randomly extracting frames from the sample video by adopting the frame extracting times to obtain a first sample image set with a fixed video frame number.

5. The method of claim 1, wherein determining the first set of sample images for the fixed number of video frames from the sample videos for the motion recognition model comprises:

wherein the content of the first and second substances,Lthe frame interval is taken for the video,Nis the number of frames of the sample video,Tin order to fix the number of video frames,

to round down;

determining a start frame by the following formula;

wherein the content of the first and second substances,

is the starting frame of the frame, and the frame is divided into a plurality of frames,random() It is meant that the random number is,sqrtindicating the square-on.

6. An action recognition device based on deep learning, characterized by comprising:

the second sample image set module is used for erasing at least one frame of sample image in the first sample image set in a time dimension manner to obtain a second sample image set of the action recognition model;

the compression result determining module is used for performing feature extraction on the second sample image set to obtain a feature map and the channel dimension and the space dimension of the feature map, and performing average pooling and dimension compression on the channel dimension and the space dimension of the feature map to obtain a compression result;

the dimension of the first sample feature is λ T ', T' is the time dimension of the feature map input by the current convolution, λ is an amplification coefficient, the convolution kernel size of the first convolution unit is 1 × 1, and the convolution kernel size of the second convolution unit is 3 × 3;

the compression result determination module includes:

7. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform a method of deep learning based action recognition as claimed in any one of claims 1-5.

8. A computer-readable storage medium storing computer instructions for causing a processor to implement a deep learning based action recognition method according to any one of claims 1-5 when executed.