WO2023159898A1

WO2023159898A1 - Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium

Info

Publication number: WO2023159898A1
Application number: PCT/CN2022/114819
Authority: WO
Inventors: 张国梁; 杜泽旭; 张屹; 吴鹏; 郑晓崑
Original assignee: 国网智能电网研究院有限公司; 国网山东省电力公司枣庄供电公司; 国家电网有限公司
Priority date: 2022-02-25
Filing date: 2022-08-25
Publication date: 2023-08-31
Also published as: CN114565973A

Abstract

The present invention provides an action recognition system, method, and apparatus, a model training method and apparatus, a computer device, and a computer readable storage medium. The action recognition system comprises: a band-pass filtering module, configured to extract a dynamic feature map according to a plurality of frames of continuous images in one segment; a static feature extraction module, configured to obtain a static feature map according to the plurality of frames of continuous images in one segment; a static feature network, configured to perform feature displacement operation on the static feature maps, and calculate difference features between the static feature maps corresponding to the segments so as to obtain static classification features; a dynamic feature network, configured to perform feature displacement operation on the dynamic feature maps, and calculate difference features between the dynamic feature maps corresponding to the segments so as to obtain dynamic classification features; and a classification network, configured to obtain an action recognition result according to the static classification features and the dynamic classification features. By implementing the present invention, an action recognition model can be obtained by using less data for training, and accurate recognition of an action can be realized.

Description

An action recognition system, method, device, and model training method, device, computer equipment, and computer-readable storage medium

Cross References to Related Applications

The present invention is based on a Chinese patent application with application number 202210179444.0 and a filing date of February 25, 2022, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference.

technical field

The present invention relates to the technical field of image processing, and relates to an action recognition system, method, device, model training method, device, computer equipment, and computer-readable storage medium.

Background technique

Surveillance cameras are very popular at present, whether in companies, factories, shopping malls, roads, or train stations, you can see the existence of surveillance cameras everywhere. However, relying solely on the camera is difficult to achieve the purpose of real-time monitoring of violations and abnormal behaviors. When abnormal behaviors occur, it is very time-consuming and labor-intensive to flip through the surveillance video frame by frame, and it is easy to miss. If motion recognition technology can be used to detect specific abnormal behaviors in real time, it can greatly save manpower and material resources and improve efficiency. Therefore, action recognition has important practical value.

The video action recognition algorithm needs to extract the time information between video frames, and the network model needs to have the ability of time modeling. The action recognition technology based on deep learning is mainly divided into: a method based on a two-stream network, and a method based on a (three-dimensional) 3D convolutional network. Among them, the method based on the dual-stream network uses optical flow as time information, and needs to calculate the optical flow in advance and store it in the local hard disk, which often requires a large amount of memory for large data sets. At the same time, due to the need to calculate the optical flow in advance, the real-time effect of the method based on the dual-stream network is also poor. However, the method based on 3D convolutional network relies on 3D convolution to achieve the effect of temporal modeling, which requires more video frames as input, and has a large number of parameters. Training the network requires a lot of computing power and is difficult to deploy. In recent years, with the success of Transformer in the fields of natural language processing and computer vision, many researchers have also applied Transformer in the field of video action recognition and achieved good results. However, Transformer has a huge number of parameters and often requires A large amount of data can be fitted, so its real-time and practical application are difficult to guarantee.

Contents of the invention

Therefore, the technical problem to be solved by the present invention is to overcome the defect in the prior art that a large amount of data is required to fit and form a model for recognizing actions, thereby providing an action recognition system, method, device, and model training method, device, Computer equipment and computer-readable storage media.

The first aspect of the embodiment of the present invention provides an action recognition system, including: an information separation network, a static feature network, a dynamic feature network, and a classification network. The information separation network includes a band-pass filter module, a static feature extraction module, and a band-pass filter module. It is configured to extract dynamic feature maps based on multiple frames of continuous images in a segment; the static feature extraction module is configured to perform temporal average pooling on multiple frames of continuous images in a segment to obtain feature maps, and combine the feature maps with The dynamic feature map is subtracted to obtain the static feature map; the static feature network is configured to perform feature displacement operations on the static feature maps corresponding to multiple segments, and calculate the difference between the static feature maps corresponding to each segment to obtain the static feature map. Classification features; the dynamic feature network is configured to perform feature displacement operations on the dynamic feature maps corresponding to multiple segments, and calculate the difference between the dynamic feature maps corresponding to each segment to obtain dynamic classification features; the classification network is configured as The action recognition result is obtained according to the static classification feature and the dynamic classification feature.

Wherein, in the action recognition system provided by the embodiment of the present invention, the bandpass filter module includes a spatial convolution layer and a temporal convolution layer.

Wherein, in the action recognition system provided by the embodiment of the present invention, at least one of the static feature network and the dynamic feature network includes: an image segmentation module, an initial feature extraction module, and at least one intermediate feature extraction module, and the image segmentation module is configured as Segment the input feature map according to the first preset size to obtain the first feature vector; the initial feature extraction module includes a linear embedding sub-module and at least one feature difference and feature displacement sub-module, and the linear embedding sub-module is configured to follow the preset channel Convert the first eigenvector to obtain the second eigenvector; the feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the second eigenvector, and calculate the difference between the second eigenvectors corresponding to each segment features to obtain the initial classification features; the intermediate feature extraction module includes a feature merging submodule and at least one feature difference and feature displacement submodule, and the feature merging submodule is configured to merge the initial classification features according to the second preset size to obtain the third The feature vector; feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the third feature vector, and calculate a difference feature between the third feature vectors corresponding to each segment to obtain a classification feature.

Wherein, in the action recognition system provided by the embodiment of the present invention, at least one of the static feature network and the dynamic feature network includes three intermediate feature extraction modules: a first intermediate feature extraction module, a second intermediate feature extraction module, a third intermediate feature extraction module, and a third intermediate feature extraction module. The intermediate feature extraction module, the image segmentation module, the initial feature extraction module, the first intermediate feature extraction module, the second intermediate feature extraction module, and the third intermediate feature extraction module are sequentially connected; the initial feature extraction module, the first intermediate feature extraction module, the second The number of feature differences and feature displacement sub-modules in the three intermediate feature extraction modules is the same; the number of feature differences and feature displacement sub-modules in the second intermediate feature extraction module is greater than that of the initial feature extraction module, the first intermediate feature extraction module, and the third The number of feature difference and feature displacement sub-modules in the intermediate feature extraction module.

Among them, in the action recognition system provided by the embodiment of the present invention, the feature difference and feature displacement sub-module includes a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, and a first fully connected layer , the first GELU function layer, the second fully connected layer, wherein, the first normalization layer, the feature displacement unit, the feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, the second The two fully connected layers are connected in turn; the input data of the second normalization layer is the first residual calculation result; the first residual calculation result is the difference between the input data of the first normalization layer and the feature The output data of the unit is calculated; the output data of the feature difference and feature displacement sub-module is the second residual calculation result; the second residual calculation result is obtained through the input data of the second normalization layer and the first The output data of the second fully connected layer is calculated.

Among them, in the action recognition system provided by the embodiment of the present invention, the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer , the fourth channel fully connected layer, wherein the first channel fully connected layer is configured to fully connect the channels of the input data to obtain a fully connected result, and input the fully connected result to the horizontal feature displacement layer and the vertical feature displacement layer respectively layer; the horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain the horizontal displacement result, and input the horizontal displacement result to the second channel fully connected layer; the vertical feature displacement layer is configured to perform a horizontal displacement on the fully connected result Perform vertical displacement to obtain the vertical displacement result, and input the vertical displacement result into the third channel fully connected layer; the fourth channel fully connected layer is configured to perform the second channel fully connected layer and the third channel fully connected layer The sum of the output results is processed to obtain the output result of the characteristic displacement unit.

Wherein, in the action recognition system provided by the embodiment of the present invention, the feature difference unit includes: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, a Five fully connected layers, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer, wherein, the input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, The upsampling layer and the feature difference output layer are connected in sequence; the input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence; the input layer is configured to segment the current moment into corresponding The input features of the previous segment are compared with the input features corresponding to the previous segment, and the difference features are respectively input into the maximum pooling layer and the fifth fully connected layer; the feature difference output layer is configured to combine the upsampling layer and the sixth The output results of the fully connected layer are summed to obtain the summation result; the summation result is multiplied point by point with the input feature corresponding to the previous time segment to obtain the multiplication result; the multiplication result is divided into the previous time segment The corresponding input features are added to obtain the output result of the feature difference unit.

Among them, in the action recognition system provided by the embodiment of the present invention, the classification network includes a first time average pooling layer, a second time average pooling layer, a static feature classifier, a dynamic feature classifier, an output layer, and a first time average pooling layer. The pooling layer is configured to perform temporal average pooling on the static classification features corresponding to multiple segments, and input the pooling result to the static feature classifier; the second temporal average pooling layer is configured to perform multiple segmental average pooling The corresponding dynamic classification features are time-average pooled, and the pooling results are input into the dynamic feature classifier; the static feature classifier is configured to obtain the first classification result according to the static classification feature; the dynamic feature classifier is configured to be based on the dynamic The classification feature obtains the second classification result; the recognition result output layer is configured to take the weighted average result of the first classification result and the second classification result as the output result.

The second aspect of the embodiment of the present invention provides an action recognition model training method, including: acquiring multiple image sequences, in which the types of pedestrian actions are marked; dividing each image sequence into multiple sub-sequences to obtain a training data set; The training data set is input into the neural network system, and the neural network system is trained until the loss value of the loss function satisfies the loss condition, and an action recognition model is obtained. The neural network system is the action recognition system provided in the first aspect of the embodiment of the present invention.

Wherein, in the action recognition model training method provided in the embodiment of the present invention, the loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.

Wherein, in the action recognition model training method provided in the embodiment of the present invention, the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function. The third aspect of the embodiment of the present invention provides an action recognition method, including: acquiring an image sequence of a target object, dividing the image sequence into multiple sub-sequences; inputting the sub-sequences into an action recognition model to generate an action recognition result, and the action recognition model passes It is obtained through training by the action recognition model training method provided in the second aspect of the embodiment of the present invention.

Wherein, in the action recognition method provided in the embodiment of the present invention, after the step of dividing the image sequence into multiple sub-sequences, and before the step of inputting the sub-sequences into the action recognition model, the method further includes: performing, etc. Proportionally zoom to get a zoomed image, the size of the short side of the zoomed image is within the preset range; randomly crop the zoomed image to get a cropped image, the size of the cropped image meets the preset conditions; use the cropped image as an image in the subsequence, execute The step of feeding a subsequence into an action recognition model.

The fourth aspect of the embodiment of the present invention provides an action recognition model training device, including: an image acquisition module configured to acquire a plurality of image sequences in which pedestrian action types are marked; a training data acquisition module configured to Each image sequence is divided into multiple sub-sequences to obtain a training data set; the model training module is configured to input the training data set into a neural network system, train the neural network system, and obtain an action recognition model, and the neural network system is an embodiment of the present invention The action recognition system provided in the first aspect.

The fifth aspect of the embodiment of the present invention provides an action recognition device, including: an image acquisition module configured to acquire an image sequence of a target object, and divide the image sequence into multiple sub-sequences; an action recognition module configured to divide the sub-sequence An action recognition model is input to generate an action recognition result, and the action recognition model is trained by the action recognition model training method provided in the second aspect of the embodiment of the present invention.

The sixth aspect of the embodiment of the present invention provides a computer device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor. A processor executes to execute the action recognition system provided in the first aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the second aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the third aspect of the embodiment of the present invention. The action recognition method provided by the aspect.

The seventh aspect of the embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are configured to cause the computer to execute the action recognition system provided by the first aspect of the embodiment of the present invention, or Execute the action recognition model training method provided in the second aspect of the embodiment of the present invention, or execute the action recognition method provided in the third aspect of the embodiment of the present invention.

The technical scheme of the embodiment of the present invention has the following advantages:

1. The action recognition system provided by the embodiment of the present invention includes an information separation network, a static feature network, a dynamic feature network, and a classification network. First, the dynamic feature map and the static feature map in the image are separated through the information separation network, and then the dynamic feature The graph and the static feature map are input into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured, and the static scene in the video can be identified by separately analyzing the static feature map. , the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, the network can be made with less computation. Obtaining spatial local information ensures the speed of network operation. By calculating the difference between feature maps, the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that the action recognition system provided by the embodiment of the present invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition system can realize accurate recognition of actions.

2. The motion recognition model training method and device provided in the embodiments of the present invention, after obtaining the training data set, input the training data set into the motion recognition system provided in the first aspect of the embodiment of the present invention, and train the motion recognition system The action recognition model is obtained. The action recognition system provided by the first aspect of the embodiment of the present invention includes an information separation network, a static feature network, a dynamic feature network, and a classification network. The action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network. The feature map, and then input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network respectively. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured. By analyzing the static feature map separately , can identify the static scene in the video, the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, it can The network acquires spatial local information with a small amount of calculation, which ensures the running speed of the network. By calculating the difference between the feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thus ensuring The accuracy of network action recognition, it can be seen that the action recognition model training method provided by the embodiment of the present invention can use less data training to obtain the action recognition model, and, through the action recognition training method and device training provided by the embodiment of the present invention The obtained action recognition model can realize accurate recognition of actions.

3. The action recognition method and device provided by the embodiments of the present invention divide the image sequence into multiple subsequences after acquiring the image sequence of the target object, and input the subsequences into the action recognition model provided by the second aspect of the embodiment of the present invention Among the action recognition models obtained through training by the training method, the action recognition model training method provided by the second aspect of the embodiment of the present invention obtains the action recognition model by training the action recognition system. The action recognition system first separates the dynamic features in the image through the information separation network and static feature maps, and then input the dynamic feature maps and static feature maps into the dynamic feature network and the static feature network respectively. By analyzing the dynamic feature maps separately, the short-term time information of the video can be captured. By analyzing the static feature maps A separate analysis can identify the static scene in the video. The static feature network and the dynamic feature network perform displacement operations on the feature map, and calculate the difference between the feature maps corresponding to each segment to obtain classification features. The displacement operation can enable the network to obtain spatial local information with a small amount of calculation, ensuring the running speed of the network. By calculating the difference between feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling , so as to ensure the accuracy of network action recognition. It can be seen that the action recognition model trained by the action recognition training method provided by the second aspect of the embodiment of the present invention can realize accurate recognition of actions. Therefore, by implementing the embodiment of the present invention, it can Accurate recognition of actions is achieved.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are For some implementations of the present invention, those skilled in the art can also obtain other drawings based on these drawings without making creative efforts.

Fig. 1 is a functional block diagram of an example of an action recognition system in an embodiment of the present invention;

FIG. 2 is a functional block diagram of an example of a static feature network and/or a dynamic feature network in an embodiment of the present invention;

Fig. 3 is a functional block diagram of an example of the feature difference and feature displacement sub-module in the embodiment of the present invention;

Fig. 4 is a functional block diagram of an example of a feature displacement unit in an embodiment of the present invention;

FIG. 5 is a functional block diagram of an example of a feature difference unit in an embodiment of the present invention;

Fig. 6 is a flowchart of an example of an action recognition model training method in an embodiment of the present invention;

FIG. 7 is a flowchart of an example of an action recognition method in an embodiment of the present invention;

Fig. 8 is a functional block diagram of an example of an action recognition model training device in an embodiment of the present invention;

FIG. 9 is a flowchart of an example of an action recognition method in an embodiment of the present invention;

FIG. 10 is a functional block diagram of an example of computer equipment in an embodiment of the present invention.

Detailed ways

The technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described embodiments are part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for description purposes only, and should not be understood as indicating or implying relative importance.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as there is no conflict with each other.

An embodiment of the present invention provides an action recognition system, as shown in FIG. 1 , including: an information separation network 11, a static feature network 12, a dynamic feature network 13, and a classification network 14. The information separation network 11 includes a bandpass filter module 111, static feature extraction module 112,

The band-pass filtering module 111 is configured to extract a dynamic feature map according to the acquired multi-frame continuous images in a segment.

In an optional embodiment, before performing action recognition through the action recognition system provided by the embodiment of the present invention, it is necessary to first convert the collected video data containing the target object into an image sequence, and then divide the image sequence into N segments on average { P ₁ , P ₂ ,...,P _N }, finally input N segments of images into the action recognition system, randomly sample consecutive K frames of images in each segment, and obtain a total of T=N×K frames of images , where the number of frames of images in a segment can be set according to actual needs, for example, the value of N is 8 or 16, and the value of K is 3.

In an optional embodiment, when the action recognition system provided by the embodiment of the present invention performs action recognition, the bandpass filter module 111 sequentially extracts the dynamic feature map corresponding to each segment based on the continuous images in each segment.

The static feature extraction module 112 is configured to perform temporal average pooling on multiple frames of continuous images in a segment to obtain a feature map, and make a difference between the feature map and the dynamic feature map to obtain a static feature map.

In the embodiment of the present invention, the static feature extraction module 112 first performs temporal average pooling on multiple frames of continuous images in a segment through the temporal average pooling layer to obtain a feature map with a time dimension of 1, and combines the feature map with the segment The dynamic feature map corresponding to the segment is subtracted to obtain the static feature map of the segment.

In an optional embodiment, when the action recognition system provided by the embodiment of the present invention performs action recognition, the static feature extraction module 112 sequentially extracts the static feature maps corresponding to each segment based on the continuous images in each segment.

The static feature network 12 is configured to perform feature displacement operations on the static feature maps corresponding to multiple segments, and calculate the difference features between the static feature maps corresponding to each segment to obtain static classification features.

In an optional embodiment, when performing action recognition through the action recognition system provided by the embodiment of the present invention, the static feature network 12 performs feature displacement operations on the static feature maps corresponding to each segment, that is, for any segment When the corresponding static feature map performs feature displacement operation, it will not be affected by the static feature maps corresponding to other segments. However, when the static feature network 12 calculates the difference features between the static feature maps corresponding to each segment, it needs to combine the static feature maps corresponding to two adjacent segments to calculate the difference features.

The dynamic feature network 13 is configured to perform a feature displacement operation on the dynamic feature maps corresponding to multiple segments, and calculate the difference feature between values of the dynamic feature maps corresponding to each segment to obtain dynamic classification features.

In an optional embodiment, the same as the static feature network 12, the dynamic feature network 13 performs a feature displacement operation on the dynamic feature map corresponding to each segment, that is, performs a feature displacement operation on the dynamic feature map corresponding to any segment , will not be affected by the static feature maps corresponding to other segments. However, when the dynamic feature network 13 calculates the difference features between the dynamic feature maps corresponding to each segment, it needs to combine the dynamic feature maps corresponding to two adjacent segments to calculate the difference features.

The classification network 14 is configured to obtain action recognition results according to static classification features and dynamic classification features.

In an optional embodiment, the classification network includes a classifier, and the static classification feature and the dynamic classification feature are analyzed by the classifier to obtain an action recognition result.

The action recognition system provided by the embodiment of the present invention includes an information separation network 11, a static feature network 12, a dynamic feature network 13, and a classification network 14. First, the dynamic feature map and the static feature map in the image are separated through the information separation network 11, and then respectively The dynamic feature map and the static feature map are input into the dynamic feature network 13 and the static feature network 12, and by analyzing the dynamic feature map separately, short-term time information of the video can be captured, and by separately analyzing the static feature map, it is possible to identify In the static scene in the video, the static feature network 12 and the dynamic feature network 13 perform displacement operations on the feature maps, and calculate the difference between the feature maps corresponding to each segment to obtain classification features. By performing displacement operations on the feature maps, the The less computational load enables the network to obtain spatial local information, which ensures the speed of the network. By calculating the difference between the feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thus ensuring the network The accuracy of action recognition, it can be seen that the action recognition system provided by the embodiment of the present invention can use less data training to obtain the action recognition model, and the action recognition model trained by the action recognition system can realize accurate recognition of actions .

In an optional embodiment, the bandpass filtering module 111 includes a spatial convolution layer and a temporal convolution layer.

For each K frame of continuous images in a segment, define P(t,x,y) as the pixel value, where x, y represent the spatial dimension, and t represents the time dimension, and P(t,x,y) corresponds to The pixel value at (x,y) at frame t. Then the output F(t,x,y) of the bandpass filter is:

Among them, where * represents the convolution operation, and LoG _μ (x, y) represents the Gaussian Laplacian operator with parameter μ:

The second derivative with respect to t is approximated with a finite difference value h(i) as follows:

Then formula (1) can be expressed as:

Among them, K represents the number of image frames in a segment, and “·” represents multiplication.

It can be seen from formula (4) that the bandpass filter function is completely differentiable. In order to improve robustness, the embodiment of the present invention uses two consecutive convolutional layers to implement the band-pass filter module 111 , therefore, the band-pass filter module 111 in the embodiment of the present invention is trainable. After implementing the convolutional layer, formula (4) is re-expressed as follows:

in

A spatial convolution layer with a convolution kernel size of k×k is used, and the Laplacian operator with a parameter of μ is used to initialize the parameters. The sum of the parameter values of the convolution kernel is normalized to 1,

is a temporal convolutional layer with a time step of s, and initializes the convolution kernel value to

In an optional embodiment, the static feature network 12 and the dynamic feature network 13 can have the same network structure, and can also have different network structures. When the static feature network 12 and the dynamic feature network 13 have the same network structure, the two The values of the network parameters are different.

In an optional embodiment, as shown in FIG. 2 , at least one of the static feature network 12 and the dynamic feature network 13 includes: an image segmentation module 121 , an initial feature extraction module 122 , and at least one intermediate feature extraction module 123 .

The image segmentation module 121 is configured to segment the input feature map according to a first preset size to obtain a first feature vector.

In an optional embodiment, the input data of the static feature network 12 is a static feature map, and the image segmentation module 121 in the static feature network 12 segments the static feature map; the input data of the dynamic feature network 13 is a dynamic feature map, and the dynamic The image segmentation module 121 in the feature network 13 segments the dynamic feature map.

In an optional embodiment, for an input RGB image

Wherein H, W represents the size of the image, and 3 represents the channel number of the image. The image segmentation module 121 divides the image according to the block size of 4×4, and synthesizes each 4×4 block obtained by segmentation into a vector to obtain the feature size for

in

Indicates the number of blocks, 48 is the number of channels.

The initial feature extraction module 122 includes a linear embedding submodule 1221 and at least one feature difference and feature displacement submodule 1222. The linear embedding submodule 1221 is configured to convert the first feature vector according to the preset number of channels to obtain the second feature vector; The feature difference and feature displacement sub-module 1222 is configured to perform a feature displacement operation on the second feature vector, and calculate the difference feature between the second feature vectors corresponding to each segment to obtain the initial classification feature.

In an optional embodiment, when the image segmentation module 121 divides the input feature map according to the block size of 4×4, and obtains a size of

When the first eigenvector of , the linear embedding sub-module 1221 projects the first eigenvector to

Among them, C represents the number of channels.

In an optional embodiment, the initial feature extraction module 122 includes two continuous feature difference and feature displacement sub-modules 1222, after processing the first feature vector through the linear embedding sub-module 1221 to obtain the second feature vector, through Two consecutive feature difference and feature displacement sub-modules 1222 process the second feature vector to obtain initial classification features.

The intermediate feature extraction module 123 includes a feature merging submodule 1231 and at least one feature difference and feature displacement submodule 1222. The feature merging submodule 1231 is configured to merge the initial classification features according to the second preset size to obtain a third feature vector; The feature difference and feature displacement sub-module 1222 is configured to perform a feature displacement operation on the third feature vector, and calculate a difference feature between the third feature vectors corresponding to each segment to obtain a classification feature.

In an optional embodiment, the feature merging sub-module 1231 in the intermediate feature extraction module 123 merges the initial classification features obtained in the previous stage according to the size of 2×2 to combine blocks, and synthesizes a vector to obtain a feature size of

Then pass through at least one feature difference and feature shift sub-module 1222 and output.

In an optional embodiment, when the static feature network 12, and/or, the dynamic feature network 13 includes multiple intermediate feature extraction modules 123, the feature difference and feature displacement sub-modules 1222 in different intermediate feature extraction modules 123 The numbers may be the same or different. Exemplarily, the number of feature difference and feature displacement sub-modules 1222 in the intermediate feature extraction module 123 may be 2, 6, and so on.

In an optional embodiment, at least one of the static feature network 12 and the dynamic feature network 13 includes three intermediate feature extraction modules 123: a first intermediate feature extraction module 123, a second intermediate feature extraction module 123, and a third intermediate feature extraction module 123. feature extraction module 123,

The image segmentation module 121, the initial feature extraction module 122, the first intermediate feature extraction module 123, the second intermediate feature extraction module 123, and the third intermediate feature extraction module 123 are connected in sequence. That is, in the embodiment of the present invention, the output data of the image segmentation module 121 is the input data of the initial feature extraction module 122, the output data of the initial feature extraction module 122 is the input data of the first intermediate feature extraction module 123, and the first intermediate feature The output data of the extraction module 123 is the input data of the second intermediate feature extraction module 123 , and the output data of the second intermediate feature extraction module 123 is the input data of the third intermediate feature extraction module 123 .

The number of feature differences in the initial feature extraction module 122 , the first intermediate feature extraction module 123 , and the third intermediate feature extraction module 123 is the same as the number of feature displacement sub-modules 1222 .

The number of feature difference and feature displacement submodules 1222 in the second intermediate feature extraction module 123 is greater than the feature difference and feature displacement submodules in the initial feature extraction module 122, the first intermediate feature extraction module 123, and the third intermediate feature extraction module 123 The number of 1222.

In an optional embodiment, the number of feature difference and feature displacement sub-modules 1222 in the initial feature extraction module 122, the first intermediate feature extraction module 123, and the third intermediate feature extraction module 123 is 2, and the second intermediate feature extraction module The number of feature difference and feature displacement sub-modules 1222 in 123 is six.

In an optional embodiment, when the static feature network 12 and the dynamic feature network 13 include three intermediate feature extraction modules 123, the final output feature size of the static feature network 12 and the dynamic feature network 13 is

In an optional embodiment, as shown in FIG. 3 , the feature difference and feature displacement submodule 1222 includes a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, and a first full connection Layer, the first GELU function layer, the second fully connected layer, wherein, the first normalization layer, feature displacement unit, feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, The second fully connected layer is connected sequentially.

The input data of the second normalization layer is the first residual calculation result, and the first residual calculation result is calculated through the input data of the first normalization layer and the output data of the feature difference unit.

The output data of the feature difference and feature displacement sub-module 1222 is the second residual calculation result, which is calculated through the input data of the second normalization layer and the output data of the second fully connected layer.

In an optional embodiment, as shown in Figure 4, the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer , the fourth channel fully connected layer, the horizontal characteristic displacement layer is connected to the second channel fully connected layer, the vertical characteristic displacement layer is connected to the fourth channel fully connected layer, and the horizontal characteristic displacement layer is connected to the second channel fully connected layer to form a structure , and the structure formed by connecting the vertical feature displacement layer and the fourth channel fully connected layer is a parallel structure.

The first channel full connection layer is configured to perform full connection on the channels of the input data to obtain a full connection result, and input the full connection result to the horizontal feature displacement layer and the vertical feature displacement layer respectively.

The horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain the horizontal displacement result, and input the horizontal displacement result into the second channel fully connected layer.

In an optional embodiment, the full connection result has three dimensions: height, width, and channel. When performing horizontal displacement on the full connection result, if the number of displacement groups is 3 and the displacement size is 1, then each group of channels in the channel dimension The feature map is moved horizontally in the manner of [+1,0,-1], and the vacated part is filled with zeros. For example, the full connection result is divided into 3 groups of data in the channel dimension, and the first group The data moves one unit length to the left along the horizontal direction. After the first group of data is moved, the vacant part is filled with zeros, the third group of data remains unchanged, the second group of data moves one unit length to the right along the horizontal direction, and the third group After the data is moved, the vacated part is filled with zeros. If the number of displacement groups is 5 and the displacement size is 2, then in the channel dimension, the channel feature maps of each group are moved in the horizontal direction according to [+4,+2,0,-2,-4], and the vacant part is filled. Zero operations.

The vertical feature displacement layer is configured to perform vertical displacement on the fully connected result to obtain the vertical displacement result, and input the vertical displacement result into the third channel fully connected layer.

The only difference between performing vertical displacement on the fully connected result and horizontal displacement on the fully connected result is that the moving direction of the vertical displacement is the vertical direction, and the moving direction of the horizontal displacement is the horizontal direction.

The fourth channel fully connected layer is configured to process the sum of the output results of the second channel fully connected layer and the third channel fully connected layer to obtain the output result of the feature displacement unit.

In an optional embodiment, as shown in FIG. 5 , the feature difference unit includes: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, and a third fully connected layer. The fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer.

The input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, the upsampling layer, and the feature difference output layer are connected in sequence.

The input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence.

The input layer is configured to make a difference between the input feature corresponding to the segment at the current moment and the input feature corresponding to the segment at the previous moment, and input the difference feature into the maximum pooling layer and the fifth fully connected layer respectively.

The feature difference output layer is configured to sum the output results of the upsampling layer and the sixth fully connected layer to obtain the summation result; multiply the summation result point by point with the input feature corresponding to the previous segment to obtain Multiplication result; add the multiplication result to the input feature corresponding to the segment at the previous moment to obtain the output result of the feature difference unit.

Exemplarily, for input features [F ₁ ,F ₂ ,...,F _N ], where

For the feature at time t, the feature at time t and time t+1 is compared, and then divided into two paths, one is down-sampled through the maximum pooling layer and then passes through the two fully connected layers with the GELU function in the middle, and then passes through The upsampling layer performs upsampling; the other goes directly through the two fully connected layers with the GELU function in the middle. Finally, the feature maps output by the two paths are summed, multiplied point-by-point with the original input features, and then added to obtain the final output.

In an optional embodiment, as shown in FIG. 1, in the action recognition system provided by the embodiment of the present invention, the classification network 14 includes a first temporal average pooling layer 141, a second temporal average pooling layer 143, a static feature classifier 142, dynamic feature classifier 144, output layer 145,

The first temporal average pooling layer 141 is configured to perform temporal average pooling on static classification features corresponding to multiple segments, and input the pooling result to the static feature classifier 142 .

The second temporal average pooling layer 143 is configured to perform temporal average pooling on the dynamic classification features corresponding to multiple segments, and input the pooling result to the dynamic feature classifier 144 .

In an optional embodiment, when the action recognition system provided by the embodiment of the present invention is used for action recognition, if the image sequence corresponding to the video is divided into N segments, the information separation network 11, the static feature network 12, the dynamic The feature network 13 sequentially processes the N segments to obtain N static classification features and N dynamic classification features, and uses the first time average pooling layer 141 to perform time average pooling on the N static classification features to obtain a time-averaged For the static classification feature of the attribute, the second time average pooling layer 143 is used to perform time average pooling on the N dynamic classification features to obtain a dynamic classification feature with a time attribute, through which the static classification feature and the dynamic classification feature with the time attribute can be Achieve more accurate recognition of actions.

The static feature classifier 142 is configured to obtain a first classification result according to the static classification feature.

The dynamic feature classifier 144 is configured to obtain a second classification result according to the dynamic classification feature.

The recognition result output layer 145 is configured to take the weighted average result of the first classification result and the second classification result as an output result.

In an optional embodiment, the static feature classifier 142 and the dynamic feature classifier 144 are Softmax classifiers.

An embodiment of the present invention provides a method for training an action recognition model, as shown in FIG. 6 , including:

Step S21: Obtain multiple image sequences, in which the types of pedestrian actions are marked.

Step S22: Divide each image sequence into multiple subsequences to obtain a training data set.

Step S23: Input the training data set into the neural network system, train the neural network system until the loss value of the loss function meets the loss condition, and obtain the action recognition model, the neural network system is the action recognition system provided in any of the above-mentioned embodiments, For details about the action recognition system, refer to the above-mentioned embodiments.

In an optional embodiment, the initialized action recognition system is trained to obtain an action recognition model. In the action recognition system, the Gaussian Laplacian is used to initialize the bandpass filter module 111, and the pre-trained model is used to Initialize the feature difference and feature displacement network, and the pre-training model is obtained through pre-training on ImageNet or other large data sets.

In an optional embodiment, the action recognition system can be fully trained with a large-scale data set, or fine-tuned based on a pre-trained model.

In the action recognition model training method provided by the embodiment of the present invention, after the training data set is obtained, the training data set is input into the action recognition system provided in the above embodiment, and the action recognition system is trained to obtain the action recognition model. The above embodiment The action recognition system provided includes information separation network, static feature network, dynamic feature network, and classification network. The action recognition system first separates the dynamic feature map and static feature map in the image through the information separation network, and then separates the dynamic feature map and static feature map. The graph is input into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured. By analyzing the static feature map separately, the static scene in the video can be identified. The static feature network And the dynamic feature network performs displacement operations on the feature maps, and calculates the difference features between the feature maps corresponding to each segment, and obtains classification features. By performing displacement operations on the feature maps, the network can obtain spatial local information with a small amount of calculation. , to ensure the running speed of the network. By calculating the difference features between the feature maps, the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that through this The action recognition model training method provided by the embodiment of the invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition training method provided by the embodiment of the invention can realize accurate recognition of actions.

In an optional embodiment, the loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.

In an optional embodiment, the following formula is defined for a batch B:

where y _i , y _j represent the truth values of the i-th and j-th annotations, and i, j∈B. F _i , F _j represent the features after time-average pooling of the network output features of the i-th and j-th video's N segments, ||·|| ₂ represents l ₂ regularization, and "·" represents the vector point product. s and d represent the results obtained by cosine similarity operation on the intermediate features when the true value categories are consistent and inconsistent, respectively. The final orthogonal projection loss function L _opl is as follows:

L _opl =(1-s)+α×|d|, (4)

where α is the hyperparameter controlling the weights, and |·| represents the absolute value operation.

In an optional embodiment, the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function. Combine the above orthogonal projection loss function with the cross entropy loss L _ce to get the final loss function L:

L=L _ce +β×L _opl , (5)

where β is a hyperparameter controlling the weights of the orthogonal projection loss.

In the embodiment of the present invention, the orthogonal projection loss function is combined with the cross-entropy loss to obtain the final loss function, and the orthogonal projection loss function is introduced to orthogonalize the middle layer features to achieve the effect of inter-class separation and intra-class clustering , so that the trained action recognition model can recognize actions more accurately.

In an optional embodiment, the above step S21 includes:

First, the images in the image sequence are proportionally scaled to obtain a scaled image, and the size of the short side of the scaled image is within a preset range. Exemplarily, the preset interval may be [256, 320].

Then, the zoomed image is randomly cropped to obtain a cropped image, and the size of the cropped image satisfies a preset condition. In an optional embodiment, the size of the cropped image is 224×224.

Finally, the image sequence formed by cropping the image is divided into multiple subsequences to obtain the training data set.

An embodiment of the present invention provides an action recognition method, as shown in FIG. 7 , including:

Step S31: Acquire the image sequence of the target object, and divide the image sequence into multiple subsequences.

Step S32: Input the subsequence into the action recognition model to generate an action recognition result. The action recognition model is trained by the action recognition model training method provided in the above-mentioned embodiment. For details about the action recognition model, refer to the description in the above-mentioned embodiment. This will not be repeated here.

In the action recognition method provided by the embodiment of the present invention, after the image sequence of the target object is acquired, the image sequence is divided into multiple subsequences, and the subsequences are input into the action recognition model trained by the action recognition model training method provided in the above embodiment. Among them, the action recognition model training method provided by the above embodiment obtains the action recognition model by training the action recognition system. The action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network, and then separates the dynamic feature map Input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured, and the static scene in the video can be identified by analyzing the static feature map separately. The static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, the network can obtain Spatial local information ensures the running speed of the network. By calculating the difference between feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that The motion recognition model trained by the motion recognition training method provided in the above embodiments can realize precise recognition of motions, therefore, the precise recognition of motions can be realized by implementing the embodiments of the present invention.

In an optional embodiment, in the action recognition method provided in the embodiment of the present invention, after the above step S31 and before step S32, the method further includes:

First, the images in the image sequence are proportionally scaled to obtain a scaled image, and the size of the short side of the scaled image is within a preset range. For details, refer to the description in the above-mentioned embodiments, which will not be repeated here.

Then, the zoomed image is randomly cropped to obtain a cropped image, the size of the cropped image satisfies a preset condition. For details, refer to the description in the above-mentioned embodiments, which will not be repeated here.

Finally, the cropped image is used as an image in the subsequence, and the step of inputting the subsequence into the action recognition model is performed. For details, refer to the description in the above-mentioned embodiments, and details will not be repeated here.

In the embodiment of the present invention, the image is proportionally scaled so that the size of the short side of the image is within the preset interval, and then randomly cut to the input size accepted by the network, so that the purpose of data enhancement can be achieved, and the scaled and cropped data For analysis, the action recognition model can analyze the effective information in the image, thereby improving the analysis efficiency and the accuracy of the analysis results.

An embodiment of the present invention provides an action recognition model training device, as shown in FIG. 8 , including:

The image acquisition module 21 is configured to acquire a plurality of image sequences, in which the types of pedestrian actions are marked. For details, refer to the description of step S21 in the above embodiment, which will not be repeated here.

The training data acquisition module 22 is configured to divide each image sequence into multiple subsequences to obtain a training data set. For details, refer to the description of step S22 in the above-mentioned embodiment, and details will not be repeated here.

The model training module 23 is configured to be configured to input the training data set into the neural network system, train the neural network system, and obtain an action recognition model. The neural network system is the action recognition system provided in the above-mentioned embodiment. For details, refer to the above-mentioned The description of step S23 in the embodiment will not be repeated here.

The action recognition model training device provided by the embodiment of the present invention, after obtaining the training data set, inputs the training data set into the action recognition system provided in the above embodiment, and trains the action recognition system to obtain the action recognition model. The above embodiment The action recognition system provided includes information separation network, static feature network, dynamic feature network, and classification network. The action recognition system first separates the dynamic feature map and static feature map in the image through the information separation network, and then separates the dynamic feature map and static feature map. The graph is input into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured. By analyzing the static feature map separately, the static scene in the video can be identified. The static feature network And the dynamic feature network performs displacement operations on the feature maps, and calculates the difference features between the feature maps corresponding to each segment, and obtains classification features. By performing displacement operations on the feature maps, the network can obtain spatial local information with a small amount of calculation. , to ensure the running speed of the network. By calculating the difference features between the feature maps, the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that through this The action recognition model training device provided by the embodiment of the invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition training device provided by the embodiment of the invention can realize accurate recognition of actions.

An embodiment of the present invention provides an action recognition device, as shown in FIG. 9 , including:

The image acquisition module 31 is configured to acquire an image sequence of the target object, and divide the image sequence into multiple subsequences. For details, refer to the description of step S31 in the above-mentioned embodiment, which will not be repeated here.

The action recognition module 32 is configured to input the subsequence into the action recognition model to generate an action recognition result. The action recognition model is trained by the action recognition model training method provided in the above embodiment. For details, refer to the steps in the above embodiment. The description of S32 will not be repeated here.

The action recognition device provided by the embodiment of the present invention divides the image sequence into multiple subsequences after acquiring the image sequence of the target object, and inputs the subsequences into the action recognition model trained by the action recognition model training method provided in the above embodiment Among them, the action recognition model training method provided by the above embodiment obtains the action recognition model by training the action recognition system. The action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network, and then separates the dynamic feature map Input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured, and the static scene in the video can be identified by analyzing the static feature map separately. The static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, the network can obtain Spatial local information ensures the running speed of the network. By calculating the difference between feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that The motion recognition model trained by the motion recognition training method provided in the above embodiments can realize precise recognition of motions, therefore, the precise recognition of motions can be realized by implementing the embodiments of the present invention.

An embodiment of the present invention provides a computer device. As shown in FIG. 10 , the computer device mainly includes one or more processors 41 and a memory 42 , and one processor 41 is taken as an example in FIG. 10 .

The computer device may also include: an input device 43 and an output device 44 .

The processor 41 , the memory 42 , the input device 43 and the output device 44 may be connected through a bus or in other ways. In FIG. 10 , connection through a bus is taken as an example.

The processor 41 may be a central processing unit (Central Processing Unit, CPU). Processor 41 can also be other general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or Other chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above-mentioned types of chips. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The memory 42 can include a program storage area and a data storage area, wherein the program storage area can store the operating system and at least one application required by the function; , data created using motion recognition devices, etc. In addition, the memory 42 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 42 may optionally include a memory that is remotely located relative to the processor 41, and these remote memories may be connected to the motion recognition system, or the motion recognition model training device, or the motion recognition device through a network. The input device 43 can receive calculation requests (or other digital or character information) input by the user, and generate key signal input related to the motion recognition system, or the motion recognition model training device, or the motion recognition device. The output device 44 may include a display device such as a display screen for outputting calculation results.

An embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer storage medium stores computer-executable instructions, and the computer-executable instructions can perform the action recognition in any of the above-mentioned method embodiments A model training method, or, an action recognition method. Wherein, the storage medium can be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive) , abbreviation: HDD) or solid-state drive (Solid-State Drive, SSD), etc.; the storage medium may also include a combination of the above-mentioned types of memory. Apparently, the above-mentioned embodiments are only examples for clear description, rather than limiting the implementation. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. And the obvious changes or changes derived therefrom are still within the scope of protection of the present invention.

Industrial Applicability

An embodiment of the present invention provides an action recognition system including: a bandpass filter module configured to extract dynamic feature maps based on multiple frames of continuous images in a segment; a static feature extraction module configured to extract dynamic feature maps based on multiple frames in a segment Continuous images acquire static feature maps; the static feature network is configured to perform feature displacement operations on the static feature maps, and calculate the difference between the static feature maps corresponding to each segment to obtain static classification features; the dynamic feature network is configured to The dynamic feature map performs feature displacement operations, and calculates the difference features between the dynamic feature maps corresponding to each segment to obtain dynamic classification features; the classification network is configured to obtain action recognition results based on static classification features and dynamic classification features. By implementing the present invention, less data can be used for training to obtain an action recognition model, and accurate recognition of actions can be realized.

Claims

An action recognition system, comprising: an information separation network, a static feature network, a dynamic feature network, and a classification network, wherein the information separation network includes a bandpass filter module and a static feature extraction module,

The band-pass filtering module is configured to extract dynamic feature maps from multiple frames of continuous images obtained in a segment;

The static feature extraction module is configured to perform temporal average pooling on the multi-frame continuous images in the one segment to obtain a feature map, and make a difference between the feature map and the dynamic feature map to obtain a static feature map ;

The static feature network is configured to perform a feature displacement operation on static feature maps corresponding to multiple segments, and calculate difference features between static feature maps corresponding to each segment to obtain static classification features;

The dynamic feature network is configured to perform a feature displacement operation on dynamic feature maps corresponding to multiple segments, and calculate difference features between dynamic feature maps corresponding to each segment to obtain dynamic classification features;

The classification network is configured to obtain an action recognition result according to the static classification feature and the dynamic classification feature.
The action recognition system according to claim 1, wherein,

The bandpass filtering module includes a spatial convolution layer and a temporal convolution layer.
The action recognition system according to claim 1, wherein at least one of the static feature network and the dynamic feature network comprises: an image segmentation module, an initial feature extraction module, at least one intermediate feature extraction module,

The image segmentation module is configured to segment the input feature map according to a first preset size to obtain a first feature vector;

The initial feature extraction module includes a linear embedding submodule and at least one feature difference and feature displacement submodule, and the linear embedding submodule is configured to convert the first feature vector according to the preset number of channels to obtain the second feature Vector; the feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the second feature vector, and calculate the difference feature between the second feature vectors corresponding to each segment to obtain the initial classification feature;

The intermediate feature extraction module includes a feature merging submodule and at least one feature difference and feature displacement submodule, and the feature merging submodule is configured to merge the initial classification features according to a second preset size to obtain a third feature vector; the feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the third feature vector, and calculate a difference feature between the third feature vectors corresponding to each segment to obtain a classification feature.
The action recognition system according to claim 3, wherein at least one of the static feature network and the dynamic feature network comprises three intermediate feature extraction modules: a first intermediate feature extraction module, a second intermediate feature extraction module, a second intermediate feature extraction module, Three intermediate feature extraction modules,

The image segmentation module, the initial feature extraction module, the first intermediate feature extraction module, the second intermediate feature extraction module, and the third intermediate feature extraction module are connected in sequence;

The feature differences in the initial feature extraction module, the first intermediate feature extraction module, and the third intermediate feature extraction module are the same as the number of feature displacement sub-modules;

The number of feature differences and feature displacement submodules in the second intermediate feature extraction module is greater than the number of feature differences and feature displacement submodules in the initial feature extraction module, the first intermediate feature extraction module, and the third intermediate feature extraction module. quantity.
The action recognition system according to claim 3 or 4, wherein the feature difference and feature displacement sub-module comprises a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, a first Fully connected layer, the first GELU function layer, and the second fully connected layer, wherein,

The first normalization layer, the feature displacement unit, the feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, and the second fully connected layer are sequentially connected;

The input data of the second normalization layer is the first residual calculation result; the first residual calculation result is calculated by the input data of the first normalization layer and the output data of the feature difference unit of;

The output data of the feature difference and feature displacement sub-module is the second residual calculation result; the second residual calculation result is obtained through the input data of the second normalization layer and the second fully connected layer The output data is calculated.
The action recognition system according to claim 5, wherein the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer. Layer, the fourth channel fully connected layer, where,

The first channel full connection layer is configured to perform full connection on channels of input data to obtain a full connection result, and input the full connection result into the horizontal feature displacement layer and the vertical feature displacement layer respectively ;

The horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain a horizontal displacement result, and input the horizontal displacement result into the second channel fully connected layer;

The vertical characteristic displacement layer is configured to perform vertical displacement on the fully connected result to obtain a vertical displacement result, and input the vertical displacement result into the third channel fully connected layer;

The fourth channel fully connected layer is configured to process the sum of the output results of the second channel fully connected layer and the third channel fully connected layer to obtain the output result of the feature displacement unit.
The action recognition system according to claim 5, wherein the feature difference unit comprises: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, The fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer, wherein,

The input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, the upsampling layer, and the feature difference output layer are connected in sequence;

The input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence;

The input layer is configured to make a difference between the input feature corresponding to the segment at the current moment and the input feature corresponding to the segment at the previous moment, and input the difference feature into the maximum pooling layer and the fifth fully connected layer respectively;

The feature difference output layer is configured to sum the output results of the upsampling layer and the sixth fully connected layer to obtain a summation result; and correspond the summation result to the last time segment The input features are multiplied point by point to obtain a multiplication result; the multiplication result is added to the input features corresponding to the previous time segment to obtain the output result of the feature difference unit.
The action recognition system according to claim 1, wherein the classification network comprises a first temporal average pooling layer, a second temporal average pooling layer, a static feature classifier, a dynamic feature classifier, an output layer,

The first temporal average pooling layer is configured to perform temporal average pooling on static classification features corresponding to multiple segments, and input pooling results into the static feature classifier;

The second temporal average pooling layer is configured to perform temporal average pooling on dynamic classification features corresponding to multiple segments, and input pooling results into the dynamic feature classifier;

The static feature classifier is configured to obtain a first classification result according to the static classification feature;

The dynamic feature classifier is configured to obtain a second classification result according to the dynamic classification feature;

The recognition result output layer is configured to use a weighted average result of the first classification result and the second classification result as an output result.
A method for training an action recognition model, comprising:

Obtaining a plurality of image sequences, the image sequences are marked with pedestrian action types;

Dividing each of the image sequences into multiple subsequences to obtain a training data set;

The training data set is input into the neural network system, and the neural network system is trained until the loss value of the loss function meets the loss condition to obtain the action recognition model, and the neural network system is any one of claims 1-8. An action recognition system as described.
The action recognition model training method according to claim 9, wherein,

The loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.
The action recognition model training method according to claim 10, wherein the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function .
A method for action recognition, comprising:

acquiring an image sequence of the target object, and dividing the image sequence into multiple subsequences;

The subsequence is input into an action recognition model to generate an action recognition result, and the action recognition model is trained by the action recognition model training method according to any one of claims 9-11.
The action recognition method according to claim 12, wherein, after the step of dividing the image sequence into multiple subsequences, and before the step of inputting the subsequences into an action recognition model, the method further comprises:

Performing proportional scaling on the images in the image sequence to obtain a scaled image, the size of the short side of the scaled image is within a preset interval;

Randomly cropping the scaled image to obtain a cropped image, the size of the cropped image satisfies a preset condition;

Using the cropped image as an image in a subsequence, the step of inputting the subsequence into an action recognition model is performed.
An action recognition model training device, comprising:

An image acquisition module configured to acquire a plurality of image sequences, wherein the image sequences are marked with pedestrian action types;

The training data acquisition module is configured to divide each of the image sequences into multiple subsequences to obtain a training data set;

The model training module is configured to input the training data set into the neural network system, train the neural network system, and obtain the action recognition model, and the neural network system is as claimed in any one of claims 1-8. The action recognition system described above.
An action recognition device, comprising:

An image acquisition module configured to acquire an image sequence of a target object, and divide the image sequence into multiple subsequences;

The action recognition module is configured to input the subsequence into an action recognition model to generate an action recognition result, and the action recognition model is trained by the action recognition model training method according to any one of claims 9-11.
A computer device comprising:

at least one processor; and a memory connected in communication with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, thereby Execute the motion recognition system as described in any one of claims 1-8, or, execute the motion recognition model training method as described in any one of claims 9-11, or, execute the motion recognition system as described in any one of claims 12 or 13 The action recognition method described above.
A computer-readable storage medium, the computer-readable storage medium stores computer instructions, the computer instructions are used to make the computer execute the action recognition system according to any one of claims 1-8, or, Execute the action recognition model training method according to any one of claims 9-11, or execute the action recognition method according to claim 12 or 13.