WO2023159898A1 - Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium - Google Patents

Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium Download PDF

Info

Publication number
WO2023159898A1
WO2023159898A1 PCT/CN2022/114819 CN2022114819W WO2023159898A1 WO 2023159898 A1 WO2023159898 A1 WO 2023159898A1 CN 2022114819 W CN2022114819 W CN 2022114819W WO 2023159898 A1 WO2023159898 A1 WO 2023159898A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
layer
action recognition
displacement
static
Prior art date
Application number
PCT/CN2022/114819
Other languages
French (fr)
Chinese (zh)
Inventor
张国梁
杜泽旭
张屹
吴鹏
郑晓崑
Original Assignee
国网智能电网研究院有限公司
国网山东省电力公司枣庄供电公司
国家电网有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国网智能电网研究院有限公司, 国网山东省电力公司枣庄供电公司, 国家电网有限公司 filed Critical 国网智能电网研究院有限公司
Publication of WO2023159898A1 publication Critical patent/WO2023159898A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention is based on a Chinese patent application with application number 202210179444.0 and a filing date of February 25, 2022, and claims the priority of this Chinese patent application.
  • the entire content of this Chinese patent application is hereby incorporated by reference.
  • the present invention relates to the technical field of image processing, and relates to an action recognition system, method, device, model training method, device, computer equipment, and computer-readable storage medium.
  • Surveillance cameras are very popular at present, whether in companies, factories, shopping malls, roads, or train stations, you can see the existence of surveillance cameras everywhere. However, relying solely on the camera is difficult to achieve the purpose of real-time monitoring of violations and abnormal behaviors. When abnormal behaviors occur, it is very time-consuming and labor-intensive to flip through the surveillance video frame by frame, and it is easy to miss. If motion recognition technology can be used to detect specific abnormal behaviors in real time, it can greatly save manpower and material resources and improve efficiency. Therefore, action recognition has important practical value.
  • the video action recognition algorithm needs to extract the time information between video frames, and the network model needs to have the ability of time modeling.
  • the action recognition technology based on deep learning is mainly divided into: a method based on a two-stream network, and a method based on a (three-dimensional) 3D convolutional network.
  • the method based on the dual-stream network uses optical flow as time information, and needs to calculate the optical flow in advance and store it in the local hard disk, which often requires a large amount of memory for large data sets.
  • the real-time effect of the method based on the dual-stream network is also poor.
  • the technical problem to be solved by the present invention is to overcome the defect in the prior art that a large amount of data is required to fit and form a model for recognizing actions, thereby providing an action recognition system, method, device, and model training method, device, Computer equipment and computer-readable storage media.
  • the first aspect of the embodiment of the present invention provides an action recognition system, including: an information separation network, a static feature network, a dynamic feature network, and a classification network.
  • the information separation network includes a band-pass filter module, a static feature extraction module, and a band-pass filter module. It is configured to extract dynamic feature maps based on multiple frames of continuous images in a segment; the static feature extraction module is configured to perform temporal average pooling on multiple frames of continuous images in a segment to obtain feature maps, and combine the feature maps with The dynamic feature map is subtracted to obtain the static feature map; the static feature network is configured to perform feature displacement operations on the static feature maps corresponding to multiple segments, and calculate the difference between the static feature maps corresponding to each segment to obtain the static feature map.
  • the dynamic feature network is configured to perform feature displacement operations on the dynamic feature maps corresponding to multiple segments, and calculate the difference between the dynamic feature maps corresponding to each segment to obtain dynamic classification features;
  • the classification network is configured as The action recognition result is obtained according to the static classification feature and the dynamic classification feature.
  • the bandpass filter module includes a spatial convolution layer and a temporal convolution layer.
  • At least one of the static feature network and the dynamic feature network includes: an image segmentation module, an initial feature extraction module, and at least one intermediate feature extraction module, and the image segmentation module is configured as Segment the input feature map according to the first preset size to obtain the first feature vector;
  • the initial feature extraction module includes a linear embedding sub-module and at least one feature difference and feature displacement sub-module, and the linear embedding sub-module is configured to follow the preset channel Convert the first eigenvector to obtain the second eigenvector;
  • the feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the second eigenvector, and calculate the difference between the second eigenvectors corresponding to each segment features to obtain the initial classification features;
  • the intermediate feature extraction module includes a feature merging submodule and at least one feature difference and feature displacement submodule, and the feature merging submodule is configured to merge the initial classification features according to the second preset size to obtain
  • At least one of the static feature network and the dynamic feature network includes three intermediate feature extraction modules: a first intermediate feature extraction module, a second intermediate feature extraction module, a third intermediate feature extraction module, and a third intermediate feature extraction module.
  • the intermediate feature extraction module, the image segmentation module, the initial feature extraction module, the first intermediate feature extraction module, the second intermediate feature extraction module, and the third intermediate feature extraction module are sequentially connected; the initial feature extraction module, the first intermediate feature extraction module, the second The number of feature differences and feature displacement sub-modules in the three intermediate feature extraction modules is the same; the number of feature differences and feature displacement sub-modules in the second intermediate feature extraction module is greater than that of the initial feature extraction module, the first intermediate feature extraction module, and the third The number of feature difference and feature displacement sub-modules in the intermediate feature extraction module.
  • the feature difference and feature displacement sub-module includes a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, and a first fully connected layer , the first GELU function layer, the second fully connected layer, wherein, the first normalization layer, the feature displacement unit, the feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, the second The two fully connected layers are connected in turn; the input data of the second normalization layer is the first residual calculation result; the first residual calculation result is the difference between the input data of the first normalization layer and the feature The output data of the unit is calculated; the output data of the feature difference and feature displacement sub-module is the second residual calculation result; the second residual calculation result is obtained through the input data of the second normalization layer and the first The output data of the second fully connected layer is calculated.
  • the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer , the fourth channel fully connected layer, wherein the first channel fully connected layer is configured to fully connect the channels of the input data to obtain a fully connected result, and input the fully connected result to the horizontal feature displacement layer and the vertical feature displacement layer respectively layer; the horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain the horizontal displacement result, and input the horizontal displacement result to the second channel fully connected layer; the vertical feature displacement layer is configured to perform a horizontal displacement on the fully connected result Perform vertical displacement to obtain the vertical displacement result, and input the vertical displacement result into the third channel fully connected layer; the fourth channel fully connected layer is configured to perform the second channel fully connected layer and the third channel fully connected layer The sum of the output results is processed to obtain the output result of the characteristic displacement unit.
  • the feature difference unit includes: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, a Five fully connected layers, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer, wherein, the input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, The upsampling layer and the feature difference output layer are connected in sequence; the input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence; the input layer is configured to segment the current moment into corresponding The input features of the previous segment are compared with the input features corresponding to the previous segment, and the difference features are respectively input into the maximum pooling layer and the fifth fully connected layer; the feature difference output layer is configured to combine the upsampling layer and the sixth The output results of the fully
  • the classification network includes a first time average pooling layer, a second time average pooling layer, a static feature classifier, a dynamic feature classifier, an output layer, and a first time average pooling layer.
  • the pooling layer is configured to perform temporal average pooling on the static classification features corresponding to multiple segments, and input the pooling result to the static feature classifier; the second temporal average pooling layer is configured to perform multiple segmental average pooling
  • the corresponding dynamic classification features are time-average pooled, and the pooling results are input into the dynamic feature classifier;
  • the static feature classifier is configured to obtain the first classification result according to the static classification feature;
  • the dynamic feature classifier is configured to be based on the dynamic The classification feature obtains the second classification result;
  • the recognition result output layer is configured to take the weighted average result of the first classification result and the second classification result as the output result.
  • the second aspect of the embodiment of the present invention provides an action recognition model training method, including: acquiring multiple image sequences, in which the types of pedestrian actions are marked; dividing each image sequence into multiple sub-sequences to obtain a training data set; The training data set is input into the neural network system, and the neural network system is trained until the loss value of the loss function satisfies the loss condition, and an action recognition model is obtained.
  • the neural network system is the action recognition system provided in the first aspect of the embodiment of the present invention.
  • the loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.
  • the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function.
  • the third aspect of the embodiment of the present invention provides an action recognition method, including: acquiring an image sequence of a target object, dividing the image sequence into multiple sub-sequences; inputting the sub-sequences into an action recognition model to generate an action recognition result, and the action recognition model passes It is obtained through training by the action recognition model training method provided in the second aspect of the embodiment of the present invention.
  • the method further includes: performing, etc. Proportionally zoom to get a zoomed image, the size of the short side of the zoomed image is within the preset range; randomly crop the zoomed image to get a cropped image, the size of the cropped image meets the preset conditions; use the cropped image as an image in the subsequence, execute The step of feeding a subsequence into an action recognition model.
  • the fourth aspect of the embodiment of the present invention provides an action recognition model training device, including: an image acquisition module configured to acquire a plurality of image sequences in which pedestrian action types are marked; a training data acquisition module configured to Each image sequence is divided into multiple sub-sequences to obtain a training data set; the model training module is configured to input the training data set into a neural network system, train the neural network system, and obtain an action recognition model, and the neural network system is an embodiment of the present invention
  • the action recognition system provided in the first aspect.
  • the fifth aspect of the embodiment of the present invention provides an action recognition device, including: an image acquisition module configured to acquire an image sequence of a target object, and divide the image sequence into multiple sub-sequences; an action recognition module configured to divide the sub-sequence An action recognition model is input to generate an action recognition result, and the action recognition model is trained by the action recognition model training method provided in the second aspect of the embodiment of the present invention.
  • the sixth aspect of the embodiment of the present invention provides a computer device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor.
  • a processor executes to execute the action recognition system provided in the first aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the second aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the third aspect of the embodiment of the present invention.
  • the action recognition method provided by the aspect is not limited to execute the action recognition system provided in the first aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the second aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the third aspect of the embodiment of the present invention.
  • the seventh aspect of the embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are configured to cause the computer to execute the action recognition system provided by the first aspect of the embodiment of the present invention, or Execute the action recognition model training method provided in the second aspect of the embodiment of the present invention, or execute the action recognition method provided in the third aspect of the embodiment of the present invention.
  • the action recognition system includes an information separation network, a static feature network, a dynamic feature network, and a classification network.
  • the dynamic feature map and the static feature map in the image are separated through the information separation network, and then the dynamic feature
  • the graph and the static feature map are input into the dynamic feature network and the static feature network.
  • the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, the network can be made with less computation.
  • Obtaining spatial local information ensures the speed of network operation.
  • the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition.
  • the action recognition system provided by the embodiment of the present invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition system can realize accurate recognition of actions.
  • the motion recognition model training method and device after obtaining the training data set, input the training data set into the motion recognition system provided in the first aspect of the embodiment of the present invention, and train the motion recognition system
  • the action recognition model is obtained.
  • the action recognition system provided by the first aspect of the embodiment of the present invention includes an information separation network, a static feature network, a dynamic feature network, and a classification network.
  • the action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network.
  • the feature map and then input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network respectively. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured.
  • the static feature network and the dynamic feature network By analyzing the static feature map separately , can identify the static scene in the video, the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, it can The network acquires spatial local information with a small amount of calculation, which ensures the running speed of the network.
  • the action recognition model training method provided by the embodiment of the present invention can use less data training to obtain the action recognition model, and, through the action recognition training method and device training provided by the embodiment of the present invention The obtained action recognition model can realize accurate recognition of actions.
  • the action recognition method and device provided by the embodiments of the present invention divide the image sequence into multiple subsequences after acquiring the image sequence of the target object, and input the subsequences into the action recognition model provided by the second aspect of the embodiment of the present invention
  • the action recognition model training method provided by the second aspect of the embodiment of the present invention obtains the action recognition model by training the action recognition system.
  • the action recognition system first separates the dynamic features in the image through the information separation network and static feature maps, and then input the dynamic feature maps and static feature maps into the dynamic feature network and the static feature network respectively. By analyzing the dynamic feature maps separately, the short-term time information of the video can be captured.
  • a separate analysis can identify the static scene in the video.
  • the static feature network and the dynamic feature network perform displacement operations on the feature map, and calculate the difference between the feature maps corresponding to each segment to obtain classification features.
  • the displacement operation can enable the network to obtain spatial local information with a small amount of calculation, ensuring the running speed of the network.
  • By calculating the difference between feature maps it can capture the long-term time relationship in the video, so that the network has the ability of time modeling , so as to ensure the accuracy of network action recognition. It can be seen that the action recognition model trained by the action recognition training method provided by the second aspect of the embodiment of the present invention can realize accurate recognition of actions. Therefore, by implementing the embodiment of the present invention, it can Accurate recognition of actions is achieved.
  • Fig. 1 is a functional block diagram of an example of an action recognition system in an embodiment of the present invention
  • FIG. 2 is a functional block diagram of an example of a static feature network and/or a dynamic feature network in an embodiment of the present invention
  • Fig. 3 is a functional block diagram of an example of the feature difference and feature displacement sub-module in the embodiment of the present invention.
  • Fig. 4 is a functional block diagram of an example of a feature displacement unit in an embodiment of the present invention.
  • FIG. 5 is a functional block diagram of an example of a feature difference unit in an embodiment of the present invention.
  • Fig. 6 is a flowchart of an example of an action recognition model training method in an embodiment of the present invention.
  • FIG. 7 is a flowchart of an example of an action recognition method in an embodiment of the present invention.
  • Fig. 8 is a functional block diagram of an example of an action recognition model training device in an embodiment of the present invention.
  • FIG. 9 is a flowchart of an example of an action recognition method in an embodiment of the present invention.
  • FIG. 10 is a functional block diagram of an example of computer equipment in an embodiment of the present invention.
  • An embodiment of the present invention provides an action recognition system, as shown in FIG. 1 , including: an information separation network 11, a static feature network 12, a dynamic feature network 13, and a classification network 14.
  • the information separation network 11 includes a bandpass filter module 111, static feature extraction module 112,
  • the band-pass filtering module 111 is configured to extract a dynamic feature map according to the acquired multi-frame continuous images in a segment.
  • the bandpass filter module 111 sequentially extracts the dynamic feature map corresponding to each segment based on the continuous images in each segment.
  • the static feature extraction module 112 is configured to perform temporal average pooling on multiple frames of continuous images in a segment to obtain a feature map, and make a difference between the feature map and the dynamic feature map to obtain a static feature map.
  • the static feature extraction module 112 first performs temporal average pooling on multiple frames of continuous images in a segment through the temporal average pooling layer to obtain a feature map with a time dimension of 1, and combines the feature map with the segment The dynamic feature map corresponding to the segment is subtracted to obtain the static feature map of the segment.
  • the static feature extraction module 112 sequentially extracts the static feature maps corresponding to each segment based on the continuous images in each segment.
  • the static feature network 12 is configured to perform feature displacement operations on the static feature maps corresponding to multiple segments, and calculate the difference features between the static feature maps corresponding to each segment to obtain static classification features.
  • the static feature network 12 when performing action recognition through the action recognition system provided by the embodiment of the present invention, performs feature displacement operations on the static feature maps corresponding to each segment, that is, for any segment When the corresponding static feature map performs feature displacement operation, it will not be affected by the static feature maps corresponding to other segments. However, when the static feature network 12 calculates the difference features between the static feature maps corresponding to each segment, it needs to combine the static feature maps corresponding to two adjacent segments to calculate the difference features.
  • the dynamic feature network 13 is configured to perform a feature displacement operation on the dynamic feature maps corresponding to multiple segments, and calculate the difference feature between values of the dynamic feature maps corresponding to each segment to obtain dynamic classification features.
  • the dynamic feature network 13 performs a feature displacement operation on the dynamic feature map corresponding to each segment, that is, performs a feature displacement operation on the dynamic feature map corresponding to any segment , will not be affected by the static feature maps corresponding to other segments.
  • the dynamic feature network 13 calculates the difference features between the dynamic feature maps corresponding to each segment, it needs to combine the dynamic feature maps corresponding to two adjacent segments to calculate the difference features.
  • the classification network 14 is configured to obtain action recognition results according to static classification features and dynamic classification features.
  • the classification network includes a classifier, and the static classification feature and the dynamic classification feature are analyzed by the classifier to obtain an action recognition result.
  • the action recognition system includes an information separation network 11, a static feature network 12, a dynamic feature network 13, and a classification network 14.
  • the dynamic feature map and the static feature map in the image are separated through the information separation network 11, and then respectively
  • the dynamic feature map and the static feature map are input into the dynamic feature network 13 and the static feature network 12, and by analyzing the dynamic feature map separately, short-term time information of the video can be captured, and by separately analyzing the static feature map, it is possible to identify
  • the static feature network 12 and the dynamic feature network 13 perform displacement operations on the feature maps, and calculate the difference between the feature maps corresponding to each segment to obtain classification features.
  • the action recognition system By performing displacement operations on the feature maps, the The less computational load enables the network to obtain spatial local information, which ensures the speed of the network. By calculating the difference between the feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thus ensuring the network.
  • the accuracy of action recognition it can be seen that the action recognition system provided by the embodiment of the present invention can use less data training to obtain the action recognition model, and the action recognition model trained by the action recognition system can realize accurate recognition of actions .
  • the bandpass filtering module 111 includes a spatial convolution layer and a temporal convolution layer.
  • LoG ⁇ (x, y) represents the Gaussian Laplacian operator with parameter ⁇ :
  • K represents the number of image frames in a segment
  • represents multiplication
  • a spatial convolution layer with a convolution kernel size of k ⁇ k is used, and the Laplacian operator with a parameter of ⁇ is used to initialize the parameters.
  • the sum of the parameter values of the convolution kernel is normalized to 1, is a temporal convolutional layer with a time step of s, and initializes the convolution kernel value to
  • the static feature network 12 and the dynamic feature network 13 can have the same network structure, and can also have different network structures.
  • the static feature network 12 and the dynamic feature network 13 have the same network structure, the two The values of the network parameters are different.
  • At least one of the static feature network 12 and the dynamic feature network 13 includes: an image segmentation module 121 , an initial feature extraction module 122 , and at least one intermediate feature extraction module 123 .
  • the image segmentation module 121 is configured to segment the input feature map according to a first preset size to obtain a first feature vector.
  • the input data of the static feature network 12 is a static feature map, and the image segmentation module 121 in the static feature network 12 segments the static feature map;
  • the input data of the dynamic feature network 13 is a dynamic feature map, and the dynamic The image segmentation module 121 in the feature network 13 segments the dynamic feature map.
  • W represents the size of the image
  • 3 represents the channel number of the image.
  • the image segmentation module 121 divides the image according to the block size of 4 ⁇ 4, and synthesizes each 4 ⁇ 4 block obtained by segmentation into a vector to obtain the feature size for in Indicates the number of blocks, 48 is the number of channels.
  • the initial feature extraction module 122 includes a linear embedding submodule 1221 and at least one feature difference and feature displacement submodule 1222.
  • the linear embedding submodule 1221 is configured to convert the first feature vector according to the preset number of channels to obtain the second feature vector;
  • the feature difference and feature displacement sub-module 1222 is configured to perform a feature displacement operation on the second feature vector, and calculate the difference feature between the second feature vectors corresponding to each segment to obtain the initial classification feature.
  • the linear embedding sub-module 1221 projects the first eigenvector to Among them, C represents the number of channels.
  • the initial feature extraction module 122 includes two continuous feature difference and feature displacement sub-modules 1222, after processing the first feature vector through the linear embedding sub-module 1221 to obtain the second feature vector, through Two consecutive feature difference and feature displacement sub-modules 1222 process the second feature vector to obtain initial classification features.
  • the intermediate feature extraction module 123 includes a feature merging submodule 1231 and at least one feature difference and feature displacement submodule 1222.
  • the feature merging submodule 1231 is configured to merge the initial classification features according to the second preset size to obtain a third feature vector;
  • the feature difference and feature displacement sub-module 1222 is configured to perform a feature displacement operation on the third feature vector, and calculate a difference feature between the third feature vectors corresponding to each segment to obtain a classification feature.
  • the feature merging sub-module 1231 in the intermediate feature extraction module 123 merges the initial classification features obtained in the previous stage according to the size of 2 ⁇ 2 to combine blocks, and synthesizes a vector to obtain a feature size of Then pass through at least one feature difference and feature shift sub-module 1222 and output.
  • the feature difference and feature displacement sub-modules 1222 in different intermediate feature extraction modules 123 may be the same or different.
  • the number of feature difference and feature displacement sub-modules 1222 in the intermediate feature extraction module 123 may be 2, 6, and so on.
  • At least one of the static feature network 12 and the dynamic feature network 13 includes three intermediate feature extraction modules 123: a first intermediate feature extraction module 123, a second intermediate feature extraction module 123, and a third intermediate feature extraction module 123.
  • feature extraction module 123 a first intermediate feature extraction module 123, a second intermediate feature extraction module 123, and a third intermediate feature extraction module 123.
  • the image segmentation module 121, the initial feature extraction module 122, the first intermediate feature extraction module 123, the second intermediate feature extraction module 123, and the third intermediate feature extraction module 123 are connected in sequence. That is, in the embodiment of the present invention, the output data of the image segmentation module 121 is the input data of the initial feature extraction module 122, the output data of the initial feature extraction module 122 is the input data of the first intermediate feature extraction module 123, and the first intermediate feature The output data of the extraction module 123 is the input data of the second intermediate feature extraction module 123 , and the output data of the second intermediate feature extraction module 123 is the input data of the third intermediate feature extraction module 123 .
  • the number of feature differences in the initial feature extraction module 122 , the first intermediate feature extraction module 123 , and the third intermediate feature extraction module 123 is the same as the number of feature displacement sub-modules 1222 .
  • the number of feature difference and feature displacement submodules 1222 in the second intermediate feature extraction module 123 is greater than the feature difference and feature displacement submodules in the initial feature extraction module 122, the first intermediate feature extraction module 123, and the third intermediate feature extraction module 123 The number of 1222.
  • the number of feature difference and feature displacement sub-modules 1222 in the initial feature extraction module 122, the first intermediate feature extraction module 123, and the third intermediate feature extraction module 123 is 2, and the second intermediate feature extraction module The number of feature difference and feature displacement sub-modules 1222 in 123 is six.
  • the final output feature size of the static feature network 12 and the dynamic feature network 13 is
  • the feature difference and feature displacement submodule 1222 includes a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, and a first full connection Layer, the first GELU function layer, the second fully connected layer, wherein, the first normalization layer, feature displacement unit, feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, The second fully connected layer is connected sequentially.
  • the input data of the second normalization layer is the first residual calculation result, and the first residual calculation result is calculated through the input data of the first normalization layer and the output data of the feature difference unit.
  • the output data of the feature difference and feature displacement sub-module 1222 is the second residual calculation result, which is calculated through the input data of the second normalization layer and the output data of the second fully connected layer.
  • the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer , the fourth channel fully connected layer, the horizontal characteristic displacement layer is connected to the second channel fully connected layer, the vertical characteristic displacement layer is connected to the fourth channel fully connected layer, and the horizontal characteristic displacement layer is connected to the second channel fully connected layer to form a structure , and the structure formed by connecting the vertical feature displacement layer and the fourth channel fully connected layer is a parallel structure.
  • the first channel full connection layer is configured to perform full connection on the channels of the input data to obtain a full connection result, and input the full connection result to the horizontal feature displacement layer and the vertical feature displacement layer respectively.
  • the horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain the horizontal displacement result, and input the horizontal displacement result into the second channel fully connected layer.
  • the full connection result has three dimensions: height, width, and channel.
  • the number of displacement groups is 3 and the displacement size is 1, then each group of channels in the channel dimension
  • the feature map is moved horizontally in the manner of [+1,0,-1], and the vacated part is filled with zeros.
  • the full connection result is divided into 3 groups of data in the channel dimension, and the first group The data moves one unit length to the left along the horizontal direction. After the first group of data is moved, the vacant part is filled with zeros, the third group of data remains unchanged, the second group of data moves one unit length to the right along the horizontal direction, and the third group After the data is moved, the vacated part is filled with zeros.
  • the number of displacement groups is 5 and the displacement size is 2, then in the channel dimension, the channel feature maps of each group are moved in the horizontal direction according to [+4,+2,0,-2,-4], and the vacant part is filled. Zero operations.
  • the vertical feature displacement layer is configured to perform vertical displacement on the fully connected result to obtain the vertical displacement result, and input the vertical displacement result into the third channel fully connected layer.
  • the only difference between performing vertical displacement on the fully connected result and horizontal displacement on the fully connected result is that the moving direction of the vertical displacement is the vertical direction, and the moving direction of the horizontal displacement is the horizontal direction.
  • the fourth channel fully connected layer is configured to process the sum of the output results of the second channel fully connected layer and the third channel fully connected layer to obtain the output result of the feature displacement unit.
  • the feature difference unit includes: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, and a third fully connected layer.
  • the input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, the upsampling layer, and the feature difference output layer are connected in sequence.
  • the input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence.
  • the input layer is configured to make a difference between the input feature corresponding to the segment at the current moment and the input feature corresponding to the segment at the previous moment, and input the difference feature into the maximum pooling layer and the fifth fully connected layer respectively.
  • the feature difference output layer is configured to sum the output results of the upsampling layer and the sixth fully connected layer to obtain the summation result; multiply the summation result point by point with the input feature corresponding to the previous segment to obtain Multiplication result; add the multiplication result to the input feature corresponding to the segment at the previous moment to obtain the output result of the feature difference unit.
  • the classification network 14 includes a first temporal average pooling layer 141, a second temporal average pooling layer 143, a static feature classifier 142, dynamic feature classifier 144, output layer 145,
  • the first temporal average pooling layer 141 is configured to perform temporal average pooling on static classification features corresponding to multiple segments, and input the pooling result to the static feature classifier 142 .
  • the second temporal average pooling layer 143 is configured to perform temporal average pooling on the dynamic classification features corresponding to multiple segments, and input the pooling result to the dynamic feature classifier 144 .
  • the action recognition system provided by the embodiment of the present invention is used for action recognition
  • the information separation network 11 the static feature network 12
  • the dynamic The feature network 13 sequentially processes the N segments to obtain N static classification features and N dynamic classification features, and uses the first time average pooling layer 141 to perform time average pooling on the N static classification features to obtain a time-averaged
  • the second time average pooling layer 143 is used to perform time average pooling on the N dynamic classification features to obtain a dynamic classification feature with a time attribute, through which the static classification feature and the dynamic classification feature with the time attribute can be Achieve more accurate recognition of actions.
  • the static feature classifier 142 is configured to obtain a first classification result according to the static classification feature.
  • the dynamic feature classifier 144 is configured to obtain a second classification result according to the dynamic classification feature.
  • the recognition result output layer 145 is configured to take the weighted average result of the first classification result and the second classification result as an output result.
  • the static feature classifier 142 and the dynamic feature classifier 144 are Softmax classifiers.
  • An embodiment of the present invention provides a method for training an action recognition model, as shown in FIG. 6 , including:
  • Step S21 Obtain multiple image sequences, in which the types of pedestrian actions are marked.
  • Step S22 Divide each image sequence into multiple subsequences to obtain a training data set.
  • Step S23 Input the training data set into the neural network system, train the neural network system until the loss value of the loss function meets the loss condition, and obtain the action recognition model, the neural network system is the action recognition system provided in any of the above-mentioned embodiments, For details about the action recognition system, refer to the above-mentioned embodiments.
  • the initialized action recognition system is trained to obtain an action recognition model.
  • the Gaussian Laplacian is used to initialize the bandpass filter module 111
  • the pre-trained model is used to Initialize the feature difference and feature displacement network
  • the pre-training model is obtained through pre-training on ImageNet or other large data sets.
  • the action recognition system can be fully trained with a large-scale data set, or fine-tuned based on a pre-trained model.
  • the training data set is input into the action recognition system provided in the above embodiment, and the action recognition system is trained to obtain the action recognition model.
  • the action recognition system provided includes information separation network, static feature network, dynamic feature network, and classification network.
  • the action recognition system first separates the dynamic feature map and static feature map in the image through the information separation network, and then separates the dynamic feature map and static feature map.
  • the graph is input into the dynamic feature network and the static feature network.
  • the static feature network And the dynamic feature network performs displacement operations on the feature maps, and calculates the difference features between the feature maps corresponding to each segment, and obtains classification features.
  • the network can obtain spatial local information with a small amount of calculation. , to ensure the running speed of the network.
  • By calculating the difference features between the feature maps the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that through this
  • the action recognition model training method provided by the embodiment of the invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition training method provided by the embodiment of the invention can realize accurate recognition of actions.
  • the loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.
  • the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function. Combine the above orthogonal projection loss function with the cross entropy loss L ce to get the final loss function L:
  • is a hyperparameter controlling the weights of the orthogonal projection loss.
  • the orthogonal projection loss function is combined with the cross-entropy loss to obtain the final loss function, and the orthogonal projection loss function is introduced to orthogonalize the middle layer features to achieve the effect of inter-class separation and intra-class clustering , so that the trained action recognition model can recognize actions more accurately.
  • step S21 includes:
  • the images in the image sequence are proportionally scaled to obtain a scaled image, and the size of the short side of the scaled image is within a preset range.
  • the preset interval may be [256, 320].
  • the zoomed image is randomly cropped to obtain a cropped image, and the size of the cropped image satisfies a preset condition.
  • the size of the cropped image is 224 ⁇ 224.
  • the image sequence formed by cropping the image is divided into multiple subsequences to obtain the training data set.
  • An embodiment of the present invention provides an action recognition method, as shown in FIG. 7 , including:
  • Step S31 Acquire the image sequence of the target object, and divide the image sequence into multiple subsequences.
  • Step S32 Input the subsequence into the action recognition model to generate an action recognition result.
  • the action recognition model is trained by the action recognition model training method provided in the above-mentioned embodiment. For details about the action recognition model, refer to the description in the above-mentioned embodiment. This will not be repeated here.
  • the action recognition model training method provided by the above embodiment obtains the action recognition model by training the action recognition system.
  • the action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network, and then separates the dynamic feature map Input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network.
  • the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features.
  • the network can obtain Spatial local information ensures the running speed of the network.
  • By calculating the difference between feature maps it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that
  • the motion recognition model trained by the motion recognition training method provided in the above embodiments can realize precise recognition of motions, therefore, the precise recognition of motions can be realized by implementing the embodiments of the present invention.
  • the method further includes:
  • the images in the image sequence are proportionally scaled to obtain a scaled image, and the size of the short side of the scaled image is within a preset range.
  • the zoomed image is randomly cropped to obtain a cropped image, the size of the cropped image satisfies a preset condition.
  • the cropped image is used as an image in the subsequence, and the step of inputting the subsequence into the action recognition model is performed.
  • the step of inputting the subsequence into the action recognition model is performed.
  • the image is proportionally scaled so that the size of the short side of the image is within the preset interval, and then randomly cut to the input size accepted by the network, so that the purpose of data enhancement can be achieved, and the scaled and cropped data
  • the action recognition model can analyze the effective information in the image, thereby improving the analysis efficiency and the accuracy of the analysis results.
  • An embodiment of the present invention provides an action recognition model training device, as shown in FIG. 8 , including:
  • the image acquisition module 21 is configured to acquire a plurality of image sequences, in which the types of pedestrian actions are marked. For details, refer to the description of step S21 in the above embodiment, which will not be repeated here.
  • the training data acquisition module 22 is configured to divide each image sequence into multiple subsequences to obtain a training data set. For details, refer to the description of step S22 in the above-mentioned embodiment, and details will not be repeated here.
  • the model training module 23 is configured to be configured to input the training data set into the neural network system, train the neural network system, and obtain an action recognition model.
  • the neural network system is the action recognition system provided in the above-mentioned embodiment. For details, refer to the above-mentioned The description of step S23 in the embodiment will not be repeated here.
  • the action recognition model training device provided by the embodiment of the present invention, after obtaining the training data set, inputs the training data set into the action recognition system provided in the above embodiment, and trains the action recognition system to obtain the action recognition model.
  • the above embodiment The action recognition system provided includes information separation network, static feature network, dynamic feature network, and classification network.
  • the action recognition system first separates the dynamic feature map and static feature map in the image through the information separation network, and then separates the dynamic feature map and static feature map.
  • the graph is input into the dynamic feature network and the static feature network.
  • the static feature network And the dynamic feature network performs displacement operations on the feature maps, and calculates the difference features between the feature maps corresponding to each segment, and obtains classification features.
  • the network can obtain spatial local information with a small amount of calculation. , to ensure the running speed of the network.
  • By calculating the difference features between the feature maps the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that through this
  • the action recognition model training device provided by the embodiment of the invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition training device provided by the embodiment of the invention can realize accurate recognition of actions.
  • An embodiment of the present invention provides an action recognition device, as shown in FIG. 9 , including:
  • the image acquisition module 31 is configured to acquire an image sequence of the target object, and divide the image sequence into multiple subsequences. For details, refer to the description of step S31 in the above-mentioned embodiment, which will not be repeated here.
  • the action recognition module 32 is configured to input the subsequence into the action recognition model to generate an action recognition result.
  • the action recognition model is trained by the action recognition model training method provided in the above embodiment. For details, refer to the steps in the above embodiment. The description of S32 will not be repeated here.
  • the action recognition device provided by the embodiment of the present invention divides the image sequence into multiple subsequences after acquiring the image sequence of the target object, and inputs the subsequences into the action recognition model trained by the action recognition model training method provided in the above embodiment Among them, the action recognition model training method provided by the above embodiment obtains the action recognition model by training the action recognition system.
  • the action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network, and then separates the dynamic feature map Input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network.
  • the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features.
  • the network can obtain Spatial local information ensures the running speed of the network.
  • By calculating the difference between feature maps it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that
  • the motion recognition model trained by the motion recognition training method provided in the above embodiments can realize precise recognition of motions, therefore, the precise recognition of motions can be realized by implementing the embodiments of the present invention.
  • An embodiment of the present invention provides a computer device.
  • the computer device mainly includes one or more processors 41 and a memory 42 , and one processor 41 is taken as an example in FIG. 10 .
  • the computer device may also include: an input device 43 and an output device 44 .
  • the processor 41 , the memory 42 , the input device 43 and the output device 44 may be connected through a bus or in other ways. In FIG. 10 , connection through a bus is taken as an example.
  • the processor 41 may be a central processing unit (Central Processing Unit, CPU).
  • Processor 41 can also be other general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or Other chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above-mentioned types of chips.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory 42 can include a program storage area and a data storage area, wherein the program storage area can store the operating system and at least one application required by the function; , data created using motion recognition devices, etc.
  • the memory 42 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory 42 may optionally include a memory that is remotely located relative to the processor 41, and these remote memories may be connected to the motion recognition system, or the motion recognition model training device, or the motion recognition device through a network.
  • the input device 43 can receive calculation requests (or other digital or character information) input by the user, and generate key signal input related to the motion recognition system, or the motion recognition model training device, or the motion recognition device.
  • the output device 44 may include a display device such as a display screen for outputting calculation results.
  • An embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer storage medium stores computer-executable instructions, and the computer-executable instructions can perform the action recognition in any of the above-mentioned method embodiments A model training method, or, an action recognition method.
  • the storage medium can be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive) , abbreviation: HDD) or solid-state drive (Solid-State Drive, SSD), etc.; the storage medium may also include a combination of the above-mentioned types of memory.
  • An embodiment of the present invention provides an action recognition system including: a bandpass filter module configured to extract dynamic feature maps based on multiple frames of continuous images in a segment; a static feature extraction module configured to extract dynamic feature maps based on multiple frames in a segment Continuous images acquire static feature maps; the static feature network is configured to perform feature displacement operations on the static feature maps, and calculate the difference between the static feature maps corresponding to each segment to obtain static classification features; the dynamic feature network is configured to The dynamic feature map performs feature displacement operations, and calculates the difference features between the dynamic feature maps corresponding to each segment to obtain dynamic classification features; the classification network is configured to obtain action recognition results based on static classification features and dynamic classification features.

Abstract

The present invention provides an action recognition system, method, and apparatus, a model training method and apparatus, a computer device, and a computer readable storage medium. The action recognition system comprises: a band-pass filtering module, configured to extract a dynamic feature map according to a plurality of frames of continuous images in one segment; a static feature extraction module, configured to obtain a static feature map according to the plurality of frames of continuous images in one segment; a static feature network, configured to perform feature displacement operation on the static feature maps, and calculate difference features between the static feature maps corresponding to the segments so as to obtain static classification features; a dynamic feature network, configured to perform feature displacement operation on the dynamic feature maps, and calculate difference features between the dynamic feature maps corresponding to the segments so as to obtain dynamic classification features; and a classification network, configured to obtain an action recognition result according to the static classification features and the dynamic classification features. By implementing the present invention, an action recognition model can be obtained by using less data for training, and accurate recognition of an action can be realized.

Description

一种动作识别系统、方法、装置及模型训练方法、装置、计算机设备及计算机可读存储介质An action recognition system, method, device, and model training method, device, computer equipment, and computer-readable storage medium
相关申请的交叉引用Cross References to Related Applications
本发明基于申请号为202210179444.0、申请日为2022年02月25日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本发明作为参考。The present invention is based on a Chinese patent application with application number 202210179444.0 and a filing date of February 25, 2022, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference.
技术领域technical field
本发明涉及图像处理技术领域,涉及一种动作识别系统、方法、装置及模型训练方法、装置、计算机设备及计算机可读存储介质。The present invention relates to the technical field of image processing, and relates to an action recognition system, method, device, model training method, device, computer equipment, and computer-readable storage medium.
背景技术Background technique
当前监控摄像头非常普及,无论是在公司、工厂、商场还是马路、火车站,随处都可以看到监控摄像头的存在。然而单纯依靠摄像头难以达到实时监测违规、异常行为的目的,当有异常行为发生后,逐帧翻看监控视频也非常耗时耗力,并且容易遗漏。如果可以通过动作识别技术,进行实时的特定异常行为检测,可以大大节省人力物力,并且提高效率。因此动作识别有着重要的实用价值。Surveillance cameras are very popular at present, whether in companies, factories, shopping malls, roads, or train stations, you can see the existence of surveillance cameras everywhere. However, relying solely on the camera is difficult to achieve the purpose of real-time monitoring of violations and abnormal behaviors. When abnormal behaviors occur, it is very time-consuming and labor-intensive to flip through the surveillance video frame by frame, and it is easy to miss. If motion recognition technology can be used to detect specific abnormal behaviors in real time, it can greatly save manpower and material resources and improve efficiency. Therefore, action recognition has important practical value.
视频的动作识别算法需要提取出视频帧之间的时间信息,需要网络模型具有时间建模的能力。基于深度学习的动作识别技术主要分为:基于双流网络的方法、基于(three-dimensional)3D卷积网络的方法。其中,基于双流网络的方法采用光流作为时间信息,需要提前计算光流并将其储存在本地硬盘,对于大数据集来说往往需要占用很大的内存。同时,由于需要提前计算光流,基于双流网络的方法实时性的效果也较差。而基于3D卷积网络的方法依靠3D卷积来达到时间建模的效果,需要更多的视频帧作为输入,参数量很大,训练网络需要花费大量的算力,也很难部署。近年来,随着Transformer在自然语言处理和计算机视觉领域的成功,许多研究者也将Transformer应用在了视频的动作识别领域中,并取得了很好的效果,然而Transformer参数量巨大,且往往需要大量的数据才能拟合,因此其实时性和实际应用也难以保证。The video action recognition algorithm needs to extract the time information between video frames, and the network model needs to have the ability of time modeling. The action recognition technology based on deep learning is mainly divided into: a method based on a two-stream network, and a method based on a (three-dimensional) 3D convolutional network. Among them, the method based on the dual-stream network uses optical flow as time information, and needs to calculate the optical flow in advance and store it in the local hard disk, which often requires a large amount of memory for large data sets. At the same time, due to the need to calculate the optical flow in advance, the real-time effect of the method based on the dual-stream network is also poor. However, the method based on 3D convolutional network relies on 3D convolution to achieve the effect of temporal modeling, which requires more video frames as input, and has a large number of parameters. Training the network requires a lot of computing power and is difficult to deploy. In recent years, with the success of Transformer in the fields of natural language processing and computer vision, many researchers have also applied Transformer in the field of video action recognition and achieved good results. However, Transformer has a huge number of parameters and often requires A large amount of data can be fitted, so its real-time and practical application are difficult to guarantee.
发明内容Contents of the invention
因此,本发明要解决的技术问题在于克服现有技术中需要大量的数据才能拟合形成用于识别动作的模型的缺陷,从而提供一种动作识别系统、方法、装置及模型训练方法、装置、计算机设备及计算机可读存储介质。Therefore, the technical problem to be solved by the present invention is to overcome the defect in the prior art that a large amount of data is required to fit and form a model for recognizing actions, thereby providing an action recognition system, method, device, and model training method, device, Computer equipment and computer-readable storage media.
本发明实施例第一方面提供了一种动作识别系统,包括:信息分离网络、静态特征网络、动态特征网络、分类网络,信息分离网络包括带通滤波模块、静态特征提取模块,带通滤波模块被配置为根据一个分段中的多帧连续图像提取动态特征图;静态特征提取模块被配置为对一个分段中的多帧连续图像进行时间平均池化,得到特征图,并将特征图与动态特征图作差,得到静态特征图;静态特征网络被配置为对多个分段对应的静态特征图进行特征位移操作,以及计算各分段对应的静态特征图之间的差异特征,得到静态分类特征;动态特征网络被配置为对多个分段对应的动态特征图进行特征位移操作,以及计算各分段对应的动态特征图之间的差异特征,得到动态分类特征;分类网络被配置为根据静态分类特征和动态分类特征得到动作识别结果。The first aspect of the embodiment of the present invention provides an action recognition system, including: an information separation network, a static feature network, a dynamic feature network, and a classification network. The information separation network includes a band-pass filter module, a static feature extraction module, and a band-pass filter module. It is configured to extract dynamic feature maps based on multiple frames of continuous images in a segment; the static feature extraction module is configured to perform temporal average pooling on multiple frames of continuous images in a segment to obtain feature maps, and combine the feature maps with The dynamic feature map is subtracted to obtain the static feature map; the static feature network is configured to perform feature displacement operations on the static feature maps corresponding to multiple segments, and calculate the difference between the static feature maps corresponding to each segment to obtain the static feature map. Classification features; the dynamic feature network is configured to perform feature displacement operations on the dynamic feature maps corresponding to multiple segments, and calculate the difference between the dynamic feature maps corresponding to each segment to obtain dynamic classification features; the classification network is configured as The action recognition result is obtained according to the static classification feature and the dynamic classification feature.
其中,在本发明实施例提供的动作识别系统中,带通滤波模块包括空间卷积层和时间卷积层。Wherein, in the action recognition system provided by the embodiment of the present invention, the bandpass filter module includes a spatial convolution layer and a temporal convolution layer.
其中,在本发明实施例提供的动作识别系统中,静态特征网络和动态特征网络中的至少之一包括:图像分割模块、初始特征提取模块、至少一个中间特征提取模块,图像分割模块被配置为按照第一预设大小对输入特征图进行分割,得到第一特征向量;初始特征提取模块包括线性嵌入子模块和至少一个特征差异与特征位移子模块,线性嵌入子模块被配置为按照预设通道数对第一特征向量进行转换,得到第二特征向量;特征差异与特征位移子模块被配置为对第二特征向量进行特征位移操作,以及计算各分段对应的第二特征向量之间的差异特征,得到初始分类特征;中间特征提取模块包括特征合并子模块和至少一个特征差异与特征位移子模块,特征合并子模块被配置为按照第二预设大小对初始分类特征进行合并,得到第三特征向量;特征差异与特征位移子模块被配置为对第三特征向量进行特征位移操作,以及计算各分段对应的第三特征向量之间的差异特征,得到分类特征。Wherein, in the action recognition system provided by the embodiment of the present invention, at least one of the static feature network and the dynamic feature network includes: an image segmentation module, an initial feature extraction module, and at least one intermediate feature extraction module, and the image segmentation module is configured as Segment the input feature map according to the first preset size to obtain the first feature vector; the initial feature extraction module includes a linear embedding sub-module and at least one feature difference and feature displacement sub-module, and the linear embedding sub-module is configured to follow the preset channel Convert the first eigenvector to obtain the second eigenvector; the feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the second eigenvector, and calculate the difference between the second eigenvectors corresponding to each segment features to obtain the initial classification features; the intermediate feature extraction module includes a feature merging submodule and at least one feature difference and feature displacement submodule, and the feature merging submodule is configured to merge the initial classification features according to the second preset size to obtain the third The feature vector; feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the third feature vector, and calculate a difference feature between the third feature vectors corresponding to each segment to obtain a classification feature.
其中,在本发明实施例提供的动作识别系统中,静态特征网络和动态特征网络中的至少之一包括三个中间特征提取模块:第一中间特征提取模块、第二中间特征提取模块、第三中间特征提取模块,图像分割模块、初始特征提取模块、第一中间特征提取模块、第二中间特征提取模块、第三中间特征提取模块依次连接;初始特征提取模块、第一中间特征提取模块、第三中间特征提取模块中的特征差异与特征位移子模块的数量相同;第二中间特征提取模块中的特征差异与特征位移子模块的数量大于初始特征提取模块、第一中间特征提取模块、第三中间特征提取模块中的特征差异与特征位移子模块的数量。Wherein, in the action recognition system provided by the embodiment of the present invention, at least one of the static feature network and the dynamic feature network includes three intermediate feature extraction modules: a first intermediate feature extraction module, a second intermediate feature extraction module, a third intermediate feature extraction module, and a third intermediate feature extraction module. The intermediate feature extraction module, the image segmentation module, the initial feature extraction module, the first intermediate feature extraction module, the second intermediate feature extraction module, and the third intermediate feature extraction module are sequentially connected; the initial feature extraction module, the first intermediate feature extraction module, the second The number of feature differences and feature displacement sub-modules in the three intermediate feature extraction modules is the same; the number of feature differences and feature displacement sub-modules in the second intermediate feature extraction module is greater than that of the initial feature extraction module, the first intermediate feature extraction module, and the third The number of feature difference and feature displacement sub-modules in the intermediate feature extraction module.
其中,在本发明实施例提供的动作识别系统中,特征差异与特征位移子模块包括,第一归一化层、特征位移单元、特征差异单元、第二归一化层、第一全连接层、第一GELU函数层、第二全连接层,其中,第一归一化层、特征位移单元、特征差异单元、第二归一化层、第一全连接层、第一GELU函数层、第二全连接层依次连接;第二归一化层的输入数据为第一残差计算结果;所述第一残差计算结果是通过所述第一归一化层的输入数据与所述特征差异单元的输出数据计算的;特征差异与特征位移子模块的输出数据为第二残差计算结果;所述第二残差计算结果是通过所述第二归一化层的输入数据与所述第二全连接层的输出数据计算的。Among them, in the action recognition system provided by the embodiment of the present invention, the feature difference and feature displacement sub-module includes a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, and a first fully connected layer , the first GELU function layer, the second fully connected layer, wherein, the first normalization layer, the feature displacement unit, the feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, the second The two fully connected layers are connected in turn; the input data of the second normalization layer is the first residual calculation result; the first residual calculation result is the difference between the input data of the first normalization layer and the feature The output data of the unit is calculated; the output data of the feature difference and feature displacement sub-module is the second residual calculation result; the second residual calculation result is obtained through the input data of the second normalization layer and the first The output data of the second fully connected layer is calculated.
其中,在本发明实施例提供的动作识别系统中,特征位移单元中包括第一信道全连接层、水平特征位移层、第二信道全连接层、竖直特征位移层、第三信道全连接层、第四信道全连接层,其中,第一信道全连接层被配置为对输入数据的信道进行全连接,得到全连接结果,并将全连接结果分别输入至水平特征位移层和竖直特征位移层中;水平特征位移层被配置为对全连接结果进行水平位移,得到水平位移结果,并将水平位移结果输入至第二信道全连接层中;竖直特征位移层被配置为对全连接结果进行竖直位移,得到竖直位移结果,并将竖直位移结果输入至第三信道全连接层中;第四信道全连接层被配置为对第二信道全连接层和第三信道全连接层的输出结果的和进行处理,得到特征位移单元的输出结果。Among them, in the action recognition system provided by the embodiment of the present invention, the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer , the fourth channel fully connected layer, wherein the first channel fully connected layer is configured to fully connect the channels of the input data to obtain a fully connected result, and input the fully connected result to the horizontal feature displacement layer and the vertical feature displacement layer respectively layer; the horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain the horizontal displacement result, and input the horizontal displacement result to the second channel fully connected layer; the vertical feature displacement layer is configured to perform a horizontal displacement on the fully connected result Perform vertical displacement to obtain the vertical displacement result, and input the vertical displacement result into the third channel fully connected layer; the fourth channel fully connected layer is configured to perform the second channel fully connected layer and the third channel fully connected layer The sum of the output results is processed to obtain the output result of the characteristic displacement unit.
其中,在本发明实施例提供的动作识别系统中,特征差异单元包括:输入层、最大池化层、第三全连接层、第二GELU函数层、第四全连接层、上采样层、第五全连接层、第三GELU函数层、第六全连接层、特征差异输出层,其中,输入层、最大池化层、第三全连接层、第二GELU函数层、第四全连接层、上采样层、特征差异输出层依次连接;输入层、第五全连接层、第三GELU函数层、第六全连接层、特征差异输出层依次连接;输入层被配置为将当前时刻分段对应的输入特征与上一时刻分段对应的输入特征作差,并将差值特征分别输入到最大池化层和第五全连接层中;特征差异输出层被配置为将上采样层和第六全连接层的输出结果进行求和,得到求和结果;将求和结果与上一时刻分段对应的输入特征进行逐点相乘,得到相乘结果;将相乘结果与上一时刻分段对应的输入特征进行相 加,得到特征差异单元的输出结果。Wherein, in the action recognition system provided by the embodiment of the present invention, the feature difference unit includes: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, a Five fully connected layers, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer, wherein, the input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, The upsampling layer and the feature difference output layer are connected in sequence; the input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence; the input layer is configured to segment the current moment into corresponding The input features of the previous segment are compared with the input features corresponding to the previous segment, and the difference features are respectively input into the maximum pooling layer and the fifth fully connected layer; the feature difference output layer is configured to combine the upsampling layer and the sixth The output results of the fully connected layer are summed to obtain the summation result; the summation result is multiplied point by point with the input feature corresponding to the previous time segment to obtain the multiplication result; the multiplication result is divided into the previous time segment The corresponding input features are added to obtain the output result of the feature difference unit.
其中,在本发明实施例提供的动作识别系统中,分类网络包括第一时间平均池化层、第二时间平均池化层、静态特征分类器、动态特征分类器、输出层,第一时间平均池化层被配置为对多个分段对应的静态分类特征进行时间平均池化,并将池化结果输入至静态特征分类器中;第二时间平均池化层被配置为对多个分段对应的动态分类特征进行时间平均池化,并将池化结果输入至动态特征分类器中;静态特征分类器被配置为根据静态分类特征得到第一分类结果;动态特征分类器被配置为根据动态分类特征得到第二分类结果;识别结果输出层被配置为将第一分类结果和第二分类结果的加权平均结果作为输出结果。Among them, in the action recognition system provided by the embodiment of the present invention, the classification network includes a first time average pooling layer, a second time average pooling layer, a static feature classifier, a dynamic feature classifier, an output layer, and a first time average pooling layer. The pooling layer is configured to perform temporal average pooling on the static classification features corresponding to multiple segments, and input the pooling result to the static feature classifier; the second temporal average pooling layer is configured to perform multiple segmental average pooling The corresponding dynamic classification features are time-average pooled, and the pooling results are input into the dynamic feature classifier; the static feature classifier is configured to obtain the first classification result according to the static classification feature; the dynamic feature classifier is configured to be based on the dynamic The classification feature obtains the second classification result; the recognition result output layer is configured to take the weighted average result of the first classification result and the second classification result as the output result.
本发明实施例第二方面提供了一种动作识别模型训练方法,包括:获取多个图像序列,图像序列中标注有行人动作类型;将各图像序列分为多段子序列,得到训练数据集;将训练数据集输入神经网络系统,对神经网络系统进行训练,直到损失函数的损失值满足损失条件,得到动作识别模型,神经网络系统为本发明实施例第一方面提供的动作识别系统。The second aspect of the embodiment of the present invention provides an action recognition model training method, including: acquiring multiple image sequences, in which the types of pedestrian actions are marked; dividing each image sequence into multiple sub-sequences to obtain a training data set; The training data set is input into the neural network system, and the neural network system is trained until the loss value of the loss function satisfies the loss condition, and an action recognition model is obtained. The neural network system is the action recognition system provided in the first aspect of the embodiment of the present invention.
其中,在本发明实施例提供的动作识别模型训练方法中,损失函数采用正交投影损失函数和交叉熵损失函数联合得到。Wherein, in the action recognition model training method provided in the embodiment of the present invention, the loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.
其中,在本发明实施例提供的动作识别模型训练方法中,损失函数通过所述正交投影损失函数与控制正交投影损失权重的超参数之积,与所述交叉熵损失函数相加得到。本发明实施例第三方面提供了一种动作识别方法,包括:获取目标对象的图像序列,将图像序列分为多段子序列;将子序列输入动作识别模型,生成动作识别结果,动作识别模型通过如本发明实施例第二方面提供的动作识别模型训练方法训练得到。Wherein, in the action recognition model training method provided in the embodiment of the present invention, the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function. The third aspect of the embodiment of the present invention provides an action recognition method, including: acquiring an image sequence of a target object, dividing the image sequence into multiple sub-sequences; inputting the sub-sequences into an action recognition model to generate an action recognition result, and the action recognition model passes It is obtained through training by the action recognition model training method provided in the second aspect of the embodiment of the present invention.
其中,在本发明实施例提供的动作识别方法中,在将图像序列分为多段子序列的步骤之后,将子序列输入动作识别模型的步骤之前,方法还包括:对图像序列中的图像进行等比例缩放,得到缩放图像,缩放图像的短边大小位于预设区间内;对缩放图像进行随机剪裁,得到裁剪图像,裁剪图像的大小满足预设条件;将裁剪图像作为子序列中的图像,执行将子序列输入动作识别模型的步骤。Wherein, in the action recognition method provided in the embodiment of the present invention, after the step of dividing the image sequence into multiple sub-sequences, and before the step of inputting the sub-sequences into the action recognition model, the method further includes: performing, etc. Proportionally zoom to get a zoomed image, the size of the short side of the zoomed image is within the preset range; randomly crop the zoomed image to get a cropped image, the size of the cropped image meets the preset conditions; use the cropped image as an image in the subsequence, execute The step of feeding a subsequence into an action recognition model.
本发明实施例第四方面提供了一种动作识别模型训练装置,包括:图像获取模块,被配置为获取多个图像序列,图像序列中标注有行人动作类型;训练数据获取模块,被配置为将各图像序列分为多段子序列,得到训练数据集;模型训练模块,被配置为将训练数据集输入神经网络系统,对神经网络系统进行训练,得到动作识别模型,神经网络系统为本发明实施例第一方面提供的动作识别系统。The fourth aspect of the embodiment of the present invention provides an action recognition model training device, including: an image acquisition module configured to acquire a plurality of image sequences in which pedestrian action types are marked; a training data acquisition module configured to Each image sequence is divided into multiple sub-sequences to obtain a training data set; the model training module is configured to input the training data set into a neural network system, train the neural network system, and obtain an action recognition model, and the neural network system is an embodiment of the present invention The action recognition system provided in the first aspect.
本发明实施例第五方面提供了一种动作识别装置,包括:图像采集模块,被配置为获取目标对象的图像序列,将图像序列分为多段子序列;动作识别模块,被配置为将子序列输入动作识别模型,生成动作识别结果,动作识别模型通过如本发明实施例第二方面提供的动作识别模型训练方法训练得到。The fifth aspect of the embodiment of the present invention provides an action recognition device, including: an image acquisition module configured to acquire an image sequence of a target object, and divide the image sequence into multiple sub-sequences; an action recognition module configured to divide the sub-sequence An action recognition model is input to generate an action recognition result, and the action recognition model is trained by the action recognition model training method provided in the second aspect of the embodiment of the present invention.
本发明实施例第六方面提供了一种计算机设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,从而执行如本发明实施例第一方面提供的动作识别系统,或,执行如本发明实施例第二方面提供的动作识别模型训练方法,或,执行如本发明实施例第三方面提供的动作识别方法。The sixth aspect of the embodiment of the present invention provides a computer device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor. A processor executes to execute the action recognition system provided in the first aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the second aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the third aspect of the embodiment of the present invention. The action recognition method provided by the aspect.
本发明实施例第七方面提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机指令,计算机指令被配置为使计算机执行如本发明实施例第一方面提供的动作识别系统,或,执行如本发明实施例第二方面提供的动作识别模型训练方法,或,执行如本发明实施例第三方面提供的动作识别方法。The seventh aspect of the embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are configured to cause the computer to execute the action recognition system provided by the first aspect of the embodiment of the present invention, or Execute the action recognition model training method provided in the second aspect of the embodiment of the present invention, or execute the action recognition method provided in the third aspect of the embodiment of the present invention.
本发明实施例技术方案,具有如下优点:The technical scheme of the embodiment of the present invention has the following advantages:
1.本发明实施例提供的动作识别系统,包括信息分离网络、静态特征网络、动态特征网络、分类网络,首先通过信息分离网络分离图像中的动态特征图和静态特征图,然后分 别将动态特征图和静态特征图输入至动态特征网络以及静态特征网络中,通过对动态特征图进行单独分析,能够捕捉到视频的短期时间信息,通过对静态特征图进行单独分析,能够识别视频中的静态场景,静态特征网络和动态特征网络对特征图进行位移操作,并计算各分段对应的特征图之间的差异特征,得到分类特征,通过对特征图进行位移操作能够以较少的运算量使网络获取空间局部信息,保证了网络的运行速度,通过计算特征图之间的差异特征,能够捕捉视频中的长期时间关系,使得网络具有时间建模的能力,从而保证网络动作识别的精度,由此可见,通过本发明实施例提供的动作识别系统能够采用更少的数据训练得到动作识别模型,并且,通过动作识别系统训练得到的动作识别模型能够实现对动作的精准识别。1. The action recognition system provided by the embodiment of the present invention includes an information separation network, a static feature network, a dynamic feature network, and a classification network. First, the dynamic feature map and the static feature map in the image are separated through the information separation network, and then the dynamic feature The graph and the static feature map are input into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured, and the static scene in the video can be identified by separately analyzing the static feature map. , the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, the network can be made with less computation. Obtaining spatial local information ensures the speed of network operation. By calculating the difference between feature maps, the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that the action recognition system provided by the embodiment of the present invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition system can realize accurate recognition of actions.
2.本发明实施例提供的动作识别模型训练方法及装置,在获取到训练数据集后,将训练数据集输入至本发明实施例第一方面提供的动作识别系统中,对动作识别系统进行训练得到动作识别模型,本发明实施例第一方面提供的动作识别系统包括信息分离网络、静态特征网络、动态特征网络、分类网络,动作识别系统首先通过信息分离网络分离图像中的动态特征图和静态特征图,然后分别将动态特征图和静态特征图输入至动态特征网络以及静态特征网络中,通过对动态特征图进行单独分析,能够捕捉到视频的短期时间信息,通过对静态特征图进行单独分析,能够识别视频中的静态场景,静态特征网络和动态特征网络对特征图进行位移操作,并计算各分段对应的特征图之间的差异特征,得到分类特征,通过对特征图进行位移操作能够以较少的运算量使网络获取空间局部信息,保证了网络的运行速度,通过计算特征图之间的差异特征,能够捕捉视频中的长期时间关系,使得网络具有时间建模的能力,从而保证网络动作识别的精度,由此可见,通过本发明实施例提供的动作识别模型训练方法能够采用更少的数据训练得到动作识别模型,并且,通过本发明实施例提供的动作识别训练方法及装置训练得到的动作识别模型能够实现对动作的精准识别。2. The motion recognition model training method and device provided in the embodiments of the present invention, after obtaining the training data set, input the training data set into the motion recognition system provided in the first aspect of the embodiment of the present invention, and train the motion recognition system The action recognition model is obtained. The action recognition system provided by the first aspect of the embodiment of the present invention includes an information separation network, a static feature network, a dynamic feature network, and a classification network. The action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network. The feature map, and then input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network respectively. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured. By analyzing the static feature map separately , can identify the static scene in the video, the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, it can The network acquires spatial local information with a small amount of calculation, which ensures the running speed of the network. By calculating the difference between the feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thus ensuring The accuracy of network action recognition, it can be seen that the action recognition model training method provided by the embodiment of the present invention can use less data training to obtain the action recognition model, and, through the action recognition training method and device training provided by the embodiment of the present invention The obtained action recognition model can realize accurate recognition of actions.
3.本发明实施例提供的动作识别方法及装置,在获取到目标对象的图像序列后将图像序列分为多段子序列,并将子序列输入通过本发明实施例第二方面提供的动作识别模型训练方法训练得到的动作识别模型中,本发明实施例第二方面提供的动作识别模型训练方法通过对动作识别系统进行训练得到动作识别模型,动作识别系统首先通过信息分离网络分离图像中的动态特征图和静态特征图,然后分别将动态特征图和静态特征图输入至动态特征网络以及静态特征网络中,通过对动态特征图进行单独分析,能够捕捉到视频的短期时间信息,通过对静态特征图进行单独分析,能够识别视频中的静态场景,静态特征网络和动态特征网络对特征图进行位移操作,并计算各分段对应的特征图之间的差异特征,得到分类特征,通过对特征图进行位移操作能够以较少的运算量使网络获取空间局部信息,保证了网络的运行速度,通过计算特征图之间的差异特征,能够捕捉视频中的长期时间关系,使得网络具有时间建模的能力,从而保证网络动作识别的精度,由此可见,通过本发明实施例第二方面提供的动作识别训练方法训练得到的动作识别模型能够实现对动作的精准识别,因此,通过实施本发明实施例能够实现对动作的精准识别。3. The action recognition method and device provided by the embodiments of the present invention divide the image sequence into multiple subsequences after acquiring the image sequence of the target object, and input the subsequences into the action recognition model provided by the second aspect of the embodiment of the present invention Among the action recognition models obtained through training by the training method, the action recognition model training method provided by the second aspect of the embodiment of the present invention obtains the action recognition model by training the action recognition system. The action recognition system first separates the dynamic features in the image through the information separation network and static feature maps, and then input the dynamic feature maps and static feature maps into the dynamic feature network and the static feature network respectively. By analyzing the dynamic feature maps separately, the short-term time information of the video can be captured. By analyzing the static feature maps A separate analysis can identify the static scene in the video. The static feature network and the dynamic feature network perform displacement operations on the feature map, and calculate the difference between the feature maps corresponding to each segment to obtain classification features. The displacement operation can enable the network to obtain spatial local information with a small amount of calculation, ensuring the running speed of the network. By calculating the difference between feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling , so as to ensure the accuracy of network action recognition. It can be seen that the action recognition model trained by the action recognition training method provided by the second aspect of the embodiment of the present invention can realize accurate recognition of actions. Therefore, by implementing the embodiment of the present invention, it can Accurate recognition of actions is achieved.
附图说明Description of drawings
为了更清楚地说明本发明实施方式或现有技术中的技术方案,下面将对实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are For some implementations of the present invention, those skilled in the art can also obtain other drawings based on these drawings without making creative efforts.
图1为本发明实施例中动作识别系统的一个示例的原理框图;Fig. 1 is a functional block diagram of an example of an action recognition system in an embodiment of the present invention;
图2为本发明实施例中静态特征网络,和/或,动态特征网络的一个示例的原理框图;FIG. 2 is a functional block diagram of an example of a static feature network and/or a dynamic feature network in an embodiment of the present invention;
图3为本发明实施例中特征差异与特征位移子模块的一个示例的原理框图;Fig. 3 is a functional block diagram of an example of the feature difference and feature displacement sub-module in the embodiment of the present invention;
图4为本发明实施例中特征位移单元的一个示例的原理框图;Fig. 4 is a functional block diagram of an example of a feature displacement unit in an embodiment of the present invention;
图5为本发明实施例中特征差异单元的一个示例的原理框图;FIG. 5 is a functional block diagram of an example of a feature difference unit in an embodiment of the present invention;
图6为本发明实施例中动作识别模型训练方法的一个示例的流程图;Fig. 6 is a flowchart of an example of an action recognition model training method in an embodiment of the present invention;
图7为本发明实施例中动作识别方法的一个示例的流程图;FIG. 7 is a flowchart of an example of an action recognition method in an embodiment of the present invention;
图8为本发明实施例中动作识别模型训练装置的一个示例的原理框图;Fig. 8 is a functional block diagram of an example of an action recognition model training device in an embodiment of the present invention;
图9为本发明实施例中动作识别方法的一个示例的流程图;FIG. 9 is a flowchart of an example of an action recognition method in an embodiment of the present invention;
图10为本发明实施例中计算机设备的一个示例的原理框图。FIG. 10 is a functional block diagram of an example of computer equipment in an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合附图对本发明的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described embodiments are part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
在本发明的描述中,需要说明的是,术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for description purposes only, and should not be understood as indicating or implying relative importance.
此外,下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as there is no conflict with each other.
本发明实施例提供了一种动作识别系统,如图1所示,包括:信息分离网络11、静态特征网络12、动态特征网络13、分类网络14,信息分离网络11包括带通滤波模块111、静态特征提取模块112,An embodiment of the present invention provides an action recognition system, as shown in FIG. 1 , including: an information separation network 11, a static feature network 12, a dynamic feature network 13, and a classification network 14. The information separation network 11 includes a bandpass filter module 111, static feature extraction module 112,
带通滤波模块111被配置为根据获取的一个分段中的多帧连续图像提取动态特征图。The band-pass filtering module 111 is configured to extract a dynamic feature map according to the acquired multi-frame continuous images in a segment.
在一可选实施例中,在通过本发明实施例提供的动作识别系统进行动作识别前,需要先将采集的含有目标对象的视频数据转换为图像序列,然后并将图像序列平均分成N段{P 1,P 2,......,P N},最后将N段图像输入至动作识别系统中,在每一段中随机采样连续的K帧图像,共得到T=N×K帧图像,其中,一个分段中的图像的帧数可以根据实际需求进行设置,示例性地,N的值为8或16,K的值为3。 In an optional embodiment, before performing action recognition through the action recognition system provided by the embodiment of the present invention, it is necessary to first convert the collected video data containing the target object into an image sequence, and then divide the image sequence into N segments on average { P 1 , P 2 ,...,P N }, finally input N segments of images into the action recognition system, randomly sample consecutive K frames of images in each segment, and obtain a total of T=N×K frames of images , where the number of frames of images in a segment can be set according to actual needs, for example, the value of N is 8 or 16, and the value of K is 3.
在一可选实施例中,在通过本发明实施例提供的动作识别系统进行动作识别时,带通滤波模块111依次基于各分段中的连续图像提取各分段对应的动态特征图。In an optional embodiment, when the action recognition system provided by the embodiment of the present invention performs action recognition, the bandpass filter module 111 sequentially extracts the dynamic feature map corresponding to each segment based on the continuous images in each segment.
静态特征提取模块112被配置为对一个分段中的多帧连续图像进行时间平均池化,得到特征图,并将特征图与动态特征图作差,得到静态特征图。The static feature extraction module 112 is configured to perform temporal average pooling on multiple frames of continuous images in a segment to obtain a feature map, and make a difference between the feature map and the dynamic feature map to obtain a static feature map.
在本发明实施例中,静态特征提取模块112先通过时间平均池化层对一个分段中的多帧连续图像进行时间平均池化,得到时间维度为1的特征图,将特征图与该分段对应的动态特征图作差,得到该分段的静态特征图。In the embodiment of the present invention, the static feature extraction module 112 first performs temporal average pooling on multiple frames of continuous images in a segment through the temporal average pooling layer to obtain a feature map with a time dimension of 1, and combines the feature map with the segment The dynamic feature map corresponding to the segment is subtracted to obtain the static feature map of the segment.
在一可选实施例中,在通过本发明实施例提供的动作识别系统进行动作识别时,静态特征提取模块112依次基于各分段中的连续图像提取各分段对应的静态特征图。In an optional embodiment, when the action recognition system provided by the embodiment of the present invention performs action recognition, the static feature extraction module 112 sequentially extracts the static feature maps corresponding to each segment based on the continuous images in each segment.
静态特征网络12被配置为对多个分段对应的静态特征图进行特征位移操作,以及计算各分段对应的静态特征图之间的差异特征,得到静态分类特征。The static feature network 12 is configured to perform feature displacement operations on the static feature maps corresponding to multiple segments, and calculate the difference features between the static feature maps corresponding to each segment to obtain static classification features.
在一可选实施例中,在通过本发明实施例提供的动作识别系统进行动作识别时,静态特征网络12分别对各分段对应的静态特征图进行特征位移操作,即,对任一分段对应的静态特征图进行特征位移操作时,不会受到其他分段对应的静态特征图的影响。但是静态特征网络12计算各分段对应的静态特征图之间的差异特征时,需要结合两个相邻分段对应的静态特征图计算差异特征。In an optional embodiment, when performing action recognition through the action recognition system provided by the embodiment of the present invention, the static feature network 12 performs feature displacement operations on the static feature maps corresponding to each segment, that is, for any segment When the corresponding static feature map performs feature displacement operation, it will not be affected by the static feature maps corresponding to other segments. However, when the static feature network 12 calculates the difference features between the static feature maps corresponding to each segment, it needs to combine the static feature maps corresponding to two adjacent segments to calculate the difference features.
动态特征网络13被配置为对多个分段对应的动态特征图进行特征位移操作,以及计算各分段对应的动态特征图值之间的差异特征,得到动态分类特征。The dynamic feature network 13 is configured to perform a feature displacement operation on the dynamic feature maps corresponding to multiple segments, and calculate the difference feature between values of the dynamic feature maps corresponding to each segment to obtain dynamic classification features.
在一可选实施例中,与静态特征网络12相同,动态特征网络13分别对各分段对应的动态特征图进行特征位移操作,即,对任一分段对应的动态特征图进行特征位移操作时, 不会受到其他分段对应的静态特征图的影响。但是动态特征网络13计算各分段对应的动态特征图之间的差异特征时,需要结合两个相邻分段对应的动态特征图计算差异特征。In an optional embodiment, the same as the static feature network 12, the dynamic feature network 13 performs a feature displacement operation on the dynamic feature map corresponding to each segment, that is, performs a feature displacement operation on the dynamic feature map corresponding to any segment , will not be affected by the static feature maps corresponding to other segments. However, when the dynamic feature network 13 calculates the difference features between the dynamic feature maps corresponding to each segment, it needs to combine the dynamic feature maps corresponding to two adjacent segments to calculate the difference features.
分类网络14被配置为根据静态分类特征和动态分类特征得到动作识别结果。The classification network 14 is configured to obtain action recognition results according to static classification features and dynamic classification features.
在一可选实施例中,分类网中包括分类器,通过分类器对静态分类特征和动态分类特征进行分析得到动作识别结果。In an optional embodiment, the classification network includes a classifier, and the static classification feature and the dynamic classification feature are analyzed by the classifier to obtain an action recognition result.
本发明实施例提供的动作识别系统,包括信息分离网络11、静态特征网络12、动态特征网络13、分类网络14,首先通过信息分离网络11分离图像中的动态特征图和静态特征图,然后分别将动态特征图和静态特征图输入至动态特征网络13以及静态特征网络12中,通过对动态特征图进行单独分析,能够捕捉到视频的短期时间信息,通过对静态特征图进行单独分析,能够识别视频中的静态场景,静态特征网络12和动态特征网络13对特征图进行位移操作,并计算各分段对应的特征图之间的差异特征,得到分类特征,通过对特征图进行位移操作能够以较少的运算量使网络获取空间局部信息,保证了网络的运行速度,通过计算特征图之间的差异特征,能够捕捉视频中的长期时间关系,使得网络具有时间建模的能力,从而保证网络动作识别的精度,由此可见,通过本发明实施例提供的动作识别系统能够采用更少的数据训练得到动作识别模型,并且,通过动作识别系统训练得到的动作识别模型能够实现对动作的精准识别。The action recognition system provided by the embodiment of the present invention includes an information separation network 11, a static feature network 12, a dynamic feature network 13, and a classification network 14. First, the dynamic feature map and the static feature map in the image are separated through the information separation network 11, and then respectively The dynamic feature map and the static feature map are input into the dynamic feature network 13 and the static feature network 12, and by analyzing the dynamic feature map separately, short-term time information of the video can be captured, and by separately analyzing the static feature map, it is possible to identify In the static scene in the video, the static feature network 12 and the dynamic feature network 13 perform displacement operations on the feature maps, and calculate the difference between the feature maps corresponding to each segment to obtain classification features. By performing displacement operations on the feature maps, the The less computational load enables the network to obtain spatial local information, which ensures the speed of the network. By calculating the difference between the feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thus ensuring the network The accuracy of action recognition, it can be seen that the action recognition system provided by the embodiment of the present invention can use less data training to obtain the action recognition model, and the action recognition model trained by the action recognition system can realize accurate recognition of actions .
在一可选实施例中,带通滤波模块111包括空间卷积层和时间卷积层。In an optional embodiment, the bandpass filtering module 111 includes a spatial convolution layer and a temporal convolution layer.
对于一个分段中的每K帧连续图像,定义P(t,x,y)为像素值,其中x,y代表空间维度,而t则代表时间维度,P(t,x,y)就对应于第t帧的(x,y)处的像素值。那么该带通滤波器的输出F(t,x,y)为:For each K frame of continuous images in a segment, define P(t,x,y) as the pixel value, where x, y represent the spatial dimension, and t represents the time dimension, and P(t,x,y) corresponds to The pixel value at (x,y) at frame t. Then the output F(t,x,y) of the bandpass filter is:
Figure PCTCN2022114819-appb-000001
Figure PCTCN2022114819-appb-000001
其中,其中*代表卷积操作,LoG μ(x,y)表示参数为μ的高斯拉普拉斯算子: Among them, where * represents the convolution operation, and LoG μ (x, y) represents the Gaussian Laplacian operator with parameter μ:
Figure PCTCN2022114819-appb-000002
Figure PCTCN2022114819-appb-000002
关于t的二阶导数采用有限差分数值h(i)近似,如下所示:The second derivative with respect to t is approximated with a finite difference value h(i) as follows:
Figure PCTCN2022114819-appb-000003
Figure PCTCN2022114819-appb-000003
则公式(1)可表述为:Then formula (1) can be expressed as:
Figure PCTCN2022114819-appb-000004
Figure PCTCN2022114819-appb-000004
其中,K表示一个分段中的图像帧数,“·”表示相乘。Among them, K represents the number of image frames in a segment, and “·” represents multiplication.
由公式(4)可以看出,该带通滤波器函数完全可微。为了提升鲁棒性,本发明实施例中使用两个连续的卷积层实现该带通滤波模块111,因此,本发明实施例中的带通滤波模块111是可训练的。采用卷积层实现后将公式(4)重新表述为如下形式:It can be seen from formula (4) that the bandpass filter function is completely differentiable. In order to improve robustness, the embodiment of the present invention uses two consecutive convolutional layers to implement the band-pass filter module 111 , therefore, the band-pass filter module 111 in the embodiment of the present invention is trainable. After implementing the convolutional layer, formula (4) is re-expressed as follows:
Figure PCTCN2022114819-appb-000005
Figure PCTCN2022114819-appb-000005
其中
Figure PCTCN2022114819-appb-000006
是采用了卷积核大小为k×k的空间卷积层,并且采用了参数为μ的拉 普拉斯算子进行了初始化参数,卷积核的参数值之和被规范化为1,
Figure PCTCN2022114819-appb-000007
是采用了时间步长为s的时间卷积层,并且将卷积核值初始化为
Figure PCTCN2022114819-appb-000008
in
Figure PCTCN2022114819-appb-000006
A spatial convolution layer with a convolution kernel size of k×k is used, and the Laplacian operator with a parameter of μ is used to initialize the parameters. The sum of the parameter values of the convolution kernel is normalized to 1,
Figure PCTCN2022114819-appb-000007
is a temporal convolutional layer with a time step of s, and initializes the convolution kernel value to
Figure PCTCN2022114819-appb-000008
在一可选实施例中,静态特征网络12与动态特征网络13可以具有相同的网络结构,也可以具有不同的网络结构,当静态特征网络12与动态特征网络13有相同的网络结构时,二者的网络参数值不同。In an optional embodiment, the static feature network 12 and the dynamic feature network 13 can have the same network structure, and can also have different network structures. When the static feature network 12 and the dynamic feature network 13 have the same network structure, the two The values of the network parameters are different.
在一可选实施例中,如图2所示,静态特征网络12和动态特征网络13中的至少之一包括:图像分割模块121、初始特征提取模块122、至少一个中间特征提取模块123。In an optional embodiment, as shown in FIG. 2 , at least one of the static feature network 12 and the dynamic feature network 13 includes: an image segmentation module 121 , an initial feature extraction module 122 , and at least one intermediate feature extraction module 123 .
图像分割模块121被配置为按照第一预设大小对输入特征图进行分割,得到第一特征向量。The image segmentation module 121 is configured to segment the input feature map according to a first preset size to obtain a first feature vector.
在一可选实施例中,静态特征网络12的输入数据为静态特征图,静态特征网络12中的图像分割模块121对静态特征图进行分割;动态特征网络13的输入数据为动态特征图,动态特征网络13中的图像分割模块121对动态特征图进行分割。In an optional embodiment, the input data of the static feature network 12 is a static feature map, and the image segmentation module 121 in the static feature network 12 segments the static feature map; the input data of the dynamic feature network 13 is a dynamic feature map, and the dynamic The image segmentation module 121 in the feature network 13 segments the dynamic feature map.
在一可选实施例中,对于一张输入RGB图像
Figure PCTCN2022114819-appb-000009
其中H,W表示图像的大小,3表示图像的通道数图像分割模块121将图像按照4×4的块大小进行分割,并将分割得到的每个4×4的块合成一个向量,得到特征大小为
Figure PCTCN2022114819-appb-000010
其中
Figure PCTCN2022114819-appb-000011
表示块的数量,48为通道数。
In an optional embodiment, for an input RGB image
Figure PCTCN2022114819-appb-000009
Wherein H, W represents the size of the image, and 3 represents the channel number of the image. The image segmentation module 121 divides the image according to the block size of 4×4, and synthesizes each 4×4 block obtained by segmentation into a vector to obtain the feature size for
Figure PCTCN2022114819-appb-000010
in
Figure PCTCN2022114819-appb-000011
Indicates the number of blocks, 48 is the number of channels.
初始特征提取模块122包括线性嵌入子模块1221和至少一个特征差异与特征位移子模块1222,线性嵌入子模块1221被配置为按照预设通道数对第一特征向量进行转换,得到第二特征向量;特征差异与特征位移子模块1222被配置为对第二特征向量进行特征位移操作,以及计算各分段对应的第二特征向量之间的差异特征,得到初始分类特征。The initial feature extraction module 122 includes a linear embedding submodule 1221 and at least one feature difference and feature displacement submodule 1222. The linear embedding submodule 1221 is configured to convert the first feature vector according to the preset number of channels to obtain the second feature vector; The feature difference and feature displacement sub-module 1222 is configured to perform a feature displacement operation on the second feature vector, and calculate the difference feature between the second feature vectors corresponding to each segment to obtain the initial classification feature.
在一可选实施例中,当图像分割模块121将输入特征图按照4×4的块大小进行分割,并得到大小为
Figure PCTCN2022114819-appb-000012
的第一特征向量时,线性嵌入子模块1221第一特征向量投影至
Figure PCTCN2022114819-appb-000013
其中,C代表通道数。
In an optional embodiment, when the image segmentation module 121 divides the input feature map according to the block size of 4×4, and obtains a size of
Figure PCTCN2022114819-appb-000012
When the first eigenvector of , the linear embedding sub-module 1221 projects the first eigenvector to
Figure PCTCN2022114819-appb-000013
Among them, C represents the number of channels.
在一可选实施例中,初始特征提取模块122中包括两个连续的特征差异与特征位移子模块1222,在通过线性嵌入子模块1221对第一特征向量进行处理得到第二特征向量后,通过两个连续的特征差异与特征位移子模块1222对第二特征向量进行处理,得到初始分类特征。In an optional embodiment, the initial feature extraction module 122 includes two continuous feature difference and feature displacement sub-modules 1222, after processing the first feature vector through the linear embedding sub-module 1221 to obtain the second feature vector, through Two consecutive feature difference and feature displacement sub-modules 1222 process the second feature vector to obtain initial classification features.
中间特征提取模块123包括特征合并子模块1231和至少一个特征差异与特征位移子模块1222,特征合并子模块1231被配置为按照第二预设大小对初始分类特征进行合并,得到第三特征向量;特征差异与特征位移子模块1222被配置为对第三特征向量进行特征位移操作,以及计算各分段对应的第三特征向量之间的差异特征,得到分类特征。The intermediate feature extraction module 123 includes a feature merging submodule 1231 and at least one feature difference and feature displacement submodule 1222. The feature merging submodule 1231 is configured to merge the initial classification features according to the second preset size to obtain a third feature vector; The feature difference and feature displacement sub-module 1222 is configured to perform a feature displacement operation on the third feature vector, and calculate a difference feature between the third feature vectors corresponding to each segment to obtain a classification feature.
在一可选实施例中,中间特征提取模块123中的特征合并子模块1231将上一阶段得到的初始分类特征按照2×2的大小对块进行合并,合成1个向量,得到特征大小为
Figure PCTCN2022114819-appb-000014
然后通过至少一个特征差异与特征位移子模块1222后输出。
In an optional embodiment, the feature merging sub-module 1231 in the intermediate feature extraction module 123 merges the initial classification features obtained in the previous stage according to the size of 2×2 to combine blocks, and synthesizes a vector to obtain a feature size of
Figure PCTCN2022114819-appb-000014
Then pass through at least one feature difference and feature shift sub-module 1222 and output.
在一可选实施例中,当静态特征网络12,和/或,动态特征网络13中包括有多个中间特征提取模块123时,不同中间特征提取模块123中特征差异与特征位移子模块1222的数量可以相同,也可以不同,示例性地,中间特征提取模块123中特征差异与特征位移子模块1222的数量可以为2、6等。In an optional embodiment, when the static feature network 12, and/or, the dynamic feature network 13 includes multiple intermediate feature extraction modules 123, the feature difference and feature displacement sub-modules 1222 in different intermediate feature extraction modules 123 The numbers may be the same or different. Exemplarily, the number of feature difference and feature displacement sub-modules 1222 in the intermediate feature extraction module 123 may be 2, 6, and so on.
在一可选实施例中,静态特征网络12和动态特征网络13中的至少之一包括三个中间特征提取模块123:第一中间特征提取模块123、第二中间特征提取模块123、第三中间特征提取模块123,In an optional embodiment, at least one of the static feature network 12 and the dynamic feature network 13 includes three intermediate feature extraction modules 123: a first intermediate feature extraction module 123, a second intermediate feature extraction module 123, and a third intermediate feature extraction module 123. feature extraction module 123,
图像分割模块121、初始特征提取模块122、第一中间特征提取模块123、第二中间特征提取模块123、第三中间特征提取模块123依次连接。即,在本发明实施例中,图像分割模块121的输出数据为初始特征提取模块122的输入数据,初始特征提取模块122的输出数据为第一中间特征提取模块123的输入数据,第一中间特征提取模块123的输出数据为第二中间特征提取模块123的输入数据,第二中间特征提取模块123的输出数据为第三中间特征提取模块123的输入数据。The image segmentation module 121, the initial feature extraction module 122, the first intermediate feature extraction module 123, the second intermediate feature extraction module 123, and the third intermediate feature extraction module 123 are connected in sequence. That is, in the embodiment of the present invention, the output data of the image segmentation module 121 is the input data of the initial feature extraction module 122, the output data of the initial feature extraction module 122 is the input data of the first intermediate feature extraction module 123, and the first intermediate feature The output data of the extraction module 123 is the input data of the second intermediate feature extraction module 123 , and the output data of the second intermediate feature extraction module 123 is the input data of the third intermediate feature extraction module 123 .
初始特征提取模块122、第一中间特征提取模块123、第三中间特征提取模块123中的特征差异与特征位移子模块1222的数量相同。The number of feature differences in the initial feature extraction module 122 , the first intermediate feature extraction module 123 , and the third intermediate feature extraction module 123 is the same as the number of feature displacement sub-modules 1222 .
第二中间特征提取模块123中的特征差异与特征位移子模块1222的数量大于初始特征提取模块122、第一中间特征提取模块123、第三中间特征提取模块123中的特征差异与特征位移子模块1222的数量。The number of feature difference and feature displacement submodules 1222 in the second intermediate feature extraction module 123 is greater than the feature difference and feature displacement submodules in the initial feature extraction module 122, the first intermediate feature extraction module 123, and the third intermediate feature extraction module 123 The number of 1222.
在一可选实施例中,初始特征提取模块122、第一中间特征提取模块123、第三中间特征提取模块123中的特征差异与特征位移子模块1222的数量为2,第二中间特征提取模块123中的特征差异与特征位移子模块1222的数量为6。In an optional embodiment, the number of feature difference and feature displacement sub-modules 1222 in the initial feature extraction module 122, the first intermediate feature extraction module 123, and the third intermediate feature extraction module 123 is 2, and the second intermediate feature extraction module The number of feature difference and feature displacement sub-modules 1222 in 123 is six.
在一可选实施例中,当静态特征网络12、动态特征网络13中包括三个中间特征提取模块123时,静态特征网络12、动态特征网络13最后输出特征大小为
Figure PCTCN2022114819-appb-000015
In an optional embodiment, when the static feature network 12 and the dynamic feature network 13 include three intermediate feature extraction modules 123, the final output feature size of the static feature network 12 and the dynamic feature network 13 is
Figure PCTCN2022114819-appb-000015
在一可选实施例中,如图3所示,特征差异与特征位移子模块1222包括,第一归一化层、特征位移单元、特征差异单元、第二归一化层、第一全连接层、第一GELU函数层、第二全连接层,其中,第一归一化层、特征位移单元、特征差异单元、第二归一化层、第一全连接层、第一GELU函数层、第二全连接层依次连接。In an optional embodiment, as shown in FIG. 3 , the feature difference and feature displacement submodule 1222 includes a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, and a first full connection Layer, the first GELU function layer, the second fully connected layer, wherein, the first normalization layer, feature displacement unit, feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, The second fully connected layer is connected sequentially.
第二归一化层的输入数据为第一残差计算结果,第一残差计算结果是通过第一归一化层的输入数据与特征差异单元的输出数据计算的。The input data of the second normalization layer is the first residual calculation result, and the first residual calculation result is calculated through the input data of the first normalization layer and the output data of the feature difference unit.
特征差异与特征位移子模块1222的输出数据为第二残差计算结果,第二残差计算结果是通过第二归一化层的输入数据与第二全连接层的输出数据计算的。The output data of the feature difference and feature displacement sub-module 1222 is the second residual calculation result, which is calculated through the input data of the second normalization layer and the output data of the second fully connected layer.
在一可选实施例中,如图4所示,特征位移单元中包括第一信道全连接层、水平特征位移层、第二信道全连接层、竖直特征位移层、第三信道全连接层、第四信道全连接层,水平特征位移层与第二信道全连接层连接,竖直特征位移层与第四信道全连接层连接,水平特征位移层与第二信道全连接层连接形成的结构,与竖直特征位移层与第四信道全连接层连接形成的结构之间为并行结构。In an optional embodiment, as shown in Figure 4, the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer , the fourth channel fully connected layer, the horizontal characteristic displacement layer is connected to the second channel fully connected layer, the vertical characteristic displacement layer is connected to the fourth channel fully connected layer, and the horizontal characteristic displacement layer is connected to the second channel fully connected layer to form a structure , and the structure formed by connecting the vertical feature displacement layer and the fourth channel fully connected layer is a parallel structure.
第一信道全连接层被配置为对输入数据的信道进行全连接,得到全连接结果,并将全连接结果分别输入至水平特征位移层和竖直特征位移层中。The first channel full connection layer is configured to perform full connection on the channels of the input data to obtain a full connection result, and input the full connection result to the horizontal feature displacement layer and the vertical feature displacement layer respectively.
水平特征位移层被配置为对全连接结果进行水平位移,得到水平位移结果,并将水平位移结果输入至第二信道全连接层中。The horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain the horizontal displacement result, and input the horizontal displacement result into the second channel fully connected layer.
在一可选实施例中,全连接结果具有三个维度:高、宽、信道,对全连接结果进行水平位移时,如果位移组数为3,位移大小为1,则在信道维度各组信道特征图按照[+1,0,-1]的方式沿水平方向进行移动,空出的部分进行填零操作,示例性的,在信道维度将全连接结果分为3组数据,将第一组数据沿水平方向向左移动一个单位长度,第一组数据移动后,空出的部分填零,第三组数据保持不变,第二组数据沿水平方向向右移动一个单位长度,第三组数据移动后,空出的部分填零。如果位移组数为5位移大小为2,则在信道维度各组信道特征图按照[+4,+2,0,-2,-4]的方式沿水平方向进行移动,空出的部分进行填零操 作。In an optional embodiment, the full connection result has three dimensions: height, width, and channel. When performing horizontal displacement on the full connection result, if the number of displacement groups is 3 and the displacement size is 1, then each group of channels in the channel dimension The feature map is moved horizontally in the manner of [+1,0,-1], and the vacated part is filled with zeros. For example, the full connection result is divided into 3 groups of data in the channel dimension, and the first group The data moves one unit length to the left along the horizontal direction. After the first group of data is moved, the vacant part is filled with zeros, the third group of data remains unchanged, the second group of data moves one unit length to the right along the horizontal direction, and the third group After the data is moved, the vacated part is filled with zeros. If the number of displacement groups is 5 and the displacement size is 2, then in the channel dimension, the channel feature maps of each group are moved in the horizontal direction according to [+4,+2,0,-2,-4], and the vacant part is filled. Zero operations.
竖直特征位移层被配置为对全连接结果进行竖直位移,得到竖直位移结果,并将竖直位移结果输入至第三信道全连接层中。The vertical feature displacement layer is configured to perform vertical displacement on the fully connected result to obtain the vertical displacement result, and input the vertical displacement result into the third channel fully connected layer.
对全连接结果进行竖直位移以及对全连接结果进行水平位移的区别仅在于,竖直位移的移动方向为竖直方向,水平位移的移动方向为水平方向。The only difference between performing vertical displacement on the fully connected result and horizontal displacement on the fully connected result is that the moving direction of the vertical displacement is the vertical direction, and the moving direction of the horizontal displacement is the horizontal direction.
第四信道全连接层被配置为对第二信道全连接层和第三信道全连接层的输出结果的和进行处理,得到特征位移单元的输出结果。The fourth channel fully connected layer is configured to process the sum of the output results of the second channel fully connected layer and the third channel fully connected layer to obtain the output result of the feature displacement unit.
在一可选实施例中,如图5所示,特征差异单元包括:输入层、最大池化层、第三全连接层、第二GELU函数层、第四全连接层、上采样层、第五全连接层、第三GELU函数层、第六全连接层、特征差异输出层。In an optional embodiment, as shown in FIG. 5 , the feature difference unit includes: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, and a third fully connected layer. The fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer.
输入层、最大池化层、第三全连接层、第二GELU函数层、第四全连接层、上采样层、特征差异输出层依次连接。The input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, the upsampling layer, and the feature difference output layer are connected in sequence.
输入层、第五全连接层、第三GELU函数层、第六全连接层、特征差异输出层依次连接。The input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence.
输入层被配置为将当前时刻分段对应的输入特征与上一时刻分段对应的输入特征作差,并将差值特征分别输入到最大池化层和第五全连接层中。The input layer is configured to make a difference between the input feature corresponding to the segment at the current moment and the input feature corresponding to the segment at the previous moment, and input the difference feature into the maximum pooling layer and the fifth fully connected layer respectively.
特征差异输出层被配置为将上采样层和第六全连接层的输出结果进行求和,得到求和结果;将求和结果与上一时刻分段对应的输入特征进行逐点相乘,得到相乘结果;将相乘结果与上一时刻分段对应的输入特征进行相加,得到特征差异单元的输出结果。The feature difference output layer is configured to sum the output results of the upsampling layer and the sixth fully connected layer to obtain the summation result; multiply the summation result point by point with the input feature corresponding to the previous segment to obtain Multiplication result; add the multiplication result to the input feature corresponding to the segment at the previous moment to obtain the output result of the feature difference unit.
示例性的,对于输入特征[F 1,F 2,......,F N],其中
Figure PCTCN2022114819-appb-000016
为t时刻的特征,将t时刻和t+1时刻的特征作差,然后分为2条路径,一条经过最大池化层进行降采样后通过中间有GELU函数的两层全连接层,然后通过上采样层进行上采样;另一条直接经过中间有GELU函数的两层全连接层。最后将两条路径输出的特征图进行加和,与原始输入特征进行逐点相乘后再相加得到最后的输出。
Exemplarily, for input features [F 1 ,F 2 ,...,F N ], where
Figure PCTCN2022114819-appb-000016
For the feature at time t, the feature at time t and time t+1 is compared, and then divided into two paths, one is down-sampled through the maximum pooling layer and then passes through the two fully connected layers with the GELU function in the middle, and then passes through The upsampling layer performs upsampling; the other goes directly through the two fully connected layers with the GELU function in the middle. Finally, the feature maps output by the two paths are summed, multiplied point-by-point with the original input features, and then added to obtain the final output.
在一可选实施例中,如图1所示,在本发明实施例提供的动作识系统中,分类网络14包括第一时间平均池化层141、第二时间平均池化层143、静态特征分类器142、动态特征分类器144、输出层145,In an optional embodiment, as shown in FIG. 1, in the action recognition system provided by the embodiment of the present invention, the classification network 14 includes a first temporal average pooling layer 141, a second temporal average pooling layer 143, a static feature classifier 142, dynamic feature classifier 144, output layer 145,
第一时间平均池化层141被配置为对多个分段对应的静态分类特征进行时间平均池化,并将池化结果输入至静态特征分类器142中。The first temporal average pooling layer 141 is configured to perform temporal average pooling on static classification features corresponding to multiple segments, and input the pooling result to the static feature classifier 142 .
第二时间平均池化层143被配置为对多个分段对应的动态分类特征进行时间平均池化,并将池化结果输入至动态特征分类器144中。The second temporal average pooling layer 143 is configured to perform temporal average pooling on the dynamic classification features corresponding to multiple segments, and input the pooling result to the dynamic feature classifier 144 .
在一可选实施例中,在通过本发明实施例提供的动作识别系统进行动作识别时,若将视频对应的图像序列分为N个分段,通过信息分离网络11、静态特征网络12、动态特征网络13依次对N个分段进行处理,得到N个静态分类特征和N个动态分类特征,采用第一时间平均池化层141对N个静态分类特征进行时间平均池化,得到一条具备时间属性的静态分类特征,采用第二时间平均池化层143对N个动态分类特征进行时间平均池化,得到一条具备时间属性的动态分类特征,通过具有时间属性的静态分类特征和动态分类特征能够更准确地实现对动作的识别。In an optional embodiment, when the action recognition system provided by the embodiment of the present invention is used for action recognition, if the image sequence corresponding to the video is divided into N segments, the information separation network 11, the static feature network 12, the dynamic The feature network 13 sequentially processes the N segments to obtain N static classification features and N dynamic classification features, and uses the first time average pooling layer 141 to perform time average pooling on the N static classification features to obtain a time-averaged For the static classification feature of the attribute, the second time average pooling layer 143 is used to perform time average pooling on the N dynamic classification features to obtain a dynamic classification feature with a time attribute, through which the static classification feature and the dynamic classification feature with the time attribute can be Achieve more accurate recognition of actions.
静态特征分类器142被配置为根据静态分类特征得到第一分类结果。The static feature classifier 142 is configured to obtain a first classification result according to the static classification feature.
动态特征分类器144被配置为根据动态分类特征得到第二分类结果。The dynamic feature classifier 144 is configured to obtain a second classification result according to the dynamic classification feature.
识别结果输出层145被配置为将第一分类结果和第二分类结果的加权平均结果作为输出结果。The recognition result output layer 145 is configured to take the weighted average result of the first classification result and the second classification result as an output result.
在一可选实施例中,静态特征分类器142和动态特征分类器144为Softmax分类器。In an optional embodiment, the static feature classifier 142 and the dynamic feature classifier 144 are Softmax classifiers.
本发明实施例提供了一种动作识别模型训练方法,如图6所示,包括:An embodiment of the present invention provides a method for training an action recognition model, as shown in FIG. 6 , including:
步骤S21:获取多个图像序列,图像序列中标注有行人动作类型。Step S21: Obtain multiple image sequences, in which the types of pedestrian actions are marked.
步骤S22:将各图像序列分为多段子序列,得到训练数据集。Step S22: Divide each image sequence into multiple subsequences to obtain a training data set.
步骤S23:将训练数据集输入神经网络系统,对神经网络系统进行训练,直到损失函数的损失值满足损失条件,得到动作识别模型,神经网络系统为上述任一实施例中提供的动作识别系统,有关动作识别系统的详细内容参见上述实施例。Step S23: Input the training data set into the neural network system, train the neural network system until the loss value of the loss function meets the loss condition, and obtain the action recognition model, the neural network system is the action recognition system provided in any of the above-mentioned embodiments, For details about the action recognition system, refer to the above-mentioned embodiments.
在一可选实施例中,对初始化后的动作识别系统进行训练,得到动作识别模型,在动作识别系统中,采用高斯拉普拉斯算子对带通滤波模块111进行初始化,采用预训练模型对特征差异与特征位移网络进行初始化,预训练模型通过ImageNet或其他的大数据集进行预训练得到。In an optional embodiment, the initialized action recognition system is trained to obtain an action recognition model. In the action recognition system, the Gaussian Laplacian is used to initialize the bandpass filter module 111, and the pre-trained model is used to Initialize the feature difference and feature displacement network, and the pre-training model is obtained through pre-training on ImageNet or other large data sets.
在一可选实施例中,可以采用大规模数据集对动作识别系统进行完整训练,也可以基于预训练模型进行微调。In an optional embodiment, the action recognition system can be fully trained with a large-scale data set, or fine-tuned based on a pre-trained model.
本发明实施例提供的动作识别模型训练方法,在获取到训练数据集后,将训练数据集输入至上述实施例提供的动作识别系统中,对动作识别系统进行训练得到动作识别模型,上述实施例提供的动作识别系统包括信息分离网络、静态特征网络、动态特征网络、分类网络,动作识别系统首先通过信息分离网络分离图像中的动态特征图和静态特征图,然后分别将动态特征图和静态特征图输入至动态特征网络以及静态特征网络中,通过对动态特征图进行单独分析,能够捕捉到视频的短期时间信息,通过对静态特征图进行单独分析,能够识别视频中的静态场景,静态特征网络和动态特征网络对特征图进行位移操作,并计算各分段对应的特征图之间的差异特征,得到分类特征,通过对特征图进行位移操作能够以较少的运算量使网络获取空间局部信息,保证了网络的运行速度,通过计算特征图之间的差异特征,能够捕捉视频中的长期时间关系,使得网络具有时间建模的能力,从而保证网络动作识别的精度,由此可见,通过本发明实施例提供的动作识别模型训练方法能够采用更少的数据训练得到动作识别模型,并且,通过本发明实施例提供的动作识别训练方法训练得到的动作识别模型能够实现对动作的精准识别。In the action recognition model training method provided by the embodiment of the present invention, after the training data set is obtained, the training data set is input into the action recognition system provided in the above embodiment, and the action recognition system is trained to obtain the action recognition model. The above embodiment The action recognition system provided includes information separation network, static feature network, dynamic feature network, and classification network. The action recognition system first separates the dynamic feature map and static feature map in the image through the information separation network, and then separates the dynamic feature map and static feature map. The graph is input into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured. By analyzing the static feature map separately, the static scene in the video can be identified. The static feature network And the dynamic feature network performs displacement operations on the feature maps, and calculates the difference features between the feature maps corresponding to each segment, and obtains classification features. By performing displacement operations on the feature maps, the network can obtain spatial local information with a small amount of calculation. , to ensure the running speed of the network. By calculating the difference features between the feature maps, the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that through this The action recognition model training method provided by the embodiment of the invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition training method provided by the embodiment of the invention can realize accurate recognition of actions.
在一可选实施例中,损失函数采用正交投影损失函数和交叉熵损失函数联合得到。In an optional embodiment, the loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.
在一可选实施例中,对于一个批量B定义如下公式:In an optional embodiment, the following formula is defined for a batch B:
Figure PCTCN2022114819-appb-000017
Figure PCTCN2022114819-appb-000017
Figure PCTCN2022114819-appb-000018
Figure PCTCN2022114819-appb-000018
其中y i,y j表示第i和第j个标注真值,且i,j∈B。F i,F j表示对第i和第j个视频的N个分段的网络输出特征进行时间平均池化后的特征,||·|| 2表示l 2正则化,“·”表示向量点积。s和d分别表示真值类别一致和不一致时,对中间特征做余弦相似性操作得到的结果。最终的正交投影损失函数L opl如下: where y i , y j represent the truth values of the i-th and j-th annotations, and i, j∈B. F i , F j represent the features after time-average pooling of the network output features of the i-th and j-th video's N segments, ||·|| 2 represents l 2 regularization, and "·" represents the vector point product. s and d represent the results obtained by cosine similarity operation on the intermediate features when the true value categories are consistent and inconsistent, respectively. The final orthogonal projection loss function L opl is as follows:
L opl=(1-s)+α×|d|,(4) L opl =(1-s)+α×|d|, (4)
其中,α是控制权重的超参数,|·|表示绝对值操作。where α is the hyperparameter controlling the weights, and |·| represents the absolute value operation.
在一可选实施例中,损失函数通过正交投影损失函数与控制正交投影损失权重的超参数之积,与交叉熵损失函数相加得到。将上述正交投影损失函数与交叉熵损失L ce联合使用得到最终的损失函数L: In an optional embodiment, the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function. Combine the above orthogonal projection loss function with the cross entropy loss L ce to get the final loss function L:
L=L ce+β×L opl,(5) L=L ce +β×L opl , (5)
其中,β是控制正交投影损失权重的超参数。where β is a hyperparameter controlling the weights of the orthogonal projection loss.
在本发明实施例中,将正交投影损失函数与交叉熵损失联合使用得到最终的损失函数,引入正交投影损失函数使中间层特征正交化,达到类间分离和类内聚类的效果,使得训练得到的动作识别模型能够更准确地识别动作。In the embodiment of the present invention, the orthogonal projection loss function is combined with the cross-entropy loss to obtain the final loss function, and the orthogonal projection loss function is introduced to orthogonalize the middle layer features to achieve the effect of inter-class separation and intra-class clustering , so that the trained action recognition model can recognize actions more accurately.
在一可选实施例中,上述步骤S21包括:In an optional embodiment, the above step S21 includes:
首先,对图像序列中的图像进行等比例缩放,得到缩放图像,缩放图像的短边大小位于预设区间内。示例性地,预设区间可以为[256,320]。First, the images in the image sequence are proportionally scaled to obtain a scaled image, and the size of the short side of the scaled image is within a preset range. Exemplarily, the preset interval may be [256, 320].
然后,对缩放图像进行随机剪裁,得到裁剪图像,裁剪图像的大小满足预设条件。在一可选实施例中,裁剪图像的尺寸为224×224。Then, the zoomed image is randomly cropped to obtain a cropped image, and the size of the cropped image satisfies a preset condition. In an optional embodiment, the size of the cropped image is 224×224.
最后,将裁剪图像形成的图像序列分为多段子序列,得到训练数据集。Finally, the image sequence formed by cropping the image is divided into multiple subsequences to obtain the training data set.
本发明实施例提供了一种动作识别方法,如图7所示,包括:An embodiment of the present invention provides an action recognition method, as shown in FIG. 7 , including:
步骤S31:获取目标对象的图像序列,将图像序列分为多段子序列。Step S31: Acquire the image sequence of the target object, and divide the image sequence into multiple subsequences.
步骤S32:将子序列输入动作识别模型,生成动作识别结果,动作识别模型通过上述实施例中提供的动作识别模型训练方法训练得到,有关动作识别模型的详细内容参见上述实施例中的描述,在此不再赘述。Step S32: Input the subsequence into the action recognition model to generate an action recognition result. The action recognition model is trained by the action recognition model training method provided in the above-mentioned embodiment. For details about the action recognition model, refer to the description in the above-mentioned embodiment. This will not be repeated here.
本发明实施例提供的动作识别方法,在获取到目标对象的图像序列后将图像序列分为多段子序列,并将子序列输入通过上述实施例提供的动作识别模型训练方法训练得到的动作识别模型中,上述实施例提供的动作识别模型训练方法通过对动作识别系统进行训练得到动作识别模型,动作识别系统首先通过信息分离网络分离图像中的动态特征图和静态特征图,然后分别将动态特征图和静态特征图输入至动态特征网络以及静态特征网络中,通过对动态特征图进行单独分析,能够捕捉到视频的短期时间信息,通过对静态特征图进行单独分析,能够识别视频中的静态场景,静态特征网络和动态特征网络对特征图进行位移操作,并计算各分段对应的特征图之间的差异特征,得到分类特征,通过对特征图进行位移操作能够以较少的运算量使网络获取空间局部信息,保证了网络的运行速度,通过计算特征图之间的差异特征,能够捕捉视频中的长期时间关系,使得网络具有时间建模的能力,从而保证网络动作识别的精度,由此可见,通过上述实施例提供的动作识别训练方法训练得到的动作识别模型能够实现对动作的精准识别,因此,通过实施本发明实施例能够实现对动作的精准识别。In the action recognition method provided by the embodiment of the present invention, after the image sequence of the target object is acquired, the image sequence is divided into multiple subsequences, and the subsequences are input into the action recognition model trained by the action recognition model training method provided in the above embodiment. Among them, the action recognition model training method provided by the above embodiment obtains the action recognition model by training the action recognition system. The action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network, and then separates the dynamic feature map Input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured, and the static scene in the video can be identified by analyzing the static feature map separately. The static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, the network can obtain Spatial local information ensures the running speed of the network. By calculating the difference between feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that The motion recognition model trained by the motion recognition training method provided in the above embodiments can realize precise recognition of motions, therefore, the precise recognition of motions can be realized by implementing the embodiments of the present invention.
在一可选实施例中,在本发明实施例提供的动作识别方法中,在上述步骤S31之后,步骤S32之前,该方法还包括:In an optional embodiment, in the action recognition method provided in the embodiment of the present invention, after the above step S31 and before step S32, the method further includes:
首先,对图像序列中的图像进行等比例缩放,得到缩放图像,缩放图像的短边大小位于预设区间内,详细内容参见上述实施例中的描述,在此不再赘述。First, the images in the image sequence are proportionally scaled to obtain a scaled image, and the size of the short side of the scaled image is within a preset range. For details, refer to the description in the above-mentioned embodiments, which will not be repeated here.
然后,对缩放图像进行随机剪裁,得到裁剪图像,裁剪图像的大小满足预设条件,详细内容参见上述实施例中的描述,在此不再赘述。Then, the zoomed image is randomly cropped to obtain a cropped image, the size of the cropped image satisfies a preset condition. For details, refer to the description in the above-mentioned embodiments, which will not be repeated here.
最后,将裁剪图像作为子序列中的图像,执行将子序列输入动作识别模型的步骤,详细内容参见上述实施例中的描述,在此不再赘述。Finally, the cropped image is used as an image in the subsequence, and the step of inputting the subsequence into the action recognition model is performed. For details, refer to the description in the above-mentioned embodiments, and details will not be repeated here.
在本发明实施例中,将图像进行等比例缩放,使得图像短边大小位于预设区间内,再随机剪裁为网络接受的输入尺寸,从而可以达到数据增强的目的,对经过缩放和裁剪的数据进行分析,动作识别模型能够对图像中的有效信息进行分析,从而提高分析效率,以及分析结果的准确性。In the embodiment of the present invention, the image is proportionally scaled so that the size of the short side of the image is within the preset interval, and then randomly cut to the input size accepted by the network, so that the purpose of data enhancement can be achieved, and the scaled and cropped data For analysis, the action recognition model can analyze the effective information in the image, thereby improving the analysis efficiency and the accuracy of the analysis results.
本发明实施例提供了一种动作识别模型训练装置,如图8所示,包括:An embodiment of the present invention provides an action recognition model training device, as shown in FIG. 8 , including:
图像获取模块21,被配置为被配置为获取多个图像序列,图像序列中标注有行人动作类型,详细内容参见上述实施例中步骤S21的描述,在此不再赘述。The image acquisition module 21 is configured to acquire a plurality of image sequences, in which the types of pedestrian actions are marked. For details, refer to the description of step S21 in the above embodiment, which will not be repeated here.
训练数据获取模块22,被配置为被配置为将各图像序列分为多段子序列,得到训练数 据集,详细内容参见上述实施例中步骤S22的描述,在此不再赘述。The training data acquisition module 22 is configured to divide each image sequence into multiple subsequences to obtain a training data set. For details, refer to the description of step S22 in the above-mentioned embodiment, and details will not be repeated here.
模型训练模块23,被配置为被配置为将训练数据集输入神经网络系统,对神经网络系统进行训练,得到动作识别模型,神经网络系统为上述实施例中提供的动作识别系统,详细内容参见上述实施例中步骤S23的描述,在此不再赘述。The model training module 23 is configured to be configured to input the training data set into the neural network system, train the neural network system, and obtain an action recognition model. The neural network system is the action recognition system provided in the above-mentioned embodiment. For details, refer to the above-mentioned The description of step S23 in the embodiment will not be repeated here.
本发明实施例提供的动作识别模型训练装置,在获取到训练数据集后,将训练数据集输入至上述实施例提供的动作识别系统中,对动作识别系统进行训练得到动作识别模型,上述实施例提供的动作识别系统包括信息分离网络、静态特征网络、动态特征网络、分类网络,动作识别系统首先通过信息分离网络分离图像中的动态特征图和静态特征图,然后分别将动态特征图和静态特征图输入至动态特征网络以及静态特征网络中,通过对动态特征图进行单独分析,能够捕捉到视频的短期时间信息,通过对静态特征图进行单独分析,能够识别视频中的静态场景,静态特征网络和动态特征网络对特征图进行位移操作,并计算各分段对应的特征图之间的差异特征,得到分类特征,通过对特征图进行位移操作能够以较少的运算量使网络获取空间局部信息,保证了网络的运行速度,通过计算特征图之间的差异特征,能够捕捉视频中的长期时间关系,使得网络具有时间建模的能力,从而保证网络动作识别的精度,由此可见,通过本发明实施例提供的动作识别模型训练装置能够采用更少的数据训练得到动作识别模型,并且,通过本发明实施例提供的动作识别训练装置训练得到的动作识别模型能够实现对动作的精准识别。The action recognition model training device provided by the embodiment of the present invention, after obtaining the training data set, inputs the training data set into the action recognition system provided in the above embodiment, and trains the action recognition system to obtain the action recognition model. The above embodiment The action recognition system provided includes information separation network, static feature network, dynamic feature network, and classification network. The action recognition system first separates the dynamic feature map and static feature map in the image through the information separation network, and then separates the dynamic feature map and static feature map. The graph is input into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured. By analyzing the static feature map separately, the static scene in the video can be identified. The static feature network And the dynamic feature network performs displacement operations on the feature maps, and calculates the difference features between the feature maps corresponding to each segment, and obtains classification features. By performing displacement operations on the feature maps, the network can obtain spatial local information with a small amount of calculation. , to ensure the running speed of the network. By calculating the difference features between the feature maps, the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that through this The action recognition model training device provided by the embodiment of the invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition training device provided by the embodiment of the invention can realize accurate recognition of actions.
本发明实施例提供了一种动作识别装置,如图9所示,包括:An embodiment of the present invention provides an action recognition device, as shown in FIG. 9 , including:
图像采集模块31,被配置为被配置为获取目标对象的图像序列,将图像序列分为多段子序列,详细内容参见上述实施例中步骤S31的描述,在此不再赘述。The image acquisition module 31 is configured to acquire an image sequence of the target object, and divide the image sequence into multiple subsequences. For details, refer to the description of step S31 in the above-mentioned embodiment, which will not be repeated here.
动作识别模块32,被配置为被配置为将子序列输入动作识别模型,生成动作识别结果,动作识别模型通过上述实施例中提供的动作识别模型训练方法训练得到,详细内容参见上述实施例中步骤S32的描述,在此不再赘述。The action recognition module 32 is configured to input the subsequence into the action recognition model to generate an action recognition result. The action recognition model is trained by the action recognition model training method provided in the above embodiment. For details, refer to the steps in the above embodiment. The description of S32 will not be repeated here.
本发明实施例提供的动作识别装置,在获取到目标对象的图像序列后将图像序列分为多段子序列,并将子序列输入通过上述实施例提供的动作识别模型训练方法训练得到的动作识别模型中,上述实施例提供的动作识别模型训练方法通过对动作识别系统进行训练得到动作识别模型,动作识别系统首先通过信息分离网络分离图像中的动态特征图和静态特征图,然后分别将动态特征图和静态特征图输入至动态特征网络以及静态特征网络中,通过对动态特征图进行单独分析,能够捕捉到视频的短期时间信息,通过对静态特征图进行单独分析,能够识别视频中的静态场景,静态特征网络和动态特征网络对特征图进行位移操作,并计算各分段对应的特征图之间的差异特征,得到分类特征,通过对特征图进行位移操作能够以较少的运算量使网络获取空间局部信息,保证了网络的运行速度,通过计算特征图之间的差异特征,能够捕捉视频中的长期时间关系,使得网络具有时间建模的能力,从而保证网络动作识别的精度,由此可见,通过上述实施例提供的动作识别训练方法训练得到的动作识别模型能够实现对动作的精准识别,因此,通过实施本发明实施例能够实现对动作的精准识别。The action recognition device provided by the embodiment of the present invention divides the image sequence into multiple subsequences after acquiring the image sequence of the target object, and inputs the subsequences into the action recognition model trained by the action recognition model training method provided in the above embodiment Among them, the action recognition model training method provided by the above embodiment obtains the action recognition model by training the action recognition system. The action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network, and then separates the dynamic feature map Input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured, and the static scene in the video can be identified by analyzing the static feature map separately. The static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, the network can obtain Spatial local information ensures the running speed of the network. By calculating the difference between feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that The motion recognition model trained by the motion recognition training method provided in the above embodiments can realize precise recognition of motions, therefore, the precise recognition of motions can be realized by implementing the embodiments of the present invention.
本发明实施例提供了一种计算机设备,如图10所示,该计算机设备主要包括一个或多个处理器41以及存储器42,图10中以一个处理器41为例。An embodiment of the present invention provides a computer device. As shown in FIG. 10 , the computer device mainly includes one or more processors 41 and a memory 42 , and one processor 41 is taken as an example in FIG. 10 .
该计算机设备还可以包括:输入装置43和输出装置44。The computer device may also include: an input device 43 and an output device 44 .
处理器41、存储器42、输入装置43和输出装置44可以通过总线或者其他方式连接,图10中以通过总线连接为例。The processor 41 , the memory 42 , the input device 43 and the output device 44 may be connected through a bus or in other ways. In FIG. 10 , connection through a bus is taken as an example.
处理器41可以为中央处理器(Central Processing Unit,CPU)。处理器41还可以为其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等芯片,或者上述 各类芯片的组合。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。存储器42可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据动作识别系统,或,动作识别模型训练装置,或,动作识别装置的使用所创建的数据等。此外,存储器42可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器42可选包括相对于处理器41远程设置的存储器,这些远程存储器可以通过网络连接至动作识别系统,或,动作识别模型训练装置,或,动作识别装置。输入装置43可接收用户输入的计算请求(或其他数字或字符信息),以及产生与动作识别系统,或,动作识别模型训练装置,或,动作识别装置有关的键信号输入。输出装置44可包括显示屏等显示设备,用以输出计算结果。The processor 41 may be a central processing unit (Central Processing Unit, CPU). Processor 41 can also be other general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or Other chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above-mentioned types of chips. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The memory 42 can include a program storage area and a data storage area, wherein the program storage area can store the operating system and at least one application required by the function; , data created using motion recognition devices, etc. In addition, the memory 42 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 42 may optionally include a memory that is remotely located relative to the processor 41, and these remote memories may be connected to the motion recognition system, or the motion recognition model training device, or the motion recognition device through a network. The input device 43 can receive calculation requests (or other digital or character information) input by the user, and generate key signal input related to the motion recognition system, or the motion recognition model training device, or the motion recognition device. The output device 44 may include a display device such as a display screen for outputting calculation results.
本发明实施例提供了一种计算机可读存储介质,该计算机可读存储介质存储计算机指令,计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的动作识别模型训练方法,或,动作识别方法。其中,存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)、随机存储记忆体(Random Access Memory,RAM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,缩写:HDD)或固态硬盘(Solid-State Drive,SSD)等;存储介质还可以包括上述种类的存储器的组合。显然,上述实施例仅仅是为清楚地说明所作的举例,而并非对实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。An embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer storage medium stores computer-executable instructions, and the computer-executable instructions can perform the action recognition in any of the above-mentioned method embodiments A model training method, or, an action recognition method. Wherein, the storage medium can be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive) , abbreviation: HDD) or solid-state drive (Solid-State Drive, SSD), etc.; the storage medium may also include a combination of the above-mentioned types of memory. Apparently, the above-mentioned embodiments are only examples for clear description, rather than limiting the implementation. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. And the obvious changes or changes derived therefrom are still within the scope of protection of the present invention.
工业实用性Industrial Applicability
本发明实施例提供了一种动作识别系统包括:带通滤波模块被配置为根据一个分段中的多帧连续图像提取动态特征图;静态特征提取模块被配置为根据一个分段中的多帧连续图像获取静态特征图;静态特征网络被配置为对静态特征图进行特征位移操作,以及计算各分段对应的静态特征图之间的差异特征,得到静态分类特征;动态特征网络被配置为对动态特征图进行特征位移操作,以及计算各分段对应的动态特征图之间的差异特征,得到动态分类特征;分类网络被配置为根据静态分类特征和动态分类特征得到动作识别结果。通过实施本发明能够采用更少的数据训练得到动作识别模型,并且能够实现对动作的精准识别。An embodiment of the present invention provides an action recognition system including: a bandpass filter module configured to extract dynamic feature maps based on multiple frames of continuous images in a segment; a static feature extraction module configured to extract dynamic feature maps based on multiple frames in a segment Continuous images acquire static feature maps; the static feature network is configured to perform feature displacement operations on the static feature maps, and calculate the difference between the static feature maps corresponding to each segment to obtain static classification features; the dynamic feature network is configured to The dynamic feature map performs feature displacement operations, and calculates the difference features between the dynamic feature maps corresponding to each segment to obtain dynamic classification features; the classification network is configured to obtain action recognition results based on static classification features and dynamic classification features. By implementing the present invention, less data can be used for training to obtain an action recognition model, and accurate recognition of actions can be realized.

Claims (17)

  1. 一种动作识别系统,包括:信息分离网络、静态特征网络、动态特征网络、分类网络,所述信息分离网络包括带通滤波模块、静态特征提取模块,An action recognition system, comprising: an information separation network, a static feature network, a dynamic feature network, and a classification network, wherein the information separation network includes a bandpass filter module and a static feature extraction module,
    所述带通滤波模块被配置为根据获取的一个分段中的多帧连续图像提取动态特征图;The band-pass filtering module is configured to extract dynamic feature maps from multiple frames of continuous images obtained in a segment;
    所述静态特征提取模块被配置为对所述一个分段中的多帧连续图像进行时间平均池化,得到特征图,并将所述特征图与所述动态特征图作差,得到静态特征图;The static feature extraction module is configured to perform temporal average pooling on the multi-frame continuous images in the one segment to obtain a feature map, and make a difference between the feature map and the dynamic feature map to obtain a static feature map ;
    所述静态特征网络被配置为对多个分段对应的静态特征图进行特征位移操作,以及计算各分段对应的静态特征图之间的差异特征,得到静态分类特征;The static feature network is configured to perform a feature displacement operation on static feature maps corresponding to multiple segments, and calculate difference features between static feature maps corresponding to each segment to obtain static classification features;
    所述动态特征网络被配置为对多个分段对应的动态特征图进行特征位移操作,以及计算各分段对应的动态特征图之间的差异特征,得到动态分类特征;The dynamic feature network is configured to perform a feature displacement operation on dynamic feature maps corresponding to multiple segments, and calculate difference features between dynamic feature maps corresponding to each segment to obtain dynamic classification features;
    所述分类网络被配置为根据所述静态分类特征和所述动态分类特征得到动作识别结果。The classification network is configured to obtain an action recognition result according to the static classification feature and the dynamic classification feature.
  2. 根据权利要求1所述的动作识别系统,其中,The action recognition system according to claim 1, wherein,
    所述带通滤波模块包括空间卷积层和时间卷积层。The bandpass filtering module includes a spatial convolution layer and a temporal convolution layer.
  3. 根据权利要求1所述的动作识别系统,其中,所述静态特征网络和动态特征网络中的至少之一包括:图像分割模块、初始特征提取模块、至少一个中间特征提取模块,The action recognition system according to claim 1, wherein at least one of the static feature network and the dynamic feature network comprises: an image segmentation module, an initial feature extraction module, at least one intermediate feature extraction module,
    所述图像分割模块被配置为按照第一预设大小对输入特征图进行分割,得到第一特征向量;The image segmentation module is configured to segment the input feature map according to a first preset size to obtain a first feature vector;
    所述初始特征提取模块包括线性嵌入子模块和至少一个特征差异与特征位移子模块,所述线性嵌入子模块被配置为按照预设通道数对所述第一特征向量进行转换,得到第二特征向量;所述特征差异与特征位移子模块被配置为对所述第二特征向量进行特征位移操作,以及计算各分段对应的第二特征向量之间的差异特征,得到初始分类特征;The initial feature extraction module includes a linear embedding submodule and at least one feature difference and feature displacement submodule, and the linear embedding submodule is configured to convert the first feature vector according to the preset number of channels to obtain the second feature Vector; the feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the second feature vector, and calculate the difference feature between the second feature vectors corresponding to each segment to obtain the initial classification feature;
    所述中间特征提取模块包括特征合并子模块和至少一个特征差异与特征位移子模块,所述特征合并子模块被配置为按照第二预设大小对所述初始分类特征进行合并,得到第三特征向量;所述特征差异与特征位移子模块被配置为对所述第三特征向量进行特征位移操作,以及计算各分段对应的第三特征向量之间的差异特征,得到分类特征。The intermediate feature extraction module includes a feature merging submodule and at least one feature difference and feature displacement submodule, and the feature merging submodule is configured to merge the initial classification features according to a second preset size to obtain a third feature vector; the feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the third feature vector, and calculate a difference feature between the third feature vectors corresponding to each segment to obtain a classification feature.
  4. 根据权利要求3所述的动作识别系统,其中,所述静态特征网络和动态特征网络中的至少之一包括三个中间特征提取模块:第一中间特征提取模块、第二中间特征提取模块、第三中间特征提取模块,The action recognition system according to claim 3, wherein at least one of the static feature network and the dynamic feature network comprises three intermediate feature extraction modules: a first intermediate feature extraction module, a second intermediate feature extraction module, a second intermediate feature extraction module, Three intermediate feature extraction modules,
    所述图像分割模块、初始特征提取模块、第一中间特征提取模块、第二中间特征提取模块、第三中间特征提取模块依次连接;The image segmentation module, the initial feature extraction module, the first intermediate feature extraction module, the second intermediate feature extraction module, and the third intermediate feature extraction module are connected in sequence;
    所述初始特征提取模块、第一中间特征提取模块、第三中间特征提取模块中的特征差异与特征位移子模块的数量相同;The feature differences in the initial feature extraction module, the first intermediate feature extraction module, and the third intermediate feature extraction module are the same as the number of feature displacement sub-modules;
    所述第二中间特征提取模块中的特征差异与特征位移子模块的数量大于所述初始特征提取模块、第一中间特征提取模块、第三中间特征提取模块中的特征差异与特征位移子模块的数量。The number of feature differences and feature displacement submodules in the second intermediate feature extraction module is greater than the number of feature differences and feature displacement submodules in the initial feature extraction module, the first intermediate feature extraction module, and the third intermediate feature extraction module. quantity.
  5. 根据权利要求3或4所述的动作识别系统,其中,所述特征差异与特征位移子模块包括,第一归一化层、特征位移单元、特征差异单元、第二归一化层、第一全连接层、第一GELU函数层、第二全连接层,其中,The action recognition system according to claim 3 or 4, wherein the feature difference and feature displacement sub-module comprises a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, a first Fully connected layer, the first GELU function layer, and the second fully connected layer, wherein,
    第一归一化层、特征位移单元、特征差异单元、第二归一化层、第一全连接层、第一GELU函数层、第二全连接层依次连接;The first normalization layer, the feature displacement unit, the feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, and the second fully connected layer are sequentially connected;
    所述第二归一化层的输入数据为第一残差计算结果;所述第一残差计算结果是通过所述第一归一化层的输入数据与所述特征差异单元的输出数据计算的;The input data of the second normalization layer is the first residual calculation result; the first residual calculation result is calculated by the input data of the first normalization layer and the output data of the feature difference unit of;
    所述特征差异与特征位移子模块的输出数据为第二残差计算结果;所述第二残差计算结果是通过所述第二归一化层的输入数据与所述第二全连接层的输出数据计算的。The output data of the feature difference and feature displacement sub-module is the second residual calculation result; the second residual calculation result is obtained through the input data of the second normalization layer and the second fully connected layer The output data is calculated.
  6. 根据权利要求5所述的动作识别系统,其中,所述特征位移单元中包括第一信道全连接层、水平特征位移层、第二信道全连接层、竖直特征位移层、第三信道全连接层、第四信道全连接层,其中,The action recognition system according to claim 5, wherein the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer. Layer, the fourth channel fully connected layer, where,
    所述第一信道全连接层被配置为对输入数据的信道进行全连接,得到全连接结果,并将所述全连接结果分别输入至所述水平特征位移层和所述竖直特征位移层中;The first channel full connection layer is configured to perform full connection on channels of input data to obtain a full connection result, and input the full connection result into the horizontal feature displacement layer and the vertical feature displacement layer respectively ;
    所述水平特征位移层被配置为对所述全连接结果进行水平位移,得到水平位移结果,并将所述水平位移结果输入至所述第二信道全连接层中;The horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain a horizontal displacement result, and input the horizontal displacement result into the second channel fully connected layer;
    所述竖直特征位移层被配置为对所述全连接结果进行竖直位移,得到竖直位移结果,并将所述竖直位移结果输入至所述第三信道全连接层中;The vertical characteristic displacement layer is configured to perform vertical displacement on the fully connected result to obtain a vertical displacement result, and input the vertical displacement result into the third channel fully connected layer;
    所述第四信道全连接层被配置为对所述第二信道全连接层和第三信道全连接层的输出结果的和进行处理,得到所述特征位移单元的输出结果。The fourth channel fully connected layer is configured to process the sum of the output results of the second channel fully connected layer and the third channel fully connected layer to obtain the output result of the feature displacement unit.
  7. 根据权利要求5所述的动作识别系统,其中,所述特征差异单元包括:输入层、最大池化层、第三全连接层、第二GELU函数层、第四全连接层、上采样层、第五全连接层、第三GELU函数层、第六全连接层、特征差异输出层,其中,The action recognition system according to claim 5, wherein the feature difference unit comprises: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, The fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer, wherein,
    输入层、最大池化层、第三全连接层、第二GELU函数层、第四全连接层、上采样层、特征差异输出层依次连接;The input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, the upsampling layer, and the feature difference output layer are connected in sequence;
    输入层、第五全连接层、第三GELU函数层、第六全连接层、特征差异输出层依次连接;The input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence;
    所述输入层被配置为将当前时刻分段对应的输入特征与上一时刻分段对应的输入特征作差,并将差值特征分别输入到最大池化层和第五全连接层中;The input layer is configured to make a difference between the input feature corresponding to the segment at the current moment and the input feature corresponding to the segment at the previous moment, and input the difference feature into the maximum pooling layer and the fifth fully connected layer respectively;
    所述特征差异输出层被配置为将所述上采样层和所述第六全连接层的输出结果进行求和,得到求和结果;将所述求和结果与所述上一时刻分段对应的输入特征进行逐点相乘,得到相乘结果;将所述相乘结果与所述上一时刻分段对应的输入特征进行相加,得到所述特征差异单元的输出结果。The feature difference output layer is configured to sum the output results of the upsampling layer and the sixth fully connected layer to obtain a summation result; and correspond the summation result to the last time segment The input features are multiplied point by point to obtain a multiplication result; the multiplication result is added to the input features corresponding to the previous time segment to obtain the output result of the feature difference unit.
  8. 根据权利要求1所述的动作识别系统,其中,所述分类网络包括第一时间平均池化层、第二时间平均池化层、静态特征分类器、动态特征分类器、输出层,The action recognition system according to claim 1, wherein the classification network comprises a first temporal average pooling layer, a second temporal average pooling layer, a static feature classifier, a dynamic feature classifier, an output layer,
    所述第一时间平均池化层被配置为对多个分段对应的静态分类特征进行时间平均池化,并将池化结果输入至所述静态特征分类器中;The first temporal average pooling layer is configured to perform temporal average pooling on static classification features corresponding to multiple segments, and input pooling results into the static feature classifier;
    所述第二时间平均池化层被配置为对多个分段对应的动态分类特征进行时间平均池化,并将池化结果输入至所述动态特征分类器中;The second temporal average pooling layer is configured to perform temporal average pooling on dynamic classification features corresponding to multiple segments, and input pooling results into the dynamic feature classifier;
    所述静态特征分类器被配置为根据所述静态分类特征得到第一分类结果;The static feature classifier is configured to obtain a first classification result according to the static classification feature;
    所述动态特征分类器被配置为根据所述动态分类特征得到第二分类结果;The dynamic feature classifier is configured to obtain a second classification result according to the dynamic classification feature;
    所述识别结果输出层被配置为将所述第一分类结果和所述第二分类结果的加权平均结果作为输出结果。The recognition result output layer is configured to use a weighted average result of the first classification result and the second classification result as an output result.
  9. 一种动作识别模型训练方法,包括:A method for training an action recognition model, comprising:
    获取多个图像序列,所述图像序列中标注有行人动作类型;Obtaining a plurality of image sequences, the image sequences are marked with pedestrian action types;
    将各所述图像序列分为多段子序列,得到训练数据集;Dividing each of the image sequences into multiple subsequences to obtain a training data set;
    将所述训练数据集输入神经网络系统,对所述神经网络系统进行训练,直到损失函数的损失值满足损失条件,得到所述动作识别模型,所述神经网络系统为权利要求1-8中任一项所述的动作识别系统。The training data set is input into the neural network system, and the neural network system is trained until the loss value of the loss function meets the loss condition to obtain the action recognition model, and the neural network system is any one of claims 1-8. An action recognition system as described.
  10. 根据权利要求9所述的动作识别模型训练方法,其中,The action recognition model training method according to claim 9, wherein,
    所述损失函数采用正交投影损失函数和交叉熵损失函数联合得到。The loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.
  11. 根据权利要求10所述的动作识别模型训练方法,其中,所述损失函数通过所述正交投影损失函数与控制正交投影损失权重的超参数之积,与所述交叉熵损失函数相加得到。The action recognition model training method according to claim 10, wherein the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function .
  12. 一种动作识别方法,包括:A method for action recognition, comprising:
    获取目标对象的图像序列,将所述图像序列分为多段子序列;acquiring an image sequence of the target object, and dividing the image sequence into multiple subsequences;
    将所述子序列输入动作识别模型,生成动作识别结果,所述动作识别模型通过如权利要求9-11中任一项所述的动作识别模型训练方法训练得到。The subsequence is input into an action recognition model to generate an action recognition result, and the action recognition model is trained by the action recognition model training method according to any one of claims 9-11.
  13. 根据权利要求12所述的动作识别方法,其中,在将所述图像序列分为多段子序列的步骤之后,将所述子序列输入动作识别模型的步骤之前,所述方法还包括:The action recognition method according to claim 12, wherein, after the step of dividing the image sequence into multiple subsequences, and before the step of inputting the subsequences into an action recognition model, the method further comprises:
    对所述图像序列中的图像进行等比例缩放,得到缩放图像,所述缩放图像的短边大小位于预设区间内;Performing proportional scaling on the images in the image sequence to obtain a scaled image, the size of the short side of the scaled image is within a preset interval;
    对所述缩放图像进行随机剪裁,得到裁剪图像,所述裁剪图像的大小满足预设条件;Randomly cropping the scaled image to obtain a cropped image, the size of the cropped image satisfies a preset condition;
    将所述裁剪图像作为子序列中的图像,执行将所述子序列输入动作识别模型的步骤。Using the cropped image as an image in a subsequence, the step of inputting the subsequence into an action recognition model is performed.
  14. 一种动作识别模型训练装置,包括:An action recognition model training device, comprising:
    图像获取模块,被配置为获取多个图像序列,所述图像序列中标注有行人动作类型;An image acquisition module configured to acquire a plurality of image sequences, wherein the image sequences are marked with pedestrian action types;
    训练数据获取模块,被配置为将各所述图像序列分为多段子序列,得到训练数据集;The training data acquisition module is configured to divide each of the image sequences into multiple subsequences to obtain a training data set;
    模型训练模块,被配置为将所述训练数据集输入神经网络系统,对所述神经网络系统进行训练,得到所述动作识别模型,所述神经网络系统为权利要求1-8中任一项所述的动作识别系统。The model training module is configured to input the training data set into the neural network system, train the neural network system, and obtain the action recognition model, and the neural network system is as claimed in any one of claims 1-8. The action recognition system described above.
  15. 一种动作识别装置,包括:An action recognition device, comprising:
    图像采集模块,被配置为获取目标对象的图像序列,将所述图像序列分为多段子序列;An image acquisition module configured to acquire an image sequence of a target object, and divide the image sequence into multiple subsequences;
    动作识别模块,被配置为将所述子序列输入动作识别模型,生成动作识别结果,所述动作识别模型通过如权利要求9-11中任一项所述的动作识别模型训练方法训练得到。The action recognition module is configured to input the subsequence into an action recognition model to generate an action recognition result, and the action recognition model is trained by the action recognition model training method according to any one of claims 9-11.
  16. 一种计算机设备,包括:A computer device comprising:
    至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,从而执行如权利要求1-8中任一项所述的动作识别系统,或,执行如权利要求9-11中任一项所述的动作识别模型训练方法,或,执行如权利要求12或13所述的动作识别方法。at least one processor; and a memory connected in communication with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, thereby Execute the motion recognition system as described in any one of claims 1-8, or, execute the motion recognition model training method as described in any one of claims 9-11, or, execute the motion recognition system as described in any one of claims 12 or 13 The action recognition method described above.
  17. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使所述计算机执行如权利要求1-8中任一项所述的动作识别系统,或,执行如权利要求9-11中任一项所述的动作识别模型训练方法,或,执行如权利要求12或13所述的 动作识别方法。A computer-readable storage medium, the computer-readable storage medium stores computer instructions, the computer instructions are used to make the computer execute the action recognition system according to any one of claims 1-8, or, Execute the action recognition model training method according to any one of claims 9-11, or execute the action recognition method according to claim 12 or 13.
PCT/CN2022/114819 2022-02-25 2022-08-25 Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium WO2023159898A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210179444.0 2022-02-25
CN202210179444.0A CN114565973A (en) 2022-02-25 2022-02-25 Motion recognition system, method and device and model training method and device

Publications (1)

Publication Number Publication Date
WO2023159898A1 true WO2023159898A1 (en) 2023-08-31

Family

ID=81716472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/114819 WO2023159898A1 (en) 2022-02-25 2022-08-25 Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN114565973A (en)
WO (1) WO2023159898A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115596A (en) * 2023-10-25 2023-11-24 腾讯科技(深圳)有限公司 Training method, device, equipment and medium of object action classification model

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565973A (en) * 2022-02-25 2022-05-31 全球能源互联网研究院有限公司 Motion recognition system, method and device and model training method and device
CN115115919B (en) * 2022-06-24 2023-05-05 国网智能电网研究院有限公司 Power grid equipment thermal defect identification method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681695A (en) * 2018-04-26 2018-10-19 北京市商汤科技开发有限公司 Video actions recognition methods and device, electronic equipment and storage medium
CN111931603A (en) * 2020-07-22 2020-11-13 北方工业大学 Human body action recognition system and method based on double-current convolution network of competitive combination network
CN113221694A (en) * 2021-04-29 2021-08-06 苏州大学 Action recognition method
CN114565973A (en) * 2022-02-25 2022-05-31 全球能源互联网研究院有限公司 Motion recognition system, method and device and model training method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681695A (en) * 2018-04-26 2018-10-19 北京市商汤科技开发有限公司 Video actions recognition methods and device, electronic equipment and storage medium
CN111931603A (en) * 2020-07-22 2020-11-13 北方工业大学 Human body action recognition system and method based on double-current convolution network of competitive combination network
CN113221694A (en) * 2021-04-29 2021-08-06 苏州大学 Action recognition method
CN114565973A (en) * 2022-02-25 2022-05-31 全球能源互联网研究院有限公司 Motion recognition system, method and device and model training method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115596A (en) * 2023-10-25 2023-11-24 腾讯科技(深圳)有限公司 Training method, device, equipment and medium of object action classification model
CN117115596B (en) * 2023-10-25 2024-02-02 腾讯科技(深圳)有限公司 Training method, device, equipment and medium of object action classification model

Also Published As

Publication number Publication date
CN114565973A (en) 2022-05-31

Similar Documents

Publication Publication Date Title
Zeng et al. Multi-scale convolutional neural networks for crowd counting
TWI750498B (en) Method and device for processing video stream
WO2023159898A1 (en) Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium
CN107529650B (en) Closed loop detection method and device and computer equipment
Ye et al. Dynamic texture based smoke detection using Surfacelet transform and HMT model
CN107330390B (en) People counting method based on image analysis and deep learning
CN105488812A (en) Motion-feature-fused space-time significance detection method
WO2022134655A1 (en) End-to-end video action detection and positioning system
CN111160295A (en) Video pedestrian re-identification method based on region guidance and space-time attention
WO2020233397A1 (en) Method and apparatus for detecting target in video, and computing device and storage medium
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN109859246B (en) Low-altitude slow unmanned aerial vehicle tracking method combining correlation filtering and visual saliency
Jiang et al. A self-attention network for smoke detection
US20230095533A1 (en) Enriched and discriminative convolutional neural network features for pedestrian re-identification and trajectory modeling
Patil et al. Multi-frame recurrent adversarial network for moving object segmentation
Angelo A novel approach on object detection and tracking using adaptive background subtraction method
CN108647605B (en) Human eye gaze point extraction method combining global color and local structural features
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
Hua et al. Dynamic scene deblurring with continuous cross-layer attention transmission
CN111079516B (en) Pedestrian gait segmentation method based on deep neural network
CN116645718A (en) Micro-expression recognition method and system based on multi-stream architecture
Toha et al. LC-Net: Localized Counting Network for extremely dense crowds
JP7253967B2 (en) Object matching device, object matching system, object matching method, and computer program
Kalboussi et al. A spatiotemporal model for video saliency detection
Chen et al. Early fire detection using HEP and space-time analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22928164

Country of ref document: EP

Kind code of ref document: A1