WO2023159898A1 - Système, procédé et appareil de reconnaissance d'actions, procédé et appareil d'entraînement de modèles, dispositif informatique et support de stockage lisible par ordinateur - Google Patents

Système, procédé et appareil de reconnaissance d'actions, procédé et appareil d'entraînement de modèles, dispositif informatique et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2023159898A1
WO2023159898A1 PCT/CN2022/114819 CN2022114819W WO2023159898A1 WO 2023159898 A1 WO2023159898 A1 WO 2023159898A1 CN 2022114819 W CN2022114819 W CN 2022114819W WO 2023159898 A1 WO2023159898 A1 WO 2023159898A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
layer
action recognition
displacement
static
Prior art date
Application number
PCT/CN2022/114819
Other languages
English (en)
Chinese (zh)
Inventor
张国梁
杜泽旭
张屹
吴鹏
郑晓崑
Original Assignee
国网智能电网研究院有限公司
国网山东省电力公司枣庄供电公司
国家电网有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国网智能电网研究院有限公司, 国网山东省电力公司枣庄供电公司, 国家电网有限公司 filed Critical 国网智能电网研究院有限公司
Publication of WO2023159898A1 publication Critical patent/WO2023159898A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention is based on a Chinese patent application with application number 202210179444.0 and a filing date of February 25, 2022, and claims the priority of this Chinese patent application.
  • the entire content of this Chinese patent application is hereby incorporated by reference.
  • the present invention relates to the technical field of image processing, and relates to an action recognition system, method, device, model training method, device, computer equipment, and computer-readable storage medium.
  • Surveillance cameras are very popular at present, whether in companies, factories, shopping malls, roads, or train stations, you can see the existence of surveillance cameras everywhere. However, relying solely on the camera is difficult to achieve the purpose of real-time monitoring of violations and abnormal behaviors. When abnormal behaviors occur, it is very time-consuming and labor-intensive to flip through the surveillance video frame by frame, and it is easy to miss. If motion recognition technology can be used to detect specific abnormal behaviors in real time, it can greatly save manpower and material resources and improve efficiency. Therefore, action recognition has important practical value.
  • the video action recognition algorithm needs to extract the time information between video frames, and the network model needs to have the ability of time modeling.
  • the action recognition technology based on deep learning is mainly divided into: a method based on a two-stream network, and a method based on a (three-dimensional) 3D convolutional network.
  • the method based on the dual-stream network uses optical flow as time information, and needs to calculate the optical flow in advance and store it in the local hard disk, which often requires a large amount of memory for large data sets.
  • the real-time effect of the method based on the dual-stream network is also poor.
  • the technical problem to be solved by the present invention is to overcome the defect in the prior art that a large amount of data is required to fit and form a model for recognizing actions, thereby providing an action recognition system, method, device, and model training method, device, Computer equipment and computer-readable storage media.
  • the first aspect of the embodiment of the present invention provides an action recognition system, including: an information separation network, a static feature network, a dynamic feature network, and a classification network.
  • the information separation network includes a band-pass filter module, a static feature extraction module, and a band-pass filter module. It is configured to extract dynamic feature maps based on multiple frames of continuous images in a segment; the static feature extraction module is configured to perform temporal average pooling on multiple frames of continuous images in a segment to obtain feature maps, and combine the feature maps with The dynamic feature map is subtracted to obtain the static feature map; the static feature network is configured to perform feature displacement operations on the static feature maps corresponding to multiple segments, and calculate the difference between the static feature maps corresponding to each segment to obtain the static feature map.
  • the dynamic feature network is configured to perform feature displacement operations on the dynamic feature maps corresponding to multiple segments, and calculate the difference between the dynamic feature maps corresponding to each segment to obtain dynamic classification features;
  • the classification network is configured as The action recognition result is obtained according to the static classification feature and the dynamic classification feature.
  • the bandpass filter module includes a spatial convolution layer and a temporal convolution layer.
  • At least one of the static feature network and the dynamic feature network includes: an image segmentation module, an initial feature extraction module, and at least one intermediate feature extraction module, and the image segmentation module is configured as Segment the input feature map according to the first preset size to obtain the first feature vector;
  • the initial feature extraction module includes a linear embedding sub-module and at least one feature difference and feature displacement sub-module, and the linear embedding sub-module is configured to follow the preset channel Convert the first eigenvector to obtain the second eigenvector;
  • the feature difference and feature displacement sub-module is configured to perform a feature displacement operation on the second eigenvector, and calculate the difference between the second eigenvectors corresponding to each segment features to obtain the initial classification features;
  • the intermediate feature extraction module includes a feature merging submodule and at least one feature difference and feature displacement submodule, and the feature merging submodule is configured to merge the initial classification features according to the second preset size to obtain
  • At least one of the static feature network and the dynamic feature network includes three intermediate feature extraction modules: a first intermediate feature extraction module, a second intermediate feature extraction module, a third intermediate feature extraction module, and a third intermediate feature extraction module.
  • the intermediate feature extraction module, the image segmentation module, the initial feature extraction module, the first intermediate feature extraction module, the second intermediate feature extraction module, and the third intermediate feature extraction module are sequentially connected; the initial feature extraction module, the first intermediate feature extraction module, the second The number of feature differences and feature displacement sub-modules in the three intermediate feature extraction modules is the same; the number of feature differences and feature displacement sub-modules in the second intermediate feature extraction module is greater than that of the initial feature extraction module, the first intermediate feature extraction module, and the third The number of feature difference and feature displacement sub-modules in the intermediate feature extraction module.
  • the feature difference and feature displacement sub-module includes a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, and a first fully connected layer , the first GELU function layer, the second fully connected layer, wherein, the first normalization layer, the feature displacement unit, the feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, the second The two fully connected layers are connected in turn; the input data of the second normalization layer is the first residual calculation result; the first residual calculation result is the difference between the input data of the first normalization layer and the feature The output data of the unit is calculated; the output data of the feature difference and feature displacement sub-module is the second residual calculation result; the second residual calculation result is obtained through the input data of the second normalization layer and the first The output data of the second fully connected layer is calculated.
  • the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer , the fourth channel fully connected layer, wherein the first channel fully connected layer is configured to fully connect the channels of the input data to obtain a fully connected result, and input the fully connected result to the horizontal feature displacement layer and the vertical feature displacement layer respectively layer; the horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain the horizontal displacement result, and input the horizontal displacement result to the second channel fully connected layer; the vertical feature displacement layer is configured to perform a horizontal displacement on the fully connected result Perform vertical displacement to obtain the vertical displacement result, and input the vertical displacement result into the third channel fully connected layer; the fourth channel fully connected layer is configured to perform the second channel fully connected layer and the third channel fully connected layer The sum of the output results is processed to obtain the output result of the characteristic displacement unit.
  • the feature difference unit includes: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, a Five fully connected layers, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer, wherein, the input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, The upsampling layer and the feature difference output layer are connected in sequence; the input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence; the input layer is configured to segment the current moment into corresponding The input features of the previous segment are compared with the input features corresponding to the previous segment, and the difference features are respectively input into the maximum pooling layer and the fifth fully connected layer; the feature difference output layer is configured to combine the upsampling layer and the sixth The output results of the fully
  • the classification network includes a first time average pooling layer, a second time average pooling layer, a static feature classifier, a dynamic feature classifier, an output layer, and a first time average pooling layer.
  • the pooling layer is configured to perform temporal average pooling on the static classification features corresponding to multiple segments, and input the pooling result to the static feature classifier; the second temporal average pooling layer is configured to perform multiple segmental average pooling
  • the corresponding dynamic classification features are time-average pooled, and the pooling results are input into the dynamic feature classifier;
  • the static feature classifier is configured to obtain the first classification result according to the static classification feature;
  • the dynamic feature classifier is configured to be based on the dynamic The classification feature obtains the second classification result;
  • the recognition result output layer is configured to take the weighted average result of the first classification result and the second classification result as the output result.
  • the second aspect of the embodiment of the present invention provides an action recognition model training method, including: acquiring multiple image sequences, in which the types of pedestrian actions are marked; dividing each image sequence into multiple sub-sequences to obtain a training data set; The training data set is input into the neural network system, and the neural network system is trained until the loss value of the loss function satisfies the loss condition, and an action recognition model is obtained.
  • the neural network system is the action recognition system provided in the first aspect of the embodiment of the present invention.
  • the loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.
  • the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function.
  • the third aspect of the embodiment of the present invention provides an action recognition method, including: acquiring an image sequence of a target object, dividing the image sequence into multiple sub-sequences; inputting the sub-sequences into an action recognition model to generate an action recognition result, and the action recognition model passes It is obtained through training by the action recognition model training method provided in the second aspect of the embodiment of the present invention.
  • the method further includes: performing, etc. Proportionally zoom to get a zoomed image, the size of the short side of the zoomed image is within the preset range; randomly crop the zoomed image to get a cropped image, the size of the cropped image meets the preset conditions; use the cropped image as an image in the subsequence, execute The step of feeding a subsequence into an action recognition model.
  • the fourth aspect of the embodiment of the present invention provides an action recognition model training device, including: an image acquisition module configured to acquire a plurality of image sequences in which pedestrian action types are marked; a training data acquisition module configured to Each image sequence is divided into multiple sub-sequences to obtain a training data set; the model training module is configured to input the training data set into a neural network system, train the neural network system, and obtain an action recognition model, and the neural network system is an embodiment of the present invention
  • the action recognition system provided in the first aspect.
  • the fifth aspect of the embodiment of the present invention provides an action recognition device, including: an image acquisition module configured to acquire an image sequence of a target object, and divide the image sequence into multiple sub-sequences; an action recognition module configured to divide the sub-sequence An action recognition model is input to generate an action recognition result, and the action recognition model is trained by the action recognition model training method provided in the second aspect of the embodiment of the present invention.
  • the sixth aspect of the embodiment of the present invention provides a computer device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor.
  • a processor executes to execute the action recognition system provided in the first aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the second aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the third aspect of the embodiment of the present invention.
  • the action recognition method provided by the aspect is not limited to execute the action recognition system provided in the first aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the second aspect of the embodiment of the present invention, or executes the action recognition model training method provided in the third aspect of the embodiment of the present invention.
  • the seventh aspect of the embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are configured to cause the computer to execute the action recognition system provided by the first aspect of the embodiment of the present invention, or Execute the action recognition model training method provided in the second aspect of the embodiment of the present invention, or execute the action recognition method provided in the third aspect of the embodiment of the present invention.
  • the action recognition system includes an information separation network, a static feature network, a dynamic feature network, and a classification network.
  • the dynamic feature map and the static feature map in the image are separated through the information separation network, and then the dynamic feature
  • the graph and the static feature map are input into the dynamic feature network and the static feature network.
  • the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, the network can be made with less computation.
  • Obtaining spatial local information ensures the speed of network operation.
  • the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition.
  • the action recognition system provided by the embodiment of the present invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition system can realize accurate recognition of actions.
  • the motion recognition model training method and device after obtaining the training data set, input the training data set into the motion recognition system provided in the first aspect of the embodiment of the present invention, and train the motion recognition system
  • the action recognition model is obtained.
  • the action recognition system provided by the first aspect of the embodiment of the present invention includes an information separation network, a static feature network, a dynamic feature network, and a classification network.
  • the action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network.
  • the feature map and then input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network respectively. By analyzing the dynamic feature map separately, the short-term time information of the video can be captured.
  • the static feature network and the dynamic feature network By analyzing the static feature map separately , can identify the static scene in the video, the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference between the feature maps corresponding to each segment, and obtain classification features. By performing displacement operations on the feature maps, it can The network acquires spatial local information with a small amount of calculation, which ensures the running speed of the network.
  • the action recognition model training method provided by the embodiment of the present invention can use less data training to obtain the action recognition model, and, through the action recognition training method and device training provided by the embodiment of the present invention The obtained action recognition model can realize accurate recognition of actions.
  • the action recognition method and device provided by the embodiments of the present invention divide the image sequence into multiple subsequences after acquiring the image sequence of the target object, and input the subsequences into the action recognition model provided by the second aspect of the embodiment of the present invention
  • the action recognition model training method provided by the second aspect of the embodiment of the present invention obtains the action recognition model by training the action recognition system.
  • the action recognition system first separates the dynamic features in the image through the information separation network and static feature maps, and then input the dynamic feature maps and static feature maps into the dynamic feature network and the static feature network respectively. By analyzing the dynamic feature maps separately, the short-term time information of the video can be captured.
  • a separate analysis can identify the static scene in the video.
  • the static feature network and the dynamic feature network perform displacement operations on the feature map, and calculate the difference between the feature maps corresponding to each segment to obtain classification features.
  • the displacement operation can enable the network to obtain spatial local information with a small amount of calculation, ensuring the running speed of the network.
  • By calculating the difference between feature maps it can capture the long-term time relationship in the video, so that the network has the ability of time modeling , so as to ensure the accuracy of network action recognition. It can be seen that the action recognition model trained by the action recognition training method provided by the second aspect of the embodiment of the present invention can realize accurate recognition of actions. Therefore, by implementing the embodiment of the present invention, it can Accurate recognition of actions is achieved.
  • Fig. 1 is a functional block diagram of an example of an action recognition system in an embodiment of the present invention
  • FIG. 2 is a functional block diagram of an example of a static feature network and/or a dynamic feature network in an embodiment of the present invention
  • Fig. 3 is a functional block diagram of an example of the feature difference and feature displacement sub-module in the embodiment of the present invention.
  • Fig. 4 is a functional block diagram of an example of a feature displacement unit in an embodiment of the present invention.
  • FIG. 5 is a functional block diagram of an example of a feature difference unit in an embodiment of the present invention.
  • Fig. 6 is a flowchart of an example of an action recognition model training method in an embodiment of the present invention.
  • FIG. 7 is a flowchart of an example of an action recognition method in an embodiment of the present invention.
  • Fig. 8 is a functional block diagram of an example of an action recognition model training device in an embodiment of the present invention.
  • FIG. 9 is a flowchart of an example of an action recognition method in an embodiment of the present invention.
  • FIG. 10 is a functional block diagram of an example of computer equipment in an embodiment of the present invention.
  • An embodiment of the present invention provides an action recognition system, as shown in FIG. 1 , including: an information separation network 11, a static feature network 12, a dynamic feature network 13, and a classification network 14.
  • the information separation network 11 includes a bandpass filter module 111, static feature extraction module 112,
  • the band-pass filtering module 111 is configured to extract a dynamic feature map according to the acquired multi-frame continuous images in a segment.
  • the bandpass filter module 111 sequentially extracts the dynamic feature map corresponding to each segment based on the continuous images in each segment.
  • the static feature extraction module 112 is configured to perform temporal average pooling on multiple frames of continuous images in a segment to obtain a feature map, and make a difference between the feature map and the dynamic feature map to obtain a static feature map.
  • the static feature extraction module 112 first performs temporal average pooling on multiple frames of continuous images in a segment through the temporal average pooling layer to obtain a feature map with a time dimension of 1, and combines the feature map with the segment The dynamic feature map corresponding to the segment is subtracted to obtain the static feature map of the segment.
  • the static feature extraction module 112 sequentially extracts the static feature maps corresponding to each segment based on the continuous images in each segment.
  • the static feature network 12 is configured to perform feature displacement operations on the static feature maps corresponding to multiple segments, and calculate the difference features between the static feature maps corresponding to each segment to obtain static classification features.
  • the static feature network 12 when performing action recognition through the action recognition system provided by the embodiment of the present invention, performs feature displacement operations on the static feature maps corresponding to each segment, that is, for any segment When the corresponding static feature map performs feature displacement operation, it will not be affected by the static feature maps corresponding to other segments. However, when the static feature network 12 calculates the difference features between the static feature maps corresponding to each segment, it needs to combine the static feature maps corresponding to two adjacent segments to calculate the difference features.
  • the dynamic feature network 13 is configured to perform a feature displacement operation on the dynamic feature maps corresponding to multiple segments, and calculate the difference feature between values of the dynamic feature maps corresponding to each segment to obtain dynamic classification features.
  • the dynamic feature network 13 performs a feature displacement operation on the dynamic feature map corresponding to each segment, that is, performs a feature displacement operation on the dynamic feature map corresponding to any segment , will not be affected by the static feature maps corresponding to other segments.
  • the dynamic feature network 13 calculates the difference features between the dynamic feature maps corresponding to each segment, it needs to combine the dynamic feature maps corresponding to two adjacent segments to calculate the difference features.
  • the classification network 14 is configured to obtain action recognition results according to static classification features and dynamic classification features.
  • the classification network includes a classifier, and the static classification feature and the dynamic classification feature are analyzed by the classifier to obtain an action recognition result.
  • the action recognition system includes an information separation network 11, a static feature network 12, a dynamic feature network 13, and a classification network 14.
  • the dynamic feature map and the static feature map in the image are separated through the information separation network 11, and then respectively
  • the dynamic feature map and the static feature map are input into the dynamic feature network 13 and the static feature network 12, and by analyzing the dynamic feature map separately, short-term time information of the video can be captured, and by separately analyzing the static feature map, it is possible to identify
  • the static feature network 12 and the dynamic feature network 13 perform displacement operations on the feature maps, and calculate the difference between the feature maps corresponding to each segment to obtain classification features.
  • the action recognition system By performing displacement operations on the feature maps, the The less computational load enables the network to obtain spatial local information, which ensures the speed of the network. By calculating the difference between the feature maps, it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thus ensuring the network.
  • the accuracy of action recognition it can be seen that the action recognition system provided by the embodiment of the present invention can use less data training to obtain the action recognition model, and the action recognition model trained by the action recognition system can realize accurate recognition of actions .
  • the bandpass filtering module 111 includes a spatial convolution layer and a temporal convolution layer.
  • LoG ⁇ (x, y) represents the Gaussian Laplacian operator with parameter ⁇ :
  • K represents the number of image frames in a segment
  • represents multiplication
  • a spatial convolution layer with a convolution kernel size of k ⁇ k is used, and the Laplacian operator with a parameter of ⁇ is used to initialize the parameters.
  • the sum of the parameter values of the convolution kernel is normalized to 1, is a temporal convolutional layer with a time step of s, and initializes the convolution kernel value to
  • the static feature network 12 and the dynamic feature network 13 can have the same network structure, and can also have different network structures.
  • the static feature network 12 and the dynamic feature network 13 have the same network structure, the two The values of the network parameters are different.
  • At least one of the static feature network 12 and the dynamic feature network 13 includes: an image segmentation module 121 , an initial feature extraction module 122 , and at least one intermediate feature extraction module 123 .
  • the image segmentation module 121 is configured to segment the input feature map according to a first preset size to obtain a first feature vector.
  • the input data of the static feature network 12 is a static feature map, and the image segmentation module 121 in the static feature network 12 segments the static feature map;
  • the input data of the dynamic feature network 13 is a dynamic feature map, and the dynamic The image segmentation module 121 in the feature network 13 segments the dynamic feature map.
  • W represents the size of the image
  • 3 represents the channel number of the image.
  • the image segmentation module 121 divides the image according to the block size of 4 ⁇ 4, and synthesizes each 4 ⁇ 4 block obtained by segmentation into a vector to obtain the feature size for in Indicates the number of blocks, 48 is the number of channels.
  • the initial feature extraction module 122 includes a linear embedding submodule 1221 and at least one feature difference and feature displacement submodule 1222.
  • the linear embedding submodule 1221 is configured to convert the first feature vector according to the preset number of channels to obtain the second feature vector;
  • the feature difference and feature displacement sub-module 1222 is configured to perform a feature displacement operation on the second feature vector, and calculate the difference feature between the second feature vectors corresponding to each segment to obtain the initial classification feature.
  • the linear embedding sub-module 1221 projects the first eigenvector to Among them, C represents the number of channels.
  • the initial feature extraction module 122 includes two continuous feature difference and feature displacement sub-modules 1222, after processing the first feature vector through the linear embedding sub-module 1221 to obtain the second feature vector, through Two consecutive feature difference and feature displacement sub-modules 1222 process the second feature vector to obtain initial classification features.
  • the intermediate feature extraction module 123 includes a feature merging submodule 1231 and at least one feature difference and feature displacement submodule 1222.
  • the feature merging submodule 1231 is configured to merge the initial classification features according to the second preset size to obtain a third feature vector;
  • the feature difference and feature displacement sub-module 1222 is configured to perform a feature displacement operation on the third feature vector, and calculate a difference feature between the third feature vectors corresponding to each segment to obtain a classification feature.
  • the feature merging sub-module 1231 in the intermediate feature extraction module 123 merges the initial classification features obtained in the previous stage according to the size of 2 ⁇ 2 to combine blocks, and synthesizes a vector to obtain a feature size of Then pass through at least one feature difference and feature shift sub-module 1222 and output.
  • the feature difference and feature displacement sub-modules 1222 in different intermediate feature extraction modules 123 may be the same or different.
  • the number of feature difference and feature displacement sub-modules 1222 in the intermediate feature extraction module 123 may be 2, 6, and so on.
  • At least one of the static feature network 12 and the dynamic feature network 13 includes three intermediate feature extraction modules 123: a first intermediate feature extraction module 123, a second intermediate feature extraction module 123, and a third intermediate feature extraction module 123.
  • feature extraction module 123 a first intermediate feature extraction module 123, a second intermediate feature extraction module 123, and a third intermediate feature extraction module 123.
  • the image segmentation module 121, the initial feature extraction module 122, the first intermediate feature extraction module 123, the second intermediate feature extraction module 123, and the third intermediate feature extraction module 123 are connected in sequence. That is, in the embodiment of the present invention, the output data of the image segmentation module 121 is the input data of the initial feature extraction module 122, the output data of the initial feature extraction module 122 is the input data of the first intermediate feature extraction module 123, and the first intermediate feature The output data of the extraction module 123 is the input data of the second intermediate feature extraction module 123 , and the output data of the second intermediate feature extraction module 123 is the input data of the third intermediate feature extraction module 123 .
  • the number of feature differences in the initial feature extraction module 122 , the first intermediate feature extraction module 123 , and the third intermediate feature extraction module 123 is the same as the number of feature displacement sub-modules 1222 .
  • the number of feature difference and feature displacement submodules 1222 in the second intermediate feature extraction module 123 is greater than the feature difference and feature displacement submodules in the initial feature extraction module 122, the first intermediate feature extraction module 123, and the third intermediate feature extraction module 123 The number of 1222.
  • the number of feature difference and feature displacement sub-modules 1222 in the initial feature extraction module 122, the first intermediate feature extraction module 123, and the third intermediate feature extraction module 123 is 2, and the second intermediate feature extraction module The number of feature difference and feature displacement sub-modules 1222 in 123 is six.
  • the final output feature size of the static feature network 12 and the dynamic feature network 13 is
  • the feature difference and feature displacement submodule 1222 includes a first normalization layer, a feature displacement unit, a feature difference unit, a second normalization layer, and a first full connection Layer, the first GELU function layer, the second fully connected layer, wherein, the first normalization layer, feature displacement unit, feature difference unit, the second normalization layer, the first fully connected layer, the first GELU function layer, The second fully connected layer is connected sequentially.
  • the input data of the second normalization layer is the first residual calculation result, and the first residual calculation result is calculated through the input data of the first normalization layer and the output data of the feature difference unit.
  • the output data of the feature difference and feature displacement sub-module 1222 is the second residual calculation result, which is calculated through the input data of the second normalization layer and the output data of the second fully connected layer.
  • the feature displacement unit includes a first channel fully connected layer, a horizontal feature displacement layer, a second channel fully connected layer, a vertical feature displacement layer, and a third channel fully connected layer , the fourth channel fully connected layer, the horizontal characteristic displacement layer is connected to the second channel fully connected layer, the vertical characteristic displacement layer is connected to the fourth channel fully connected layer, and the horizontal characteristic displacement layer is connected to the second channel fully connected layer to form a structure , and the structure formed by connecting the vertical feature displacement layer and the fourth channel fully connected layer is a parallel structure.
  • the first channel full connection layer is configured to perform full connection on the channels of the input data to obtain a full connection result, and input the full connection result to the horizontal feature displacement layer and the vertical feature displacement layer respectively.
  • the horizontal feature displacement layer is configured to perform horizontal displacement on the fully connected result to obtain the horizontal displacement result, and input the horizontal displacement result into the second channel fully connected layer.
  • the full connection result has three dimensions: height, width, and channel.
  • the number of displacement groups is 3 and the displacement size is 1, then each group of channels in the channel dimension
  • the feature map is moved horizontally in the manner of [+1,0,-1], and the vacated part is filled with zeros.
  • the full connection result is divided into 3 groups of data in the channel dimension, and the first group The data moves one unit length to the left along the horizontal direction. After the first group of data is moved, the vacant part is filled with zeros, the third group of data remains unchanged, the second group of data moves one unit length to the right along the horizontal direction, and the third group After the data is moved, the vacated part is filled with zeros.
  • the number of displacement groups is 5 and the displacement size is 2, then in the channel dimension, the channel feature maps of each group are moved in the horizontal direction according to [+4,+2,0,-2,-4], and the vacant part is filled. Zero operations.
  • the vertical feature displacement layer is configured to perform vertical displacement on the fully connected result to obtain the vertical displacement result, and input the vertical displacement result into the third channel fully connected layer.
  • the only difference between performing vertical displacement on the fully connected result and horizontal displacement on the fully connected result is that the moving direction of the vertical displacement is the vertical direction, and the moving direction of the horizontal displacement is the horizontal direction.
  • the fourth channel fully connected layer is configured to process the sum of the output results of the second channel fully connected layer and the third channel fully connected layer to obtain the output result of the feature displacement unit.
  • the feature difference unit includes: an input layer, a maximum pooling layer, a third fully connected layer, a second GELU function layer, a fourth fully connected layer, an upsampling layer, and a third fully connected layer.
  • the input layer, the maximum pooling layer, the third fully connected layer, the second GELU function layer, the fourth fully connected layer, the upsampling layer, and the feature difference output layer are connected in sequence.
  • the input layer, the fifth fully connected layer, the third GELU function layer, the sixth fully connected layer, and the feature difference output layer are connected in sequence.
  • the input layer is configured to make a difference between the input feature corresponding to the segment at the current moment and the input feature corresponding to the segment at the previous moment, and input the difference feature into the maximum pooling layer and the fifth fully connected layer respectively.
  • the feature difference output layer is configured to sum the output results of the upsampling layer and the sixth fully connected layer to obtain the summation result; multiply the summation result point by point with the input feature corresponding to the previous segment to obtain Multiplication result; add the multiplication result to the input feature corresponding to the segment at the previous moment to obtain the output result of the feature difference unit.
  • the classification network 14 includes a first temporal average pooling layer 141, a second temporal average pooling layer 143, a static feature classifier 142, dynamic feature classifier 144, output layer 145,
  • the first temporal average pooling layer 141 is configured to perform temporal average pooling on static classification features corresponding to multiple segments, and input the pooling result to the static feature classifier 142 .
  • the second temporal average pooling layer 143 is configured to perform temporal average pooling on the dynamic classification features corresponding to multiple segments, and input the pooling result to the dynamic feature classifier 144 .
  • the action recognition system provided by the embodiment of the present invention is used for action recognition
  • the information separation network 11 the static feature network 12
  • the dynamic The feature network 13 sequentially processes the N segments to obtain N static classification features and N dynamic classification features, and uses the first time average pooling layer 141 to perform time average pooling on the N static classification features to obtain a time-averaged
  • the second time average pooling layer 143 is used to perform time average pooling on the N dynamic classification features to obtain a dynamic classification feature with a time attribute, through which the static classification feature and the dynamic classification feature with the time attribute can be Achieve more accurate recognition of actions.
  • the static feature classifier 142 is configured to obtain a first classification result according to the static classification feature.
  • the dynamic feature classifier 144 is configured to obtain a second classification result according to the dynamic classification feature.
  • the recognition result output layer 145 is configured to take the weighted average result of the first classification result and the second classification result as an output result.
  • the static feature classifier 142 and the dynamic feature classifier 144 are Softmax classifiers.
  • An embodiment of the present invention provides a method for training an action recognition model, as shown in FIG. 6 , including:
  • Step S21 Obtain multiple image sequences, in which the types of pedestrian actions are marked.
  • Step S22 Divide each image sequence into multiple subsequences to obtain a training data set.
  • Step S23 Input the training data set into the neural network system, train the neural network system until the loss value of the loss function meets the loss condition, and obtain the action recognition model, the neural network system is the action recognition system provided in any of the above-mentioned embodiments, For details about the action recognition system, refer to the above-mentioned embodiments.
  • the initialized action recognition system is trained to obtain an action recognition model.
  • the Gaussian Laplacian is used to initialize the bandpass filter module 111
  • the pre-trained model is used to Initialize the feature difference and feature displacement network
  • the pre-training model is obtained through pre-training on ImageNet or other large data sets.
  • the action recognition system can be fully trained with a large-scale data set, or fine-tuned based on a pre-trained model.
  • the training data set is input into the action recognition system provided in the above embodiment, and the action recognition system is trained to obtain the action recognition model.
  • the action recognition system provided includes information separation network, static feature network, dynamic feature network, and classification network.
  • the action recognition system first separates the dynamic feature map and static feature map in the image through the information separation network, and then separates the dynamic feature map and static feature map.
  • the graph is input into the dynamic feature network and the static feature network.
  • the static feature network And the dynamic feature network performs displacement operations on the feature maps, and calculates the difference features between the feature maps corresponding to each segment, and obtains classification features.
  • the network can obtain spatial local information with a small amount of calculation. , to ensure the running speed of the network.
  • By calculating the difference features between the feature maps the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that through this
  • the action recognition model training method provided by the embodiment of the invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition training method provided by the embodiment of the invention can realize accurate recognition of actions.
  • the loss function is jointly obtained by using an orthogonal projection loss function and a cross-entropy loss function.
  • the loss function is obtained by adding the product of the orthogonal projection loss function and a hyperparameter controlling the weight of the orthogonal projection loss to the cross-entropy loss function. Combine the above orthogonal projection loss function with the cross entropy loss L ce to get the final loss function L:
  • is a hyperparameter controlling the weights of the orthogonal projection loss.
  • the orthogonal projection loss function is combined with the cross-entropy loss to obtain the final loss function, and the orthogonal projection loss function is introduced to orthogonalize the middle layer features to achieve the effect of inter-class separation and intra-class clustering , so that the trained action recognition model can recognize actions more accurately.
  • step S21 includes:
  • the images in the image sequence are proportionally scaled to obtain a scaled image, and the size of the short side of the scaled image is within a preset range.
  • the preset interval may be [256, 320].
  • the zoomed image is randomly cropped to obtain a cropped image, and the size of the cropped image satisfies a preset condition.
  • the size of the cropped image is 224 ⁇ 224.
  • the image sequence formed by cropping the image is divided into multiple subsequences to obtain the training data set.
  • An embodiment of the present invention provides an action recognition method, as shown in FIG. 7 , including:
  • Step S31 Acquire the image sequence of the target object, and divide the image sequence into multiple subsequences.
  • Step S32 Input the subsequence into the action recognition model to generate an action recognition result.
  • the action recognition model is trained by the action recognition model training method provided in the above-mentioned embodiment. For details about the action recognition model, refer to the description in the above-mentioned embodiment. This will not be repeated here.
  • the action recognition model training method provided by the above embodiment obtains the action recognition model by training the action recognition system.
  • the action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network, and then separates the dynamic feature map Input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network.
  • the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features.
  • the network can obtain Spatial local information ensures the running speed of the network.
  • By calculating the difference between feature maps it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that
  • the motion recognition model trained by the motion recognition training method provided in the above embodiments can realize precise recognition of motions, therefore, the precise recognition of motions can be realized by implementing the embodiments of the present invention.
  • the method further includes:
  • the images in the image sequence are proportionally scaled to obtain a scaled image, and the size of the short side of the scaled image is within a preset range.
  • the zoomed image is randomly cropped to obtain a cropped image, the size of the cropped image satisfies a preset condition.
  • the cropped image is used as an image in the subsequence, and the step of inputting the subsequence into the action recognition model is performed.
  • the step of inputting the subsequence into the action recognition model is performed.
  • the image is proportionally scaled so that the size of the short side of the image is within the preset interval, and then randomly cut to the input size accepted by the network, so that the purpose of data enhancement can be achieved, and the scaled and cropped data
  • the action recognition model can analyze the effective information in the image, thereby improving the analysis efficiency and the accuracy of the analysis results.
  • An embodiment of the present invention provides an action recognition model training device, as shown in FIG. 8 , including:
  • the image acquisition module 21 is configured to acquire a plurality of image sequences, in which the types of pedestrian actions are marked. For details, refer to the description of step S21 in the above embodiment, which will not be repeated here.
  • the training data acquisition module 22 is configured to divide each image sequence into multiple subsequences to obtain a training data set. For details, refer to the description of step S22 in the above-mentioned embodiment, and details will not be repeated here.
  • the model training module 23 is configured to be configured to input the training data set into the neural network system, train the neural network system, and obtain an action recognition model.
  • the neural network system is the action recognition system provided in the above-mentioned embodiment. For details, refer to the above-mentioned The description of step S23 in the embodiment will not be repeated here.
  • the action recognition model training device provided by the embodiment of the present invention, after obtaining the training data set, inputs the training data set into the action recognition system provided in the above embodiment, and trains the action recognition system to obtain the action recognition model.
  • the above embodiment The action recognition system provided includes information separation network, static feature network, dynamic feature network, and classification network.
  • the action recognition system first separates the dynamic feature map and static feature map in the image through the information separation network, and then separates the dynamic feature map and static feature map.
  • the graph is input into the dynamic feature network and the static feature network.
  • the static feature network And the dynamic feature network performs displacement operations on the feature maps, and calculates the difference features between the feature maps corresponding to each segment, and obtains classification features.
  • the network can obtain spatial local information with a small amount of calculation. , to ensure the running speed of the network.
  • By calculating the difference features between the feature maps the long-term time relationship in the video can be captured, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that through this
  • the action recognition model training device provided by the embodiment of the invention can use less data for training to obtain an action recognition model, and the action recognition model trained by the action recognition training device provided by the embodiment of the invention can realize accurate recognition of actions.
  • An embodiment of the present invention provides an action recognition device, as shown in FIG. 9 , including:
  • the image acquisition module 31 is configured to acquire an image sequence of the target object, and divide the image sequence into multiple subsequences. For details, refer to the description of step S31 in the above-mentioned embodiment, which will not be repeated here.
  • the action recognition module 32 is configured to input the subsequence into the action recognition model to generate an action recognition result.
  • the action recognition model is trained by the action recognition model training method provided in the above embodiment. For details, refer to the steps in the above embodiment. The description of S32 will not be repeated here.
  • the action recognition device provided by the embodiment of the present invention divides the image sequence into multiple subsequences after acquiring the image sequence of the target object, and inputs the subsequences into the action recognition model trained by the action recognition model training method provided in the above embodiment Among them, the action recognition model training method provided by the above embodiment obtains the action recognition model by training the action recognition system.
  • the action recognition system first separates the dynamic feature map and the static feature map in the image through the information separation network, and then separates the dynamic feature map Input the dynamic feature map and the static feature map into the dynamic feature network and the static feature network.
  • the static feature network and the dynamic feature network perform displacement operations on the feature maps, and calculate the difference features between the feature maps corresponding to each segment, and obtain classification features.
  • the network can obtain Spatial local information ensures the running speed of the network.
  • By calculating the difference between feature maps it can capture the long-term time relationship in the video, so that the network has the ability of time modeling, thereby ensuring the accuracy of network action recognition. It can be seen that
  • the motion recognition model trained by the motion recognition training method provided in the above embodiments can realize precise recognition of motions, therefore, the precise recognition of motions can be realized by implementing the embodiments of the present invention.
  • An embodiment of the present invention provides a computer device.
  • the computer device mainly includes one or more processors 41 and a memory 42 , and one processor 41 is taken as an example in FIG. 10 .
  • the computer device may also include: an input device 43 and an output device 44 .
  • the processor 41 , the memory 42 , the input device 43 and the output device 44 may be connected through a bus or in other ways. In FIG. 10 , connection through a bus is taken as an example.
  • the processor 41 may be a central processing unit (Central Processing Unit, CPU).
  • Processor 41 can also be other general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or Other chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above-mentioned types of chips.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory 42 can include a program storage area and a data storage area, wherein the program storage area can store the operating system and at least one application required by the function; , data created using motion recognition devices, etc.
  • the memory 42 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory 42 may optionally include a memory that is remotely located relative to the processor 41, and these remote memories may be connected to the motion recognition system, or the motion recognition model training device, or the motion recognition device through a network.
  • the input device 43 can receive calculation requests (or other digital or character information) input by the user, and generate key signal input related to the motion recognition system, or the motion recognition model training device, or the motion recognition device.
  • the output device 44 may include a display device such as a display screen for outputting calculation results.
  • An embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer storage medium stores computer-executable instructions, and the computer-executable instructions can perform the action recognition in any of the above-mentioned method embodiments A model training method, or, an action recognition method.
  • the storage medium can be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive) , abbreviation: HDD) or solid-state drive (Solid-State Drive, SSD), etc.; the storage medium may also include a combination of the above-mentioned types of memory.
  • An embodiment of the present invention provides an action recognition system including: a bandpass filter module configured to extract dynamic feature maps based on multiple frames of continuous images in a segment; a static feature extraction module configured to extract dynamic feature maps based on multiple frames in a segment Continuous images acquire static feature maps; the static feature network is configured to perform feature displacement operations on the static feature maps, and calculate the difference between the static feature maps corresponding to each segment to obtain static classification features; the dynamic feature network is configured to The dynamic feature map performs feature displacement operations, and calculates the difference features between the dynamic feature maps corresponding to each segment to obtain dynamic classification features; the classification network is configured to obtain action recognition results based on static classification features and dynamic classification features.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne un système, un procédé et un appareil de reconnaissance d'actions, un procédé et un appareil d'entraînement de modèles, un dispositif informatique et un support de stockage lisible par ordinateur. Le système de reconnaissance d'actions comprend : un module de filtrage passe-bande, conçu pour extraire une carte de caractéristiques dynamiques selon une pluralité de trames d'images continues dans un segment ; un module d'extraction de caractéristiques statiques, conçu pour obtenir une carte de caractéristiques statiques selon la pluralité de trames d'images continues dans un segment ; un réseau de caractéristiques statiques, conçu pour effectuer une opération de déplacement de caractéristiques sur les cartes de caractéristiques statiques et pour calculer des caractéristiques de différence entre les cartes de caractéristiques statiques correspondant aux segments, afin d'obtenir des caractéristiques statiques de classification ; un réseau de caractéristiques dynamiques, conçu pour effectuer une opération de déplacement de caractéristiques sur les cartes de caractéristiques dynamiques et pour calculer des caractéristiques de différence entre les cartes de caractéristiques dynamiques correspondant aux segments, afin d'obtenir des caractéristiques dynamiques de classification ; et un réseau de classification, conçu pour obtenir un résultat de reconnaissance d'actions selon les caractéristiques statiques de classification et les caractéristiques dynamiques de classification. Par implémentation de la présente invention, un modèle de reconnaissance d'actions peut s'obtenir à partir d'un nombre réduit de données d'entraînement et une reconnaissance précise d'une action est possible.
PCT/CN2022/114819 2022-02-25 2022-08-25 Système, procédé et appareil de reconnaissance d'actions, procédé et appareil d'entraînement de modèles, dispositif informatique et support de stockage lisible par ordinateur WO2023159898A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210179444.0 2022-02-25
CN202210179444.0A CN114565973A (zh) 2022-02-25 2022-02-25 一种动作识别系统、方法、装置及模型训练方法、装置

Publications (1)

Publication Number Publication Date
WO2023159898A1 true WO2023159898A1 (fr) 2023-08-31

Family

ID=81716472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/114819 WO2023159898A1 (fr) 2022-02-25 2022-08-25 Système, procédé et appareil de reconnaissance d'actions, procédé et appareil d'entraînement de modèles, dispositif informatique et support de stockage lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN114565973A (fr)
WO (1) WO2023159898A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115596A (zh) * 2023-10-25 2023-11-24 腾讯科技(深圳)有限公司 对象动作分类模型的训练方法、装置、设备及介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565973A (zh) * 2022-02-25 2022-05-31 全球能源互联网研究院有限公司 一种动作识别系统、方法、装置及模型训练方法、装置
CN115115919B (zh) * 2022-06-24 2023-05-05 国网智能电网研究院有限公司 一种电网设备热缺陷识别方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681695A (zh) * 2018-04-26 2018-10-19 北京市商汤科技开发有限公司 视频动作识别方法及装置、电子设备和存储介质
CN111931603A (zh) * 2020-07-22 2020-11-13 北方工业大学 基于竞合网络的双流卷积网络的人体动作识别系统及方法
CN113221694A (zh) * 2021-04-29 2021-08-06 苏州大学 一种动作识别方法
CN114565973A (zh) * 2022-02-25 2022-05-31 全球能源互联网研究院有限公司 一种动作识别系统、方法、装置及模型训练方法、装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681695A (zh) * 2018-04-26 2018-10-19 北京市商汤科技开发有限公司 视频动作识别方法及装置、电子设备和存储介质
CN111931603A (zh) * 2020-07-22 2020-11-13 北方工业大学 基于竞合网络的双流卷积网络的人体动作识别系统及方法
CN113221694A (zh) * 2021-04-29 2021-08-06 苏州大学 一种动作识别方法
CN114565973A (zh) * 2022-02-25 2022-05-31 全球能源互联网研究院有限公司 一种动作识别系统、方法、装置及模型训练方法、装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115596A (zh) * 2023-10-25 2023-11-24 腾讯科技(深圳)有限公司 对象动作分类模型的训练方法、装置、设备及介质
CN117115596B (zh) * 2023-10-25 2024-02-02 腾讯科技(深圳)有限公司 对象动作分类模型的训练方法、装置、设备及介质

Also Published As

Publication number Publication date
CN114565973A (zh) 2022-05-31

Similar Documents

Publication Publication Date Title
Zeng et al. Multi-scale convolutional neural networks for crowd counting
WO2023159898A1 (fr) Système, procédé et appareil de reconnaissance d'actions, procédé et appareil d'entraînement de modèles, dispositif informatique et support de stockage lisible par ordinateur
Xiong et al. Spatiotemporal modeling for crowd counting in videos
CN107529650B (zh) 闭环检测方法、装置及计算机设备
Ye et al. Dynamic texture based smoke detection using Surfacelet transform and HMT model
CN107330390B (zh) 一种基于图像分析和深度学习的人数统计方法
TW202101371A (zh) 視訊流的處理方法和裝置
WO2022134655A1 (fr) Système de détection et de positionnement d'action vidéo de bout en bout
CN105488812A (zh) 一种融合运动特征的时空显著性检测方法
WO2020233397A1 (fr) Procédé et appareil de détection de cible dans une vidéo, et dispositif informatique et support d'informations
CN111160295A (zh) 基于区域引导和时空注意力的视频行人重识别方法
CN109859246B (zh) 一种结合相关滤波与视觉显著性的低空慢速无人机跟踪方法
CN110532959B (zh) 基于双通道三维卷积神经网络的实时暴力行为检测系统
Jiang et al. A self-attention network for smoke detection
US20230095533A1 (en) Enriched and discriminative convolutional neural network features for pedestrian re-identification and trajectory modeling
Patil et al. Multi-frame recurrent adversarial network for moving object segmentation
Angelo A novel approach on object detection and tracking using adaptive background subtraction method
CN108647605B (zh) 一种结合全局颜色与局部结构特征的人眼凝视点提取方法
Hua et al. Dynamic scene deblurring with continuous cross-layer attention transmission
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
CN111079516B (zh) 基于深度神经网络的行人步态分割方法
US20230154139A1 (en) Systems and methods for contrastive pretraining with video tracking supervision
Toha et al. LC-Net: Localized Counting Network for extremely dense crowds
JP7253967B2 (ja) 物体対応付け装置、物体対応付けシステム、物体対応付け方法及びコンピュータプログラム
Kalboussi et al. A spatiotemporal model for video saliency detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22928164

Country of ref document: EP

Kind code of ref document: A1