WO2023125119A1

WO2023125119A1 - Spatio-temporal action detection method and apparatus, electronic device and storage medium

Info

Publication number: WO2023125119A1
Application number: PCT/CN2022/140123
Authority: WO
Inventors: 葛成伟; 童俊文; 关涛; 李健
Original assignee: 中兴通讯股份有限公司
Priority date: 2021-12-30
Filing date: 2022-12-19
Publication date: 2023-07-06
Also published as: CN116434096A

Abstract

Embodiments of the present application relate to the field of computer vision and deep learning. Disclosed are a spatio-temporal action detection method and apparatus, an electronic device and a storage medium. The method comprises: performing positioning on each person in continuous video frames to obtain position information of each person in each video frame, and caching the position information of each person in each video frame; and identifying actions in each video frame according to the cached position information in the video frames having a preset length sequence, so as to obtain actions of each person in each of the continuous video frames.

Description

Spatio-temporal motion detection method, device, electronic device and storage medium

related application

This application claims the priority of the Chinese patent application with application number 202111657437.9 filed on December 30, 2021.

technical field

The present application relates to the field of computer vision and deep learning, and in particular to a spatio-temporal motion detection method, device, electronic equipment and storage medium.

Background technique

Spatio-temporal action detection refers to positioning different characters in a given untrimmed video, analyzing the motion of the located characters, and outputting the action types of different characters. Compared with action recognition, spatio-temporal action detection needs to model the action of each character, while action recognition is to model the action of the entire video. Usually, there are multiple characters in the analyzed video, and the actions of different characters are also different. Inconsistent, motion modeling for the entire video is clearly inappropriate.

Spatial-temporal action detection includes two sub-tasks: character localization in the spatial domain and temporal action analysis. Existing spatio-temporal action detection methods can be divided into two-stage and single-stage. However, regardless of whether it is two-stage or single-stage, most of the current action recognition is based on time-series segments as a whole for action modeling, and an action category is output for this segment. There are inappropriate sampling strategies, too long sampling lengths, and inaccurate accuracy. Positioning action frames and timing features are poorly expressed, resulting in the inability to accurately locate and recognize different characters and actions in long videos.

Contents of the invention

The purpose of this application is to solve the above problems, provide a spatio-temporal action detection method, device, electronic equipment and storage medium, which solves the problems of inappropriate sampling strategy selection, too long sampling length selection, inability to accurately locate action frames, and poor timing feature expression problem, to achieve the purpose of accurate positioning and recognition of different characters and different actions in the long video.

In order to solve the above problems, the embodiment of the present application provides a spatio-temporal motion detection method, the method includes: locating each character in continuous video frames, obtaining the position information of each character in each video frame, and each The position information of each character in the video frame is cached; according to the position information of the character in the video frame of the preset length sequence of the buffer, the character action of each video frame is identified, and the position of each character in each video frame in the continuous video frame is obtained. Character action.

In order to solve the above problems, an embodiment of the present application provides a spatio-temporal motion detection device, the method includes: a position recognition module, used to locate each character in consecutive video frames, and obtain the position of each character in each video frame Information, and cache the position information of each character in each video frame; the action recognition module is used to identify the character action of each video frame according to the character position information in the video frame of the preset length sequence of the cache, and obtain Character actions for each character in each video frame in consecutive video frames.

In order to solve the above problems, an embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the Instructions executed by at least one processor, where the instructions are executed by the at least one processor, so that the at least one processor can execute the above spatio-temporal motion detection method.

In order to solve the above problems, an embodiment of the present application further provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the above spatio-temporal motion detection method is implemented.

In the embodiment of the present application, first locate the characters to obtain position information, and cache the acquired position information of each character, and then identify each video frame according to the character position information in the cached preset length sequence of video frames The action of each character in each video frame in the continuous video frame is obtained, which solves the problem of sampling strategy and sampling length selection, and performs action discrimination on each video frame, which can distinguish the background and action foreground of the sequence of video frames information, which enhances the temporal feature representation ability of the network model. Accurate positioning and recognition of different characters and different actions in the long video are realized.

Description of drawings

One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute a limitation to the embodiments. Elements with the same reference numerals in the drawings represent similar elements. Unless otherwise stated, the drawings in the drawings are not limited to scale.

FIG. 1 is a flowchart of a spatiotemporal motion detection method provided by an embodiment of the present application;

Fig. 2 is a flowchart of network model integrated reasoning provided by an embodiment of the present application;

Fig. 3 is a schematic structural diagram of an idle motion detection device provided by an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, various implementations of the present application will be described in detail below in conjunction with the accompanying drawings. However, those of ordinary skill in the art can understand that, in each implementation manner of the present application, many technical details are provided for readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following implementation modes, the technical solution claimed in this application can also be realized.

An embodiment of the present application relates to a spatio-temporal motion detection method, by first locating the characters to obtain position information, and caching the acquired position information of each character, and then according to the characters in the video frame sequence of the preset length of the cache Position information, identify the character action of each video frame, and obtain the character action of each character in each video frame in the continuous video frame, solve the problem of sampling strategy and sampling length selection, and perform action discrimination on each video frame, which can distinguish The background and action foreground information of the video frame sequence enhances the temporal feature representation ability of the network model. Accurate positioning and recognition of different characters and different actions in the long video are realized.

In one example, each person in consecutive video frames can be located by using a pre-trained object tracking network model; wherein, the object tracking network model is used to detect the position information of each person in each video frame. Store the position information of each person output by the target tracking network model in the buffer matrix. Each element S _i ^j of the buffer matrix represents the position information of the j person in the i frame, j represents the row where the element is located, and i represents the element in the column. Input the position information of each character stored in the buffer matrix into the pre-trained action recognition model, and obtain the action of each character in each video frame in the continuous video frame according to the output result of the action recognition model; wherein, the action recognition model uses Based on the position information of the person in the sequence of video frames with a preset length, the movement of the person in each video frame is identified.

Therefore, in one example, a spatio-temporal action detection method can consist of two stages: a network model training stage and a network model inference stage. The specific instructions are as follows:

In the network model training phase, it includes the training of the target tracking network model and the training of the action recognition model, wherein the basic steps of the target tracking network model training are as follows:

(1) Network model design: The target tracking network model is to locate and correlate the characters in the video, and commonly used multi-target tracking networks, such as DeepSORT, CenterTrack, FairMOT, etc., can be used;

(2) Sample labeling: Use single-category target labels to mark the characters in the video with rectangular frames according to different target IDs of different characters;

(3) Model training: use the labeled character samples for model training, and obtain the character target tracking model file after training iterations to a certain number of times.

The basic steps of action recognition model training are as follows:

(1) The entire network model includes a time-series feature extraction backbone and a dense prediction action classification head; among them, any time-series network model can be used as the backbone of this application, such as 3D convolutional network, dual-stream convolutional network, etc.;

a. The dense prediction action classification head is used to distinguish the action category of a single video frame. Assuming that the number of action categories including the background is N, the feature dimension output by the backbone network is B×C×L×H×W, where B represents batch processing number, C represents the number of channels, L represents the video sequence length, H represents the feature height, W represents the feature width, and proceed as follows:

b. The global average pooling operation is performed on the output features of the backbone network according to the H and W dimensions, that is, the head processing process, and the output dimension is B×C×L×1×1;

c. Perform a full connection operation on the output of step a, that is, connect the input dimension and the output dimension, and the output dimension is B×NL, that is, the number of output channels is NL;

d. Perform a rearrangement operation on the output of step b, that is, divide the channel number NL obtained in the second part into two parts, one part is L, the other part is N, and the output dimension is B×N×L;

e. The output of step c is calculated according to the second dimension of the softmax cross-entropy loss function, that is, the loss function is calculated for each frame of the video sequence;

(2) Sample labeling: First, extract a character sample for each character in the video according to the target id and the rectangular frame, and each character id forms a sample set; secondly, mark the action category corresponding to each video frame for each character sample set ;

(3) Model training: For each character sample set, select a continuous frame sequence of fixed length L to input to the network, randomly select the starting position of the frame sequence, and obtain the video frame action recognition model file after training iterations to a certain number of times.

The network model reasoning module uses the model files obtained through training for reasoning, and accurately locates and recognizes different characters and actions in the video.

The spatio-temporal motion detection method provided in the embodiment of the present application includes five parts: system initialization, video frame input, target tracking reasoning, motion recognition reasoning, and result output. The specific functions of each part are as follows:

System initialization, load the target tracking network model and action recognition model trained offline, and initialize the buffer matrix

Allocate necessary variables and storage space.

Video frame input, load offline video from disk and read video frames as input source, or read network video stream via rtmp or rtsp as input source.

Target tracking reasoning, output characters and their ids according to the trained target tracking network model, and complete the buffer matrix

update.

Action recognition reasoning, based on the trained action recognition model and buffer matrix

Perform action recognition on different characters to obtain action types.

The results are output, and the action trajectory lines and action types of different characters are output.

The implementation details of the spatio-temporal motion detection method in this embodiment are described in detail below. The following content is only for the convenience of understanding the implementation details of the solution, and is not necessary for implementing the solution. The specific process is shown in Figure 1, and may include the following steps:

In step 101, each person in consecutive video frames is positioned, the location information of each person in each video frame is obtained, and the location information of each person in each video frame is cached.

Specifically, the video frame is input into the target tracking network model, and the target tracking network model outputs the position information of each character in the current frame, and the server stores the position information of each character output by the target tracking network model in the buffer matrix, and the buffer matrix each element

Indicates the position information of the j character in the i frame, j indicates the row where the element is located, and i indicates the column where the element is located, as follows:

Among them, the server updates the buffer matrix in the following way: when the person in the current video frame output by the target tracking network model does not exist in the buffer matrix, add a row corresponding to the person in the buffer matrix, and place the person in the current video frame The position information in is updated in the buffer matrix; in the case that the characters in the current video frame output by the target tracking network model exist in the buffer matrix, update the position information of the characters in the current video frame in the buffer matrix; in the buffer matrix If the person corresponding to the row in is not included in the object tracking network model output and detected in the current video frame, delete the row data corresponding to the not included person.

In one example, for a given video frame, the server uses a target tracking network model to deduce and obtain the person and its id information in the current frame. If the target id is in the buffer matrix

does not exist in

Add a row of target id, and update the target information according to the frame number, that is, the coordinate position information of the target; if the target id is in the buffer matrix

already exists in , add the target information according to the target id and frame number; if

The target id in the current frame does not exist in the detection result, indicating that the target disappears, then delete it

In the target id information, that is, the entire line of location information about the above target id is deleted.

In one example, the FairMOT network model is used for character target tracking, which can take into account both performance and inference speed.

In step 102, according to the character position information in the cached preset-length sequence of video frames, the character actions of each video frame are identified, and the character actions of each character in each video frame in the continuous video frames are obtained.

In one example, before inputting the position information of each character output by the target tracking network model into the pre-trained action recognition model, it also includes: detecting the length of each row in the buffer matrix, and determining that the length is greater than or equal to the preset length The first target row of the sequence, i.e. the pair buffer matrix

Check the validity of the sequence length, for example, if the sequence length of the target id is greater than or equal to the preset length sequence L, then input the results of the first L frames of the target id to the action recognition model, and output the action recognition result of each video frame, and at the same time Delete the target information of the first T video frames, and the next T

Leave blank.

In addition, obtain the second target line in the buffer matrix whose length is less than the preset length sequence; use the last detected character action of the character corresponding to the second target line as the character action of the current video frame, for example, if the sequence of the target id If the length is less than L, the last prediction result of the above target id is used as the action recognition result of the current frame.

Input the position information of each person output by the target tracking network model into the pre-trained action recognition model, the method includes: input the position information of the target person in the continuous L video frames into the pre-trained action recognition model, and obtain the target person The character action of each video frame in the L video frames; the position information of the target character in the continuous P video frames is input into the pre-trained action recognition model, and the target character is obtained in each of the P video frames. The character action of the video frame; wherein, P is a preset length sequence, and the first T video frames in the P video frames overlap with the last T video frames in the L video frames, and T is less than the preset length sequence.

In one example, in the action recognition stage, overlap prediction is performed on some character sequences, the overlap length is T, and the value of T is L/2. For example, an action recognition is performed on the 1st-32nd frame of the target id to obtain The action of the character in frames 1-32 of the target id, and then perform an action recognition on the frames 16-48 of the target id, and obtain the action of the character in frames 16-48 of the target id. Therefore, the 16-32 frames of the target id are Overlapping forecasts. It should be noted that the embodiment of the present application does not specifically limit the values of T (T<L) and L.

In one embodiment, according to the character action of the target character in each video frame of the L video frames, and the character action of the target character in each video frame of the P video frames, the target character in each video frame is obtained The character action of the target character; wherein, according to the confidence of the recognized multiple character actions of the target character in the overlapping video frame, the final character action of the target character in the overlapping video frame is determined.

In one example, for the video frames in the overlapping area, an action classification result with higher confidence in the classification output is selected as the final human action recognition result. In the action recognition stage, overlapping predictions are performed on character sequences, which increases the prediction accuracy of the action recognition model.

In one embodiment, before each person in the continuous video frame is positioned through the pre-trained target tracking network model, an initial action recognition model is generated, and the initial action recognition model is trained according to the sample set of each person, Obtain a well-trained action recognition model; wherein, the action recognition model includes a backbone network for extracting time series features and a dense prediction action classification head for predicting character actions in each frame; the feature dimensions output by the backbone network include: B×C× L×H×W, where B represents the number of batches, C represents the number of channels, L represents the video frame of the preset length sequence, H represents the height of the depth feature, and W represents the width of the depth feature; generate the initial action recognition model, including : Perform global average pooling operation on the output information of the backbone network according to the H and W dimensions, and update the output dimension to B×C×L×1×1: Perform full connection and rearrangement operations on the output information after the output dimension is updated, Obtain the output information with an output dimension of B×N×L, where N represents the number of action categories; calculate the loss function for the output information with an output dimension of B×N×L according to the number of action categories.

In one example, since 3D convolution can effectively extract temporal action features, in this example ResNet18-3D convolution is used as the network backbone for temporal feature extraction; the dense prediction action classification head is used to identify the action category to which a single video frame belongs , assuming that the number of action categories including the background is N=3, and the action categories include jumping, running and other three categories, where the others represent the background, the input dimensions of the backbone network are [16,3,32,112,112], and the downsampling multiple is 16, The output feature dimension is [16, 512, 4, 7, 7], and the output features of the backbone network are processed as follows:

a. Perform a global average pooling operation on the output features according to the H and W dimensions, and the output dimensions are [16, 512, 4, 1, 1];

b. Perform a full connection operation on the output of step a, and the output dimension is [16,96], that is, the number of output channels is 96;

c. Perform a rearrangement operation on the output of step b, and the output dimension is 16×3×32;

d. Perform softmax cross-entropy loss function calculation on the output of step c according to the second dimension, that is, perform loss function calculation on each frame of the video sequence.

The action recognition model performs action discrimination for each video frame, which solves the problem of sampling strategy and sampling length selection, can distinguish the background and action foreground information of the video frame sequence, and enhances the temporal feature expression ability of the network model.

In order to make the spatio-temporal action detection method of the embodiment of the present application clearer, the network model integration reasoning process is introduced with reference to FIG. 2, as follows:

In step 201, a video frame is input to the object tracking model, and the person and its id information of the current frame are acquired.

In step 202, the buffer matrix is updated, wherein, if the target id is in the buffer matrix

does not exist in

In step 203, it is judged whether the length sequence of the target id is greater than or equal to the preset length sequence L, and in the case of greater than or equal to the preset length, execute step 205, otherwise, directly use the last prediction result as the current frame action recognition result .

In step 204, the results of the first L frames of the target id are input to the action recognition model.

In step 205, the spatio-temporal action result is acquired.

However, in the current two-stage or single-stage spatio-temporal motion detection methods, most of the current motion recognition is based on the timing segment as a whole for motion modeling, and an action category is output for the segment. Usually, a time series segment is not all action frames, but also has background frames, especially in scenes with relatively fast action rates, such as playing table tennis, badminton, etc., there is a problem with the sampling strategy for action discrimination based on long sequences as a whole. If it is too short, the action features cannot be fully extracted, and if the sampling is too long, it will incorporate too many background features, which will affect the action discrimination. In addition, because the action frame cannot be accurately located, it is difficult to obtain a robust temporal feature representation for action modeling with the entire timing segment. This leads to the inability to accurately locate and identify different characters and actions in long videos.

However, the spatio-temporal action detection method provided by the embodiment of the present application is a two-stage detection method through target tracking and character action recognition, wherein, target tracking is to extract spatial feature information and perform correlation, and action recognition is to extract temporal features. Therefore, two parts Separate training can increase the convergence speed of the network, reduce the difficulty of training, and also reduce the interdependence of the two network structures, increasing the accuracy of spatio-temporal action recognition. In addition, the action recognition method of collecting dense predictions solves the problem of sampling strategies. , The problem of sampling length selection, the action discrimination of each video frame can distinguish the background and action foreground information of the video frame sequence, and enhance the temporal feature expression ability of the network model. Accurate positioning and recognition of different characters and different actions in long videos has been realized, and it has the characteristics of high robustness and high accuracy. Therefore, the method provided in the embodiment of the present application can be applied in real application scenarios such as industrial production, agricultural production, and daily life, and can replace the traditional manual checking of statistics schemes, reduce manual intervention, improve work efficiency, have broad market applications, and can Bring greater research and economic value.

The step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this application ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this application.

The embodiment of the present application also relates to a spatio-temporal motion detection device, as shown in FIG. 3 , including: a position recognition module 301 and a motion recognition module 302 .

Specifically, the position recognition module 301 is configured to locate each character in consecutive video frames, obtain the position information of each character in each video frame, and cache the position information of each character in each video frame The action recognition module 302 is used to identify the character action of each video frame according to the position information of the character in the cached preset length sequence of video frames, and obtain the character action of each character in each video frame in the continuous video frames.

In one example, the position identification module 301 uses a pre-trained target tracking network model to locate each character in the continuous video frame, and stores the position information of each character output by the target tracking network model in a buffer matrix, buffering Every element of the matrix

Represents the position information of j person in the i frame, j represents the row where the element is located, and i represents the column where the element is located; wherein, the target tracking network model is used to detect the position information of each person in each video frame; The action recognition module 302 is used to input the position information of each character stored in the buffer matrix into the pre-trained action recognition model, and obtain the character action of each character in each video frame in the continuous video frame according to the output result of the action recognition model; Wherein, the action recognition model is used to identify the action of the person in each video frame according to the position information of the person in the sequence of video frames with a preset length.

In one example, the position recognition module 301 uses the target tracking network model to deduce and obtain the person and its id information in the current frame for a given video frame. If the target id is in the buffer matrix

does not exist in

In one example, the spatio-temporal action detection device of the embodiment of the present application further includes a length detection module (not shown in the figure), which inputs the position information of each character output by the target tracking network model into the pre-trained action recognition Before the model, the length detection module detects the length of each row in the buffer matrix to determine the first target row whose length is greater than or equal to the preset length sequence, that is, the buffer matrix

Leave blank.

In addition, the length detection module obtains the second target line in the buffer matrix whose length is less than the preset length sequence; uses the last detected character action of the character corresponding to the second target line as the character action of the current video frame, for example, if the target If the sequence length of id is less than L, the last prediction result of the above target id is used as the action recognition result of the current frame.

In one example, the action recognition module 302 performs overlap prediction on some character sequences, the overlap length is T, and the value of T is L/2. For example, an action recognition is performed on the 1-32 frames of the target id to obtain the target id The character action of frames 1-32, and then perform action recognition on frames 16-48 of the target id to obtain the character action of frames 16-48 of the target id, so the frames 16-32 of the target id are overlapped predict. It should be noted that the embodiment of the present application does not specifically limit the values of T (T<L) and L.

In one embodiment, the action recognition module 302 obtains the target character's character action in each of the L video frames and the target character's character action in each of the P video frames based on the target character's character action in each of the P video frames. The character action of each video frame; wherein, according to the confidence of the recognized multiple character actions of the target character in the overlapping video frame, the final character action of the target character in the overlapping video frame is determined.

The spatio-temporal action detection device provided by the embodiment of the present application adopts a two-stage detection method of target tracking and character action recognition, wherein, target tracking is to extract spatial feature information and perform correlation, and action recognition is to extract temporal features, so the two parts are separated Training can increase the convergence speed of the network, reduce the difficulty of training, and also reduce the interdependence of the two network structures, increasing the accuracy of spatio-temporal action recognition. In addition, the action recognition method of collecting dense prediction solves the problem of sampling strategy, For the problem of sampling length selection, action discrimination is performed on each video frame, which can distinguish the background and action foreground information of the video frame sequence, and enhance the temporal feature expression ability of the network model. Accurate positioning and recognition of different characters and different actions in long videos has been realized, and it has the characteristics of high robustness and high accuracy.

It is not difficult to find that this embodiment is a device embodiment corresponding to the above embodiment of the spatiotemporal motion detection method, and this embodiment can be implemented in cooperation with the above embodiment of the spatiotemporal motion detection method. The relevant technical details mentioned in the above embodiments of the spatio-temporal motion detection method are still valid in this embodiment, and will not be repeated here in order to reduce repetition. Correspondingly, the relevant technical details mentioned in this implementation manner can also be applied to the above embodiments of the spatio-temporal motion detection method.

It is worth mentioning that all the modules involved in the above embodiments of the present application are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, and can also Combination of physical units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problems proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.

The embodiment of the present application also provides an electronic device, as shown in FIG. 4 , including at least one processor 401; Instructions executed by the at least one processor 401, the instructions are executed by the at least one processor 401, so that the at least one processor can execute the above spatio-temporal motion detection method.

Wherein, the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor is transmitted on the wireless medium through the antenna, further, the antenna also receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory can be used to store data that the processor uses when performing operations.

The above-mentioned products can execute the method provided in the embodiment of this application, and have corresponding functional modules and beneficial effects for executing the method. For technical details not described in detail in this embodiment, please refer to the method provided in the embodiment of this application.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

Those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device (which can be A single chip microcomputer, a chip, etc.) or a processor (processor) executes all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

The above-mentioned embodiments are provided for those of ordinary skill in the art to implement and use this application. Those of ordinary skill in the art can make various modifications or changes to the above-mentioned embodiments without departing from the inventive idea of this application. Therefore, this application The scope of protection is not limited by the above-mentioned embodiments, but should conform to the maximum scope of the innovative features mentioned in the claims.

Claims

A spatio-temporal motion detection method, comprising:

Positioning each character in the continuous video frames, obtaining the position information of each character in each video frame, and caching the position information of each character in each video frame;

According to the character position information in the cached sequence of video frames with a preset length, the character action of each video frame is identified, and the character action of each character in each video frame in the continuous video frames is obtained.
The spatio-temporal motion detection method according to claim 1, wherein said positioning each person in the continuous video frames to obtain the position information of each person in each video frame includes:

By pre-trained target tracking network model, each character in the continuous video frame is positioned; wherein, the target tracking network model is set to detect the position information of each character in each video frame;

Said caching the position information of each character in each video frame includes: storing the position information of each character output by the target tracking network model in a buffer matrix, each of the buffer matrix element
Indicates the position information of the j person in the i-th frame, the j indicates the row where the element is located, and the i indicates the column where the element is located;

According to the character position information in the video frames of the cached preset length sequence, the character action of each video frame is identified, and the character action of each character in each video frame in the continuous video frames is obtained, including:

Input the position information of each of the characters stored in the buffer matrix into a pre-trained action recognition model, and according to the output result of the action recognition model, obtain the character of each character in each video frame in the continuous video frames action;

Wherein, the action recognition model is configured to identify the action of the person in each video frame according to the position information of the person in the sequence of video frames with a preset length.
The spatio-temporal action detection method according to claim 2, wherein said inputting the position information of each said person stored in said buffer matrix into a pre-trained action recognition model comprises:

Detecting the length of each row in the buffer matrix, and determining the first target row whose length is greater than or equal to the preset length sequence;

Inputting the first L rows of data of the first target row into the pre-trained action recognition model, where L is the preset length sequence.
The spatiotemporal action detection method according to claim 3, wherein, after detecting the length of each row in the buffer matrix, further comprising:

Obtaining a second target row in the buffer matrix whose length is less than the preset length sequence;

The last detected person's motion of the person corresponding to the second target row is used as the person's motion of the current video frame.
The spatio-temporal action detection method according to claim 2, wherein storing the position information of each of the characters output by the target tracking network model in a buffer matrix includes:

In the case that the character in the current video frame output by the target tracking network model does not exist in the buffer matrix, add a row corresponding to the character in the buffer matrix, and place the character in the current video frame The location information in is updated in the buffer matrix;

In the case that the character in the current video frame output by the target tracking network model exists in the buffer matrix, update the position information of the character in the current video frame in the buffer matrix;

If the person corresponding to the row in the buffer matrix does not include the person detected in the current video frame output by the object tracking network model, delete the row data corresponding to the not included person.
The spatio-temporal action detection method according to any one of claims 2 to 4, wherein said inputting the position information of each of the characters stored in the buffer matrix into a pre-trained action recognition model comprises:

Input the position information of the target person in the continuous L video frames into the pre-trained action recognition model to obtain the character action of the target person in each video frame of the L video frames; wherein, Said L is said preset length sequence;

Input the position information of the target person in the continuous P video frames into the pre-trained action recognition model to obtain the character action of the target person in each video frame of the P video frames; wherein, The P is the preset length sequence, and the first T video frames in the P video frames overlap with the last T video frames in the L video frames, and the T is smaller than the preset length sequence ;

According to the output result of the action recognition model, obtaining the action of each character in each video frame in the continuous video frames includes:

According to the character action of the target character in each video frame of the L video frames, and the character action of the target character in each video frame of the P video frames, the target character is obtained in Character movements for each video frame;

Wherein, the final character action of the target character in the overlapping video frame is determined according to the confidence of the recognized multiple character actions of the target character in the overlapping video frame.
The spatio-temporal action detection method according to any one of claims 2 to 5, wherein, before the target tracking network model trained in advance, before locating each character in the continuous video frames, it also includes:

generating an initial motion recognition model, and training the initial motion recognition model according to the sample sets of each character, to obtain the trained motion recognition model;

Wherein, the action recognition model includes a backbone network for extracting time-series features and a dense prediction action classification head for predicting character actions in each frame; the feature dimensions output by the backbone network include: B×C×L×H× W, where B represents the number of batches, C represents the number of channels, L represents the video frame of the preset length sequence, H represents the height of the depth feature, and W represents the width of the depth feature; the initial action recognition model generated includes:

Perform a global average pooling operation on the output information of the backbone network according to the H and W dimensions, and update the output dimension to B×C×L×1×1:

Perform full connection and rearrangement operations on the output information after the output dimension is updated, and obtain output information with an output dimension of B×N×L, where N represents the number of action categories;

A loss function is calculated on the output information whose output dimension is B×N×L according to the number of action categories.
A spatiotemporal motion detection device, comprising:

The position recognition module is configured to locate each character in the continuous video frame, obtain the position information of each character in each video frame, and cache the position information of each character in each video frame;

The action recognition module is configured to identify the character action of each video frame according to the character position information in the video frames of the preset length sequence of the cache, and obtain the character action of each character in each video frame in the continuous video frames.
An electronic device comprising:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 7 The spatio-temporal motion detection method described above.
A computer-readable storage medium storing a computer program, wherein the computer program implements the spatio-temporal motion detection method according to any one of claims 1 to 7 when executed by a processor.