WO2023125119A1 - Spatio-temporal action detection method and apparatus, electronic device and storage medium - Google Patents

Spatio-temporal action detection method and apparatus, electronic device and storage medium Download PDF

Info

Publication number
WO2023125119A1
WO2023125119A1 PCT/CN2022/140123 CN2022140123W WO2023125119A1 WO 2023125119 A1 WO2023125119 A1 WO 2023125119A1 CN 2022140123 W CN2022140123 W CN 2022140123W WO 2023125119 A1 WO2023125119 A1 WO 2023125119A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
action
video frame
target
position information
Prior art date
Application number
PCT/CN2022/140123
Other languages
French (fr)
Chinese (zh)
Inventor
葛成伟
童俊文
关涛
李健
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023125119A1 publication Critical patent/WO2023125119A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present application relates to the field of computer vision and deep learning, and in particular to a spatio-temporal motion detection method, device, electronic equipment and storage medium.
  • Spatio-temporal action detection refers to positioning different characters in a given untrimmed video, analyzing the motion of the located characters, and outputting the action types of different characters. Compared with action recognition, spatio-temporal action detection needs to model the action of each character, while action recognition is to model the action of the entire video. Usually, there are multiple characters in the analyzed video, and the actions of different characters are also different. Inconsistent, motion modeling for the entire video is clearly inappropriate.
  • Spatial-temporal action detection includes two sub-tasks: character localization in the spatial domain and temporal action analysis.
  • Existing spatio-temporal action detection methods can be divided into two-stage and single-stage. However, regardless of whether it is two-stage or single-stage, most of the current action recognition is based on time-series segments as a whole for action modeling, and an action category is output for this segment. There are inappropriate sampling strategies, too long sampling lengths, and inaccurate accuracy. Positioning action frames and timing features are poorly expressed, resulting in the inability to accurately locate and recognize different characters and actions in long videos.
  • the purpose of this application is to solve the above problems, provide a spatio-temporal action detection method, device, electronic equipment and storage medium, which solves the problems of inappropriate sampling strategy selection, too long sampling length selection, inability to accurately locate action frames, and poor timing feature expression problem, to achieve the purpose of accurate positioning and recognition of different characters and different actions in the long video.
  • the embodiment of the present application provides a spatio-temporal motion detection method, the method includes: locating each character in continuous video frames, obtaining the position information of each character in each video frame, and each The position information of each character in the video frame is cached; according to the position information of the character in the video frame of the preset length sequence of the buffer, the character action of each video frame is identified, and the position of each character in each video frame in the continuous video frame is obtained. Character action.
  • an embodiment of the present application provides a spatio-temporal motion detection device, the method includes: a position recognition module, used to locate each character in consecutive video frames, and obtain the position of each character in each video frame Information, and cache the position information of each character in each video frame; the action recognition module is used to identify the character action of each video frame according to the character position information in the video frame of the preset length sequence of the cache, and obtain Character actions for each character in each video frame in consecutive video frames.
  • an embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the Instructions executed by at least one processor, where the instructions are executed by the at least one processor, so that the at least one processor can execute the above spatio-temporal motion detection method.
  • an embodiment of the present application further provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the above spatio-temporal motion detection method is implemented.
  • the action of each character in each video frame in the continuous video frame is obtained, which solves the problem of sampling strategy and sampling length selection, and performs action discrimination on each video frame, which can distinguish the background and action foreground of the sequence of video frames information, which enhances the temporal feature representation ability of the network model.
  • FIG. 1 is a flowchart of a spatiotemporal motion detection method provided by an embodiment of the present application
  • Fig. 2 is a flowchart of network model integrated reasoning provided by an embodiment of the present application
  • Fig. 3 is a schematic structural diagram of an idle motion detection device provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • An embodiment of the present application relates to a spatio-temporal motion detection method, by first locating the characters to obtain position information, and caching the acquired position information of each character, and then according to the characters in the video frame sequence of the preset length of the cache Position information, identify the character action of each video frame, and obtain the character action of each character in each video frame in the continuous video frame, solve the problem of sampling strategy and sampling length selection, and perform action discrimination on each video frame, which can distinguish
  • the background and action foreground information of the video frame sequence enhances the temporal feature representation ability of the network model. Accurate positioning and recognition of different characters and different actions in the long video are realized.
  • each person in consecutive video frames can be located by using a pre-trained object tracking network model; wherein, the object tracking network model is used to detect the position information of each person in each video frame.
  • the object tracking network model is used to detect the position information of each person in each video frame.
  • Each element S i j of the buffer matrix represents the position information of the j person in the i frame, j represents the row where the element is located, and i represents the element in the column.
  • a spatio-temporal action detection method can consist of two stages: a network model training stage and a network model inference stage.
  • the specific instructions are as follows:
  • the network model training phase it includes the training of the target tracking network model and the training of the action recognition model, wherein the basic steps of the target tracking network model training are as follows:
  • the target tracking network model is to locate and correlate the characters in the video, and commonly used multi-target tracking networks, such as DeepSORT, CenterTrack, FairMOT, etc., can be used;
  • Sample labeling Use single-category target labels to mark the characters in the video with rectangular frames according to different target IDs of different characters;
  • Model training use the labeled character samples for model training, and obtain the character target tracking model file after training iterations to a certain number of times.
  • the entire network model includes a time-series feature extraction backbone and a dense prediction action classification head; among them, any time-series network model can be used as the backbone of this application, such as 3D convolutional network, dual-stream convolutional network, etc.;
  • the dense prediction action classification head is used to distinguish the action category of a single video frame.
  • the feature dimension output by the backbone network is B ⁇ C ⁇ L ⁇ H ⁇ W, where B represents batch processing number, C represents the number of channels, L represents the video sequence length, H represents the feature height, W represents the feature width, and proceed as follows:
  • the global average pooling operation is performed on the output features of the backbone network according to the H and W dimensions, that is, the head processing process, and the output dimension is B ⁇ C ⁇ L ⁇ 1 ⁇ 1;
  • step c Perform a full connection operation on the output of step a, that is, connect the input dimension and the output dimension, and the output dimension is B ⁇ NL, that is, the number of output channels is NL;
  • step d Perform a rearrangement operation on the output of step b, that is, divide the channel number NL obtained in the second part into two parts, one part is L, the other part is N, and the output dimension is B ⁇ N ⁇ L;
  • step c is calculated according to the second dimension of the softmax cross-entropy loss function, that is, the loss function is calculated for each frame of the video sequence;
  • Model training For each character sample set, select a continuous frame sequence of fixed length L to input to the network, randomly select the starting position of the frame sequence, and obtain the video frame action recognition model file after training iterations to a certain number of times.
  • the network model reasoning module uses the model files obtained through training for reasoning, and accurately locates and recognizes different characters and actions in the video.
  • the spatio-temporal motion detection method provided in the embodiment of the present application includes five parts: system initialization, video frame input, target tracking reasoning, motion recognition reasoning, and result output.
  • system initialization video frame input
  • target tracking reasoning motion recognition reasoning
  • result output The specific functions of each part are as follows:
  • Video frame input load offline video from disk and read video frames as input source, or read network video stream via rtmp or rtsp as input source.
  • Target tracking reasoning output characters and their ids according to the trained target tracking network model, and complete the buffer matrix update.
  • Action recognition reasoning based on the trained action recognition model and buffer matrix Perform action recognition on different characters to obtain action types.
  • the results are output, and the action trajectory lines and action types of different characters are output.
  • step 101 each person in consecutive video frames is positioned, the location information of each person in each video frame is obtained, and the location information of each person in each video frame is cached.
  • the video frame is input into the target tracking network model, and the target tracking network model outputs the position information of each character in the current frame, and the server stores the position information of each character output by the target tracking network model in the buffer matrix, and the buffer matrix each element Indicates the position information of the j character in the i frame, j indicates the row where the element is located, and i indicates the column where the element is located, as follows:
  • the server updates the buffer matrix in the following way: when the person in the current video frame output by the target tracking network model does not exist in the buffer matrix, add a row corresponding to the person in the buffer matrix, and place the person in the current video frame
  • the position information in is updated in the buffer matrix; in the case that the characters in the current video frame output by the target tracking network model exist in the buffer matrix, update the position information of the characters in the current video frame in the buffer matrix; in the buffer matrix If the person corresponding to the row in is not included in the object tracking network model output and detected in the current video frame, delete the row data corresponding to the not included person.
  • the server uses a target tracking network model to deduce and obtain the person and its id information in the current frame. If the target id is in the buffer matrix does not exist in Add a row of target id, and update the target information according to the frame number, that is, the coordinate position information of the target; if the target id is in the buffer matrix already exists in , add the target information according to the target id and frame number; if The target id in the current frame does not exist in the detection result, indicating that the target disappears, then delete it In the target id information, that is, the entire line of location information about the above target id is deleted.
  • the FairMOT network model is used for character target tracking, which can take into account both performance and inference speed.
  • step 102 according to the character position information in the cached preset-length sequence of video frames, the character actions of each video frame are identified, and the character actions of each character in each video frame in the continuous video frames are obtained.
  • the target tracking network model before inputting the position information of each character output by the target tracking network model into the pre-trained action recognition model, it also includes: detecting the length of each row in the buffer matrix, and determining that the length is greater than or equal to the preset length
  • the first target row of the sequence i.e. the pair buffer matrix
  • Check the validity of the sequence length for example, if the sequence length of the target id is greater than or equal to the preset length sequence L, then input the results of the first L frames of the target id to the action recognition model, and output the action recognition result of each video frame, and at the same time Delete the target information of the first T video frames, and the next T Leave blank.
  • the method includes: input the position information of the target person in the continuous L video frames into the pre-trained action recognition model, and obtain the target person
  • the character action of each video frame in the L video frames the position information of the target character in the continuous P video frames is input into the pre-trained action recognition model, and the target character is obtained in each of the P video frames.
  • the character action of the video frame wherein, P is a preset length sequence, and the first T video frames in the P video frames overlap with the last T video frames in the L video frames, and T is less than the preset length sequence.
  • overlap prediction is performed on some character sequences, the overlap length is T, and the value of T is L/2.
  • an action recognition is performed on the 1st-32nd frame of the target id to obtain The action of the character in frames 1-32 of the target id, and then perform an action recognition on the frames 16-48 of the target id, and obtain the action of the character in frames 16-48 of the target id. Therefore, the 16-32 frames of the target id are Overlapping forecasts. It should be noted that the embodiment of the present application does not specifically limit the values of T (T ⁇ L) and L.
  • the target character in each video frame is obtained The character action of the target character; wherein, according to the confidence of the recognized multiple character actions of the target character in the overlapping video frame, the final character action of the target character in the overlapping video frame is determined.
  • an action classification result with higher confidence in the classification output is selected as the final human action recognition result.
  • overlapping predictions are performed on character sequences, which increases the prediction accuracy of the action recognition model.
  • an initial action recognition model is generated, and the initial action recognition model is trained according to the sample set of each person, Obtain a well-trained action recognition model; wherein, the action recognition model includes a backbone network for extracting time series features and a dense prediction action classification head for predicting character actions in each frame; the feature dimensions output by the backbone network include: B ⁇ C ⁇ L ⁇ H ⁇ W, where B represents the number of batches, C represents the number of channels, L represents the video frame of the preset length sequence, H represents the height of the depth feature, and W represents the width of the depth feature; generate the initial action recognition model, including : Perform global average pooling operation on the output information of the backbone network according to the H and W dimensions, and update the output dimension to B ⁇ C ⁇ L ⁇ 1 ⁇ 1: Perform full connection and rearrangement operations on the output information after the output dimension is updated, Obtain the output information with an output dimension of B ⁇ N ⁇ L, where N represents the number
  • ResNet18-3D convolution is used as the network backbone for temporal feature extraction;
  • the output feature dimension is [16, 512, 4, 7, 7], and the output features of the backbone network are processed as follows:
  • step b Perform a full connection operation on the output of step a, and the output dimension is [16,96], that is, the number of output channels is 96;
  • step b Perform a rearrangement operation on the output of step b, and the output dimension is 16 ⁇ 3 ⁇ 32;
  • step d Perform softmax cross-entropy loss function calculation on the output of step c according to the second dimension, that is, perform loss function calculation on each frame of the video sequence.
  • the action recognition model performs action discrimination for each video frame, which solves the problem of sampling strategy and sampling length selection, can distinguish the background and action foreground information of the video frame sequence, and enhances the temporal feature expression ability of the network model.
  • step 201 a video frame is input to the object tracking model, and the person and its id information of the current frame are acquired.
  • step 202 the buffer matrix is updated, wherein, if the target id is in the buffer matrix does not exist in Add a row of target id, and update the target information according to the frame number, that is, the coordinate position information of the target; if the target id is in the buffer matrix already exists in , add the target information according to the target id and frame number; if The target id in the current frame does not exist in the detection result, indicating that the target disappears, then delete it In the target id information, that is, the entire line of location information about the above target id is deleted.
  • step 203 it is judged whether the length sequence of the target id is greater than or equal to the preset length sequence L, and in the case of greater than or equal to the preset length, execute step 205, otherwise, directly use the last prediction result as the current frame action recognition result .
  • step 204 the results of the first L frames of the target id are input to the action recognition model.
  • step 205 the spatio-temporal action result is acquired.
  • a time series segment is not all action frames, but also has background frames, especially in scenes with relatively fast action rates, such as playing table tennis, badminton, etc.
  • the sampling strategy for action discrimination based on long sequences as a whole. If it is too short, the action features cannot be fully extracted, and if the sampling is too long, it will incorporate too many background features, which will affect the action discrimination.
  • the action frame cannot be accurately located, it is difficult to obtain a robust temporal feature representation for action modeling with the entire timing segment. This leads to the inability to accurately locate and identify different characters and actions in long videos.
  • the spatio-temporal action detection method provided by the embodiment of the present application is a two-stage detection method through target tracking and character action recognition, wherein, target tracking is to extract spatial feature information and perform correlation, and action recognition is to extract temporal features. Therefore, two parts Separate training can increase the convergence speed of the network, reduce the difficulty of training, and also reduce the interdependence of the two network structures, increasing the accuracy of spatio-temporal action recognition.
  • the action recognition method of collecting dense predictions solves the problem of sampling strategies. , The problem of sampling length selection, the action discrimination of each video frame can distinguish the background and action foreground information of the video frame sequence, and enhance the temporal feature expression ability of the network model.
  • the method provided in the embodiment of the present application can be applied in real application scenarios such as industrial production, agricultural production, and daily life, and can replace the traditional manual checking of statistics schemes, reduce manual intervention, improve work efficiency, have broad market applications, and can Bring greater research and economic value.
  • step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this application ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this application.
  • the embodiment of the present application also relates to a spatio-temporal motion detection device, as shown in FIG. 3 , including: a position recognition module 301 and a motion recognition module 302 .
  • the position recognition module 301 is configured to locate each character in consecutive video frames, obtain the position information of each character in each video frame, and cache the position information of each character in each video frame
  • the action recognition module 302 is used to identify the character action of each video frame according to the position information of the character in the cached preset length sequence of video frames, and obtain the character action of each character in each video frame in the continuous video frames.
  • the position identification module 301 uses a pre-trained target tracking network model to locate each character in the continuous video frame, and stores the position information of each character output by the target tracking network model in a buffer matrix, buffering Every element of the matrix Represents the position information of j person in the i frame, j represents the row where the element is located, and i represents the column where the element is located; wherein, the target tracking network model is used to detect the position information of each person in each video frame;
  • the action recognition module 302 is used to input the position information of each character stored in the buffer matrix into the pre-trained action recognition model, and obtain the character action of each character in each video frame in the continuous video frame according to the output result of the action recognition model; Wherein, the action recognition model is used to identify the action of the person in each video frame according to the position information of the person in the sequence of video frames with a preset length.
  • the position recognition module 301 uses the target tracking network model to deduce and obtain the person and its id information in the current frame for a given video frame. If the target id is in the buffer matrix does not exist in Add a row of target id, and update the target information according to the frame number, that is, the coordinate position information of the target; if the target id is in the buffer matrix already exists in , add the target information according to the target id and frame number; if The target id in the current frame does not exist in the detection result, indicating that the target disappears, then delete it In the target id information, that is, the entire line of location information about the above target id is deleted.
  • the spatio-temporal action detection device of the embodiment of the present application further includes a length detection module (not shown in the figure), which inputs the position information of each character output by the target tracking network model into the pre-trained action recognition Before the model, the length detection module detects the length of each row in the buffer matrix to determine the first target row whose length is greater than or equal to the preset length sequence, that is, the buffer matrix Check the validity of the sequence length, for example, if the sequence length of the target id is greater than or equal to the preset length sequence L, then input the results of the first L frames of the target id to the action recognition model, and output the action recognition result of each video frame, and at the same time Delete the target information of the first T video frames, and the next T Leave blank.
  • a length detection module (not shown in the figure), which inputs the position information of each character output by the target tracking network model into the pre-trained action recognition Before the model, the length detection module detects the length of each row in the buffer matrix to determine the first target row
  • the length detection module obtains the second target line in the buffer matrix whose length is less than the preset length sequence; uses the last detected character action of the character corresponding to the second target line as the character action of the current video frame, for example, if the target If the sequence length of id is less than L, the last prediction result of the above target id is used as the action recognition result of the current frame.
  • the action recognition module 302 performs overlap prediction on some character sequences, the overlap length is T, and the value of T is L/2. For example, an action recognition is performed on the 1-32 frames of the target id to obtain the target id The character action of frames 1-32, and then perform action recognition on frames 16-48 of the target id to obtain the character action of frames 16-48 of the target id, so the frames 16-32 of the target id are overlapped predict. It should be noted that the embodiment of the present application does not specifically limit the values of T (T ⁇ L) and L.
  • the action recognition module 302 obtains the target character's character action in each of the L video frames and the target character's character action in each of the P video frames based on the target character's character action in each of the P video frames.
  • the character action of each video frame wherein, according to the confidence of the recognized multiple character actions of the target character in the overlapping video frame, the final character action of the target character in the overlapping video frame is determined.
  • the spatio-temporal action detection device adopts a two-stage detection method of target tracking and character action recognition, wherein, target tracking is to extract spatial feature information and perform correlation, and action recognition is to extract temporal features, so the two parts are separated Training can increase the convergence speed of the network, reduce the difficulty of training, and also reduce the interdependence of the two network structures, increasing the accuracy of spatio-temporal action recognition.
  • the action recognition method of collecting dense prediction solves the problem of sampling strategy, For the problem of sampling length selection, action discrimination is performed on each video frame, which can distinguish the background and action foreground information of the video frame sequence, and enhance the temporal feature expression ability of the network model. Accurate positioning and recognition of different characters and different actions in long videos has been realized, and it has the characteristics of high robustness and high accuracy.
  • this embodiment is a device embodiment corresponding to the above embodiment of the spatiotemporal motion detection method, and this embodiment can be implemented in cooperation with the above embodiment of the spatiotemporal motion detection method.
  • the relevant technical details mentioned in the above embodiments of the spatio-temporal motion detection method are still valid in this embodiment, and will not be repeated here in order to reduce repetition.
  • the relevant technical details mentioned in this implementation manner can also be applied to the above embodiments of the spatio-temporal motion detection method.
  • modules involved in the above embodiments of the present application are logical modules.
  • a logical unit can be a physical unit, or a part of a physical unit, and can also Combination of physical units.
  • units that are not closely related to solving the technical problems proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
  • the embodiment of the present application also provides an electronic device, as shown in FIG. 4 , including at least one processor 401; Instructions executed by the at least one processor 401, the instructions are executed by the at least one processor 401, so that the at least one processor can execute the above spatio-temporal motion detection method.
  • the memory and the processor are connected by a bus
  • the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor is transmitted on the wireless medium through the antenna, further, the antenna also receives the data and transmits the data to the processor.
  • the processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory can be used to store data that the processor uses when performing operations.
  • Embodiments of the present application also provide a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • a device which can be A single chip microcomputer, a chip, etc.
  • a processor processor
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Embodiments of the present application relate to the field of computer vision and deep learning. Disclosed are a spatio-temporal action detection method and apparatus, an electronic device and a storage medium. The method comprises: performing positioning on each person in continuous video frames to obtain position information of each person in each video frame, and caching the position information of each person in each video frame; and identifying actions in each video frame according to the cached position information in the video frames having a preset length sequence, so as to obtain actions of each person in each of the continuous video frames.

Description

时空动作检测方法、装置、电子设备及存储介质Spatio-temporal motion detection method, device, electronic device and storage medium
相关申请related application
本申请要求于2021年12月30号申请的、申请号为202111657437.9的中国专利申请的优先权。This application claims the priority of the Chinese patent application with application number 202111657437.9 filed on December 30, 2021.
技术领域technical field
本申请涉及计算机视觉与深度学习领域,尤其涉及一种时空动作检测方法、装置、电子设备及存储介质。The present application relates to the field of computer vision and deep learning, and in particular to a spatio-temporal motion detection method, device, electronic equipment and storage medium.
背景技术Background technique
时空动作检测,是指对给定的未修剪视频,对其中的不同人物进行位置定位,并对定位到的人物进行动作分析,输出不同人物的动作类型。与动作识别相比,时空动作检测需要对每个人物进行动作建模,而动作识别是对整个视频进行动作建模,通常情况下,在分析视频中存在多个人物,不同人物的动作行为也不一致,对整个视频进行动作建模显然不合适。Spatio-temporal action detection refers to positioning different characters in a given untrimmed video, analyzing the motion of the located characters, and outputting the action types of different characters. Compared with action recognition, spatio-temporal action detection needs to model the action of each character, while action recognition is to model the action of the entire video. Usually, there are multiple characters in the analyzed video, and the actions of different characters are also different. Inconsistent, motion modeling for the entire video is clearly inappropriate.
时空动作检测包含空间域人物定位及时序动作分析两个子任务。现有的时空动作检测方法可以分为两阶段和单阶段。然而,无论是两阶段还是单阶段,当前动作识别绝大部分是以时序片段作为整体进行动作建模,对该片段输出一个动作类别,存在采样策略选取不合适、采样长度选取过长、无法准确定位动作帧以及时序特征表述差,从而导致不能对长视频中的不同人物,不同动作进行准确定位与识别的问题。Spatial-temporal action detection includes two sub-tasks: character localization in the spatial domain and temporal action analysis. Existing spatio-temporal action detection methods can be divided into two-stage and single-stage. However, regardless of whether it is two-stage or single-stage, most of the current action recognition is based on time-series segments as a whole for action modeling, and an action category is output for this segment. There are inappropriate sampling strategies, too long sampling lengths, and inaccurate accuracy. Positioning action frames and timing features are poorly expressed, resulting in the inability to accurately locate and recognize different characters and actions in long videos.
发明内容Contents of the invention
本申请的目的在于解决上述问题,提供一种时空动作检测方法、装置、电子设备及存储介质,解决了采样策略选取不合适、采样长度选取过长、无法准确定位动作帧以及时序特征表述差的问题,实现了对长视频中的不同人物,不同动作进行准确定位与识别的目的。The purpose of this application is to solve the above problems, provide a spatio-temporal action detection method, device, electronic equipment and storage medium, which solves the problems of inappropriate sampling strategy selection, too long sampling length selection, inability to accurately locate action frames, and poor timing feature expression problem, to achieve the purpose of accurate positioning and recognition of different characters and different actions in the long video.
为解决上述问题,本申请的实施例提供了一种时空动作检测方法,方法包括:对连续视频帧中的各人物进行定位,得到每一视频帧中的各人物的位置信息,并对每一视频帧中的各人物的位置信息进行缓存;根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作,得到连续视频帧中每一视频帧的各人物的人物动作。In order to solve the above problems, the embodiment of the present application provides a spatio-temporal motion detection method, the method includes: locating each character in continuous video frames, obtaining the position information of each character in each video frame, and each The position information of each character in the video frame is cached; according to the position information of the character in the video frame of the preset length sequence of the buffer, the character action of each video frame is identified, and the position of each character in each video frame in the continuous video frame is obtained. Character action.
为解决上述问题,本申请的实施例提供了一种时空动作检测装置,方法包括:位置识别模块,用于对连续视频帧中的各人物进行定位,得到每一视频帧中的各人物的位置信息,并对每一视频帧中的各人物的位置信息进行缓存;动作识别模块,用于根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作,得到连续视频帧中每一视频帧的各人物的人物动作。In order to solve the above problems, an embodiment of the present application provides a spatio-temporal motion detection device, the method includes: a position recognition module, used to locate each character in consecutive video frames, and obtain the position of each character in each video frame Information, and cache the position information of each character in each video frame; the action recognition module is used to identify the character action of each video frame according to the character position information in the video frame of the preset length sequence of the cache, and obtain Character actions for each character in each video frame in consecutive video frames.
为解决上述问题,本申请的实施例还提供了一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执 行上述时空动作检测方法。In order to solve the above problems, an embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the Instructions executed by at least one processor, where the instructions are executed by the at least one processor, so that the at least one processor can execute the above spatio-temporal motion detection method.
为解决上述问题,本申请的实施例还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述时空动作检测方法。In order to solve the above problems, an embodiment of the present application further provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the above spatio-temporal motion detection method is implemented.
在本申请实施例中,首先对人物进行定位获取位置信息,并对获取的各人物的位置信息进行缓存,再根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作,得到连续视频帧中每一视频帧的各人物的人物动作,解决了采样策略、采样长度选取的问题,对每个视频帧进行动作判别,可以区分视频帧序列的背景与动作前景信息,增强了网络模型的时序特征表述能力。实现了对长视频中的不同人物、不同动作进行准确定位与识别。In the embodiment of the present application, first locate the characters to obtain position information, and cache the acquired position information of each character, and then identify each video frame according to the character position information in the cached preset length sequence of video frames The action of each character in each video frame in the continuous video frame is obtained, which solves the problem of sampling strategy and sampling length selection, and performs action discrimination on each video frame, which can distinguish the background and action foreground of the sequence of video frames information, which enhances the temporal feature representation ability of the network model. Accurate positioning and recognition of different characters and different actions in the long video are realized.
附图说明Description of drawings
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute a limitation to the embodiments. Elements with the same reference numerals in the drawings represent similar elements. Unless otherwise stated, the drawings in the drawings are not limited to scale.
图1是本申请一实施例提供的时空动作检测方法的流程图;FIG. 1 is a flowchart of a spatiotemporal motion detection method provided by an embodiment of the present application;
图2是本申请一实施例提供的网络模型集成推理的流程图;Fig. 2 is a flowchart of network model integrated reasoning provided by an embodiment of the present application;
图3是本申请一实施例提供时的空动作检测装置的结构示意图;Fig. 3 is a schematic structural diagram of an idle motion detection device provided by an embodiment of the present application;
图4是本申请一实施例提供的电子设备的结构示意图。Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施方式中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请所要求保护的技术方案。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, various implementations of the present application will be described in detail below in conjunction with the accompanying drawings. However, those of ordinary skill in the art can understand that, in each implementation manner of the present application, many technical details are provided for readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following implementation modes, the technical solution claimed in this application can also be realized.
本申请的一实施例涉及一种时空动作检测方法,通过先对人物进行定位获取位置信息,并对获取的各人物的位置信息进行缓存,再根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作,得到连续视频帧中每一视频帧的各人物的人物动作,解决了采样策略、采样长度选取的问题,对每个视频帧进行动作判别,可以区分视频帧序列的背景与动作前景信息,增强了网络模型的时序特征表述能力。实现了对长视频中的不同人物、不同动作进行准确定位与识别。An embodiment of the present application relates to a spatio-temporal motion detection method, by first locating the characters to obtain position information, and caching the acquired position information of each character, and then according to the characters in the video frame sequence of the preset length of the cache Position information, identify the character action of each video frame, and obtain the character action of each character in each video frame in the continuous video frame, solve the problem of sampling strategy and sampling length selection, and perform action discrimination on each video frame, which can distinguish The background and action foreground information of the video frame sequence enhances the temporal feature representation ability of the network model. Accurate positioning and recognition of different characters and different actions in the long video are realized.
在一个例子中,可以通过预先训练好的目标跟踪网络模型,对连续视频帧中的各人物进行定位;其中,目标跟踪网络模型用于对每一视频帧中的各人物的位置信息进行检测。将目标跟踪网络模型输出的各人物的位置信息存储在缓冲矩阵中,缓冲矩阵的每一个元素S i j表示j个人物在第i帧中的位置信息,j表示元素所在的行,i表示元素所在的列。将缓冲矩阵中存储的各人物的位置信息输入预先训练好的动作识别模型,根据动作识别模型的输出结果,得到连续视频帧中每一视频帧的各人物的人物动作;其中,动作识别模型用于根据预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作。 In one example, each person in consecutive video frames can be located by using a pre-trained object tracking network model; wherein, the object tracking network model is used to detect the position information of each person in each video frame. Store the position information of each person output by the target tracking network model in the buffer matrix. Each element S i j of the buffer matrix represents the position information of the j person in the i frame, j represents the row where the element is located, and i represents the element in the column. Input the position information of each character stored in the buffer matrix into the pre-trained action recognition model, and obtain the action of each character in each video frame in the continuous video frame according to the output result of the action recognition model; wherein, the action recognition model uses Based on the position information of the person in the sequence of video frames with a preset length, the movement of the person in each video frame is identified.
因此,在一个例子中,时空动作检测方法可以包含两个阶段:网络模型训练阶段和网络模型推理阶段。具体说明如下:Therefore, in one example, a spatio-temporal action detection method can consist of two stages: a network model training stage and a network model inference stage. The specific instructions are as follows:
在网络模型训练阶段中,包括目标跟踪网络模型的训练和动作识别模型的训练,其中,目标跟踪网络模型训练的基本步骤如下:In the network model training phase, it includes the training of the target tracking network model and the training of the action recognition model, wherein the basic steps of the target tracking network model training are as follows:
(1)网络模型设计:目标跟踪网络模型是对视频中的人物进行定位并进行时序关联,常用的多目标跟踪网络,如DeepSORT、CenterTrack、FairMOT等均可使用;(1) Network model design: The target tracking network model is to locate and correlate the characters in the video, and commonly used multi-target tracking networks, such as DeepSORT, CenterTrack, FairMOT, etc., can be used;
(2)样本标注:使用单类别目标标签,根据不同人物不同目标id来对视频中人物进行矩形框标注;(2) Sample labeling: Use single-category target labels to mark the characters in the video with rectangular frames according to different target IDs of different characters;
(3)模型训练:使用标注的人物样本进行模型训练,训练迭代到一定次数后得到人物目标跟踪模型文件。(3) Model training: use the labeled character samples for model training, and obtain the character target tracking model file after training iterations to a certain number of times.
动作识别模型训练的基本步骤如下:The basic steps of action recognition model training are as follows:
(1)整个网络模型包括时序特征提取主干、密集预测动作分类头;其中,任何时序网络模型均可作为本申请的主干,如3D卷积网路、双流卷积网络等;(1) The entire network model includes a time-series feature extraction backbone and a dense prediction action classification head; among them, any time-series network model can be used as the backbone of this application, such as 3D convolutional network, dual-stream convolutional network, etc.;
a.密集预测动作分类头用来判别单个视频帧所属的动作类别,假定包含背景的动作类别数为N,主干网络输出的特征维度为B×C×L×H×W,其中B表示批处理数,C表示通道数,L表示视频序列长度,H表示特征高度,W表示特征宽度,进行如下处理:a. The dense prediction action classification head is used to distinguish the action category of a single video frame. Assuming that the number of action categories including the background is N, the feature dimension output by the backbone network is B×C×L×H×W, where B represents batch processing number, C represents the number of channels, L represents the video sequence length, H represents the feature height, W represents the feature width, and proceed as follows:
b.对主干网络输出特征按H、W维度进行全局平均池化操作,即头处理过程,输出维度B×C×L×1×1;b. The global average pooling operation is performed on the output features of the backbone network according to the H and W dimensions, that is, the head processing process, and the output dimension is B×C×L×1×1;
c.对步骤a的输出进行全连接操作,即,将输入维度和输出维度进行连接,输出维度B×NL,即输出通道数为NL;c. Perform a full connection operation on the output of step a, that is, connect the input dimension and the output dimension, and the output dimension is B×NL, that is, the number of output channels is NL;
d.对步骤b的输出进行重排操作,即,将第二部得出的通道数NL分为两个部分,一部分为L,另一部分为N,输出维度B×N×L;d. Perform a rearrangement operation on the output of step b, that is, divide the channel number NL obtained in the second part into two parts, one part is L, the other part is N, and the output dimension is B×N×L;
e.对步骤c的输出按照第二维度进行softmax交叉熵损失函数计算,即对视频序列的每一帧进行损失函数计算;e. The output of step c is calculated according to the second dimension of the softmax cross-entropy loss function, that is, the loss function is calculated for each frame of the video sequence;
(2)样本标注:首先按照目标id和矩形框对视频中的每个人物提取人物样本,每个人物id形成一个样本集;其次对每个人物样本集标注每个视频帧所对应的动作类别;(2) Sample labeling: First, extract a character sample for each character in the video according to the target id and the rectangular frame, and each character id forms a sample set; secondly, mark the action category corresponding to each video frame for each character sample set ;
(3)模型训练:对每个人物样本集,选择固定长度L的连续帧序列输入网络,随机选择帧序列的起始位置,训练迭代到一定次数后得到视频帧动作识别模型文件。(3) Model training: For each character sample set, select a continuous frame sequence of fixed length L to input to the network, randomly select the starting position of the frame sequence, and obtain the video frame action recognition model file after training iterations to a certain number of times.
网络模型推理模块利用训练得到的模型文件进行推理,对视频中不同人物、不同动作进行准确定位与识别。The network model reasoning module uses the model files obtained through training for reasoning, and accurately locates and recognizes different characters and actions in the video.
本申请实施例提供的时空动作检测方法,包含系统初始化、视频帧输入、目标跟踪推理、动作识别推理、结果输出五个部分,各部分的具体功能如下:The spatio-temporal motion detection method provided in the embodiment of the present application includes five parts: system initialization, video frame input, target tracking reasoning, motion recognition reasoning, and result output. The specific functions of each part are as follows:
系统初始化,加载离线训练的目标跟踪网络模型和动作识别模型,初始化缓冲矩阵
Figure PCTCN2022140123-appb-000001
分配必须的变量及存储空间。
System initialization, load the target tracking network model and action recognition model trained offline, and initialize the buffer matrix
Figure PCTCN2022140123-appb-000001
Allocate necessary variables and storage space.
视频帧输入,从磁盘加载离线视频并读取视频帧作为输入源,也可以通过rtmp或rtsp读取网络视频流作为输入源。Video frame input, load offline video from disk and read video frames as input source, or read network video stream via rtmp or rtsp as input source.
目标跟踪推理,根据训练好的目标跟踪网络模型输出人物及其id,完成缓冲矩阵
Figure PCTCN2022140123-appb-000002
的更新。
Target tracking reasoning, output characters and their ids according to the trained target tracking network model, and complete the buffer matrix
Figure PCTCN2022140123-appb-000002
update.
动作识别推理,根据训练好的动作识别模型及缓冲矩阵
Figure PCTCN2022140123-appb-000003
对不同人物进行动作识别,获取动作类型。
Action recognition reasoning, based on the trained action recognition model and buffer matrix
Figure PCTCN2022140123-appb-000003
Perform action recognition on different characters to obtain action types.
结果输出,输出不同人物的动作轨迹线及其动作类型。The results are output, and the action trajectory lines and action types of different characters are output.
下面对本实施例中的时空动作检测方法的实现细节进行具体的说明,以下内容仅为方便理解本方案的实现细节,并非实施本方案的必须。具体流程如图1所示,可包括如下步骤:The implementation details of the spatio-temporal motion detection method in this embodiment are described in detail below. The following content is only for the convenience of understanding the implementation details of the solution, and is not necessary for implementing the solution. The specific process is shown in Figure 1, and may include the following steps:
在步骤101中,对连续视频帧中的各人物进行定位,得到每一视频帧中的各人物的位置信息,并对每一视频帧中的各人物的位置信息进行缓存。In step 101, each person in consecutive video frames is positioned, the location information of each person in each video frame is obtained, and the location information of each person in each video frame is cached.
具体地说,将视频帧输入到目标跟踪网络模型中,目标跟踪网络模型输出当前帧各人物的位置信息,服务器将目标跟踪网络模型输出的各人物的位置信息存储在缓冲矩阵中,缓冲矩阵的每一个元素
Figure PCTCN2022140123-appb-000004
表示j个人物在第i帧中的位置信息,j表示元素所在的行,i表示元素所在的列,具体如下:
Specifically, the video frame is input into the target tracking network model, and the target tracking network model outputs the position information of each character in the current frame, and the server stores the position information of each character output by the target tracking network model in the buffer matrix, and the buffer matrix each element
Figure PCTCN2022140123-appb-000004
Indicates the position information of the j character in the i frame, j indicates the row where the element is located, and i indicates the column where the element is located, as follows:
Figure PCTCN2022140123-appb-000005
Figure PCTCN2022140123-appb-000005
其中,服务器通过以下方式更新缓冲矩阵:在目标跟踪网络模型输出的当前视频帧中的人物不存在于缓冲矩阵的情况下,在缓冲矩阵中增加与人物对应的行,并将人物在当前视频帧中的位置信息更新在缓冲矩阵中;在目标跟踪网络模型输出的当前视频帧中的人物存在于缓冲矩阵的情况下,将人物在当前视频帧中的位置信息更新在缓冲矩阵中;在缓冲矩阵中的行所对应的人物未包括在目标跟踪网络模型输出的在当前视频帧中检测到的人物的情况下,删除未包括的人物所对应的行数据。Among them, the server updates the buffer matrix in the following way: when the person in the current video frame output by the target tracking network model does not exist in the buffer matrix, add a row corresponding to the person in the buffer matrix, and place the person in the current video frame The position information in is updated in the buffer matrix; in the case that the characters in the current video frame output by the target tracking network model exist in the buffer matrix, update the position information of the characters in the current video frame in the buffer matrix; in the buffer matrix If the person corresponding to the row in is not included in the object tracking network model output and detected in the current video frame, delete the row data corresponding to the not included person.
在一个例子中,服务器对给定的视频帧,使用目标跟踪网络模型推理获取当前帧的人物及其id信息。若目标id在缓冲矩阵
Figure PCTCN2022140123-appb-000006
中不存在,则
Figure PCTCN2022140123-appb-000007
新增一行目标id,并根据帧号更新目标信息,即目标的坐标位置信息;若目标id在缓冲矩阵
Figure PCTCN2022140123-appb-000008
中已经存在,则根据目标id及帧号追加目标信息;若
Figure PCTCN2022140123-appb-000009
中的目标id在当前帧检测结果不存在,说明目标消失,则删除
Figure PCTCN2022140123-appb-000010
中该目标id信息,即一整行关于上述目标id的位置信息全部删除。
In one example, for a given video frame, the server uses a target tracking network model to deduce and obtain the person and its id information in the current frame. If the target id is in the buffer matrix
Figure PCTCN2022140123-appb-000006
does not exist in
Figure PCTCN2022140123-appb-000007
Add a row of target id, and update the target information according to the frame number, that is, the coordinate position information of the target; if the target id is in the buffer matrix
Figure PCTCN2022140123-appb-000008
already exists in , add the target information according to the target id and frame number; if
Figure PCTCN2022140123-appb-000009
The target id in the current frame does not exist in the detection result, indicating that the target disappears, then delete it
Figure PCTCN2022140123-appb-000010
In the target id information, that is, the entire line of location information about the above target id is deleted.
在一个例子中,采用FairMOT网络模型进行人物目标跟踪,可以兼顾性能与推理速度。In one example, the FairMOT network model is used for character target tracking, which can take into account both performance and inference speed.
在步骤102中,根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作,得到所述连续视频帧中每一视频帧的各人物的人物动作。In step 102, according to the character position information in the cached preset-length sequence of video frames, the character actions of each video frame are identified, and the character actions of each character in each video frame in the continuous video frames are obtained.
在一个例子中,在将目标跟踪网络模型输出的各人物的位置信息输入预先训练好的动作识别模型之前,还包括:对缓冲矩阵中每一行的长度进行检测,确定长度大于或等于预设长度序列的第一目标行,即对缓冲矩阵
Figure PCTCN2022140123-appb-000011
进行序列长度有效性检查,例如,若目标id的序列长度大于或等于预设长度序列L,则将目标id的前L帧结果输入到动作识别模型,输出每个视频帧的动作识别结果,同时删除前T个视频帧的目标信息,将后T个
Figure PCTCN2022140123-appb-000012
置为空。
In one example, before inputting the position information of each character output by the target tracking network model into the pre-trained action recognition model, it also includes: detecting the length of each row in the buffer matrix, and determining that the length is greater than or equal to the preset length The first target row of the sequence, i.e. the pair buffer matrix
Figure PCTCN2022140123-appb-000011
Check the validity of the sequence length, for example, if the sequence length of the target id is greater than or equal to the preset length sequence L, then input the results of the first L frames of the target id to the action recognition model, and output the action recognition result of each video frame, and at the same time Delete the target information of the first T video frames, and the next T
Figure PCTCN2022140123-appb-000012
Leave blank.
另外,获取缓冲矩阵中长度小于预设长度序列的第二目标行;将第二目标行对应的人物的上一次检测到的人物动作,作为当前视频帧的人物动作,例如,若目标id的序列长度小于L,则将上述目标id上一次的预测结果作为当前帧的动作识别结果。In addition, obtain the second target line in the buffer matrix whose length is less than the preset length sequence; use the last detected character action of the character corresponding to the second target line as the character action of the current video frame, for example, if the sequence of the target id If the length is less than L, the last prediction result of the above target id is used as the action recognition result of the current frame.
将目标跟踪网络模型输出的各人物的位置信息输入预先训练好的动作识别模型,方法包括:将连续的L个视频帧中的目标人物的位置信息输入预先训练好的动作识别模型,得到目标人物在L个视频帧中的每一视频帧的人物动作;将连续的P个视频帧中的目标人物的位置信息输入预先训练好的动作识别模型,得到目标人物在P个视频帧中的每一视频帧的人物动 作;其中,P为预设长度序列,且P个视频帧中的前T个视频帧与L个视频帧中的后T个视频帧重叠,T小于预设长度序列。Input the position information of each person output by the target tracking network model into the pre-trained action recognition model, the method includes: input the position information of the target person in the continuous L video frames into the pre-trained action recognition model, and obtain the target person The character action of each video frame in the L video frames; the position information of the target character in the continuous P video frames is input into the pre-trained action recognition model, and the target character is obtained in each of the P video frames. The character action of the video frame; wherein, P is a preset length sequence, and the first T video frames in the P video frames overlap with the last T video frames in the L video frames, and T is less than the preset length sequence.
在一个例子中,在动作识别阶段,对部分人物序列进行重叠预测,重叠长度为T,T的取值为L/2,例如,对目标id的第1-32帧进行一次动作识别,得到目标id第1-32帧的人物动作,再对目标id的第16-48帧进行一次动作识别,得到目标id第16-48帧的人物动作,因此,对目标id的第16-32帧进行了重叠预测。需要说明的是本申请实施例不对T(T<L)和L的取值作具体限定。In one example, in the action recognition stage, overlap prediction is performed on some character sequences, the overlap length is T, and the value of T is L/2. For example, an action recognition is performed on the 1st-32nd frame of the target id to obtain The action of the character in frames 1-32 of the target id, and then perform an action recognition on the frames 16-48 of the target id, and obtain the action of the character in frames 16-48 of the target id. Therefore, the 16-32 frames of the target id are Overlapping forecasts. It should be noted that the embodiment of the present application does not specifically limit the values of T (T<L) and L.
在一实施方式中,根据目标人物在L个视频帧中的每一视频帧的人物动作,以及目标人物在P个视频帧中的每一视频帧的人物动作,得到目标人物在每一视频帧的人物动作;其中,根据识别到的目标人物在重叠的视频帧的多个人物动作的置信度,确定目标人物在重叠的视频帧的最终人物动作。In one embodiment, according to the character action of the target character in each video frame of the L video frames, and the character action of the target character in each video frame of the P video frames, the target character in each video frame is obtained The character action of the target character; wherein, according to the confidence of the recognized multiple character actions of the target character in the overlapping video frame, the final character action of the target character in the overlapping video frame is determined.
在一个例子中,针对重叠区域的视频帧,选择分类输出置信度较高的动作分类结果作为最终的人物动作识别结果。在动作识别阶段,对人物序列进行重叠预测,增加了动作识别模型预测的准确性。In one example, for the video frames in the overlapping area, an action classification result with higher confidence in the classification output is selected as the final human action recognition result. In the action recognition stage, overlapping predictions are performed on character sequences, which increases the prediction accuracy of the action recognition model.
在一实施方式中,在通过预先训练好的目标跟踪网络模型,对连续视频帧中的各人物进行定位之前,生成初始的动作识别模型,并根据各人物的样本集训练初始的动作识别模型,得到训练好的动作识别模型;其中,动作识别模型包括用于提取时序特征的主干网络和用于预测每帧的人物动作的密集预测动作分类头;主干网络输出的特征维度包括:B×C×L×H×W,其中B表示批处理数,C表示通道数,L表示预设长度序列的视频帧,H表示深度特征的高度,W表示深度特征的宽度;生成初始的动作识别模型,包括:将主干网络的输出信息按H、W维度进行全局平均池化操作,将输出维度更新为B×C×L×1×1:将输出维度更新后的输出信息进行全连接和重排操作,得到输出维度为B×N×L的输出信息,N表示动作类别数;根据动作类别数对输出维度为B×N×L的输出信息进行损失函数的计算。In one embodiment, before each person in the continuous video frame is positioned through the pre-trained target tracking network model, an initial action recognition model is generated, and the initial action recognition model is trained according to the sample set of each person, Obtain a well-trained action recognition model; wherein, the action recognition model includes a backbone network for extracting time series features and a dense prediction action classification head for predicting character actions in each frame; the feature dimensions output by the backbone network include: B×C× L×H×W, where B represents the number of batches, C represents the number of channels, L represents the video frame of the preset length sequence, H represents the height of the depth feature, and W represents the width of the depth feature; generate the initial action recognition model, including : Perform global average pooling operation on the output information of the backbone network according to the H and W dimensions, and update the output dimension to B×C×L×1×1: Perform full connection and rearrangement operations on the output information after the output dimension is updated, Obtain the output information with an output dimension of B×N×L, where N represents the number of action categories; calculate the loss function for the output information with an output dimension of B×N×L according to the number of action categories.
在一个例子中,由于3D卷积可以有效的提取时序动作特征,在本例子中使用ResNet18-3D卷积作为网络主干进行时序特征提取;密集预测动作分类头用来判别单个视频帧所属的动作类别,假定包含背景的动作类别数为N=3,动作类别包括跳、跑及其他三类,其中,其他表示背景,主干网络的输入维度为[16,3,32,112,112],下采样倍数为16,输出特征维度为[16,512,4,7,7],对主干网络输出特征进行如下处理:In one example, since 3D convolution can effectively extract temporal action features, in this example ResNet18-3D convolution is used as the network backbone for temporal feature extraction; the dense prediction action classification head is used to identify the action category to which a single video frame belongs , assuming that the number of action categories including the background is N=3, and the action categories include jumping, running and other three categories, where the others represent the background, the input dimensions of the backbone network are [16,3,32,112,112], and the downsampling multiple is 16, The output feature dimension is [16, 512, 4, 7, 7], and the output features of the backbone network are processed as follows:
a、对输出特征按H、W维度进行全局平均池化操作,输出维度[16,512,4,1,1];a. Perform a global average pooling operation on the output features according to the H and W dimensions, and the output dimensions are [16, 512, 4, 1, 1];
b、对步骤a的输出进行全连接操作,输出维度[16,96],即输出通道数为96;b. Perform a full connection operation on the output of step a, and the output dimension is [16,96], that is, the number of output channels is 96;
c、对步骤b的输出进行重排操作,输出维度16×3×32;c. Perform a rearrangement operation on the output of step b, and the output dimension is 16×3×32;
d、对步骤c的输出按照第二维度进行softmax交叉熵损失函数计算,即对视频序列的每一帧进行损失函数计算。d. Perform softmax cross-entropy loss function calculation on the output of step c according to the second dimension, that is, perform loss function calculation on each frame of the video sequence.
动作识别模型对每个视频帧进行动作判别,解决了采样策略、采样长度选取的问题,可以区分视频帧序列的背景与动作前景信息,增强了网络模型的时序特征表述能力。The action recognition model performs action discrimination for each video frame, which solves the problem of sampling strategy and sampling length selection, can distinguish the background and action foreground information of the video frame sequence, and enhances the temporal feature expression ability of the network model.
为了使本申请实施例的时空动作检测方法更加清楚,接下来参考图2对网络模型集成推理过程进行介绍,具体如下:In order to make the spatio-temporal action detection method of the embodiment of the present application clearer, the network model integration reasoning process is introduced with reference to FIG. 2, as follows:
在步骤201中,向目标跟踪模型输入视频帧,获取当前帧的人物及其id信息。In step 201, a video frame is input to the object tracking model, and the person and its id information of the current frame are acquired.
在步骤202中,更新缓冲矩阵,其中,若目标id在缓冲矩阵
Figure PCTCN2022140123-appb-000013
中不存在,则
Figure PCTCN2022140123-appb-000014
新增一 行目标id,并根据帧号更新目标信息,即目标的坐标位置信息;若目标id在缓冲矩阵
Figure PCTCN2022140123-appb-000015
中已经存在,则根据目标id及帧号追加目标信息;若
Figure PCTCN2022140123-appb-000016
中的目标id在当前帧检测结果不存在,说明目标消失,则删除
Figure PCTCN2022140123-appb-000017
中该目标id信息,即一整行关于上述目标id的位置信息全部删除。
In step 202, the buffer matrix is updated, wherein, if the target id is in the buffer matrix
Figure PCTCN2022140123-appb-000013
does not exist in
Figure PCTCN2022140123-appb-000014
Add a row of target id, and update the target information according to the frame number, that is, the coordinate position information of the target; if the target id is in the buffer matrix
Figure PCTCN2022140123-appb-000015
already exists in , add the target information according to the target id and frame number; if
Figure PCTCN2022140123-appb-000016
The target id in the current frame does not exist in the detection result, indicating that the target disappears, then delete it
Figure PCTCN2022140123-appb-000017
In the target id information, that is, the entire line of location information about the above target id is deleted.
在步骤203中,判断目标id的长度序列是否大于或等于预设长度序列L,在大于或等于预设长度情况下,执行步骤205,否则,直接将上一次的预测结果作为当前帧动作识别结果。In step 203, it is judged whether the length sequence of the target id is greater than or equal to the preset length sequence L, and in the case of greater than or equal to the preset length, execute step 205, otherwise, directly use the last prediction result as the current frame action recognition result .
在步骤204中,将目标id的前L帧结果输入到动作识别模型。In step 204, the results of the first L frames of the target id are input to the action recognition model.
在步骤205中,获取时空动作结果。In step 205, the spatio-temporal action result is acquired.
而目前的两阶段或单阶段的时空动作检测方法中,当前动作识别绝大部分是以时序片段作为整体进行动作建模,对该片段输出一个动作类别。通常情况下,一个时序片段并不全是动作帧,还存在背景帧,特别是在动作速率比较快的场景,如打乒乓球、羽毛球等,以长序列作为整体进行动作判别存在采样策略问题,采样过短不能充分提取动作特征,采样过长将融入过多背景特征,影响动作判别,此外,由于无法准确定位动作帧,以整个时序片段进行动作建模很难获取鲁棒的时序特征表述,从而导致不能对长视频中的不同人物,不同动作进行准确定位与识别的问题。However, in the current two-stage or single-stage spatio-temporal motion detection methods, most of the current motion recognition is based on the timing segment as a whole for motion modeling, and an action category is output for the segment. Usually, a time series segment is not all action frames, but also has background frames, especially in scenes with relatively fast action rates, such as playing table tennis, badminton, etc., there is a problem with the sampling strategy for action discrimination based on long sequences as a whole. If it is too short, the action features cannot be fully extracted, and if the sampling is too long, it will incorporate too many background features, which will affect the action discrimination. In addition, because the action frame cannot be accurately located, it is difficult to obtain a robust temporal feature representation for action modeling with the entire timing segment. This leads to the inability to accurately locate and identify different characters and actions in long videos.
而本申请实施例提供的时空动作检测方法,通过目标跟踪和人物动作识别两阶段的检测方法,其中,目标跟踪为提取空间特征信息并进行关联,动作识别为提取时序特征,因此,两个部分分开训练可以增加网络的收敛速度,减少了训练难度,同时也降低了两个网络结构的相互依赖关系,增加了时空动作识别的准确率,另外,采集密集预测的动作识别方法,解决了采样策略、采样长度选取的问题,对每个视频帧进行动作判别,可以区分视频帧序列的背景与动作前景信息,增强了网络模型的时序特征表述能力。实现了对长视频中的不同人物、不同动作进行准确定位与识别,具有高鲁棒性、高准确性的特性。因此,本申请实施例提供的方法可以应用在工业生产、农业生产、日常生活等现实应用场景中,可以替代传统的人工查看统计方案,减少人工干预,提高工作效率,具有广阔的市场应用,能够带来较大的研究和经济价值。However, the spatio-temporal action detection method provided by the embodiment of the present application is a two-stage detection method through target tracking and character action recognition, wherein, target tracking is to extract spatial feature information and perform correlation, and action recognition is to extract temporal features. Therefore, two parts Separate training can increase the convergence speed of the network, reduce the difficulty of training, and also reduce the interdependence of the two network structures, increasing the accuracy of spatio-temporal action recognition. In addition, the action recognition method of collecting dense predictions solves the problem of sampling strategies. , The problem of sampling length selection, the action discrimination of each video frame can distinguish the background and action foreground information of the video frame sequence, and enhance the temporal feature expression ability of the network model. Accurate positioning and recognition of different characters and different actions in long videos has been realized, and it has the characteristics of high robustness and high accuracy. Therefore, the method provided in the embodiment of the present application can be applied in real application scenarios such as industrial production, agricultural production, and daily life, and can replace the traditional manual checking of statistics schemes, reduce manual intervention, improve work efficiency, have broad market applications, and can Bring greater research and economic value.
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本申请的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该申请的保护范围内。The step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this application ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this application.
本申请实施例还涉及一种时空动作检测装置,如图3所示,包括:位置识别模块301和动作识别模块302。The embodiment of the present application also relates to a spatio-temporal motion detection device, as shown in FIG. 3 , including: a position recognition module 301 and a motion recognition module 302 .
具体地说,位置识别模块301,用于对连续视频帧中的各人物进行定位,得到每一视频帧中的各人物的位置信息,并对每一视频帧中的各人物的位置信息进行缓存;动作识别模块302,用于根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作,得到连续视频帧中每一视频帧的各人物的人物动作。Specifically, the position recognition module 301 is configured to locate each character in consecutive video frames, obtain the position information of each character in each video frame, and cache the position information of each character in each video frame The action recognition module 302 is used to identify the character action of each video frame according to the position information of the character in the cached preset length sequence of video frames, and obtain the character action of each character in each video frame in the continuous video frames.
在一个例子中,位置识别模块301通过预先训练好的目标跟踪网络模型,对连续视频帧中的各人物进行定位,并将目标跟踪网络模型输出的各人物的位置信息存储在缓冲矩阵中,缓冲矩阵的每一个元素
Figure PCTCN2022140123-appb-000018
表示j个人物在第i帧中的位置信息,j表示元素所在的行,i表示元素所在的列;其中,目标跟踪网络模型用于对每一视频帧中的各人物的位置信息进行检测;动作识别模块302用于将缓冲矩阵中存储的各人物的位置信息输入预先训练好的动作识 别模型,根据动作识别模型的输出结果,得到连续视频帧中每一视频帧的各人物的人物动作;其中,动作识别模型用于根据预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作。
In one example, the position identification module 301 uses a pre-trained target tracking network model to locate each character in the continuous video frame, and stores the position information of each character output by the target tracking network model in a buffer matrix, buffering Every element of the matrix
Figure PCTCN2022140123-appb-000018
Represents the position information of j person in the i frame, j represents the row where the element is located, and i represents the column where the element is located; wherein, the target tracking network model is used to detect the position information of each person in each video frame; The action recognition module 302 is used to input the position information of each character stored in the buffer matrix into the pre-trained action recognition model, and obtain the character action of each character in each video frame in the continuous video frame according to the output result of the action recognition model; Wherein, the action recognition model is used to identify the action of the person in each video frame according to the position information of the person in the sequence of video frames with a preset length.
在一个例子中,位置识别模块301对给定的视频帧,使用目标跟踪网络模型推理获取当前帧的人物及其id信息。若目标id在缓冲矩阵
Figure PCTCN2022140123-appb-000019
中不存在,则
Figure PCTCN2022140123-appb-000020
新增一行目标id,并根据帧号更新目标信息,即目标的坐标位置信息;若目标id在缓冲矩阵
Figure PCTCN2022140123-appb-000021
中已经存在,则根据目标id及帧号追加目标信息;若
Figure PCTCN2022140123-appb-000022
中的目标id在当前帧检测结果不存在,说明目标消失,则删除
Figure PCTCN2022140123-appb-000023
中该目标id信息,即一整行关于上述目标id的位置信息全部删除。
In one example, the position recognition module 301 uses the target tracking network model to deduce and obtain the person and its id information in the current frame for a given video frame. If the target id is in the buffer matrix
Figure PCTCN2022140123-appb-000019
does not exist in
Figure PCTCN2022140123-appb-000020
Add a row of target id, and update the target information according to the frame number, that is, the coordinate position information of the target; if the target id is in the buffer matrix
Figure PCTCN2022140123-appb-000021
already exists in , add the target information according to the target id and frame number; if
Figure PCTCN2022140123-appb-000022
The target id in the current frame does not exist in the detection result, indicating that the target disappears, then delete it
Figure PCTCN2022140123-appb-000023
In the target id information, that is, the entire line of location information about the above target id is deleted.
在一个例子中,本申请实施例的时空动作检测装置,还包括一种长度检测模块(图中未示出),在将目标跟踪网络模型输出的各人物的位置信息输入预先训练好的动作识别模型之前,长度检测模块对缓冲矩阵中每一行的长度进行检测,确定长度大于或等于预设长度序列的第一目标行,即对缓冲矩阵
Figure PCTCN2022140123-appb-000024
进行序列长度有效性检查,例如,若目标id的序列长度大于或等于预设长度序列L,则将目标id的前L帧结果输入到动作识别模型,输出每个视频帧的动作识别结果,同时删除前T个视频帧的目标信息,将后T个
Figure PCTCN2022140123-appb-000025
置为空。
In one example, the spatio-temporal action detection device of the embodiment of the present application further includes a length detection module (not shown in the figure), which inputs the position information of each character output by the target tracking network model into the pre-trained action recognition Before the model, the length detection module detects the length of each row in the buffer matrix to determine the first target row whose length is greater than or equal to the preset length sequence, that is, the buffer matrix
Figure PCTCN2022140123-appb-000024
Check the validity of the sequence length, for example, if the sequence length of the target id is greater than or equal to the preset length sequence L, then input the results of the first L frames of the target id to the action recognition model, and output the action recognition result of each video frame, and at the same time Delete the target information of the first T video frames, and the next T
Figure PCTCN2022140123-appb-000025
Leave blank.
另外,长度检测模块获取缓冲矩阵中长度小于预设长度序列的第二目标行;将第二目标行对应的人物的上一次检测到的人物动作,作为当前视频帧的人物动作,例如,若目标id的序列长度小于L,则将上述目标id上一次的预测结果作为当前帧的动作识别结果。In addition, the length detection module obtains the second target line in the buffer matrix whose length is less than the preset length sequence; uses the last detected character action of the character corresponding to the second target line as the character action of the current video frame, for example, if the target If the sequence length of id is less than L, the last prediction result of the above target id is used as the action recognition result of the current frame.
在一个例子中,动作识别模块302对部分人物序列进行重叠预测,重叠长度为T,T的取值为L/2,例如,对目标id的第1-32帧进行一次动作识别,得到目标id第1-32帧的人物动作,再对目标id的第16-48帧进行一次动作识别,得到目标id第16-48帧的人物动作,因此,对目标id的第16-32帧进行了重叠预测。需要说明的是本申请实施例不对T(T<L)和L的取值作具体限定。In one example, the action recognition module 302 performs overlap prediction on some character sequences, the overlap length is T, and the value of T is L/2. For example, an action recognition is performed on the 1-32 frames of the target id to obtain the target id The character action of frames 1-32, and then perform action recognition on frames 16-48 of the target id to obtain the character action of frames 16-48 of the target id, so the frames 16-32 of the target id are overlapped predict. It should be noted that the embodiment of the present application does not specifically limit the values of T (T<L) and L.
在一实施方式中,动作识别模块302根据目标人物在L个视频帧中的每一视频帧的人物动作,以及目标人物在P个视频帧中的每一视频帧的人物动作,得到目标人物在每一视频帧的人物动作;其中,根据识别到的目标人物在重叠的视频帧的多个人物动作的置信度,确定目标人物在重叠的视频帧的最终人物动作。In one embodiment, the action recognition module 302 obtains the target character's character action in each of the L video frames and the target character's character action in each of the P video frames based on the target character's character action in each of the P video frames. The character action of each video frame; wherein, according to the confidence of the recognized multiple character actions of the target character in the overlapping video frame, the final character action of the target character in the overlapping video frame is determined.
本申请实施例提供的时空动作检测装置,通过目标跟踪和人物动作识别两阶段的检测方法,其中,目标跟踪为提取空间特征信息并进行关联,动作识别为提取时序特征,因此,两个部分分开训练可以增加网络的收敛速度,减少了训练难度,同时也降低了两个网络结构的相互依赖关系,增加了时空动作识别的准确率,另外,采集密集预测的动作识别方法,解决了采样策略、采样长度选取的问题,对每个视频帧进行动作判别,可以区分视频帧序列的背景与动作前景信息,增强了网络模型的时序特征表述能力。实现了对长视频中的不同人物、不同动作进行准确定位与识别,具有高鲁棒性、高准确性的特性。The spatio-temporal action detection device provided by the embodiment of the present application adopts a two-stage detection method of target tracking and character action recognition, wherein, target tracking is to extract spatial feature information and perform correlation, and action recognition is to extract temporal features, so the two parts are separated Training can increase the convergence speed of the network, reduce the difficulty of training, and also reduce the interdependence of the two network structures, increasing the accuracy of spatio-temporal action recognition. In addition, the action recognition method of collecting dense prediction solves the problem of sampling strategy, For the problem of sampling length selection, action discrimination is performed on each video frame, which can distinguish the background and action foreground information of the video frame sequence, and enhance the temporal feature expression ability of the network model. Accurate positioning and recognition of different characters and different actions in long videos has been realized, and it has the characteristics of high robustness and high accuracy.
不难发现,本实施方式为上述时空动作检测方法实施例相对应的装置实施例,本实施方式可与上述时空动作检测方法实施例互相配合实施。上述时空动作检测方法实施例提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述时空动作检测方法实施例中。It is not difficult to find that this embodiment is a device embodiment corresponding to the above embodiment of the spatiotemporal motion detection method, and this embodiment can be implemented in cooperation with the above embodiment of the spatiotemporal motion detection method. The relevant technical details mentioned in the above embodiments of the spatio-temporal motion detection method are still valid in this embodiment, and will not be repeated here in order to reduce repetition. Correspondingly, the relevant technical details mentioned in this implementation manner can also be applied to the above embodiments of the spatio-temporal motion detection method.
值得一提的是,本申请上述实施方式中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单 元的组合实现。此外,为了突出本申请的创新部分,本实施方式中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施方式中不存在其它的单元。It is worth mentioning that all the modules involved in the above embodiments of the present application are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, and can also Combination of physical units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problems proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
本申请的实施例还提供一种电子设备,如图4所示,包括至少一个处理器401;以及,与所述至少一个处理器401通信连接的存储器402;其中,所述存储器402存储有可被所述至少一个处理器401执行的指令,所述指令被所述至少一个处理器401执行,以使所述至少一个处理器能够执行上述时空动作检测方法。The embodiment of the present application also provides an electronic device, as shown in FIG. 4 , including at least one processor 401; Instructions executed by the at least one processor 401, the instructions are executed by the at least one processor 401, so that the at least one processor can execute the above spatio-temporal motion detection method.
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。Wherein, the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor is transmitted on the wireless medium through the antenna, further, the antenna also receives the data and transmits the data to the processor.
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。The processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory can be used to store data that the processor uses when performing operations.
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果,未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。The above-mentioned products can execute the method provided in the embodiment of this application, and have corresponding functional modules and beneficial effects for executing the method. For technical details not described in detail in this embodiment, please refer to the method provided in the embodiment of this application.
本申请的实施例还提供一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。Embodiments of the present application also provide a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.
本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。Those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device (which can be A single chip microcomputer, a chip, etc.) or a processor (processor) executes all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
上述实施例是提供给本领域普通技术人员来实现和使用本申请的,本领域普通技术人员可以在脱离本申请的发明思想的情况下,对上述实施例做出种种修改或变化,因而本申请的保护范围并不被上述实施例所限,而应该符合权利要求书所提到的创新性特征的最大范围。The above-mentioned embodiments are provided for those of ordinary skill in the art to implement and use this application. Those of ordinary skill in the art can make various modifications or changes to the above-mentioned embodiments without departing from the inventive idea of this application. Therefore, this application The scope of protection is not limited by the above-mentioned embodiments, but should conform to the maximum scope of the innovative features mentioned in the claims.

Claims (10)

  1. 一种时空动作检测方法,包括:A spatio-temporal motion detection method, comprising:
    对连续视频帧中的各人物进行定位,得到每一视频帧中的各人物的位置信息,并对所述每一视频帧中的各人物的位置信息进行缓存;Positioning each character in the continuous video frames, obtaining the position information of each character in each video frame, and caching the position information of each character in each video frame;
    根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作,得到所述连续视频帧中每一视频帧的各人物的人物动作。According to the character position information in the cached sequence of video frames with a preset length, the character action of each video frame is identified, and the character action of each character in each video frame in the continuous video frames is obtained.
  2. 根据权利要求1所述的时空动作检测方法,其中,所述对连续视频帧中的各人物进行定位,得到每一视频帧中的各人物的位置信息,包括:The spatio-temporal motion detection method according to claim 1, wherein said positioning each person in the continuous video frames to obtain the position information of each person in each video frame includes:
    通过预先训练好的目标跟踪网络模型,对连续视频帧中的各人物进行定位;其中,所述目标跟踪网络模型设置为对每一视频帧中的各人物的位置信息进行检测;By pre-trained target tracking network model, each character in the continuous video frame is positioned; wherein, the target tracking network model is set to detect the position information of each character in each video frame;
    所述对所述每一视频帧中的各人物的位置信息进行缓存,包括:将所述目标跟踪网络模型输出的各所述人物的位置信息存储在缓冲矩阵中,所述缓冲矩阵的每一个元素
    Figure PCTCN2022140123-appb-100001
    表示j个人物在第i帧中的位置信息,所述j表示所述元素所在的行,所述i表示所述元素所在的列;
    Said caching the position information of each character in each video frame includes: storing the position information of each character output by the target tracking network model in a buffer matrix, each of the buffer matrix element
    Figure PCTCN2022140123-appb-100001
    Indicates the position information of the j person in the i-th frame, the j indicates the row where the element is located, and the i indicates the column where the element is located;
    所述根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作,得到所述连续视频帧中每一视频帧的各人物的人物动作,包括:According to the character position information in the video frames of the cached preset length sequence, the character action of each video frame is identified, and the character action of each character in each video frame in the continuous video frames is obtained, including:
    将所述缓冲矩阵中存储的各所述人物的位置信息输入预先训练好的动作识别模型,根据所述动作识别模型的输出结果,得到所述连续视频帧中每一视频帧的各人物的人物动作;Input the position information of each of the characters stored in the buffer matrix into a pre-trained action recognition model, and according to the output result of the action recognition model, obtain the character of each character in each video frame in the continuous video frames action;
    其中,所述动作识别模型设置为根据预设长度序列的视频帧中的人物位置信息,识别每一视频帧的人物动作。Wherein, the action recognition model is configured to identify the action of the person in each video frame according to the position information of the person in the sequence of video frames with a preset length.
  3. 根据权利要求2所述的时空动作检测方法,其中,所述将所述缓冲矩阵中存储的各所述人物的位置信息输入预先训练好的动作识别模型,包括:The spatio-temporal action detection method according to claim 2, wherein said inputting the position information of each said person stored in said buffer matrix into a pre-trained action recognition model comprises:
    对所述缓冲矩阵中每一行的长度进行检测,确定长度大于或等于所述预设长度序列的第一目标行;Detecting the length of each row in the buffer matrix, and determining the first target row whose length is greater than or equal to the preset length sequence;
    将所述第一目标行的前L个行数据输入所述预先训练好的动作识别模型,所述L为所述预设长度序列。Inputting the first L rows of data of the first target row into the pre-trained action recognition model, where L is the preset length sequence.
  4. 根据权利要求3所述的时空动作检测方法,其中,在所述对所述缓冲矩阵中每一行的长度进行检测后,还包括:The spatiotemporal action detection method according to claim 3, wherein, after detecting the length of each row in the buffer matrix, further comprising:
    获取所述缓冲矩阵中长度小于所述预设长度序列的第二目标行;Obtaining a second target row in the buffer matrix whose length is less than the preset length sequence;
    将所述第二目标行对应的人物的上一次检测到的人物动作,作为当前视频帧的人物动作。The last detected person's motion of the person corresponding to the second target row is used as the person's motion of the current video frame.
  5. 根据权利要求2所述的时空动作检测方法,其中,所述将所述目标跟踪网络模型输出的各所述人物的位置信息存储在缓冲矩阵中,包括:The spatio-temporal action detection method according to claim 2, wherein storing the position information of each of the characters output by the target tracking network model in a buffer matrix includes:
    在所述目标跟踪网络模型输出的当前视频帧中的人物不存在于所述缓冲矩阵的情况下,在所述缓冲矩阵中增加与所述人物对应的行,并将所述人物在当前视频帧中的位置信息更新在所述缓冲矩阵中;In the case that the character in the current video frame output by the target tracking network model does not exist in the buffer matrix, add a row corresponding to the character in the buffer matrix, and place the character in the current video frame The location information in is updated in the buffer matrix;
    在所述目标跟踪网络模型输出的当前视频帧中的人物存在于所述缓冲矩阵的情况下,将所述人物在当前视频帧中的位置信息更新在所述缓冲矩阵中;In the case that the character in the current video frame output by the target tracking network model exists in the buffer matrix, update the position information of the character in the current video frame in the buffer matrix;
    在所述缓冲矩阵中的行所对应的人物未包括在所述目标跟踪网络模型输出的在当前视频帧中检测到的人物的情况下,删除所述未包括的人物所对应的行数据。If the person corresponding to the row in the buffer matrix does not include the person detected in the current video frame output by the object tracking network model, delete the row data corresponding to the not included person.
  6. 根据权利要求2至4中任一项所述的时空动作检测方法,其中,所述将所述缓冲矩阵中存储的各所述人物的位置信息输入预先训练好的动作识别模型,包括:The spatio-temporal action detection method according to any one of claims 2 to 4, wherein said inputting the position information of each of the characters stored in the buffer matrix into a pre-trained action recognition model comprises:
    将连续的L个视频帧中的目标人物的位置信息输入所述预先训练好的动作识别模型,得到所述目标人物在所述L个视频帧中的每一视频帧的人物动作;其中,所述L为所述预设长度序列;Input the position information of the target person in the continuous L video frames into the pre-trained action recognition model to obtain the character action of the target person in each video frame of the L video frames; wherein, Said L is said preset length sequence;
    将连续的P个视频帧中的目标人物的位置信息输入所述预先训练好的动作识别模型,得到所述目标人物在所述P个视频帧中的每一视频帧的人物动作;其中,所述P为所述预设长度序列,且所述P个视频帧中的前T个视频帧与所述L个视频帧中的后T个视频帧重叠,所述T小于所述预设长度序列;Input the position information of the target person in the continuous P video frames into the pre-trained action recognition model to obtain the character action of the target person in each video frame of the P video frames; wherein, The P is the preset length sequence, and the first T video frames in the P video frames overlap with the last T video frames in the L video frames, and the T is smaller than the preset length sequence ;
    所述根据所述动作识别模型的输出结果,得到所述连续视频帧中每一视频帧的各人物的人物动作,包括:According to the output result of the action recognition model, obtaining the action of each character in each video frame in the continuous video frames includes:
    根据所述目标人物在所述L个视频帧中的每一视频帧的人物动作,以及所述目标人物在所述P个视频帧中的每一视频帧的人物动作,得到所述目标人物在每一视频帧的人物动作;According to the character action of the target character in each video frame of the L video frames, and the character action of the target character in each video frame of the P video frames, the target character is obtained in Character movements for each video frame;
    其中,根据识别到的所述目标人物在所述重叠的视频帧的多个人物动作的置信度,确定所述目标人物在所述重叠的视频帧的最终人物动作。Wherein, the final character action of the target character in the overlapping video frame is determined according to the confidence of the recognized multiple character actions of the target character in the overlapping video frame.
  7. 根据权利要求2至5中任一项所述的时空动作检测方法,其中,在所述通过预先训练好的目标跟踪网络模型,对连续视频帧中的各人物进行定位之前,还包括:The spatio-temporal action detection method according to any one of claims 2 to 5, wherein, before the target tracking network model trained in advance, before locating each character in the continuous video frames, it also includes:
    生成初始的动作识别模型,并根据各人物的样本集训练所述初始的动作识别模型,得到所述训练好的动作识别模型;generating an initial motion recognition model, and training the initial motion recognition model according to the sample sets of each character, to obtain the trained motion recognition model;
    其中,所述动作识别模型包括用于提取时序特征的主干网络和用于预测每帧的人物动作的密集预测动作分类头;所述主干网络输出的特征维度包括:B×C×L×H×W,其中B表示批处理数,C表示通道数,L表示预设长度序列的视频帧,H表示深度特征的高度,W表示深度特征的宽度;所述生成初始的动作识别模型,包括:Wherein, the action recognition model includes a backbone network for extracting time-series features and a dense prediction action classification head for predicting character actions in each frame; the feature dimensions output by the backbone network include: B×C×L×H× W, where B represents the number of batches, C represents the number of channels, L represents the video frame of the preset length sequence, H represents the height of the depth feature, and W represents the width of the depth feature; the initial action recognition model generated includes:
    将所述主干网络的输出信息按H、W维度进行全局平均池化操作,将输出维度更新为B×C×L×1×1:Perform a global average pooling operation on the output information of the backbone network according to the H and W dimensions, and update the output dimension to B×C×L×1×1:
    将输出维度更新后的输出信息进行全连接和重排操作,得到输出维度为B×N×L的输出信息,所述N表示动作类别数;Perform full connection and rearrangement operations on the output information after the output dimension is updated, and obtain output information with an output dimension of B×N×L, where N represents the number of action categories;
    根据所述动作类别数对所述输出维度为B×N×L的输出信息进行损失函数的计算。A loss function is calculated on the output information whose output dimension is B×N×L according to the number of action categories.
  8. 一种时空动作检测装置,包括:A spatiotemporal motion detection device, comprising:
    位置识别模块,设置为对连续视频帧中的各人物进行定位,得到每一视频帧中的各人物的位置信息,并对所述每一视频帧中的各人物的位置信息进行缓存;The position recognition module is configured to locate each character in the continuous video frame, obtain the position information of each character in each video frame, and cache the position information of each character in each video frame;
    动作识别模块,设置为根据缓存的预设长度序列的视频帧中的人物位置信息,识别每一 视频帧的人物动作,得到所述连续视频帧中每一视频帧的各人物的人物动作。The action recognition module is configured to identify the character action of each video frame according to the character position information in the video frames of the preset length sequence of the cache, and obtain the character action of each character in each video frame in the continuous video frames.
  9. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至7中任一项所述的时空动作检测方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 7 The spatio-temporal motion detection method described above.
  10. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的时空动作检测方法。A computer-readable storage medium storing a computer program, wherein the computer program implements the spatio-temporal motion detection method according to any one of claims 1 to 7 when executed by a processor.
PCT/CN2022/140123 2021-12-30 2022-12-19 Spatio-temporal action detection method and apparatus, electronic device and storage medium WO2023125119A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111657437.9 2021-12-30
CN202111657437.9A CN116434096A (en) 2021-12-30 2021-12-30 Spatiotemporal motion detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023125119A1 true WO2023125119A1 (en) 2023-07-06

Family

ID=86997723

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/140123 WO2023125119A1 (en) 2021-12-30 2022-12-19 Spatio-temporal action detection method and apparatus, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN116434096A (en)
WO (1) WO2023125119A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117649537A (en) * 2024-01-30 2024-03-05 浙江省公众信息产业有限公司 Monitoring video object identification tracking method, system, electronic equipment and storage medium
CN117953588A (en) * 2024-03-26 2024-04-30 南昌航空大学 Badminton player action intelligent recognition method integrating scene information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170286774A1 (en) * 2016-04-04 2017-10-05 Xerox Corporation Deep data association for online multi-class multi-object tracking
CN110472531A (en) * 2019-07-29 2019-11-19 腾讯科技(深圳)有限公司 Method for processing video frequency, device, electronic equipment and storage medium
CN113392676A (en) * 2020-03-12 2021-09-14 北京沃东天骏信息技术有限公司 Multi-target tracking behavior identification method and device
CN113688761A (en) * 2021-08-31 2021-11-23 安徽大学 Pedestrian behavior category detection method based on image sequence

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170286774A1 (en) * 2016-04-04 2017-10-05 Xerox Corporation Deep data association for online multi-class multi-object tracking
CN110472531A (en) * 2019-07-29 2019-11-19 腾讯科技(深圳)有限公司 Method for processing video frequency, device, electronic equipment and storage medium
CN113392676A (en) * 2020-03-12 2021-09-14 北京沃东天骏信息技术有限公司 Multi-target tracking behavior identification method and device
CN113688761A (en) * 2021-08-31 2021-11-23 安徽大学 Pedestrian behavior category detection method based on image sequence

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117649537A (en) * 2024-01-30 2024-03-05 浙江省公众信息产业有限公司 Monitoring video object identification tracking method, system, electronic equipment and storage medium
CN117649537B (en) * 2024-01-30 2024-04-26 浙江省公众信息产业有限公司 Monitoring video object identification tracking method, system, electronic equipment and storage medium
CN117953588A (en) * 2024-03-26 2024-04-30 南昌航空大学 Badminton player action intelligent recognition method integrating scene information

Also Published As

Publication number Publication date
CN116434096A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
WO2023125119A1 (en) Spatio-temporal action detection method and apparatus, electronic device and storage medium
US10726313B2 (en) Active learning method for temporal action localization in untrimmed videos
CN108447080B (en) Target tracking method, system and storage medium based on hierarchical data association and convolutional neural network
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
CN109919977B (en) Video motion person tracking and identity recognition method based on time characteristics
WO2016107482A1 (en) Method and device for determining identity identifier of human face in human face image, and terminal
CN110766724B (en) Target tracking network training and tracking method and device, electronic equipment and medium
EP1975879B1 (en) Computer implemented method for tracking object in sequence of frames of video
JP2019036008A (en) Control program, control method, and information processing device
JP2019036009A (en) Control program, control method, and information processing device
CN113326835B (en) Action detection method and device, terminal equipment and storage medium
US20140126830A1 (en) Information processing device, information processing method, and program
CN113327272B (en) Robustness long-time tracking method based on correlation filtering
WO2021169642A1 (en) Video-based eyeball turning determination method and system
JP2022542199A (en) KEYPOINT DETECTION METHOD, APPARATUS, ELECTRONICS AND STORAGE MEDIA
US11694342B2 (en) Apparatus and method for tracking multiple objects
CN110610123A (en) Multi-target vehicle detection method and device, electronic equipment and storage medium
CN112131944B (en) Video behavior recognition method and system
CN111291749B (en) Gesture recognition method and device and robot
CN102855635A (en) Method and device for determining human body action cycles and recognizing human body actions
CN110766725A (en) Template image updating method and device, target tracking method and device, electronic equipment and medium
CN114676756A (en) Image recognition method, image recognition device and computer storage medium
CN111814653B (en) Method, device, equipment and storage medium for detecting abnormal behavior in video
CN113780145A (en) Sperm morphology detection method, sperm morphology detection device, computer equipment and storage medium
Wang et al. Multi-scale aggregation network for temporal action proposals

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22914373

Country of ref document: EP

Kind code of ref document: A1