WO2022227512A1

WO2022227512A1 - Single-stage dynamic pose recognition method and apparatus, and terminal device

Info

Publication number: WO2022227512A1
Application number: PCT/CN2021/131680
Authority: WO
Inventors: 邵池; 汤志超; 程骏; 林灿然; 郭渺辰; 庞建新
Original assignee: 深圳市优必选科技股份有限公司
Priority date: 2021-04-26
Filing date: 2021-11-19
Publication date: 2022-11-03
Also published as: CN113011395B; CN113011395A

Abstract

Disclosed in the embodiments of the present application are a single-stage dynamic pose recognition method and apparatus, and a terminal device. By means of the technical solution of the present application, a pose in each frame in video segments is determined on the basis of M video frame sets corresponding to M video segments, which are collected by M video collection apparatuses in the same time period and comprise the same group of dynamic poses, and feature sub-vectors corresponding to video frames in each video frame set are mutually enhanced and fused to realize dynamic pose recognition, thereby effectively enhancing the accuracy of dynamic pose recognition; in addition, M feature extraction models are pre-trained for the M video frame sets, and the feature sub-vectors of the video frames in the corresponding video frame sets are respectively extracted by using the M feature extraction models, thereby ensuring effective extraction of the feature sub-vectors of the video frames in the video frame sets; moreover, an attention level parameter is introduced to take the influence of surrounding feature sub-vectors on a feature vector into full consideration.

Description

A single-stage dynamic pose recognition method, device and terminal device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese Patent Application No. 2021104549677, which was filed with the China Patent Office on April 26, 2021, and is entitled "A Single-Stage Dynamic Pose Recognition Method, Apparatus and Terminal Equipment", the entire contents of which are approved by Reference is incorporated in this application.

technical field

The present application relates to the field of artificial intelligence, and in particular, to a single-stage dynamic pose recognition method, device and terminal device.

Background technique

At present, most dynamic pose recognition methods are based on identifying isolated poses. The input video to the recognition model is manually segmented video clips, and each video clip contains a separate pose (gesture pose or body pose). However, in real-world scenarios, where poses are generally executed continuously, this isolated pose-based approach cannot be directly applied.

Application content

In view of the above problems, the present application proposes a single-stage dynamic pose recognition method, apparatus and terminal device.

The present application proposes a single-stage dynamic pose recognition method, which includes:

Obtaining M video frame sets, the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time period, and M≥2;

Utilize the predetermined M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set respectively;

Determine the corresponding t-th feature vector according to the preset attention level parameter and the t-th feature sub-vector of the m-th video frame set;

The pose in each frame of the video segment is identified according to each feature vector in the M video frame sets.

In the single-stage dynamic pose recognition method described in this application, the t-th feature vector is determined by the following formula:

Represents the t-th feature vector of the m-th video frame set, 0<tp<t, t<t+q<T, |pq|≤1, T is the total number of video frames in the video segment, A represents The attention level parameter,

Represents the feature sub-vector corresponding to the attention level a corresponding to the t-th video frame in the m-th video frame set, a≤A,

represents the t-th feature sub-vector in the m-th video frame set.

The single-stage dynamic pose recognition method described in the present application, wherein the pose in each frame of the video segment is identified according to each feature vector in the M video frame sets, including:

Use the M t-th feature vectors to determine the t-th feature enhancement vector;

Performing global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector;

Perform RELU (Rectified Linear Unit, linear rectification function) activation processing to the t-th feature pooling vector to determine the t-th feature activation vector;

A feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment.

The single-stage dynamic pose recognition method proposed in this application uses the following formula to determine the t-th feature pooling vector:

Zt represents the t-th feature pooling vector,

represents the t-th feature enhancement vector,

Represents the feature sub-vector corresponding to the attention level a of the t-th video frame in the m-th video frame set.

The single-stage dynamic pose recognition method described in the present application, wherein the feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment, including:

Perform hole convolution processing, RELU activation processing, dropout processing and softmax processing on the feature fusion vector composed of the T feature activation vectors in turn to determine the predicted category and corresponding to the pose in each frame in the video segment. predicted probability.

The single-stage dynamic pose recognition method described in this application further includes:

The classification loss for the video segment is calculated according to the following formula:

Ls represents the classification loss of the video segment, C represents the total number of predicted categories, Δt,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c, and yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c, yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c, and ε represents the preset probability threshold.

In the single-stage dynamic pose recognition method described in this application, the poses include gesture poses and/or body poses.

The present application proposes a single-stage dynamic pose recognition device, which includes:

an acquisition module, configured to acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period, M≥2;

The determination module is used to extract the feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models; it is also used for the preset attention level parameter and the t th video frame set of the m th video frame set. The eigenvectors determine the corresponding t-th eigenvector;

An identification module, configured to identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.

The present application proposes a terminal device including a memory and a processor, wherein the memory stores a computer program, and the computer program executes the single-stage dynamic pose recognition method described in the present application when running on the processor.

The present application provides a readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the single-stage dynamic pose recognition method described in the present application.

In the technical solution of the present application, on the one hand, when determining the pose in each frame of the video segment, it is based on M video segments corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time segment. A set of video frames is used, and the feature sub-vectors corresponding to the video frames in each video frame set are used to realize dynamic pose recognition, which effectively enhances the accuracy of dynamic pose recognition; on the other hand, M features are pre-trained for the M video frame sets. Extraction model, using M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set, thereby ensuring the effective extraction of the feature sub-vectors of each video frame in each video frame set; on the other hand, by introducing attention The force level parameter fully considers that the eigenvector may be affected by the surrounding eigenvectors.

Description of drawings

In order to illustrate the technical solutions of the present application more clearly, the following briefly introduces the accompanying drawings used in the embodiments. It should be understood that the following drawings only show some embodiments of the present application, and therefore should not be It is regarded as a limitation on the protection scope of this application. In the various figures, similar components are numbered similarly.

1 shows a schematic flowchart of a single-stage dynamic pose recognition method proposed by an embodiment of the present application;

FIG. 2 shows a schematic diagram of the relationship between an attention level parameter and a feature sub-vector proposed by an embodiment of the present application;

FIG. 3 shows a schematic diagram of the relationship between another attention level parameter and a feature sub-vector proposed by an embodiment of the present application;

FIG. 4 shows a schematic flowchart of identifying poses in each frame in a video segment according to an embodiment of the present application;

FIG. 5 shows a schematic flowchart of another single-stage dynamic pose recognition method proposed by an embodiment of the present application;

FIG. 6 shows a schematic structural diagram of a single-stage dynamic pose recognition device proposed in an embodiment of the present application.

Description of main component symbols:

10-single-stage dynamic pose recognition device; 11-acquisition module; 12-determination module; 13-recognition module.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments.

The components of the embodiments of the present application generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the application provided in the accompanying drawings is not intended to limit the scope of the application as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present application.

Hereinafter, the terms "comprising", "having" and their cognates, which may be used in various embodiments of the present application, are only intended to denote particular features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the presence of or adding one or more other features, numbers, steps, operations, elements, components or combinations of the foregoing or the possibility of a combination of the foregoing.

Furthermore, the terms "first", "second", "third", etc. are only used to differentiate the description and should not be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of this application belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having the same meaning as the contextual meaning in the relevant technical field and will not be interpreted as having an idealized or overly formal meaning, unless explicitly defined in the various embodiments of the present application.

Pose recognition includes gesture pose recognition and/or body pose recognition. Pose recognition is one of the extensive research directions in academia and industry. There are many practical applications, including human-computer interaction, robotics, sign language recognition, Gaming and virtual reality controls, etc. Pose recognition can be further divided into static pose recognition and dynamic pose recognition. The method proposed in this application is mainly used to recognize dynamic poses in videos.

It can be understood that, for dynamic pose recognition, two recognition methods are generally included, for example, a two-stage recognition method and a single-stage recognition method. The two-stage recognition method uses two models for recognition: one model is used to perform pose detection (also called the pose recognition stage, which is used to identify the presence or absence of a pose), and the other model is used to perform pose detection. Perform gesture classification. For example, the pose is first detected by a lightweight 3D-CNN model, and then a heavyweight 3D-CNN classification model is activated for pose classification when the pose is detected. For single-stage recognition methods, frames in the video that do not contain action are labeled as non-pose classes. Compared with the two-stage recognition method, the one-stage recognition method uses only one model for pose classification, and besides being simpler than the two-stage recognition method, the one-stage recognition method also avoids the potential problem of error propagation between stages. For example, in a two-stage recognition method, if the model that detects the pose makes an error in the pose detection stage, that error will propagate to the subsequent classification stages. The single-stage dynamic pose recognition method adopted in this application can detect and classify multiple poses in a single video through a single model. This method detects dynamic poses in videos without a pose preprocessing stage.

Example 1

An embodiment of the present application, as shown in FIG. 1 , a single-stage dynamic pose recognition method includes the following steps:

S100: Acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time period, and M≥2 .

M video capture devices are generally installed in the same area, and it needs to be ensured that the M video capture devices can capture the same group of dynamic poses at the same time. The M video capture devices may be of different types, for example, RGB image capture devices and RGB-D image (depth image) capture devices may be used simultaneously.

It can be understood that the sets of M video frames corresponding to the M video segments including the same group of dynamic poses collected by the M video collection devices in the same time period may be pre-stored in a database or a storage device. When performing identification, M video frame sets may be obtained from a database or a storage device; or, M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period may be. Real-time upload to the terminal device for recognizing the dynamic pose, so that the terminal device can recognize the dynamic pose in real time; The video capture device with the pose function can acquire the video frame sets corresponding to other video capture devices, so as to realize the recognition of the dynamic pose corresponding to the M video frame sets by using less hardware devices.

Further, there are at least 2 video segments including the same group of dynamic poses collected in the same time period, and optionally, M≥2. It can be understood that when M=2, the complexity of the dynamic pose recognition process is low, the amount of calculation is small, and the recognition speed is fast. With the increase of M, although the complexity of the dynamic pose recognition process increases, the amount of calculation increases, and the recognition speed slows down, but the accuracy of dynamic pose recognition will increase.

S200: Extract feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models.

Each video frame set includes a plurality of video frames, and the plurality of video frames are sequentially arranged in a time sequence to form a video frame sequence, that is, the video frame sampled first is in the front, and the video frame sampled later is in the back. Considering that each video frame set may be acquired by using different types of video capture devices, for example, an RGB image capture device and an RGB-D image (depth image) capture device may be used at the same time. Therefore, pre-determined M feature extraction models are required, that is, not only a feature extraction model (such as ResNet-based RGB feature extraction) needs to be pre-trained for RGB images, but also a feature extraction model (such as ResNet-based RGB feature extraction) needs to be pre-trained for depth images. deep feature extraction). Further, it is ensured that the feature sub-vectors of each video frame in each video frame set are effectively extracted.

S300: Determine the corresponding t-th feature vector according to the preset attention level parameter and the t-th feature sub-vector of the m-th video frame set.

Each video frame set includes a plurality of video frames, and the plurality of video frames are sequentially arranged in a time sequence to form a video frame sequence, that is, the video frame sampled first is in the front, and the video frame sampled later is in the back. Considering that the t-th feature sub-vector in each video frame set may be affected by surrounding feature sub-vectors, this embodiment introduces an attention level parameter when determining the t-th feature vector of the m-th video frame set. The force level parameter is used to reflect which feature sub-vectors around the t-th feature vector of the m-th video frame set are affected by.

Exemplarily, when A=3, as shown in FIG. 2 , the t-th feature vector of the m-th video frame set is composed of the t-1-th feature vector corresponding to the t-1-th video frame of the m-th video frame set. Eigenvectors

The t-th feature sub-vector corresponding to the t-th video frame

The t+1 th feature sub-vector corresponding to the t+1 th video frame

composition, if

represents the t-th feature vector of the m-th video frame set, then

Exemplarily, when A=8, as shown in FIG. 3 , the t-th feature vector of the m-th video frame set is composed of the t-3-th video frame corresponding to the t-3-th video frame of the m-th video frame set. Eigenvectors

The t-2th feature sub-vector corresponding to the t-2th video frame

The t-1th feature sub-vector corresponding to the t-1th video frame

The t-th feature sub-vector corresponding to the t-th video frame

The t+1th feature sub-vector corresponding to the t+1th video frame

The t+2th feature sub-vector corresponding to the t+2th video frame

The t+3th feature sub-vector corresponding to the t+3th video frame

The t+4th feature sub-vector corresponding to the t+4th video frame

composition, that is

Or, the t-th feature vector of the m-th video frame set is composed of the t-4th feature sub-vector corresponding to the t-4th video frame of the m-th video frame set

The t-3th feature sub-vector corresponding to the t-3th video frame

The t-2th feature sub-vector corresponding to the t-2th video frame

The t-1th feature sub-vector corresponding to the t-1th video frame

The t-th feature sub-vector corresponding to the t-th video frame

The t+1th feature sub-vector corresponding to the t+1th video frame

The t+2th feature sub-vector corresponding to the t+2th video frame

The t+3th feature sub-vector corresponding to the t+3th video frame

composition, that is

Further, the t-th eigenvector can be determined using the following formula:

Represents the t-th feature vector of the m-th video frame set, 0<tp<t, t<t+q<T, |pq|≤1, T is the total number of video frames in the video segment (that is, the th The total number of video frames in the set of m video frames), A represents the attention level parameter,

represents the t-th feature sub-vector in the m-th video frame set.

S400: Identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.

Each video frame set includes multiple feature vectors, and feature enhancement processing, global average pooling processing and activation processing are performed on each feature vector in the M video frame sets respectively to obtain a feature fusion vector, and then use the feature fusion vector Identify the pose in each frame of the video segment.

In the technical solution of this embodiment, on the one hand, when determining the pose in each frame of the video segment, it is based on the corresponding M video segments including the same group of dynamic poses collected by M video capture devices in the same time segment. There are M video frame sets, and the feature sub-vectors corresponding to the video frames in each video frame set are used for mutual enhancement and fusion to realize dynamic pose recognition, which effectively enhances the accuracy of dynamic pose recognition. Train M feature extraction models, and use the M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set, thereby ensuring that the feature sub-vectors of each video frame in each video frame set are effectively extracted; another aspect , the technical solution of this embodiment introduces the attention level parameter, fully considering that the feature vector may be affected by the surrounding feature sub-vectors.

Further, compared with the two-stage identification method in the technical solution of this embodiment, the single-stage identification method is not only simpler than the two-stage identification method, but also avoids the potential problem of error propagation between stages. For example, in a two-stage recognition method, if the model that detects the pose makes an error in the pose detection stage, that error will propagate to the subsequent classification stages.

Further, based on the identification of the pose in each video frame in the video segment based on the technical solution of this embodiment, the video segment can be segmented. For example, two video frames with adjacent video frames and different poses can be As a segmentation point, the continuous same pose can be used as a segment.

Example 2

In an embodiment of the present application, as shown in FIG. 4 , after obtaining each feature vector in the M video frame sets, the pose recognition in each frame in the video segment includes the following steps:

S410: Determine the t-th feature enhancement vector by using the M t-th feature vectors.

The t-th feature vector of the m-th video frame set can be expressed as

Using the M t-th feature vectors to determine the t-th feature enhancement vector can be

Exemplarily, when M=2, the first video frame set includes multiple feature vectors

t=1, 2, 3, ..., the second video frame set includes multiple feature vectors

t=1, 2, 3, . . . Further, when t=1, the first feature enhancement vector can be expressed as

S420: Perform global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector.

Exemplarily, the following formula can be used to determine the t-th feature pooling vector:

Zt represents the t-th feature pooling vector,

represents the t-th feature enhancement vector,

S430: Perform RELU activation processing on the t-th feature pooling vector to determine the t-th feature activation vector.

Exemplarily, performing RELU activation processing on the t-th feature pooling vector to determine the t-th feature activation vector βt, which can be expressed as βt=RELU(Zt)=max(0, Zt).

It can be understood that nonlinear factors can be introduced through RELU activation, so that the technical solution of the present application can solve more complex pose classification and recognition problems. RELU activation is actually a function of taking the maximum value. The ReLU activation function is actually a piecewise linear function, which turns all negative values into 0, while the positive values remain unchanged. This operation can be understood as one-sided suppression. (That is: in the case where the input is negative, it will output 0, then the neuron will not be activated. This means that only some neurons will be activated at the same time, which makes the network very sparse, which in turn affects computation It is very efficient.) It is because of the unilateral inhibition that the neurons in the neural network also have sparse activation. Exemplarily, in a deep neural network model (such as CNN), after the model adds N layers, the activation rate of ReLU neurons will theoretically be reduced by a factor of 2 N times. The ReLU activation function does not have complex exponential operations, so the calculation is simple and the classification and recognition efficiency is high; in addition, the ReLU activation function has faster convergence speed than the Sigmoid/tanh activation function.

S440: Identify the pose in each frame of the video segment by using a feature fusion vector composed of T feature activation vectors.

Feature activation vector βt, t=1, 2, 3, ..., T, feature fusion vector β=[β1, β2, ..., βT] that can be composed of T feature activation vectors.

Further, perform hole convolution processing, RELU activation processing, dropout processing and softmax processing on the feature fusion vector β=[β1, β2,..., βT] in sequence to determine which pose the pose in each frame in the video segment belongs to. Predicted categories and corresponding predicted probabilities.

Exemplarily, atrous convolution processing, RELU activation processing, dropout processing and softmax processing can be represented by the following functional relationship fMEM:

y1,c,y2,c,...,yT,c=fMEM([β1,β2,...,βT]), yt,c indicates the corresponding prediction when the pose in the t-th video frame belongs to the prediction category c Probability, t = 1, 2, 3, ..., T.

Among them, the hole convolution processing is to inject holes on the basis of standard convolution to increase the receptive field, and the hole convolution processing can increase the receptive field while maintaining the size of the feature fusion vector. Dropout processing includes using one-dimensional convolutional layer, dropout layer and one-dimensional convolutional layer to perform dropout processing on the feature fusion vector. When the information of the neural network propagates forward, the dropout makes the activation value of a neuron with a certain probability p stops working, which can make the neural network model more generalizable and avoid over-reliance on some local features, which can effectively alleviate the occurrence of over-fitting, and achieve the effect of regularization to a certain extent. The softmax process uses the softmax function to map the input to a real number between 0 and 1, and the normalization guarantees that the sum is 1, thereby ensuring that the sum of the probabilities of multi-classification is exactly 1.

Example 3

An embodiment of the present application, referring to FIG. 5, shows another single-stage dynamic pose recognition method, which further includes after step S400:

S500: Calculate the classification loss corresponding to the video segment according to the prediction category to which the pose in each frame in the video segment belongs and the corresponding prediction probability.

Exemplarily, the classification loss corresponding to the video segment can be calculated according to the following formula:

On the one hand, the accuracy of the current pose recognition can be determined by the classification loss corresponding to the video segment, that is, the smaller the classification loss corresponding to the video segment, the higher the accuracy of the current pose recognition; on the other hand, it can be used for Evaluate the single-stage dynamic pose recognition model, that is, when training the single-stage dynamic pose recognition model, it can be determined whether the single-stage dynamic pose recognition model meets the standard according to the convergence of the classification loss function. For example, when the classification loss function converges and the classification loss When the loss is less than the preset loss threshold, the single-stage dynamic pose recognition model is trained and can be used to recognize dynamic poses in video segments.

Example 4

An embodiment of the present application, referring to FIG. 6 , shows that a single-stage dynamic pose recognition device 10 includes an acquisition module 11 , a determination module 12 and an identification module 13 .

The acquisition module 11 is configured to acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period , M≥2; the determination module 12 is used to extract the feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models; also used for according to the preset attention level parameter and the mth The t-th feature sub-vector of the video frame set determines the corresponding t-th feature vector; the identification module 13 is configured to identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.

Further, the t-th feature vector is determined by the following formula:

represents the t-th feature sub-vector in the m-th video frame set.

Further, identifying the pose in each frame of the video segment according to each feature vector in the M video frame sets, including:

Use the M t-th feature vectors to determine the t-th feature enhancement vector; perform global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector; The vector is subjected to RELU activation processing to determine the t-th feature activation vector; the feature fusion vector composed of the T feature activation vectors is used to identify the pose in each frame of the video segment.

Further, use the following formula to determine the t-th feature pooling vector:

Zt represents the t-th feature pooling vector,

represents the t-th feature enhancement vector,

Further, the described feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame in the video segment, including:

Further, the identification module 13 is further configured to calculate the classification loss corresponding to the video segment according to the prediction category to which the pose in each frame in the video segment belongs and the corresponding prediction probability.

Further, the classification loss of the video segment is calculated according to the following formula:

Further, the poses include gesture poses and/or body poses.

The single-stage dynamic pose recognition device 10 disclosed in this embodiment is used in conjunction with the acquisition module 11 , the determination module 12 and the recognition module 13 to execute the single-stage dynamic pose recognition method described in the above embodiments. The related implementations and beneficial effects are also applicable in this embodiment, and will not be repeated here.

It can be understood that this application proposes a terminal device, including a memory and a processor, wherein the memory stores a computer program, and the computer program executes the single-stage dynamic pose recognition method described in this application when running on the processor .

It can be understood that this application provides a readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the single-stage dynamic pose recognition method described in this application.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may also be implemented in other manners. The apparatus embodiments described above are only schematic. For example, the flowcharts and structural diagrams in the accompanying drawings show the possible implementation architectures and functions of the apparatuses, methods and computer program products according to various embodiments of the present application. and operation. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flow diagrams, and combinations of blocks in the block diagrams and/or flow diagrams, can be implemented using dedicated hardware-based systems that perform the specified functions or actions. be implemented, or may be implemented in a combination of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present application may be integrated together to form an independent part, or each module may exist independently, or two or more modules may be integrated to form an independent part.

If the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application.

Claims

A single-stage dynamic pose recognition method, characterized in that the method comprises:

Obtaining M video frame sets, the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time period, and M≥2;

Utilize the predetermined M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set respectively;

Determine the corresponding t-th feature vector according to the preset attention level parameter and the t-th feature sub-vector of the m-th video frame set;

The pose in each frame of the video segment is identified according to each feature vector in the M video frame sets.
The single-stage dynamic pose recognition method according to claim 1, wherein the t-th feature vector is determined by the following formula:

Represents the t-th feature vector of the m-th video frame set, 0<tp<t, t<t+q<T, |pq|≤1, T is the total number of video frames in the video segment, A represents The attention level parameter,
Represents the feature sub-vector corresponding to the attention level a corresponding to the t-th video frame in the m-th video frame set, a≤A,
represents the t-th feature sub-vector in the m-th video frame set.
The single-stage dynamic pose identification method according to claim 2, wherein the identifying the pose in each frame of the video segment according to each feature vector in the M video frame sets, comprising:

Use the M t-th feature vectors to determine the t-th feature enhancement vector;

Performing global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector;

Perform RELU activation processing on the t-th feature pooling vector to determine the t-th feature activation vector;

A feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment.
The single-stage dynamic pose recognition method according to claim 3, wherein the t-th feature pooling vector is determined by the following formula:

Zt represents the t-th feature pooling vector,
represents the t-th feature enhancement vector,
Represents the feature sub-vector corresponding to the attention level a of the t-th video frame in the m-th video frame set.
The single-stage dynamic pose recognition method according to claim 3, wherein the feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment, comprising:

Perform hole convolution processing, RELU activation processing, dropout processing and softmax processing on the feature fusion vector composed of the T feature activation vectors in turn to determine the predicted category and corresponding to the pose in each frame in the video segment. predicted probability.
The single-stage dynamic pose recognition method according to claim 5, further comprising:

The classification loss for the video segment is calculated according to the following formula:

Ls represents the classification loss of the video segment, C represents the total number of predicted categories, Δt,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c, and yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c, yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c, and ε represents the preset probability threshold.
The single-stage dynamic pose recognition method according to any one of claims 1 to 6, wherein the poses include gesture poses and/or body poses.
A single-stage dynamic pose recognition device, characterized in that the device comprises:

an acquisition module, configured to acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period, M≥2;

The determination module is used to extract the feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models; it is also used for the preset attention level parameter and the t th video frame set of the m th video frame set. The eigenvectors determine the corresponding t-th eigenvector;

An identification module, configured to identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.
A terminal device, characterized in that it includes a memory and a processor, wherein the memory stores a computer program, and the computer program executes the single-stage dynamic process described in any one of claims 1 to 7 when the computer program runs on the processor. Pose recognition method.
A readable storage medium, characterized in that it stores a computer program, and the computer program executes the single-stage dynamic pose recognition method according to any one of claims 1 to 7 when the computer program runs on a processor.