WO2022227512A1 - Single-stage dynamic pose recognition method and apparatus, and terminal device - Google Patents
Single-stage dynamic pose recognition method and apparatus, and terminal device Download PDFInfo
- Publication number
- WO2022227512A1 WO2022227512A1 PCT/CN2021/131680 CN2021131680W WO2022227512A1 WO 2022227512 A1 WO2022227512 A1 WO 2022227512A1 CN 2021131680 W CN2021131680 W CN 2021131680W WO 2022227512 A1 WO2022227512 A1 WO 2022227512A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- video frame
- video
- vector
- pose
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 239000013598 vector Substances 0.000 claims abstract description 160
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 230000004913 activation Effects 0.000 claims description 37
- 238000012545 processing Methods 0.000 claims description 32
- 238000011176 pooling Methods 0.000 claims description 21
- 230000004927 fusion Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 9
- 210000002569 neuron Anatomy 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Definitions
- the present application relates to the field of artificial intelligence, and in particular, to a single-stage dynamic pose recognition method, device and terminal device.
- the present application proposes a single-stage dynamic pose recognition method, apparatus and terminal device.
- the present application proposes a single-stage dynamic pose recognition method, which includes:
- the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time period, and M ⁇ 2;
- the pose in each frame of the video segment is identified according to each feature vector in the M video frame sets.
- the t-th feature vector is determined by the following formula:
- T is the total number of video frames in the video segment
- A represents The attention level parameter
- the single-stage dynamic pose recognition method described in the present application wherein the pose in each frame of the video segment is identified according to each feature vector in the M video frame sets, including:
- a feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment.
- the single-stage dynamic pose recognition method proposed in this application uses the following formula to determine the t-th feature pooling vector:
- Zt represents the t-th feature pooling vector, represents the t-th feature enhancement vector, Represents the feature sub-vector corresponding to the attention level a of the t-th video frame in the m-th video frame set.
- the single-stage dynamic pose recognition method described in the present application wherein the feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment, including:
- the single-stage dynamic pose recognition method described in this application further includes:
- the classification loss for the video segment is calculated according to the following formula:
- Ls represents the classification loss of the video segment
- C represents the total number of predicted categories
- ⁇ t,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c
- yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c
- yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c
- ⁇ represents the preset probability threshold.
- the poses include gesture poses and/or body poses.
- the present application proposes a single-stage dynamic pose recognition device, which includes:
- an acquisition module configured to acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period, M ⁇ 2;
- the determination module is used to extract the feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models; it is also used for the preset attention level parameter and the t th video frame set of the m th video frame set.
- the eigenvectors determine the corresponding t-th eigenvector;
- An identification module configured to identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.
- the present application proposes a terminal device including a memory and a processor, wherein the memory stores a computer program, and the computer program executes the single-stage dynamic pose recognition method described in the present application when running on the processor.
- the present application provides a readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the single-stage dynamic pose recognition method described in the present application.
- the pose in each frame of the video segment when determining the pose in each frame of the video segment, it is based on M video segments corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time segment.
- a set of video frames is used, and the feature sub-vectors corresponding to the video frames in each video frame set are used to realize dynamic pose recognition, which effectively enhances the accuracy of dynamic pose recognition; on the other hand, M features are pre-trained for the M video frame sets.
- Extraction model using M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set, thereby ensuring the effective extraction of the feature sub-vectors of each video frame in each video frame set; on the other hand, by introducing attention
- the force level parameter fully considers that the eigenvector may be affected by the surrounding eigenvectors.
- FIG. 1 shows a schematic flowchart of a single-stage dynamic pose recognition method proposed by an embodiment of the present application
- FIG. 2 shows a schematic diagram of the relationship between an attention level parameter and a feature sub-vector proposed by an embodiment of the present application
- FIG. 3 shows a schematic diagram of the relationship between another attention level parameter and a feature sub-vector proposed by an embodiment of the present application
- FIG. 4 shows a schematic flowchart of identifying poses in each frame in a video segment according to an embodiment of the present application
- FIG. 5 shows a schematic flowchart of another single-stage dynamic pose recognition method proposed by an embodiment of the present application
- FIG. 6 shows a schematic structural diagram of a single-stage dynamic pose recognition device proposed in an embodiment of the present application.
- Pose recognition includes gesture pose recognition and/or body pose recognition. Pose recognition is one of the extensive research directions in academia and industry. There are many practical applications, including human-computer interaction, robotics, sign language recognition, Gaming and virtual reality controls, etc. Pose recognition can be further divided into static pose recognition and dynamic pose recognition. The method proposed in this application is mainly used to recognize dynamic poses in videos.
- two recognition methods are generally included, for example, a two-stage recognition method and a single-stage recognition method.
- the two-stage recognition method uses two models for recognition: one model is used to perform pose detection (also called the pose recognition stage, which is used to identify the presence or absence of a pose), and the other model is used to perform pose detection.
- Perform gesture classification For example, the pose is first detected by a lightweight 3D-CNN model, and then a heavyweight 3D-CNN classification model is activated for pose classification when the pose is detected.
- a lightweight 3D-CNN model For single-stage recognition methods, frames in the video that do not contain action are labeled as non-pose classes.
- the one-stage recognition method uses only one model for pose classification, and besides being simpler than the two-stage recognition method, the one-stage recognition method also avoids the potential problem of error propagation between stages. For example, in a two-stage recognition method, if the model that detects the pose makes an error in the pose detection stage, that error will propagate to the subsequent classification stages.
- the single-stage dynamic pose recognition method adopted in this application can detect and classify multiple poses in a single video through a single model. This method detects dynamic poses in videos without a pose preprocessing stage.
- a single-stage dynamic pose recognition method includes the following steps:
- M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time period, and M ⁇ 2 .
- M video capture devices are generally installed in the same area, and it needs to be ensured that the M video capture devices can capture the same group of dynamic poses at the same time.
- the M video capture devices may be of different types, for example, RGB image capture devices and RGB-D image (depth image) capture devices may be used simultaneously.
- the sets of M video frames corresponding to the M video segments including the same group of dynamic poses collected by the M video collection devices in the same time period may be pre-stored in a database or a storage device.
- M video frame sets may be obtained from a database or a storage device; or, M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period may be.
- the video capture device with the pose function can acquire the video frame sets corresponding to other video capture devices, so as to realize the recognition of the dynamic pose corresponding to the M video frame sets by using less hardware devices.
- Each video frame set includes a plurality of video frames, and the plurality of video frames are sequentially arranged in a time sequence to form a video frame sequence, that is, the video frame sampled first is in the front, and the video frame sampled later is in the back.
- each video frame set may be acquired by using different types of video capture devices, for example, an RGB image capture device and an RGB-D image (depth image) capture device may be used at the same time. Therefore, pre-determined M feature extraction models are required, that is, not only a feature extraction model (such as ResNet-based RGB feature extraction) needs to be pre-trained for RGB images, but also a feature extraction model (such as ResNet-based RGB feature extraction) needs to be pre-trained for depth images. deep feature extraction). Further, it is ensured that the feature sub-vectors of each video frame in each video frame set are effectively extracted.
- S300 Determine the corresponding t-th feature vector according to the preset attention level parameter and the t-th feature sub-vector of the m-th video frame set.
- Each video frame set includes a plurality of video frames, and the plurality of video frames are sequentially arranged in a time sequence to form a video frame sequence, that is, the video frame sampled first is in the front, and the video frame sampled later is in the back.
- this embodiment introduces an attention level parameter when determining the t-th feature vector of the m-th video frame set.
- the force level parameter is used to reflect which feature sub-vectors around the t-th feature vector of the m-th video frame set are affected by.
- the t-th feature vector of the m-th video frame set is composed of the t-1-th feature vector corresponding to the t-1-th video frame of the m-th video frame set.
- Eigenvectors The t-th feature sub-vector corresponding to the t-th video frame.
- the t+1 th feature sub-vector corresponding to the t+1 th video frame composition if represents the t-th feature vector of the m-th video frame set, then
- the t-th feature vector of the m-th video frame set is composed of the t-3-th video frame corresponding to the t-3-th video frame of the m-th video frame set.
- Eigenvectors The t-2th feature sub-vector corresponding to the t-2th video frame The t-1th feature sub-vector corresponding to the t-1th video frame The t-th feature sub-vector corresponding to the t-th video frame The t+1th feature sub-vector corresponding to the t+1th video frame The t+2th feature sub-vector corresponding to the t+2th video frame The t+3th feature sub-vector corresponding to the t+3th video frame The t+4th feature sub-vector corresponding to the t+4th video frame composition, that is Or, the t-th feature vector of the m-th video frame set is composed of the t-4th feature sub-vector corresponding to the t-4
- t-th eigenvector can be determined using the following formula:
- T is the total number of video frames in the video segment (that is, the th The total number of video frames in the set of m video frames)
- A represents the attention level parameter
- Each video frame set includes multiple feature vectors, and feature enhancement processing, global average pooling processing and activation processing are performed on each feature vector in the M video frame sets respectively to obtain a feature fusion vector, and then use the feature fusion vector Identify the pose in each frame of the video segment.
- the pose in each frame of the video segment when determining the pose in each frame of the video segment, it is based on the corresponding M video segments including the same group of dynamic poses collected by M video capture devices in the same time segment.
- M video frame sets There are M video frame sets, and the feature sub-vectors corresponding to the video frames in each video frame set are used for mutual enhancement and fusion to realize dynamic pose recognition, which effectively enhances the accuracy of dynamic pose recognition.
- the single-stage identification method is not only simpler than the two-stage identification method, but also avoids the potential problem of error propagation between stages. For example, in a two-stage recognition method, if the model that detects the pose makes an error in the pose detection stage, that error will propagate to the subsequent classification stages.
- the video segment can be segmented.
- two video frames with adjacent video frames and different poses can be As a segmentation point, the continuous same pose can be used as a segment.
- the pose recognition in each frame in the video segment includes the following steps:
- S410 Determine the t-th feature enhancement vector by using the M t-th feature vectors.
- the t-th feature vector of the m-th video frame set can be expressed as Using the M t-th feature vectors to determine the t-th feature enhancement vector can be
- the first feature enhancement vector can be expressed as
- S420 Perform global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector.
- Zt represents the t-th feature pooling vector, represents the t-th feature enhancement vector, Represents the feature sub-vector corresponding to the attention level a of the t-th video frame in the m-th video frame set.
- S430 Perform RELU activation processing on the t-th feature pooling vector to determine the t-th feature activation vector.
- RELU activation is actually a function of taking the maximum value.
- the ReLU activation function is actually a piecewise linear function, which turns all negative values into 0, while the positive values remain unchanged. This operation can be understood as one-sided suppression. (That is: in the case where the input is negative, it will output 0, then the neuron will not be activated. This means that only some neurons will be activated at the same time, which makes the network very sparse, which in turn affects computation It is very efficient.) It is because of the unilateral inhibition that the neurons in the neural network also have sparse activation.
- the activation rate of ReLU neurons will theoretically be reduced by a factor of 2 N times.
- the ReLU activation function does not have complex exponential operations, so the calculation is simple and the classification and recognition efficiency is high; in addition, the ReLU activation function has faster convergence speed than the Sigmoid/tanh activation function.
- S440 Identify the pose in each frame of the video segment by using a feature fusion vector composed of T feature activation vectors.
- Feature activation vector ⁇ t, t 1, 2, 3, ..., T
- feature fusion vector ⁇ [ ⁇ 1, ⁇ 2, ..., ⁇ T] that can be composed of T feature activation vectors.
- Atrous convolution processing RELU activation processing
- dropout processing e.g., atrous convolution processing
- softmax processing e.g., atrous convolution processing, RELU activation processing, dropout processing and softmax processing
- the hole convolution processing is to inject holes on the basis of standard convolution to increase the receptive field, and the hole convolution processing can increase the receptive field while maintaining the size of the feature fusion vector.
- Dropout processing includes using one-dimensional convolutional layer, dropout layer and one-dimensional convolutional layer to perform dropout processing on the feature fusion vector.
- the dropout makes the activation value of a neuron with a certain probability p stops working, which can make the neural network model more generalizable and avoid over-reliance on some local features, which can effectively alleviate the occurrence of over-fitting, and achieve the effect of regularization to a certain extent.
- the softmax process uses the softmax function to map the input to a real number between 0 and 1, and the normalization guarantees that the sum is 1, thereby ensuring that the sum of the probabilities of multi-classification is exactly 1.
- FIG. 5 shows another single-stage dynamic pose recognition method, which further includes after step S400:
- S500 Calculate the classification loss corresponding to the video segment according to the prediction category to which the pose in each frame in the video segment belongs and the corresponding prediction probability.
- the classification loss corresponding to the video segment can be calculated according to the following formula:
- Ls represents the classification loss of the video segment
- C represents the total number of predicted categories
- ⁇ t,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c
- yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c
- yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c
- ⁇ represents the preset probability threshold.
- the accuracy of the current pose recognition can be determined by the classification loss corresponding to the video segment, that is, the smaller the classification loss corresponding to the video segment, the higher the accuracy of the current pose recognition; on the other hand, it can be used for Evaluate the single-stage dynamic pose recognition model, that is, when training the single-stage dynamic pose recognition model, it can be determined whether the single-stage dynamic pose recognition model meets the standard according to the convergence of the classification loss function. For example, when the classification loss function converges and the classification loss When the loss is less than the preset loss threshold, the single-stage dynamic pose recognition model is trained and can be used to recognize dynamic poses in video segments.
- a single-stage dynamic pose recognition device 10 includes an acquisition module 11 , a determination module 12 and an identification module 13 .
- the acquisition module 11 is configured to acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period , M ⁇ 2; the determination module 12 is used to extract the feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models; also used for according to the preset attention level parameter and the mth The t-th feature sub-vector of the video frame set determines the corresponding t-th feature vector; the identification module 13 is configured to identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.
- t-th feature vector is determined by the following formula:
- T is the total number of video frames in the video segment
- A represents The attention level parameter
- identifying the pose in each frame of the video segment according to each feature vector in the M video frame sets including:
- M t-th feature vectors Use the M t-th feature vectors to determine the t-th feature enhancement vector; perform global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector; The vector is subjected to RELU activation processing to determine the t-th feature activation vector; the feature fusion vector composed of the T feature activation vectors is used to identify the pose in each frame of the video segment.
- Zt represents the t-th feature pooling vector, represents the t-th feature enhancement vector, Represents the feature sub-vector corresponding to the attention level a of the t-th video frame in the m-th video frame set.
- the described feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame in the video segment, including:
- the identification module 13 is further configured to calculate the classification loss corresponding to the video segment according to the prediction category to which the pose in each frame in the video segment belongs and the corresponding prediction probability.
- classification loss of the video segment is calculated according to the following formula:
- Ls represents the classification loss of the video segment
- C represents the total number of predicted categories
- ⁇ t,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c
- yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c
- yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c
- ⁇ represents the preset probability threshold.
- the poses include gesture poses and/or body poses.
- the single-stage dynamic pose recognition device 10 disclosed in this embodiment is used in conjunction with the acquisition module 11 , the determination module 12 and the recognition module 13 to execute the single-stage dynamic pose recognition method described in the above embodiments.
- the related implementations and beneficial effects are also applicable in this embodiment, and will not be repeated here.
- this application proposes a terminal device, including a memory and a processor, wherein the memory stores a computer program, and the computer program executes the single-stage dynamic pose recognition method described in this application when running on the processor .
- this application provides a readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the single-stage dynamic pose recognition method described in this application.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures.
- each block of the block diagrams and/or flow diagrams, and combinations of blocks in the block diagrams and/or flow diagrams can be implemented using dedicated hardware-based systems that perform the specified functions or actions. be implemented, or may be implemented in a combination of special purpose hardware and computer instructions.
- each functional module or unit in each embodiment of the present application may be integrated together to form an independent part, or each module may exist independently, or two or more modules may be integrated to form an independent part.
- the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium.
- the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
- the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Human Computer Interaction (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Disclosed in the embodiments of the present application are a single-stage dynamic pose recognition method and apparatus, and a terminal device. By means of the technical solution of the present application, a pose in each frame in video segments is determined on the basis of M video frame sets corresponding to M video segments, which are collected by M video collection apparatuses in the same time period and comprise the same group of dynamic poses, and feature sub-vectors corresponding to video frames in each video frame set are mutually enhanced and fused to realize dynamic pose recognition, thereby effectively enhancing the accuracy of dynamic pose recognition; in addition, M feature extraction models are pre-trained for the M video frame sets, and the feature sub-vectors of the video frames in the corresponding video frame sets are respectively extracted by using the M feature extraction models, thereby ensuring effective extraction of the feature sub-vectors of the video frames in the video frame sets; moreover, an attention level parameter is introduced to take the influence of surrounding feature sub-vectors on a feature vector into full consideration.
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2021年04月26日提交中国专利局的申请号为2021104549677、名称为“一种单阶段动态位姿识别方法、装置和终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 2021104549677, which was filed with the China Patent Office on April 26, 2021, and is entitled "A Single-Stage Dynamic Pose Recognition Method, Apparatus and Terminal Equipment", the entire contents of which are approved by Reference is incorporated in this application.
本申请涉及人工智能领域,尤其涉及一种单阶段动态位姿识别方法、装置和终端设备。The present application relates to the field of artificial intelligence, and in particular, to a single-stage dynamic pose recognition method, device and terminal device.
目前,大多数动态位姿识别方法都是基于识别孤立位姿的,识别模型输入视频为手动分割的视频片段,每个视频片段包含一个单独的位姿(手势位姿或身体位姿)。然而,在真实世界场景中,位姿一般是连续执行的,这种基于孤立位姿的方法并不能直接应用。At present, most dynamic pose recognition methods are based on identifying isolated poses. The input video to the recognition model is manually segmented video clips, and each video clip contains a separate pose (gesture pose or body pose). However, in real-world scenarios, where poses are generally executed continuously, this isolated pose-based approach cannot be directly applied.
申请内容Application content
鉴于上述问题,本申请提出一种单阶段动态位姿识别方法、装置和终端设备。In view of the above problems, the present application proposes a single-stage dynamic pose recognition method, apparatus and terminal device.
本申请提出一种单阶段动态位姿识别方法,所述方法包括:The present application proposes a single-stage dynamic pose recognition method, which includes:
获取M个视频帧集合,所述M个视频帧集合为M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合,M≥2;Obtaining M video frame sets, the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time period, and M≥2;
利用预先确定的M个特征提取模型分别提取对应视频帧集合中各个视频帧的特征子向量;Utilize the predetermined M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set respectively;
根据预设的注意力等级参数和第m个视频帧集合的第t个特征子向量确定对应的第t个特征向量;Determine the corresponding t-th feature vector according to the preset attention level parameter and the t-th feature sub-vector of the m-th video frame set;
根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿。The pose in each frame of the video segment is identified according to each feature vector in the M video frame sets.
本申请所述的单阶段动态位姿识别方法,所述第t个特征向量利用以下公式确定:In the single-stage dynamic pose recognition method described in this application, the t-th feature vector is determined by the following formula:
表示所述第m个视频帧集合的第t个特征向量,0<t-p<t,t<t+q<T,|p-q|≤1,T为所述视频段中视频帧的总数,A表示所述注意力等级参数,
表示第m个视频帧集合中第t帧视频帧对应的注意力等级a对应的特征子向量,a≤A,
表示第m个视频帧集合中第t个特征子向量。
Represents the t-th feature vector of the m-th video frame set, 0<tp<t, t<t+q<T, |pq|≤1, T is the total number of video frames in the video segment, A represents The attention level parameter, Represents the feature sub-vector corresponding to the attention level a corresponding to the t-th video frame in the m-th video frame set, a≤A, represents the t-th feature sub-vector in the m-th video frame set.
本申请所述的单阶段动态位姿识别方法,所述根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿,包括:The single-stage dynamic pose recognition method described in the present application, wherein the pose in each frame of the video segment is identified according to each feature vector in the M video frame sets, including:
利用M个第t个特征向量确定第t个特征增强向量;Use the M t-th feature vectors to determine the t-th feature enhancement vector;
对所述第t个特征增强向量进行全局平均池化处理以确定第t个特征池化向量;Performing global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector;
对所述第t个特征池化向量进行RELU(Rectified Linear Unit,线性整流函数)激活处理以确定第t个特征激活向量;Perform RELU (Rectified Linear Unit, linear rectification function) activation processing to the t-th feature pooling vector to determine the t-th feature activation vector;
利用T个特征激活向量组成的特征融合向量识别所述视频段中每一帧中的位姿。A feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment.
本申请提出所述的单阶段动态位姿识别方法,利用以下公式确定第t个特征池化向量:The single-stage dynamic pose recognition method proposed in this application uses the following formula to determine the t-th feature pooling vector:
Zt表示所述第t个特征池化向量,
表示所述第t个特征增强向量,
表示第m个视频帧集合中第t帧视频帧的注意力等级a对应的特征子向量。
Zt represents the t-th feature pooling vector, represents the t-th feature enhancement vector, Represents the feature sub-vector corresponding to the attention level a of the t-th video frame in the m-th video frame set.
本申请所述的单阶段动态位姿识别方法,所述利用T个特征激活向量组成的特征融合向量识别所述视频段中每一帧中的位姿,包括:The single-stage dynamic pose recognition method described in the present application, wherein the feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment, including:
对所述利用T个特征激活向量组成的特征融合向量依次进行空洞卷积处理、RELU激活处理、dropout处理和softmax处理以确定所述视频段中每一帧中的位姿所属的预测类别和对应的预测概率。Perform hole convolution processing, RELU activation processing, dropout processing and softmax processing on the feature fusion vector composed of the T feature activation vectors in turn to determine the predicted category and corresponding to the pose in each frame in the video segment. predicted probability.
本申请所述的单阶段动态位姿识别方法,还包括:The single-stage dynamic pose recognition method described in this application further includes:
根据以下公式计算所述视频段的分类损失:The classification loss for the video segment is calculated according to the following formula:
Ls表示所述视频段的分类损失,C表示预测类别总数,Δt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的分类损失,yt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的预测概率,yt-1,c表示第t-1帧视频帧中的位姿属于预测类别c时所对应的预测概率,ε表示预设的概率阈值。Ls represents the classification loss of the video segment, C represents the total number of predicted categories, Δt,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c, and yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c, yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c, and ε represents the preset probability threshold.
本申请所述的单阶段动态位姿识别方法,所述位姿包括手势位姿和/或身体位姿。In the single-stage dynamic pose recognition method described in this application, the poses include gesture poses and/or body poses.
本申请提出一种单阶段动态位姿识别装置,所述装置包括:The present application proposes a single-stage dynamic pose recognition device, which includes:
获取模块,用于获取M个视频帧集合,所述M个视频帧集合为M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合,M≥2;an acquisition module, configured to acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period, M≥2;
确定模块,用于利用预先确定的M个特征提取模型分别提取对应视频帧集合中各个视频帧的特征子向量;还用于根据预设的注意力等级参数和第m个视频帧集合的第t个特征子向量确定对应的第t个特征向量;The determination module is used to extract the feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models; it is also used for the preset attention level parameter and the t th video frame set of the m th video frame set. The eigenvectors determine the corresponding t-th eigenvector;
识别模块,用于根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿。An identification module, configured to identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.
本申请提出一种终端设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序在所述处理器上运行时执行本申请所述的单阶段动态位姿识别方法。The present application proposes a terminal device including a memory and a processor, wherein the memory stores a computer program, and the computer program executes the single-stage dynamic pose recognition method described in the present application when running on the processor.
本申请提出一种可读存储介质,其存储有计算机程序,所述计算机程序在处理器上运行时执行本申请所述的单阶段动态位姿识别方法。The present application provides a readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the single-stage dynamic pose recognition method described in the present application.
本申请的技术方案,一方面,在确定视频段中每一帧中的位姿时,是基于M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合,利用各个视频帧集合内的视频帧对应的特征子向量实现动态位姿识别,有效增强动态位姿识别的准确度;另一方面,针对M个视频帧集合预先训练M个特征提取模型,利用M个特征提取模型分别提取对应视频帧集合中各个视频帧的特征子向量,进而保证对各个视频帧集合中各个视频帧的特征子向量进行有效提取;再一方面,通过引入注意力等级参数,充分考虑了特征向量可能受到周围特征子向量的影响。In the technical solution of the present application, on the one hand, when determining the pose in each frame of the video segment, it is based on M video segments corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time segment. A set of video frames is used, and the feature sub-vectors corresponding to the video frames in each video frame set are used to realize dynamic pose recognition, which effectively enhances the accuracy of dynamic pose recognition; on the other hand, M features are pre-trained for the M video frame sets. Extraction model, using M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set, thereby ensuring the effective extraction of the feature sub-vectors of each video frame in each video frame set; on the other hand, by introducing attention The force level parameter fully considers that the eigenvector may be affected by the surrounding eigenvectors.
为了更清楚地说明本申请的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对本申请保护范围的限定。在各个附图中,类似的构成部分采用类似的编号。In order to illustrate the technical solutions of the present application more clearly, the following briefly introduces the accompanying drawings used in the embodiments. It should be understood that the following drawings only show some embodiments of the present application, and therefore should not be It is regarded as a limitation on the protection scope of this application. In the various figures, similar components are numbered similarly.
图1示出了本申请实施例提出的一种单阶段动态位姿识别方法的流程示意图;1 shows a schematic flowchart of a single-stage dynamic pose recognition method proposed by an embodiment of the present application;
图2示出了本申请实施例提出的一种注意力等级参数与特征子向量的关系示意图;FIG. 2 shows a schematic diagram of the relationship between an attention level parameter and a feature sub-vector proposed by an embodiment of the present application;
图3示出了本申请实施例提出的另一种注意力等级参数与特征子向量的关系示意图;FIG. 3 shows a schematic diagram of the relationship between another attention level parameter and a feature sub-vector proposed by an embodiment of the present application;
图4示出了本申请实施例提出的一种识别视频段中每一帧中位姿的流程示意图;FIG. 4 shows a schematic flowchart of identifying poses in each frame in a video segment according to an embodiment of the present application;
图5示出了本申请实施例提出的另一种单阶段动态位姿识别方法的流程示意图;FIG. 5 shows a schematic flowchart of another single-stage dynamic pose recognition method proposed by an embodiment of the present application;
图6示出了本申请实施例提出的一种单阶段动态位姿识别装置的结构示意图。FIG. 6 shows a schematic structural diagram of a single-stage dynamic pose recognition device proposed in an embodiment of the present application.
主要元件符号说明:Description of main component symbols:
10-单阶段动态位姿识别装置;11-获取模块;12-确定模块;13-识别模块。10-single-stage dynamic pose recognition device; 11-acquisition module; 12-determination module; 13-recognition module.
下面将结合本申请实施例中附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments.
通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围,而是仅仅表示本申请的选定实施例。基于本申请的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本申请保护的范围。The components of the embodiments of the present application generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the application provided in the accompanying drawings is not intended to limit the scope of the application as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present application.
在下文中,可在本申请的各种实施例中使用的术语“包括”、“具有”及其同源词仅意在表示特定特征、数字、步骤、操作、元件、组件或前述项的组合,并且不应被理解为首先排除一个或更多个其它特征、数字、步骤、操作、元件、组件或前述项的组合的存在或增加一个或更多个特征、数字、步骤、操作、元件、组件或前述项的组合的可能性。Hereinafter, the terms "comprising", "having" and their cognates, which may be used in various embodiments of the present application, are only intended to denote particular features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the presence of or adding one or more other features, numbers, steps, operations, elements, components or combinations of the foregoing or the possibility of a combination of the foregoing.
此外,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。Furthermore, the terms "first", "second", "third", etc. are only used to differentiate the description and should not be construed as indicating or implying relative importance.
除非另有限定,否则在这里使用的所有术语(包括技术术语和科学术语)具有与本申请的各种实施例所属领域普通技术人员通常理解的含义相同的含义。所述术语(诸如在一般使用的词典中限定的术语)将被解释为具有与在相关技术领域中的语境含义相同的 含义并且将不被解释为具有理想化的含义或过于正式的含义,除非在本申请的各种实施例中被清楚地限定。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of this application belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having the same meaning as the contextual meaning in the relevant technical field and will not be interpreted as having an idealized or overly formal meaning, unless explicitly defined in the various embodiments of the present application.
位姿识别包括手势位姿识别和/或身体位姿识别,位姿识别是学术界以及工业界广泛研究的方向之一,目前已有许多实际应用,包括人机交互、机器人技术、手语识别、游戏和虚拟现实控制等。位姿识别又可以分为静态位姿识别和动态位姿识别,本申请提出的方法主要用于识别视频中的动态位姿。Pose recognition includes gesture pose recognition and/or body pose recognition. Pose recognition is one of the extensive research directions in academia and industry. There are many practical applications, including human-computer interaction, robotics, sign language recognition, Gaming and virtual reality controls, etc. Pose recognition can be further divided into static pose recognition and dynamic pose recognition. The method proposed in this application is mainly used to recognize dynamic poses in videos.
可以理解,对于动态位姿识别,一般包括两种识别方法,例如,两阶段识别方法和单阶段识别方法。两阶段识别方法使用两种模型进行识别:一种模型用于执行位姿检测(也称为位姿识别阶段,用于识别是否存在位姿),另一种模型用于对识别出的位姿进行手势分类。例如,首先通过一个轻量级的3D-CNN模型检测位姿,然后在检测到位姿时激活一个重量级的3D-CNN分类模型进行位姿分类。对于单阶段识别方法,视频中不包含动作的帧被标记为非位姿类。与两阶段识别方法相比,单阶段识别方法仅使用位姿分类一个模型,除了比两阶段识别方法简单之外,单阶段识别方法还避免了错误在各阶段之间传播的潜在问题。例如,在两阶段识别方法中,如果检测位姿的模型在检测位姿阶段时出错,则该错误将传播到后续分类阶段。本申请中采用的单阶段动态位姿识别方法,可以通过单个模型检测和分类单个视频中的多个位姿。这种方法无需位姿预处理阶段即可检测视频中的动态位姿。It can be understood that, for dynamic pose recognition, two recognition methods are generally included, for example, a two-stage recognition method and a single-stage recognition method. The two-stage recognition method uses two models for recognition: one model is used to perform pose detection (also called the pose recognition stage, which is used to identify the presence or absence of a pose), and the other model is used to perform pose detection. Perform gesture classification. For example, the pose is first detected by a lightweight 3D-CNN model, and then a heavyweight 3D-CNN classification model is activated for pose classification when the pose is detected. For single-stage recognition methods, frames in the video that do not contain action are labeled as non-pose classes. Compared with the two-stage recognition method, the one-stage recognition method uses only one model for pose classification, and besides being simpler than the two-stage recognition method, the one-stage recognition method also avoids the potential problem of error propagation between stages. For example, in a two-stage recognition method, if the model that detects the pose makes an error in the pose detection stage, that error will propagate to the subsequent classification stages. The single-stage dynamic pose recognition method adopted in this application can detect and classify multiple poses in a single video through a single model. This method detects dynamic poses in videos without a pose preprocessing stage.
实施例1Example 1
本申请的一个实施例,如图1所示,单阶段动态位姿识别方法包括以下步骤:An embodiment of the present application, as shown in FIG. 1 , a single-stage dynamic pose recognition method includes the following steps:
S100:获取M个视频帧集合,所述M个视频帧集合为M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合,M≥2。S100: Acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time period, and M≥2 .
M个视频采集装置一般安装在同一区域,并且需要保证M个视频采集装置可以同时采集同一组动态位姿。M个视频采集装置可以是不同类型,例如,可以同时使用RGB图像采集装置和RGB-D图像(深度图像)采集装置。M video capture devices are generally installed in the same area, and it needs to be ensured that the M video capture devices can capture the same group of dynamic poses at the same time. The M video capture devices may be of different types, for example, RGB image capture devices and RGB-D image (depth image) capture devices may be used simultaneously.
可以理解,M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合可以预先存储在数据库或者存储设备中,在对视频段中的位姿进行识别时,可以从数据库或者存储设备中获取M个视频帧集合;或者,M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合可以实时上传至用于识别动态位姿的终端设备,以使终端设备可以实时识别动态位 姿;或者,M个视频采集装置中可以至少有一个视频采集装置具有识别动态位姿的功能,具有识别动态位姿功能的视频采集装置可以获取其他视频采集装置对应的视频帧集合,以利用较少的硬件设备实现识别M个视频帧集合对应的动态位姿。It can be understood that the sets of M video frames corresponding to the M video segments including the same group of dynamic poses collected by the M video collection devices in the same time period may be pre-stored in a database or a storage device. When performing identification, M video frame sets may be obtained from a database or a storage device; or, M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period may be. Real-time upload to the terminal device for recognizing the dynamic pose, so that the terminal device can recognize the dynamic pose in real time; The video capture device with the pose function can acquire the video frame sets corresponding to other video capture devices, so as to realize the recognition of the dynamic pose corresponding to the M video frame sets by using less hardware devices.
进一步的,在同一时间段采集的包括同一组动态位姿的视频段至少为2个,可选的,M≥2。可以理解,在M=2时,动态位姿识别过程的复杂度较低,计算量较少,识别速度较快。随着M的增加,虽然动态位姿识别过程的复杂度增加,计算量增加,识别速度减缓,但是,动态位姿识别的精度将提高。Further, there are at least 2 video segments including the same group of dynamic poses collected in the same time period, and optionally, M≥2. It can be understood that when M=2, the complexity of the dynamic pose recognition process is low, the amount of calculation is small, and the recognition speed is fast. With the increase of M, although the complexity of the dynamic pose recognition process increases, the amount of calculation increases, and the recognition speed slows down, but the accuracy of dynamic pose recognition will increase.
S200:利用预先确定的M个特征提取模型分别提取对应视频帧集合中各个视频帧的特征子向量。S200: Extract feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models.
每一个视频帧集合中包括多个视频帧,多个视频帧按照时间顺序依次排列形成一个视频帧序列,即先采样的视频帧靠前,后采样的视频帧靠后。考虑到各个视频帧集合可能是利用不同类型的视频采集装置采集获得的,例如,可能同时使用RGB图像采集装置和RGB-D图像(深度图像)采集装置。因此,需要预先确定的M个特征提取模型,即不仅需要针对RGB图像预先训练一个特征提取模型(例如基于ResNet的RGB特征提取),还需要针对深度图像预先训练一个特征提取模型(例如基于ResNet的深度特征提取)。进而保证对各个视频帧集合中各个视频帧的特征子向量进行有效提取。Each video frame set includes a plurality of video frames, and the plurality of video frames are sequentially arranged in a time sequence to form a video frame sequence, that is, the video frame sampled first is in the front, and the video frame sampled later is in the back. Considering that each video frame set may be acquired by using different types of video capture devices, for example, an RGB image capture device and an RGB-D image (depth image) capture device may be used at the same time. Therefore, pre-determined M feature extraction models are required, that is, not only a feature extraction model (such as ResNet-based RGB feature extraction) needs to be pre-trained for RGB images, but also a feature extraction model (such as ResNet-based RGB feature extraction) needs to be pre-trained for depth images. deep feature extraction). Further, it is ensured that the feature sub-vectors of each video frame in each video frame set are effectively extracted.
S300:根据预设的注意力等级参数和第m个视频帧集合的第t个特征子向量确定对应的第t个特征向量。S300: Determine the corresponding t-th feature vector according to the preset attention level parameter and the t-th feature sub-vector of the m-th video frame set.
每一个视频帧集合中包括多个视频帧,多个视频帧按照时间顺序依次排列形成一个视频帧序列,即先采样的视频帧靠前,后采样的视频帧靠后。考虑到每一个视频帧集合中的第t个特征子向量可能受到周围特征子向量的影响,本实施例在确定第m个视频帧集合的第t个特征向量时,引入注意力等级参数,注意力等级参数用于体现第m个视频帧集合的第t个特征向量受到周围哪些特征子向量的影响。Each video frame set includes a plurality of video frames, and the plurality of video frames are sequentially arranged in a time sequence to form a video frame sequence, that is, the video frame sampled first is in the front, and the video frame sampled later is in the back. Considering that the t-th feature sub-vector in each video frame set may be affected by surrounding feature sub-vectors, this embodiment introduces an attention level parameter when determining the t-th feature vector of the m-th video frame set. The force level parameter is used to reflect which feature sub-vectors around the t-th feature vector of the m-th video frame set are affected by.
示范性的,在A=3时,如图2所示,第m个视频帧集合的第t个特征向量由第m个视频帧集合的第t-1帧视频帧对应的第t-1个特征子向量
第t帧视频帧对应的第t个特征子向量
和第t+1帧视频帧对应的第t+1个特征子向量
组成,即若
表示所述第m个视频帧集合的第t个特征向量,则
Exemplarily, when A=3, as shown in FIG. 2 , the t-th feature vector of the m-th video frame set is composed of the t-1-th feature vector corresponding to the t-1-th video frame of the m-th video frame set. Eigenvectors The t-th feature sub-vector corresponding to the t-th video frame The t+1 th feature sub-vector corresponding to the t+1 th video frame composition, if represents the t-th feature vector of the m-th video frame set, then
示范性的,在A=8时,如图3所示,第m个视频帧集合的第t个特征向量由第m个视频帧集合的第t-3帧视频帧对应的第t-3个特征子向量
第t-2帧视频帧对应的第t-2个特征子向量
第t-1帧视频帧对应的第t-1个特征子向量
第t帧视频帧对应的第t个特征子向量
第t+1帧视频帧对应的第t+1个特征子向量
第t+2帧视频帧对应的第t+2个特征子向量
第t+3帧视频帧对应的第t+3个特征子向量
和第t+4帧视频帧对应的第t+4个特征子向量
组成,即
或者,第m个视频帧集合的第t个特征向量由第m个视频帧集合的第t-4帧视频帧对应的第t-4个特征子向量
第t-3帧视频帧对应的第t-3个特征子向量
第t-2帧视频帧对应的第t-2个特征子向量
第t-1帧视频帧对应的第t-1个特征子向量
第t帧视频帧对应的第t个特征子向量
第t+1帧视频帧对应的第t+1个特征子向量
第t+2帧视频帧对应的第t+2个特征子向量
和第t+3帧视频帧对应的第t+3个特征子向量
组成,即
Exemplarily, when A=8, as shown in FIG. 3 , the t-th feature vector of the m-th video frame set is composed of the t-3-th video frame corresponding to the t-3-th video frame of the m-th video frame set. Eigenvectors The t-2th feature sub-vector corresponding to the t-2th video frame The t-1th feature sub-vector corresponding to the t-1th video frame The t-th feature sub-vector corresponding to the t-th video frame The t+1th feature sub-vector corresponding to the t+1th video frame The t+2th feature sub-vector corresponding to the t+2th video frame The t+3th feature sub-vector corresponding to the t+3th video frame The t+4th feature sub-vector corresponding to the t+4th video frame composition, that is Or, the t-th feature vector of the m-th video frame set is composed of the t-4th feature sub-vector corresponding to the t-4th video frame of the m-th video frame set The t-3th feature sub-vector corresponding to the t-3th video frame The t-2th feature sub-vector corresponding to the t-2th video frame The t-1th feature sub-vector corresponding to the t-1th video frame The t-th feature sub-vector corresponding to the t-th video frame The t+1th feature sub-vector corresponding to the t+1th video frame The t+2th feature sub-vector corresponding to the t+2th video frame The t+3th feature sub-vector corresponding to the t+3th video frame composition, that is
进一步的,第t个特征向量可以利用以下公式确定:Further, the t-th eigenvector can be determined using the following formula:
表示所述第m个视频帧集合的第t个特征向量,0<t-p<t,t<t+q<T,|p-q|≤1,T为所述视频段中视频帧的总数(即第m个视频帧集合中视频帧的总数),A表示所述注意力等级参数,
表示第m个视频帧集合中第t帧视频帧对应的注意力等级a对应的特征子向量,a≤A,
表示第m个视频帧集合中第t个特征子向量。
Represents the t-th feature vector of the m-th video frame set, 0<tp<t, t<t+q<T, |pq|≤1, T is the total number of video frames in the video segment (that is, the th The total number of video frames in the set of m video frames), A represents the attention level parameter, Represents the feature sub-vector corresponding to the attention level a corresponding to the t-th video frame in the m-th video frame set, a≤A, represents the t-th feature sub-vector in the m-th video frame set.
S400:根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿。S400: Identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.
每一个视频帧集合中的包括多个特征向量,分别对M个视频帧集合中的各个特征向量进行特征增强处理、全局平均池化处理以及激活处理,以获取特征融合向量,然后利用特征融合向量识别视频段中每一帧中的位姿。Each video frame set includes multiple feature vectors, and feature enhancement processing, global average pooling processing and activation processing are performed on each feature vector in the M video frame sets respectively to obtain a feature fusion vector, and then use the feature fusion vector Identify the pose in each frame of the video segment.
本实施例的技术方案,一方面,在确定视频段中每一帧中的位姿时,是基于M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合,利用各个视频帧集合内的视频帧对应的特征子向量相互增强融合实现动态位姿识别,有效增强动态位姿识别的准确度;另一方面,针对M个视频帧集合预先训练M个特征提取模型,利用M个特征提取模型分别提取对应视频帧集合中各个视频帧的特征子向量,进而保证对各个视频帧集合中各个视频帧的特征子向量进行有效提取;再一方面,本实施例的技术方案引入注意力等级参数,充分考虑了特征向量可能受到周围特征子向量的影响。In the technical solution of this embodiment, on the one hand, when determining the pose in each frame of the video segment, it is based on the corresponding M video segments including the same group of dynamic poses collected by M video capture devices in the same time segment. There are M video frame sets, and the feature sub-vectors corresponding to the video frames in each video frame set are used for mutual enhancement and fusion to realize dynamic pose recognition, which effectively enhances the accuracy of dynamic pose recognition. Train M feature extraction models, and use the M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set, thereby ensuring that the feature sub-vectors of each video frame in each video frame set are effectively extracted; another aspect , the technical solution of this embodiment introduces the attention level parameter, fully considering that the feature vector may be affected by the surrounding feature sub-vectors.
进一步的,本实施例的技术方案与两阶段识别方法相比,单阶段识别方法不仅比两阶段识别方法简单,而且单阶段识别方法还避免了错误在各阶段之间传播的潜在问题。例如,在两阶段识别方法中,如果检测位姿的模型在检测位姿阶段时出错,则该错误将传播到后续分类阶段。Further, compared with the two-stage identification method in the technical solution of this embodiment, the single-stage identification method is not only simpler than the two-stage identification method, but also avoids the potential problem of error propagation between stages. For example, in a two-stage recognition method, if the model that detects the pose makes an error in the pose detection stage, that error will propagate to the subsequent classification stages.
进一步的,基于本实施例的技术方案对视频段中的每一帧视频帧中位姿的识别,可以对视频段进行切分,例如,视频帧相邻且位姿不同的两个视频帧可以作为一个切分点,进而可以将连续相同的位姿作为一段。Further, based on the identification of the pose in each video frame in the video segment based on the technical solution of this embodiment, the video segment can be segmented. For example, two video frames with adjacent video frames and different poses can be As a segmentation point, the continuous same pose can be used as a segment.
实施例2Example 2
本申请的一个实施例,如图4所示,在获得M个视频帧集合中的各个特征向量后,视频段中每一帧中的位姿识别包括以下步骤:In an embodiment of the present application, as shown in FIG. 4 , after obtaining each feature vector in the M video frame sets, the pose recognition in each frame in the video segment includes the following steps:
S410:利用M个第t个特征向量确定第t个特征增强向量。S410: Determine the t-th feature enhancement vector by using the M t-th feature vectors.
第m个视频帧集合的第t个特征向量可以表示为
利用M个第t个特征向量确定第t个特征增强向量可以为
The t-th feature vector of the m-th video frame set can be expressed as Using the M t-th feature vectors to determine the t-th feature enhancement vector can be
示范性的,在M=2时,第一个视频帧集合中包括多个特征向量
t=1,2,3,……,第二个视频帧集合中包括多个特征向量
t=1,2,3,……。进一步的,在t=1时,第一特征增强向量可以表示为
Exemplarily, when M=2, the first video frame set includes multiple feature vectors t=1, 2, 3, ..., the second video frame set includes multiple feature vectors t=1, 2, 3, . . . Further, when t=1, the first feature enhancement vector can be expressed as
S420:对所述第t个特征增强向量进行全局平均池化处理以确定第t个特征池化向量。S420: Perform global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector.
示范性的,可以利用以下公式确定第t个特征池化向量:Exemplarily, the following formula can be used to determine the t-th feature pooling vector:
Zt表示所述第t个特征池化向量,
表示所述第t个特征增强向量,
表示第m个视频帧集合中第t帧视频帧的注意力等级a对应的特征子向量。
Zt represents the t-th feature pooling vector, represents the t-th feature enhancement vector, Represents the feature sub-vector corresponding to the attention level a of the t-th video frame in the m-th video frame set.
S430:对所述第t个特征池化向量进行RELU激活处理以确定第t个特征激活向量。S430: Perform RELU activation processing on the t-th feature pooling vector to determine the t-th feature activation vector.
示范性的,对所述第t个特征池化向量进行RELU激活处理以确定第t个特征激活向量βt,可以表示为βt=RELU(Zt)=max(0,Zt)。Exemplarily, performing RELU activation processing on the t-th feature pooling vector to determine the t-th feature activation vector βt, which can be expressed as βt=RELU(Zt)=max(0, Zt).
可以理解,通过RELU激活可以引入非线性因素,使得本申请的技术方案可以解决更加复杂的位姿分类识别问题,RELU激活其实就是个取最大值的函数。ReLU激活函数其实是分段线性函数,把所有的负值都变为0,而正值不变,这种操作可以被理解成单侧抑制。(也就是说:在输入是负值的情况下,它会输出0,那么神经元就不会被激活。这意味着同一时间只有部分神经元会被激活,从而使得网络很稀疏,进而对计算来说是非常有效率的。)正因为有了单侧抑制,才使得神经网络中的神经元也具有了稀疏激活性。示范性的,在深度神经网络模型(如CNN)中,当模型增加N层之后,理论上ReLU神经元的激活率将降低2的N次方倍。ReLU激活函数没有复杂的指数运算,因此,计算简单、分类识别效率高;另外,ReLU激活函数与Sigmoid/tanh激活函数相比,收敛速度更快。It can be understood that nonlinear factors can be introduced through RELU activation, so that the technical solution of the present application can solve more complex pose classification and recognition problems. RELU activation is actually a function of taking the maximum value. The ReLU activation function is actually a piecewise linear function, which turns all negative values into 0, while the positive values remain unchanged. This operation can be understood as one-sided suppression. (That is: in the case where the input is negative, it will output 0, then the neuron will not be activated. This means that only some neurons will be activated at the same time, which makes the network very sparse, which in turn affects computation It is very efficient.) It is because of the unilateral inhibition that the neurons in the neural network also have sparse activation. Exemplarily, in a deep neural network model (such as CNN), after the model adds N layers, the activation rate of ReLU neurons will theoretically be reduced by a factor of 2 N times. The ReLU activation function does not have complex exponential operations, so the calculation is simple and the classification and recognition efficiency is high; in addition, the ReLU activation function has faster convergence speed than the Sigmoid/tanh activation function.
S440:利用T个特征激活向量组成的特征融合向量识别所述视频段中每一帧中的位姿。S440: Identify the pose in each frame of the video segment by using a feature fusion vector composed of T feature activation vectors.
特征激活向量βt,t=1,2,3,……,T,利用T个特征激活向量可以组成的特征融合向量β=[β1,β2,…,βT]。Feature activation vector βt, t=1, 2, 3, ..., T, feature fusion vector β=[β1, β2, ..., βT] that can be composed of T feature activation vectors.
进一步的,对特征融合向量β=[β1,β2,…,βT]依次进行空洞卷积处理、RELU激活处理、dropout处理和softmax处理以确定所述视频段中每一帧中的位姿所属的预测类别和对应的预测概率。Further, perform hole convolution processing, RELU activation processing, dropout processing and softmax processing on the feature fusion vector β=[β1, β2,..., βT] in sequence to determine which pose the pose in each frame in the video segment belongs to. Predicted categories and corresponding predicted probabilities.
示范性的,空洞卷积处理、RELU激活处理、dropout处理和softmax处理可以利用如下函数关系fMEM表示:Exemplarily, atrous convolution processing, RELU activation processing, dropout processing and softmax processing can be represented by the following functional relationship fMEM:
y1,c,y2,c,……,yT,c=fMEM([β1,β2,…,βT]),yt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的预测概率,t=1,2,3,……,T。y1,c,y2,c,...,yT,c=fMEM([β1,β2,...,βT]), yt,c indicates the corresponding prediction when the pose in the t-th video frame belongs to the prediction category c Probability, t = 1, 2, 3, ..., T.
其中,空洞卷积处理是在标准卷积的基础上注入空洞,以此来增加感受野,空洞卷积处理可以在增加感受野的同时保持特征融合向量的尺寸。dropout处理包括利用一维卷积层、dropout层和一维卷积层对特征融合向量进行dropout处理,dropout在神经网络的信息前向传播的时候,让某个神经元的激活值以一定的概率p停止工作,这样可以使神经网络模型泛化性更强,避免对某些局部的特征产生过度依赖,进而可以比较有效的缓解过拟合的发生,在一定程度上达到正则化的效果。softmax处理是利用softmax函数将输入映射为0-1之间的实数,并且归一化保证和为1,进而保证多分类的概率之和也刚好为1。Among them, the hole convolution processing is to inject holes on the basis of standard convolution to increase the receptive field, and the hole convolution processing can increase the receptive field while maintaining the size of the feature fusion vector. Dropout processing includes using one-dimensional convolutional layer, dropout layer and one-dimensional convolutional layer to perform dropout processing on the feature fusion vector. When the information of the neural network propagates forward, the dropout makes the activation value of a neuron with a certain probability p stops working, which can make the neural network model more generalizable and avoid over-reliance on some local features, which can effectively alleviate the occurrence of over-fitting, and achieve the effect of regularization to a certain extent. The softmax process uses the softmax function to map the input to a real number between 0 and 1, and the normalization guarantees that the sum is 1, thereby ensuring that the sum of the probabilities of multi-classification is exactly 1.
实施例3Example 3
本申请的一个实施例,参见图5,示出了另一种单阶段动态位姿识别方法,在步骤S400之后还包括:An embodiment of the present application, referring to FIG. 5, shows another single-stage dynamic pose recognition method, which further includes after step S400:
S500:根据所述视频段中每一帧中的位姿所属的预测类别和对应的预测概率计算所述视频段对应的分类损失。S500: Calculate the classification loss corresponding to the video segment according to the prediction category to which the pose in each frame in the video segment belongs and the corresponding prediction probability.
示范性的,可以根据以下公式计算所述视频段对应的分类损失:Exemplarily, the classification loss corresponding to the video segment can be calculated according to the following formula:
Ls表示所述视频段的分类损失,C表示预测类别总数,Δt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的分类损失,yt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的预测概率,yt-1,c表示第t-1帧视频帧中的位姿属于预测类别c时所对应的预测概率,ε表示预设的概率阈值。Ls represents the classification loss of the video segment, C represents the total number of predicted categories, Δt,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c, and yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c, yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c, and ε represents the preset probability threshold.
通过所述视频段对应的分类损失一方面可以确定当前位姿识别的准确度,即所述视频段对应的分类损失越小,代表当前位姿识别的准确度越高;另一方面可以用于评价单阶段动态位姿识别模型,即在训练单阶段动态位姿识别模型时,可以根据分类损失函数的收敛情况确定单阶段动态位姿识别模型是否达标,例如,在分类损失函数收敛且分类 损失小于预设的损失阈值时,单阶段动态位姿识别模型训练完成,可以用于识别视频段中的动态位姿。On the one hand, the accuracy of the current pose recognition can be determined by the classification loss corresponding to the video segment, that is, the smaller the classification loss corresponding to the video segment, the higher the accuracy of the current pose recognition; on the other hand, it can be used for Evaluate the single-stage dynamic pose recognition model, that is, when training the single-stage dynamic pose recognition model, it can be determined whether the single-stage dynamic pose recognition model meets the standard according to the convergence of the classification loss function. For example, when the classification loss function converges and the classification loss When the loss is less than the preset loss threshold, the single-stage dynamic pose recognition model is trained and can be used to recognize dynamic poses in video segments.
实施例4Example 4
本申请的一个实施例,参见图6,示出了一种单阶段动态位姿识别装置10包括:获取模块11、确定模块12和识别模块13。An embodiment of the present application, referring to FIG. 6 , shows that a single-stage dynamic pose recognition device 10 includes an acquisition module 11 , a determination module 12 and an identification module 13 .
获取模块11,用于获取M个视频帧集合,所述M个视频帧集合为M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合,M≥2;确定模块12,用于利用预先确定的M个特征提取模型分别提取对应视频帧集合中各个视频帧的特征子向量;还用于根据预设的注意力等级参数和第m个视频帧集合的第t个特征子向量确定对应的第t个特征向量;识别模块13,用于根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿。The acquisition module 11 is configured to acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period , M≥2; the determination module 12 is used to extract the feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models; also used for according to the preset attention level parameter and the mth The t-th feature sub-vector of the video frame set determines the corresponding t-th feature vector; the identification module 13 is configured to identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.
进一步的,所述第t个特征向量利用以下公式确定:Further, the t-th feature vector is determined by the following formula:
表示所述第m个视频帧集合的第t个特征向量,0<t-p<t,t<t+q<T,|p-q|≤1,T为所述视频段中视频帧的总数,A表示所述注意力等级参数,
表示第m个视频帧集合中第t帧视频帧对应的注意力等级a对应的特征子向量,a≤A,
表示第m个视频帧集合中第t个特征子向量。
Represents the t-th feature vector of the m-th video frame set, 0<tp<t, t<t+q<T, |pq|≤1, T is the total number of video frames in the video segment, A represents The attention level parameter, Represents the feature sub-vector corresponding to the attention level a corresponding to the t-th video frame in the m-th video frame set, a≤A, represents the t-th feature sub-vector in the m-th video frame set.
进一步的,所述根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿,包括:Further, identifying the pose in each frame of the video segment according to each feature vector in the M video frame sets, including:
利用M个第t个特征向量确定第t个特征增强向量;对所述第t个特征增强向量进行全局平均池化处理以确定第t个特征池化向量;对所述第t个特征池化向量进行RELU激活处理以确定第t个特征激活向量;利用T个特征激活向量组成的特征融合向量识别所述视频段中每一帧中的位姿。Use the M t-th feature vectors to determine the t-th feature enhancement vector; perform global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector; The vector is subjected to RELU activation processing to determine the t-th feature activation vector; the feature fusion vector composed of the T feature activation vectors is used to identify the pose in each frame of the video segment.
进一步的,利用以下公式确定第t个特征池化向量:Further, use the following formula to determine the t-th feature pooling vector:
Zt表示所述第t个特征池化向量,
表示所述第t个特征增强向量,
表示第m个视频帧集合中第t帧视频帧的注意力等级a对应的特征子向量。
Zt represents the t-th feature pooling vector, represents the t-th feature enhancement vector, Represents the feature sub-vector corresponding to the attention level a of the t-th video frame in the m-th video frame set.
进一步的,所述利用T个特征激活向量组成的特征融合向量识别所述视频段中每一帧中的位姿,包括:Further, the described feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame in the video segment, including:
对所述利用T个特征激活向量组成的特征融合向量依次进行空洞卷积处理、RELU激活处理、dropout处理和softmax处理以确定所述视频段中每一帧中的位姿所属的预测类别和对应的预测概率。Perform hole convolution processing, RELU activation processing, dropout processing and softmax processing on the feature fusion vector composed of the T feature activation vectors in turn to determine the predicted category and corresponding to the pose in each frame in the video segment. predicted probability.
进一步的,识别模块13,还用于根据所述视频段中每一帧中的位姿所属的预测类别和对应的预测概率计算所述视频段对应的分类损失。Further, the identification module 13 is further configured to calculate the classification loss corresponding to the video segment according to the prediction category to which the pose in each frame in the video segment belongs and the corresponding prediction probability.
进一步的,根据以下公式计算所述视频段的分类损失:Further, the classification loss of the video segment is calculated according to the following formula:
Ls表示所述视频段的分类损失,C表示预测类别总数,Δt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的分类损失,yt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的预测概率,yt-1,c表示第t-1帧视频帧中的位姿属于预测类别c时所对应的预测概率,ε表示预设的概率阈值。Ls represents the classification loss of the video segment, C represents the total number of predicted categories, Δt,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c, and yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c, yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c, and ε represents the preset probability threshold.
进一步的,所述位姿包括手势位姿和/或身体位姿。Further, the poses include gesture poses and/or body poses.
本实施例公开的单阶段动态位姿识别装置10通过获取模块11、确定模块12和识别模块13的配合使用,用于执行上述实施例所述的单阶段动态位姿识别方法,上述实施例所涉及的实施方案以及有益效果在本实施例中同样适用,在此不再赘述。The single-stage dynamic pose recognition device 10 disclosed in this embodiment is used in conjunction with the acquisition module 11 , the determination module 12 and the recognition module 13 to execute the single-stage dynamic pose recognition method described in the above embodiments. The related implementations and beneficial effects are also applicable in this embodiment, and will not be repeated here.
可以理解,本申请提出一种终端设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序在所述处理器上运行时执行本申请所述的单阶段动态位姿识别方法。It can be understood that this application proposes a terminal device, including a memory and a processor, wherein the memory stores a computer program, and the computer program executes the single-stage dynamic pose recognition method described in this application when running on the processor .
可以理解,本申请提出一种可读存储介质,其存储有计算机程序,所述计算机程序在处理器上运行时执行本申请所述的单阶段动态位姿识别方法。It can be understood that this application provides a readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the single-stage dynamic pose recognition method described in this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,附图中的流程图和结构图显示了根据本申请的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在作为替换的实现方式中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,结构图和/或流程图中的每个方框、以及结构图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may also be implemented in other manners. The apparatus embodiments described above are only schematic. For example, the flowcharts and structural diagrams in the accompanying drawings show the possible implementation architectures and functions of the apparatuses, methods and computer program products according to various embodiments of the present application. and operation. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flow diagrams, and combinations of blocks in the block diagrams and/or flow diagrams, can be implemented using dedicated hardware-based systems that perform the specified functions or actions. be implemented, or may be implemented in a combination of special purpose hardware and computer instructions.
另外,在本申请各个实施例中的各功能模块或单元可以集成在一起形成一个独立的部分,也可以是各个模块单独存在,也可以两个或更多个模块集成形成一个独立的部分。In addition, each functional module or unit in each embodiment of the present application may be integrated together to form an independent part, or each module may exist independently, or two or more modules may be integrated to form an independent part.
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是智能手机、个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application.
Claims (10)
- 一种单阶段动态位姿识别方法,其特征在于,所述方法包括:A single-stage dynamic pose recognition method, characterized in that the method comprises:获取M个视频帧集合,所述M个视频帧集合为M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合,M≥2;Obtaining M video frame sets, the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video capture devices in the same time period, and M≥2;利用预先确定的M个特征提取模型分别提取对应视频帧集合中各个视频帧的特征子向量;Utilize the predetermined M feature extraction models to extract the feature sub-vectors of each video frame in the corresponding video frame set respectively;根据预设的注意力等级参数和第m个视频帧集合的第t个特征子向量确定对应的第t个特征向量;Determine the corresponding t-th feature vector according to the preset attention level parameter and the t-th feature sub-vector of the m-th video frame set;根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿。The pose in each frame of the video segment is identified according to each feature vector in the M video frame sets.
- 根据权利要求1所述的单阶段动态位姿识别方法,其特征在于,所述第t个特征向量利用以下公式确定:The single-stage dynamic pose recognition method according to claim 1, wherein the t-th feature vector is determined by the following formula:表示所述第m个视频帧集合的第t个特征向量,0<t-p<t,t<t+q<T,|p-q|≤1,T为所述视频段中视频帧的总数,A表示所述注意力等级参数, 表示第m个视频帧集合中第t帧视频帧对应的注意力等级a对应的特征子向量,a≤A, 表示第m个视频帧集合中第t个特征子向量。 Represents the t-th feature vector of the m-th video frame set, 0<tp<t, t<t+q<T, |pq|≤1, T is the total number of video frames in the video segment, A represents The attention level parameter, Represents the feature sub-vector corresponding to the attention level a corresponding to the t-th video frame in the m-th video frame set, a≤A, represents the t-th feature sub-vector in the m-th video frame set.
- 根据权利要求2所述的单阶段动态位姿识别方法,其特征在于,所述根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿,包括:The single-stage dynamic pose identification method according to claim 2, wherein the identifying the pose in each frame of the video segment according to each feature vector in the M video frame sets, comprising:利用M个第t个特征向量确定第t个特征增强向量;Use the M t-th feature vectors to determine the t-th feature enhancement vector;对所述第t个特征增强向量进行全局平均池化处理以确定第t个特征池化向量;Performing global average pooling on the t-th feature enhancement vector to determine the t-th feature pooling vector;对所述第t个特征池化向量进行RELU激活处理以确定第t个特征激活向量;Perform RELU activation processing on the t-th feature pooling vector to determine the t-th feature activation vector;利用T个特征激活向量组成的特征融合向量识别所述视频段中每一帧中的位姿。A feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment.
- 根据权利要求3所述的单阶段动态位姿识别方法,其特征在于,利用以下公式确定第t个特征池化向量:The single-stage dynamic pose recognition method according to claim 3, wherein the t-th feature pooling vector is determined by the following formula:
- 根据权利要求3所述的单阶段动态位姿识别方法,其特征在于,所述利用T个特征激活向量组成的特征融合向量识别所述视频段中每一帧中的位姿,包括:The single-stage dynamic pose recognition method according to claim 3, wherein the feature fusion vector composed of T feature activation vectors is used to identify the pose in each frame of the video segment, comprising:对所述利用T个特征激活向量组成的特征融合向量依次进行空洞卷积处理、RELU激活处理、dropout处理和softmax处理以确定所述视频段中每一帧中的位姿所属的预测类别和对应的预测概率。Perform hole convolution processing, RELU activation processing, dropout processing and softmax processing on the feature fusion vector composed of the T feature activation vectors in turn to determine the predicted category and corresponding to the pose in each frame in the video segment. predicted probability.
- 根据权利要求5所述的单阶段动态位姿识别方法,其特征在于,还包括:The single-stage dynamic pose recognition method according to claim 5, further comprising:根据以下公式计算所述视频段的分类损失:The classification loss for the video segment is calculated according to the following formula:Ls表示所述视频段的分类损失,C表示预测类别总数,Δt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的分类损失,yt,c表示第t帧视频帧中的位姿属于预测类别c时所对应的预测概率,yt-1,c表示第t-1帧视频帧中的位姿属于预测类别c时所对应的预测概率,ε表示预设的概率阈值。Ls represents the classification loss of the video segment, C represents the total number of predicted categories, Δt,c represents the classification loss when the pose in the t-th video frame belongs to the predicted category c, and yt, c represents the t-th video frame in the The corresponding prediction probability when the pose of yt belongs to the prediction category c, yt-1,c represents the corresponding prediction probability when the pose in the t-1th video frame belongs to the prediction category c, and ε represents the preset probability threshold.
- 根据权利要求1至6任一项所述的单阶段动态位姿识别方法,其特征在于,所述位姿包括手势位姿和/或身体位姿。The single-stage dynamic pose recognition method according to any one of claims 1 to 6, wherein the poses include gesture poses and/or body poses.
- 一种单阶段动态位姿识别装置,其特征在于,所述装置包括:A single-stage dynamic pose recognition device, characterized in that the device comprises:获取模块,用于获取M个视频帧集合,所述M个视频帧集合为M个视频采集装置在同一时间段采集的包括同一组动态位姿的M个视频段对应的M个视频帧集合,M≥2;an acquisition module, configured to acquire M video frame sets, where the M video frame sets are M video frame sets corresponding to M video segments including the same group of dynamic poses collected by M video acquisition devices in the same time period, M≥2;确定模块,用于利用预先确定的M个特征提取模型分别提取对应视频帧集合中各个视频帧的特征子向量;还用于根据预设的注意力等级参数和第m个视频帧集合的第t个特征子向量确定对应的第t个特征向量;The determination module is used to extract the feature sub-vectors of each video frame in the corresponding video frame set by using the predetermined M feature extraction models; it is also used for the preset attention level parameter and the t th video frame set of the m th video frame set. The eigenvectors determine the corresponding t-th eigenvector;识别模块,用于根据所述M个视频帧集合中的各个特征向量识别视频段中每一帧中的位姿。An identification module, configured to identify the pose in each frame of the video segment according to each feature vector in the M video frame sets.
- 一种终端设备,其特征在于,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序在所述处理器上运行时执行权利要求1至7任一项所述的单阶段动态位姿识别方法。A terminal device, characterized in that it includes a memory and a processor, wherein the memory stores a computer program, and the computer program executes the single-stage dynamic process described in any one of claims 1 to 7 when the computer program runs on the processor. Pose recognition method.
- 一种可读存储介质,其特征在于,其存储有计算机程序,所述计算机程序在处理器上运行时执行权利要求1至7任一项所述的单阶段动态位姿识别方法。A readable storage medium, characterized in that it stores a computer program, and the computer program executes the single-stage dynamic pose recognition method according to any one of claims 1 to 7 when the computer program runs on a processor.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110454967.7A CN113011395B (en) | 2021-04-26 | 2021-04-26 | Single-stage dynamic pose recognition method and device and terminal equipment |
CN202110454967.7 | 2021-04-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022227512A1 true WO2022227512A1 (en) | 2022-11-03 |
Family
ID=76380409
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/131680 WO2022227512A1 (en) | 2021-04-26 | 2021-11-19 | Single-stage dynamic pose recognition method and apparatus, and terminal device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113011395B (en) |
WO (1) | WO2022227512A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011395B (en) * | 2021-04-26 | 2023-09-01 | 深圳市优必选科技股份有限公司 | Single-stage dynamic pose recognition method and device and terminal equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108932500A (en) * | 2018-07-09 | 2018-12-04 | 广州智能装备研究院有限公司 | A kind of dynamic gesture identification method and system based on deep neural network |
CN108960207A (en) * | 2018-08-08 | 2018-12-07 | 广东工业大学 | A kind of method of image recognition, system and associated component |
CN109961005A (en) * | 2019-01-28 | 2019-07-02 | 山东大学 | A kind of dynamic gesture identification method and system based on two-dimensional convolution network |
CN113011395A (en) * | 2021-04-26 | 2021-06-22 | 深圳市优必选科技股份有限公司 | Single-stage dynamic pose identification method and device and terminal equipment |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100544687C (en) * | 2007-04-19 | 2009-09-30 | 上海交通大学 | Vision alternative method based on cognitive and target identification |
CN108399381B (en) * | 2018-02-12 | 2020-10-30 | 北京市商汤科技开发有限公司 | Pedestrian re-identification method and device, electronic equipment and storage medium |
CN108670276A (en) * | 2018-05-29 | 2018-10-19 | 南京邮电大学 | Study attention evaluation system based on EEG signals |
CN109101896B (en) * | 2018-07-19 | 2022-03-25 | 电子科技大学 | Video behavior identification method based on space-time fusion characteristics and attention mechanism |
CN111079658B (en) * | 2019-12-19 | 2023-10-31 | 北京海国华创云科技有限公司 | Multi-target continuous behavior analysis method, system and device based on video |
CN111581958A (en) * | 2020-05-27 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Conversation state determining method and device, computer equipment and storage medium |
CN112580557A (en) * | 2020-12-25 | 2021-03-30 | 深圳市优必选科技股份有限公司 | Behavior recognition method and device, terminal equipment and readable storage medium |
-
2021
- 2021-04-26 CN CN202110454967.7A patent/CN113011395B/en active Active
- 2021-11-19 WO PCT/CN2021/131680 patent/WO2022227512A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108932500A (en) * | 2018-07-09 | 2018-12-04 | 广州智能装备研究院有限公司 | A kind of dynamic gesture identification method and system based on deep neural network |
CN108960207A (en) * | 2018-08-08 | 2018-12-07 | 广东工业大学 | A kind of method of image recognition, system and associated component |
CN109961005A (en) * | 2019-01-28 | 2019-07-02 | 山东大学 | A kind of dynamic gesture identification method and system based on two-dimensional convolution network |
CN113011395A (en) * | 2021-04-26 | 2021-06-22 | 深圳市优必选科技股份有限公司 | Single-stage dynamic pose identification method and device and terminal equipment |
Non-Patent Citations (1)
Title |
---|
MALEKI BEHNAM; EBRAHIMNEZHAD HOSSEIN: "Intelligent visual mouse system based on hand pose trajectory recognition in video sequences", MULTIMEDIA SYSTEMS, vol. 21, no. 6, 25 September 2014 (2014-09-25), US , pages 581 - 601, XP035535854, ISSN: 0942-4962, DOI: 10.1007/s00530-014-0420-y * |
Also Published As
Publication number | Publication date |
---|---|
CN113011395A (en) | 2021-06-22 |
CN113011395B (en) | 2023-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108133188B (en) | Behavior identification method based on motion history image and convolutional neural network | |
Islam et al. | Static hand gesture recognition using convolutional neural network with data augmentation | |
Das et al. | Sign language recognition using deep learning on custom processed static gesture images | |
Alani et al. | Hand gesture recognition using an adapted convolutional neural network with data augmentation | |
US10002290B2 (en) | Learning device and learning method for object detection | |
Mishra et al. | Real time human action recognition using triggered frame extraction and a typical CNN heuristic | |
CN109472209B (en) | Image recognition method, device and storage medium | |
JP2022141931A (en) | Method and device for training living body detection model, method and apparatus for living body detection, electronic apparatus, storage medium, and computer program | |
CN113255557B (en) | Deep learning-based video crowd emotion analysis method and system | |
Liu et al. | Real-time facial expression recognition based on cnn | |
Ali et al. | Facial emotion detection using neural network | |
Harini et al. | Sign language translation | |
Balasubramanian et al. | Analysis of facial emotion recognition | |
CN109815920A (en) | Gesture identification method based on convolutional neural networks and confrontation convolutional neural networks | |
Silanon | Thai Finger‐Spelling Recognition Using a Cascaded Classifier Based on Histogram of Orientation Gradient Features | |
WO2023185074A1 (en) | Group behavior recognition method based on complementary spatio-temporal information modeling | |
Pandey et al. | Face recognition using machine learning | |
WO2022227512A1 (en) | Single-stage dynamic pose recognition method and apparatus, and terminal device | |
CN109829441B (en) | Facial expression recognition method and device based on course learning | |
Chakraborty et al. | Sign Language Recognition Using Landmark Detection, GRU and LSTM | |
CN117058736A (en) | Facial false detection recognition method, device, medium and equipment based on key point detection | |
CN116957051A (en) | Remote sensing image weak supervision target detection method for optimizing feature extraction | |
Muhamad et al. | A comparative study using improved LSTM/GRU for human action recognition | |
Li et al. | A pre-training strategy for convolutional neural network applied to Chinese digital gesture recognition | |
Gunji et al. | Recognition of sign language based on hand gestures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21938972 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21938972 Country of ref document: EP Kind code of ref document: A1 |