CN115359571A

CN115359571A - Online cross-channel interactive parallel distillation framework attitude estimation method and device

Info

Publication number: CN115359571A
Application number: CN202211061531.2A
Authority: CN
Inventors: 郑群; 杨义坤; 孙桂刚; 李超; 朱宪
Original assignee: Xiamen Information Technology Application And Innovation Research Institute Co ltd
Current assignee: Xiamen Information Technology Application And Innovation Research Institute Co ltd
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2022-11-18

Abstract

The invention relates to the technical field of artificial intelligence computer vision, and particularly provides an online cross-channel interactive parallel distillation architecture attitude estimation method which comprises the steps of firstly, acquiring an external video stream by a video acquisition device, segmenting the video stream into frames, inputting the frames into a feature extraction network, and extracting features; the extracted features are transmitted to a YOLOV5 target detection model, the position of the target human body in each frame of image is detected, and a detection frame is marked to obtain feature data of the target human body; transmitting the target human body feature data to a posture detection model Faster-Pose to obtain human body key point feature information; and mapping the obtained characteristic information of the human key points into a characteristic diagram through linear transformation to obtain the characteristic diagram with the human key point labels. Compared with the prior art, the method and the device have the advantages that the relevance of the channel characteristic information and the spatial characteristic information is considered, and the expression capacity of the required characteristic information is improved.

Description

Online cross-channel interactive parallel distillation framework attitude estimation method and device

Technical Field

The invention relates to the technical field of artificial intelligence computer vision, and particularly provides an online cross-channel interactive parallel distillation architecture attitude estimation method and device.

Background

Human posture estimation is an important technical field in artificial intelligence computer vision. Human-computer interaction can be better realized by estimating the behavior of people in the scene. At present, human posture estimation is commonly used in worker violation operation detection, security protection field and VR wearable equipment. The accuracy of extracting the human body detection frame by the target detection model in the human body posture estimation algorithm is crucial to the accuracy and stability of positioning the human body key points.

The existing attention model does not consider the relevance of the channel characteristic information and the spatial characteristic information, so that the accuracy is not high.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an online cross-channel interactive parallel distillation framework attitude estimation method with strong practicability.

The invention further provides a device for estimating the attitude of the online cross-channel interactive parallel distillation framework, which is reasonable in design, safe and applicable.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for estimating attitude of an online cross-channel interactive parallel distillation framework comprises the steps that firstly, a video acquisition device acquires external video streams, and the video streams are segmented into frames and input into a feature extraction network for feature extraction;

the extracted features are transmitted to a YOLOV5 target detection model, the position of the target human body in each frame of image is detected, and a detection frame is marked to obtain feature data of the target human body;

transmitting the target human body feature data to a posture detection model Faster-Pose to obtain human body key point feature information; and mapping the obtained characteristic information of the human body key points into a characteristic diagram through linear transformation to obtain the characteristic diagram with the human body key point labels.

Further, the feature extraction network is designed into a CSP structure, a cross-channel interactive attention mechanism is introduced, the cross-channel interactive attention mechanism combines channel attention and space attention, the covariance matrix is used for calculating the similarity of every two channels of the feature map in the channel attention model, and the channels with high similarity are fused;

the second-order finite difference method is used in the spatial attention to calculate the characteristic image pixel value difference and the pixel gradient direction.

Further, if the covariance calculation value of the two channels is a negative number, negative correlation is represented, if the covariance calculation value is 0, the two channels are independent from each other and are not correlated with each other, and if the covariance calculation value is a positive number, the positive correlation of the two channels is represented for feature fusion;

first, the mean value of each channel is calculated as shown in formula (1):

the mean of all channels is characterized as

The variance for each channel is calculated as shown in equation (2):

the variance of all channels is noted as:

the covariance between channels C1, C2 is calculated as shown in equation (3):

all the channel similarity covariance values obtained by analogy are recorded as Cov _i，k And only covariance is made among different channels, and pixel-by-pixel fusion is carried out on the channels with positive correlation according to the covariance value.

Further, the extracted features are transmitted to a YOLOV5 target detection model, the position of the target human body in each frame of image is detected, and detection frames are marked to obtain feature data of the target human body;

the YOLOV5 target detection model was trained using the public data set msco 2017. And randomly extracting the MSCOCO 2017 data set according to a preset proportion, and performing data enhancement pretreatment operation on the sample data.

Preferably, the data enhancement mode includes performing multi-angle rotation on the image, dividing the rotation angle into 30 degrees, randomly masking the image according to the probability P, setting the pixel value under the mask to be 0, turning the image up, down, left and right, performing distortion processing of different degrees on the image, and performing color disturbance on the image.

Further, the channel attention model obtains a channel characteristic probability matrix by using a SoftMax function, and the space attention model obtains a space characteristic probability matrix by using the SoftMax function;

and the probability matrix and the original characteristic diagram are fused in a product mode respectively, and weight information is added to the characteristic diagram.

Further, feature extraction is carried out on each channel of the feature map by using a Depth-Wise method to obtain a feature value matrix of each channel, and cross-channel feature fusion is carried out.

Furthermore, an online parallel knowledge distillation method is carried out in a posture detection model Faster-Pose, the online parallel knowledge distillation method continues to use a knowledge distillation frame based on Teacher-Student (Teacher-Student) on a network structure, a Teacher network consists of 8 Hourglass feature extraction modules, and a Student network consists of 4 Hourglass feature extraction modules;

training a Teacher network by using an MSCOCO 2017 data set, training a Student network by using a part of data set with labels, calculating loss of a Teacher network characteristic diagram and Student network characteristic diagram by using KL divergence in the training process, fusing information of the Teacher characteristic diagram and information of the Student characteristic diagram according to channel similarity, and training the Teacher and the Student network in parallel in the training process;

in the reasoning process, a Teacher network direct reasoning Student network is removed, a cross-channel interactive attention mechanism is introduced into a Faster-Pose attitude detection model, the cross-channel interactive attention mechanism endows a characteristic diagram in the Teacher network with different weight information, and the calculation processes of the TEAcher network characteristic diagram and the Student network characteristic diagram are shown in a formula (4):

wherein,

respectively representing a characteristic diagram extracted by the second Hourglass module of the Teacher network and a characteristic diagram extracted by the first Hourglass module of the Student network;

the total profile loss is shown in equation (5):

the final loss function of the Faster-Pose attitude estimation model is shown as the formula (6):

wherein

For Student network model loss, α and λ are the hyper-parameters to be learned.

Further, the data information of the human key point Heat Map output by the Faster-Pose attitude detection model is mapped to the original characteristic diagram by using a linear interpolation method, and the pixel point deviation in the mapping process is corrected by using the trilinear interpolation.

An online cross-channel interactive parallel distillation framework attitude estimation device comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine readable program to perform an online cross-channel interactive parallel distillation architecture pose estimation method.

Compared with the prior art, the online cross-channel interactive parallel distillation architecture attitude estimation method and device provided by the invention have the following outstanding beneficial effects:

the invention provides a cross-channel interactive attention mechanism and a new attitude detection model Faster-Pose. In the feature extraction stage, channel attention is used to detect which channels of the feature map contain the feature expression of the required information, and spatial attention is used to detect which position of the feature map has the required feature information. In the invention, the feature information extracted by the spatial attention and the feature information extracted by the channel attention are fused, and the expression capability of the required feature information is improved by considering the relevance of the channel feature information and the spatial feature information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of an online cross-channel interactive parallel distillation architecture attitude estimation method;

FIG. 2 is a schematic diagram of a gesture prediction algorithm model framework in an online cross-channel interactive parallel distillation architecture gesture prediction method;

FIG. 3 is a schematic diagram of a preprocessing flow of an MSCOCO 2017 data set in an online cross-channel interactive parallel distillation architecture attitude estimation method;

FIG. 4 is a schematic diagram of a YOLOV5 target detection algorithm framework in an online cross-channel interactive parallel distillation architecture attitude estimation method;

FIG. 5 is a schematic diagram of a C-CIAM attention mechanism framework in an online cross-channel interactive parallel distillation architecture attitude estimation method;

FIG. 6 is a schematic diagram of a Faster-Pose attitude detection model in an online cross-channel interactive parallel distillation architecture attitude estimation method.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A preferred embodiment is given below:

as shown in fig. 1 to 6, in the method for estimating the attitude of the online cross-channel interactive parallel distillation architecture in the embodiment, first, a video acquisition device acquires an external video stream, divides the video stream into frames, and inputs the frames into a feature extraction network for feature extraction;

the extracted features are transmitted to a YOLOV5 target detection model, the position of the target human body in each frame of image is detected, and detection frames are marked to obtain feature data of the target human body;

transmitting the target human body feature data to a posture detection model Faster-Pose to obtain human body key point feature information; and mapping the obtained characteristic information of the human key points into a characteristic diagram through linear transformation to obtain the characteristic diagram with the human key point labels.

The video acquisition device in this embodiment is a 4-channel 2D camera, and the video stream is segmented into frames at a rate of 60 frames per second and input to the feature extraction network for feature extraction.

The feature extraction network designs a CSP structure by referring to a CSPNet network, and introduces a cross-channel interactive attention mechanism. The cross-channel interactive attention mechanism combines channel attention and spatial attention, the covariance matrix is used for calculating the similarity of every two channels of the feature map in the channel attention model, and the channels with high similarity are fused. The second-order finite difference method is used in the spatial attention to calculate the characteristic image pixel value difference and the pixel gradient direction.

And if the covariance calculation value of the two channels is a negative number, the two channels are in negative correlation, if the covariance calculation value is 0, the two channels are independent from each other, and if the covariance calculation value is a positive number, the two channels are in positive correlation for feature fusion.

First, the mean value of each channel is calculated as shown in formula (1):

the mean of all channels is characterized

The variance for each channel is calculated as shown in equation (2):

the variance of all channels is noted as:

the covariance between channels C1, C2 is calculated as shown in equation (3):

And (3) conveying the extracted features to a YOLOV5 target detection model, detecting the position of the target human body in each frame of image and marking a detection frame to obtain feature data of the target human body. The YOLOV5 target detection model was trained using the public data set msco 2017. And randomly extracting the MSCOCO 2017 data set according to a preset proportion, and performing data enhancement pretreatment operation on the sample data.

The channel attention model obtains a channel characteristic probability matrix by using a SoftMax function, and the space attention model obtains a space characteristic probability matrix by using the SoftMax function. And the probability matrix and the original characteristic diagram are fused in a product mode respectively, and weight information is added to the characteristic diagram.

And (3) extracting the characteristics of each channel of the characteristic diagram by using a Depth-Wise method to obtain a characteristic value matrix of each channel, fusing the probability matrix and the original characteristic diagram by using a product mode respectively, and adding weight information to the characteristic diagram.

The data enhancement mode comprises the steps of carrying out multi-angle rotation on the image, wherein the division interval of the rotation angle is 30 degrees; randomly masking the image according to the probability P, and setting the pixel value under the mask to be 0; turning the image up, down, left and right; and carrying out distortion deformation processing on the image to different degrees and carrying out color disturbance on the image.

The fast-position attitude detection model improves the existing FastPose attitude detection model, and provides a new Distillation method, namely an Online Parallel knowledge Distillation method (Online Parallel Distillation). The online parallel knowledge distillation method continues to use a knowledge distillation framework based on teachers and students (Teacher-Student) on a network structure, wherein the Teacher network consists of 8 Hourglass feature extraction modules, and the Student network consists of 4 Hourglass feature extraction modules. The Teacher network was trained using the MSCOCO 2017 dataset and the Student network was trained using a portion of the labeled dataset. KL (Kullback-Leibler Divergence) Divergence is used for calculating loss of the Teacher network characteristic diagram and the Student network characteristic diagram in the training process, information of the Teacher characteristic diagram and information of the Student characteristic diagram are fused according to the channel similarity, and the Teacher and the Student network are trained in parallel in the training process. The Teacher network direct reasoning Student network is removed in the reasoning process. A cross-channel interactive attention mechanism is introduced into a Faster-Pose attitude detection model, and different weight information is given to a feature diagram in a Teacher network by the cross-channel interactive attention mechanism. The calculation process of the Teacher network characteristic diagram and the Student network characteristic diagram is shown in formula (4):

wherein

Respectively representing the feature diagram extracted by the second Hourglass module of the Teacher network and the feature diagram extracted by the first Hourglass module of the Student network.

The total profile loss is shown in equation (5):

the final loss function of the Faster-Pose attitude estimation model is shown in formula (6):

wherein

And mapping the data information of the human body key point Heat Map output by the Faster-Pose attitude detection model into the original characteristic diagram by using a linear interpolation method, and correcting the pixel point offset in the mapping process by using the trilinear interpolation.

Based on the above method, the online cross-channel interactive parallel distillation architecture posture estimation device in this embodiment includes: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

The memory in this embodiment is 512GB, the processor is an 8-core CPU processor, and the apparatus further requires an incadata (NVIDIA) graphics card of RTX2080TI or more.

The method and the device fully consider the relation between the channel attention and the space attention, fuse the two characteristics according to the designed channel fusion rule, and have the advantages that the distribution of the characteristics can be determined through a channel attention model, and the characterization capability of the space dimension target characteristics can be further enhanced by combining the position information of the target characteristics in the space dimension.

The method fully considers that when the extracted feature diagram has few channels and large spatial features, the channel feature generalization is easy to be insufficient, the spatial features are sensitive and difficult to learn, and the pixel value difference and the pixel gradient direction of the feature diagram are calculated on the spatial attention model by using a second-order finite difference method, so that the positioning performance of the target position on the spatial dimension can be improved.

The original FastPose attitude estimation model is improved, and a new attitude estimation model, namely, faster-Pose, is provided. A new information interaction and fusion mode between the Teacher network and the Student network is designed, and a new loss function is proposed.

The above embodiments are only specific examples, and the scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that may be made by one of ordinary skill in the art and in accordance with the claims of the present invention for the method and apparatus for estimating pose of an on-line cross-channel interactive parallel distillation structure should fall within the scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An online cross-channel interactive parallel distillation architecture attitude estimation method is characterized in that firstly, a video acquisition device acquires an external video stream, and the video stream is segmented into frames and input into a feature extraction network for feature extraction;

2. The method for estimating the attitude of the online cross-channel interactive parallel distillation framework according to claim 1, wherein the feature extraction network is designed to be a CSP structure, a cross-channel interactive attention mechanism is introduced, the cross-channel interactive attention mechanism combines channel attention and spatial attention, the covariance matrix is used for calculating the similarity of every two channels of feature maps in a channel attention model, and the channels with high similarity are fused;

3. The method for estimating the attitude of the on-line cross-channel interactive parallel distillation framework, according to claim 2, wherein the calculated covariance values of the two channels are negative numbers, which means negative correlation, and 0, which means independent and uncorrelated between the two channels, and positive numbers, which means positive correlation of the two channels for feature fusion;

first, the mean value of each channel is calculated as shown in formula (1):

the mean of all channels is characterized

The variance for each channel is calculated as shown in equation (2):

the variance of all channels is noted as:

the covariance between channels C1, C2 is calculated as shown in equation (3):

all the channel similarity covariance values obtained by analogy are recorded as Cov _i，k And only making covariance among different channels, and performing pixel-by-pixel fusion on the channels with positive correlation according to the covariance value.

4. The online cross-channel interactive parallel distillation architecture posture estimation method as claimed in claim 3, wherein the extracted features are transmitted to a YOLOV5 target detection model, the position of the target human body in each frame of image is detected and detection frames are marked to obtain feature data of the target human body;

the YOLOV5 target detection model was trained using the public data set MSCOCO 2017. And the MSCOCO 2017 data set is randomly extracted according to a preset proportion, and data enhancement pretreatment operation is carried out on the sample data.

5. The method as claimed in claim 4, wherein the data enhancement mode comprises multi-angle rotation of the image, the division interval of the rotation angle is 30 degrees, the image is randomly masked according to the probability P, the pixel value under the mask is set to be 0, the image is turned over up and down, left and right, the image is distorted and deformed to different degrees, and color disturbance is performed on the image.

6. The method for estimating the attitude of the on-line cross-channel interactive parallel distillation framework, according to claim 5, is characterized in that the channel attention model obtains a channel characteristic probability matrix by using a SoftMax function, and the space attention model obtains a space characteristic probability matrix by using a SoftMax function;

7. The online cross-channel interactive parallel distillation architecture attitude estimation method as claimed in claim 6, wherein a Depth-Wise method is used to perform feature extraction on each channel of a feature map to obtain a feature value matrix of each channel, and cross-channel feature fusion is performed.

8. The online cross-channel interactive parallel distillation architecture posture estimation method according to claim 7, characterized in that an online parallel knowledge distillation method is performed in a posture detection model fast-Pose, the online parallel knowledge distillation method continues to use a knowledge distillation framework based on Teacher-Student (Teacher-Student) on a network structure, the Teacher network consists of 8 Hourglass characteristic extraction modules, and the Student network consists of 4 Hourglass characteristic extraction modules;

training a Teacher network by using an MSCOCO 2017 data set, training a Student network by using a part of data set with a label, calculating loss of a Teacher network characteristic diagram and a Student network characteristic diagram by using KL divergence in the training process, fusing information of the Teacher characteristic diagram and information of the Student characteristic diagram according to channel similarity, and training the Teacher and the Student network in parallel in the training process;

wherein,

the total profile loss is shown in equation (5):

wherein

9. The method as claimed in claim 8, wherein the Pose estimation method comprises mapping the human body key point Heat Map data output by the Faster-Pose Pose detection model to an original feature Map by linear interpolation, and correcting the pixel shift occurring during the mapping process by trilinear interpolation.

10. An online cross-channel interactive parallel distillation framework attitude estimation device is characterized by comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor, configured to invoke the machine readable program to perform the method of any of claims 1 to 9.