CN116359846A

CN116359846A - Dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning

Info

Publication number: CN116359846A
Application number: CN202310236507.6A
Authority: CN
Inventors: 王帅; 梅洛瑜; 曹东江; 史瑞签
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-06-30

Abstract

The invention belongs to the technical field of Internet of things perception, and designs a dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning, in particular to human body analysis based on millimeter wave perception. Currently, millimeter wave perception centered by people is mostly focused on scenes such as motion recognition, gesture estimation and the like, but the scenes cannot acquire semantic information of millimeter wave point clouds, namely body part information corresponding to each radar point cannot be distinguished, so that a dynamic millimeter wave Lei Dadian cloud human body analysis scheme is needed. The method comprises the following steps: and for millimeter wave point cloud data, clustering is firstly carried out, then a multi-task learning model is used for jointly executing human body analysis and gesture estimation tasks to extract features, and then multi-task feature fusion is carried out through a non-local network, so that the final output result is point cloud with annotation semantic tags.

Description

Dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning

Technical Field

The invention relates to the technical field of perception of the Internet of things, in particular to a dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning.

Background

Perception and understanding of human activity plays an increasingly important role in human-centric intelligent applications. The traditional method adopts a camera or a body contact sensor, is easily influenced by severe environment and has privacy problem. In terms of human perception, research using millimeter wave radar has been on the rise in recent years, and its effectiveness in gesture and activity recognition, gesture estimation, identity recognition, and the like has been demonstrated. However, the tasks cannot clearly acquire semantic information of the millimeter wave point cloud, namely, body part information corresponding to each radar point is difficult to distinguish, so that human body analysis is realized.

In human body sensing applications, fine-grained body part information is continuously required, and the lack of semantic information greatly limits millimeter wave radars from becoming an enabling technology for human body computing in daily life. Meanwhile, the semantic information is used as an additional information input channel, so that the human perception task is more robust. Various computer vision tasks have demonstrated that inclusion of semantic information input can significantly improve the accuracy of pose estimation, activity recognition, and person recognition, which is more prominent with millimeter wave radars because millimeter wave point clouds are inherently lower in image quality than vision sensors. Therefore, a technical scheme is needed to realize the dynamic millimeter wave Lei Dadian cloud human body analysis task and obtain the point cloud with the body part semantic information tag.

The sparse nature of millimeter wave point clouds makes feature extraction challenging: due to the single chip and antenna size, millimeter wave point clouds are extremely sparse, which makes it difficult to perceive detailed structures of the human body from the point clouds even with the naked human eye. Extracting features containing human body structural information (such as gestures) using existing deep neural network models is challenging, which directly affects human body parsing tasks.

Specular reflection causes loss of body parts in millimeter wave point cloud data: due to the fact that the millimeter wave radar with low cost is limited to the small antenna aperture, most human body reflection signals are not returned to the sensor, specular reflection occurs, body parts in the point cloud are lost, and finally an error analysis result is caused.

Disclosure of Invention

In order to solve the problems, the invention discloses a dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning; according to the method, millimeter wave point cloud data are clustered firstly, then human body analysis and gesture estimation tasks are executed in a combined mode through a multi-task learning model to conduct feature extraction, and then multi-task feature fusion is conducted through a non-local network, so that the point cloud with annotation semantic tags is output finally.

The specific technical scheme is as follows:

step 1: the sparsity of the millimeter wave point cloud is solved through a multi-task feature extraction module; the method comprises the steps of adopting a multi-task learning model to jointly execute a main task of human body analysis and an auxiliary task of gesture estimation; the auxiliary task can effectively guide the human analytic network to extract high-level structural features representing the posture of the subject; because of strong correlation between the human body posture and the human body analysis, the human body posture correlation characteristics are beneficial to improving the accuracy and the robustness of the analysis network in predicting semantic tags; for human analysis and gesture estimation tasks, the multi-task learning model extracts corresponding features in parallel; step 2; the problem of missing body parts in the point cloud data caused by specular reflection is solved by a multitasking feature fusion module; obtaining inspiration from a non-local network (NLN), designing a multi-task feature fusion method, combining intra-task attention and inter-task attention mechanisms, and realizing the aggregation of space-time features of a main body from a global angle;

step 3: the invention adopts a Kinect system to obtain the true values of the human body analysis tag and the gesture estimation tag in the off-line training stage of the system.

Further, in step 1, the multi-task learning model may be specifically divided into a point module, a frame module and a feature aggregation module, where the input of the model is a frame sequence with a length s, each frame includes n points, and each point includes d feature dimensions;

step 1.1, extracting analysis characteristics of a human body;

in the point module, for the point set C of the frame corresponding to the t moment _t Any radar point p in (2) _i,t Obtaining a high-dimensional characteristic representation of a point, i.e. a point characteristic, using a multi-presence perceptron (MLP)

The formula is as follows:

wherein θ is _e The learnable parameters of the MLP are represented, and H represents the human body analysis task.

At the frame module, for each radar point p _i,t Point features of (2)

First encoded as a higher-dimensional feature representation +.>

The formula is as follows:

wherein θ is _h A learnable parameter representing MLP;

step 1.2: extracting human body posture characteristics;

for extraction of human body posture features, a slightly different network architecture is used than for human body analytic feature extraction; in particular, frame characteristics for any frame

Processing using a long short term memory network (LSTM); the formula is as follows:

wherein θ is _r Parameters representing LSTM.

Finally, will

Point features associated with the task of human posture->

Connecting to obtain characteristic vector of each point in the frame under the task of attitude estimation>

The formula is as follows:

further, in the step 1.1, the point characteristics of all the points in the frame are determined

Aggregation into frame features->

To extract global information of the frame; the formula is as follows:

wherein N is the number of points contained in the frame corresponding to the time t, A () represents the attention function and is theta _a A learnable parameter representing an attention function;

finally, frame characteristics

Connect to the dot feature->

Obtaining human body analysis tasksFeature vector +.>

The formula is as follows:

further, the step 2 performs the tasks of human body analysis and posture estimation by using two parallel NLNs respectively; the method comprises the following steps:

step 2.1: an intra-task attention mechanism, for human body analysis tasks, analyzing NLN with a series of analysis features as input, and executing task self-attention to aggregate the features of different frames; this will generate a global context for classifying the body parts in each frame to solve the problem of losing body parts due to specular reflection in the local frames;

step 2.2: an inter-task attention mechanism; in order to integrate the features in the human body analysis and gesture estimation tasks, the correlation between the analysis features and the gesture features needs to be found, and an inter-task attention mechanism is adopted to calculate the space-time correlation between the analysis features and the gesture features; for analysis tasks, the method inputs the gesture estimation features into an analysis NLN, and firstly analyzes a feature matrix Z ^H And gesture feature matrix Z ^P Performing linear transformation, then performing dot product and normalization on the result to finally obtain an inter-task attention matrix a of the human body analysis task ^H→P ；

Step 2.3 feature polymerization: the present invention fuses human body analytic features and pose estimation features in all frames to predict body parts at points in a particular frame using intra-task and inter-task attention matrices.

Step 2.4: outputting a model; human body analytic characteristic Y ^H And pose estimation feature Y ^P The three-dimensional human body part classification information and the human skeleton key point position information are output finally after being processed by a multi-layer perceptron and a fully-connected neural network respectively.

Further, the step 2.1: for parsing tasks, firstCharacterizing points of all points within all time series

Stacked as feature matrix Z ^H After that, Z ^H Obtaining an embedded vector by linear transformation>

And->

Further, in order to estimate the space-time correlation between the points in each group of frames, dot product and normalization processing are performed on the embedded vectors through a nonlinear function, so as to obtain an intra-task attention matrix a under the human body analysis task ^H . The formula is as follows:

where sigma represents a non-linear function,

and->

Representing parameters of the linear transformation.

Similarly, in the posture estimation task, the same processing procedure is performed to obtain the attention matrix a in the corresponding task ^P The method comprises the steps of carrying out a first treatment on the surface of the The formula is as follows:

where sigma represents a non-linear function,

and->

Representing linear transformationsParameters of Z ^P Point characteristics representing all points in all time sequences under the task of pose estimation +.>

And stacking the obtained feature matrix.

Further, the specific formula of the step 2.2 is as follows:

wherein the method comprises the steps of

And->

The weight parameters representing the linear transformation.

Similarly, the human body analytic features are input into the pose estimation NLN to obtain an inter-task attention matrix a of the pose estimation task ^P→H The method comprises the steps of carrying out a first treatment on the surface of the The formula is as follows:

wherein the method comprises the steps of

And->

The weight parameters representing the linear transformation.

Further, the step 2.3 is specifically that for human body analysis task, the feature matrix Z is first set ^H Respectively linearly transformed into

And->

Then respectively and intra-task attention matrix a ^H And inter-task attention matrix a ^H→P Multiplying to obtain the intra-task features and inter-task features respectively, and calculating the weighted sum of all the features according to the correlation between all the features and the current frame; finally, the intra-task features and inter-task features are connected, and the result is connected with the original feature Z ^H Adding elements by elements and generating final aggregated human body analytic features Y ^H The method comprises the steps of carrying out a first treatment on the surface of the The formula is as follows:

wherein the method comprises the steps of

And->

Respectively linear transformation parameters;

for the attitude estimation task, the aggregated attitude estimation feature Y is finally obtained in the same way as the method ^P The method comprises the steps of carrying out a first treatment on the surface of the The formula is as follows:

wherein the method comprises the steps of

And->

Respectively linear transformation parameters.

Further, the step 3 needs to be described as follows: the Kinect system is only used in the offline training phase, but is not required in the reasoning phase; for human body analytic tasks, cross entropy loss is adopted to minimize errors between predicted body part classification and true classification of each point; the formula is as follows:

where N represents the number of points, K is the class number of the semantic tags,

is a function of the output 0 or 1, when +.>

When 1 indicates that sample n belongs to category k, < >>

Is the predicted probability that sample b belongs to class k;

for the attitude estimation task, a mean square error is adopted to minimize the error between the predicted position and the actual position of the skeleton joint; the formula is as follows:

wherein I II the L2-paradigm is represented,

and p _m Respectively representing a predicted value and an actual value of the bone joint, wherein M represents the number of selected bone joint points; the network architecture is trained end to end, and the overall supervision function of the system is as follows:

L＝γL _H +βL _P

where γ and β are hyper-parameters.

The invention has the beneficial effects that: :

the invention designs a dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning, which can generate point cloud with annotation semantic tags, solves the problem that the semantic information of the millimeter wave point cloud cannot be acquired by millimeter wave perception centered by people at present, achieves the accuracy of about 92% and the accuracy of IoU of 84%, and can respectively improve the performance of two downstream tasks (gesture estimation and action recognition) by about 18% and 6% by the predicted semantic tags.

Drawings

Fig. 1: the overall structure of the system is schematically shown.

Fig. 2: the multi-task feature extraction module is structured schematically.

Fig. 3: the multi-task feature fusion module is structurally schematic.

Fig. 4: system accuracy in different scenarios.

Detailed Description

The present invention is further illustrated in the following drawings and detailed description, which are to be understood as being merely illustrative of the invention and not limiting the scope of the invention. It should be noted that the words "front", "rear", "left", "right", "upper" and "lower" used in the following description refer to directions in the drawings, and the words "inner" and "outer" refer to directions toward or away from, respectively, the geometric center of a particular component.

As shown in fig. 1, the method for analyzing a dynamic millimeter wave Lei Dadian cloud human body based on joint learning in this embodiment specifically includes the following steps:

as shown in fig. 2, step 1: solving sparsity of millimeter wave point cloud through' multitasking feature extraction module

Aiming at the characteristic problem that the existing deep neural network model is difficult to extract the information containing the human body structure, the invention adopts the multi-task learning model to jointly execute human body analysis (main task) and gesture estimation (auxiliary task). The auxiliary task may effectively direct the human analytic network to extract high-level structural features representative of the subject's posture. Because of the strong correlation between the human body posture and the human body analysis, the human body posture correlation characteristics are beneficial to improving the accuracy and the robustness of the analysis network in predicting the semantic tags. For human analysis and gesture estimation tasks, the multi-task learning model extracts corresponding features in parallel, specifically, the multi-task learning model can be divided into a point module, a frame module and a feature aggregation module, the input of the model is a frame sequence with the length of s, each frame comprises n points, and each point comprises d feature dimensions.

Step 1.1: extracting analysis characteristics of a human body;

The formula is as follows:

At the frame module, for each radar point p _i,t Point features of (2)

First encoded as a higher-dimensional feature representation +.>

The formula is as follows:

wherein θ is _h A learnable parameter representing MLP.

Further, the point characteristics of all points in the frame

Aggregation into frame features->

To extract global information of the frame. The formula is as follows:

wherein N is the time t pairThe number of points contained in the frame, A () represents the attention function, which is θ _a A learnable parameter representing an attention function.

Finally, frame characteristics

Connect to the dot feature->

Obtaining feature vector of each point in the frame under human body analysis task>

The formula is as follows:

step 1.2: extracting human body posture characteristics;

for the extraction of human body posture features, a slightly different network architecture is used than the human body analytic feature extraction. In particular, frame characteristics for any frame

Processing is performed using a long short term memory network (LSTM). The formula is as follows:

wherein θ is _r Parameters representing LSTM.

Finally, will

Point features associated with the task of human posture->

The formula is as follows:

as shown in fig. 3, step 2: solving the problem of missing body parts in point cloud data caused by specular reflection through a multi-task feature fusion module

In order to solve the problem of missing body parts in point cloud data caused by specular reflection, the invention obtains inspiration from a non-local network (NLN), and designs a multi-task feature fusion method which combines intra-task attention and inter-task attention mechanisms and realizes the aggregation of space-time features of a main body from a global angle. Specifically, the method uses two parallel NLNs to perform human body parsing and pose estimation tasks, respectively.

Step 2.1: an intra-task attention mechanism;

for human body analysis tasks, analyzing NLN takes a series of analysis features as input, and executing task self-attention to aggregate the features of different frames. This will generate a global context for classifying the body parts in each frame to solve the problem of losing body parts due to specular reflection in the local frames. More specifically, for the parsing task, point features of all points in all time series are first generated

And->

Further, in order to estimate the space-time correlation between the points in each group of frames, the dot product and normalization processing are performed on the embedded vector through a nonlinear function (such as a softmax function), so as to obtain any one of the human body analysis tasksIntra-business attention matrix a ^H . The formula is as follows:

where sigma represents a non-linear function,

and->

Representing parameters of the linear transformation.

Similarly, in the posture estimation task, the same processing procedure is performed to obtain the attention matrix a in the corresponding task ^P . The formula is as follows:

where sigma represents a non-linear function,

and->

Parameters representing linear transformation, Z ^P Point characteristics representing all points in all time sequences under the task of pose estimation +.>

And stacking the obtained feature matrix.

Step 2.2: an inter-task attention mechanism;

in order to integrate the features in the human body analysis and gesture estimation tasks, the correlation between the analysis features and the gesture features needs to be found out, and the inter-task attention mechanism is adopted to calculate the space-time correlation between the analysis features and the gesture features. For the analysis task, the method inputs the gesture estimation feature into the analysis NLN, and firstly analyzes the feature matrix Z ^H And gesture feature momentArray Z ^P Performing linear transformation, then performing dot product and normalization on the result to finally obtain an inter-task attention matrix a of the human body analysis task ^H→P . The formula is as follows:

wherein the method comprises the steps of

And->

The weight parameters representing the linear transformation.

Similarly, the human body analytic features are input into the pose estimation NLN to obtain an inter-task attention matrix a of the pose estimation task ^P→H . The formula is as follows:

wherein the method comprises the steps of

And->

The weight parameters representing the linear transformation.

Step 2.3: feature aggregation;

the present invention fuses human body analytic features and pose estimation features in all frames to predict body parts at points in a particular frame using intra-task and inter-task attention matrices.

Specifically, for human body analysis tasks, the feature matrix Z is first of all ^H Respectively linearly transformed into

And

then respectively and intra-task attention matrix a ^H And inter-task attention matrix a ^H→P Multiplying to obtain the intra-task feature and inter-task feature, and calculating the weighted sum of all features according to the correlation between all features and the current frame. Finally, the intra-task features and inter-task features are connected, and the result is connected with the original feature Z ^H Adding elements by elements and generating final aggregated human body analytic features Y ^H . The formula is as follows:

wherein the method comprises the steps of

And->

Respectively linear transformation parameters.

For the attitude estimation task, the aggregated attitude estimation feature Y is finally obtained in the same way as the method ^P . The formula is as follows:

wherein the method comprises the steps of

And->

Respectively linear transformation parameters.

Step 2.4: outputting a model;

human body analytic characteristic Y ^H And pose estimation feature Y ^P The three-dimensional model is processed by a multi-layer perceptron (MLP) and a fully connected neural network (FC) respectively, and the final output is human body part classification information and human skeleton key point position information.

Step 3: multitasking supervision

In the off-line training stage of the system, the Kinect system is adopted to obtain the true values of the human body analysis tag and the gesture estimation tag. It should be noted that the Kinect system is only used in the offline training phase, but is not required in the reasoning phase. For human body parsing tasks, the present invention employs cross entropy loss to minimize the error between the predicted body part classification and the true classification for each point. The formula is as follows:

is a function of the output 0 or 1, when +.>

When 1 indicates that sample n belongs to category k, < >>

Is the predicted probability that sample n belongs to class k.

For the pose estimation task, the present invention employs Mean Square Error (MSE) to minimize the error between the predicted and actual positions of the skeletal joints. The formula is as follows:

wherein I II the L2-paradigm is represented,

and p _m Respectively representing a predicted value and an actual value of the bone joint, and M represents the number of selected bone joint points.

The network architecture designed by the invention is trained end to end, and the overall supervision function of the system is as follows:

L＝γL _H +βL _P

where γ and β are hyper-parameters.

As shown in fig. 4, the system accuracy in different scenarios is shown.

Example 1: emergency rescue

In emergency rescue, rescue teams often need to be tightly matched, and tools are accurately handed to teammates. However, in a scene such as a fire rescue scene, a large amount of smoke exists, and the conventional imaging device based on the camera is difficult to work normally, and the imaging device based on the millimeter wave still has robustness in a severe environment. According to the invention, by adding the extra semantic tag with the body part as extra input, the accuracy of the hand of the radar identification personnel can be improved, so that the rescue personnel can be assisted to accurately finish tool handover in a smoke environment.

Example 2: motion recognition

In some scenarios, accurate perception of human motion is required. In the scene of nursing homes, millimeter wave equipment for health monitoring needs to accurately identify the falling action of the old. According to the method, the action recognition performance of the millimeter wave equipment can be improved by adding the semantic information of the body part of the point cloud.

Example 3: identification recognition

Since cameras may pose privacy violations, more and more millimeter wave devices have been deployed in recent years in private settings such as warehouses, offices, etc. to replace cameras for monitoring. Unlike camera imaging, which contains rich semantic information, millimeter wave imaging has inherent sparsity and is disadvantageous in identifying personnel. According to the invention, by adding the semantic information of the body part of the point cloud, the personnel identity recognition performance of the millimeter wave equipment can be improved.

Example 4: autopilot

In the field of automatic driving, the method has great significance in helping the vehicle understand and recognize the actions of pedestrians, and can be applied to vehicle-mounted millimeter wave radar equipment to improve the action recognition capability of the automatic driving vehicle to the pedestrians, and pre-judgment and response can be timely carried out when emergency is met.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features.

Claims

1. A dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning is characterized by comprising the following steps of: the method comprises the following steps:

step 1: the sparsity of the millimeter wave point cloud is solved through a multi-task feature extraction module; the method comprises the steps of adopting a multi-task learning model to jointly execute a main task of human body analysis and an auxiliary task of gesture estimation; the auxiliary task can effectively guide the human analytic network to extract high-level structural features representing the posture of the subject; because of strong correlation between the human body posture and the human body analysis, the human body posture correlation characteristics are beneficial to improving the accuracy and the robustness of the analysis network in predicting semantic tags; for human analysis and gesture estimation tasks, the multi-task learning model extracts corresponding features in parallel;

step 2; the problem of missing body parts in the point cloud data caused by specular reflection is solved by a multitasking feature fusion module; the inspiration is obtained from a non-local network, a multi-task feature fusion method is designed, and the intra-task attention and inter-task attention mechanisms are combined, so that the space-time features of the main body are aggregated from the global angle;

2. The dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning is characterized by comprising the following steps of: the step 1 specifically, the multi-task learning model may be divided into a point module, a frame module and a feature aggregation module, wherein the input of the model is a frame sequence with a length s, each frame comprises n points, and each point comprises d feature dimensions;

step 1.1, extracting analysis characteristics of a human body;

at the point module, for the t moment pairPoint set C of the application frame _t Any radar point p in (2) _i,t Obtaining a high-dimensional characteristic representation of the point by adopting a multi-presentation perceptron, namely the point characteristic

The formula is as follows:

wherein θ is _e The learning parameters of the MLP are represented, and H represents the human body analysis task;

at the frame module, for each radar point p _i,t Point features of (2)

First encoded as a higher-dimensional feature representation +.>

The formula is as follows:

wherein θ is _h A learnable parameter representing MLP;

step 1.2: extracting human body posture characteristics;

Processing by using a long-term and short-term memory network; the formula is as follows:

wherein θ is _r Parameters representing LSTM;

finally, will

Point features associated with the task of human posture->

The formula is as follows:

3. the dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning is characterized by comprising the following steps of: further in the step 1.1, the point characteristics of all points in the frame are obtained

Aggregation into frame features->

To extract global information of the frame; the formula is as follows:

finally, frame characteristics

Connecting to point features/>

Obtaining feature vector of each point in the frame under human body analysis task

The formula is as follows:

4. the dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning according to claim 1, wherein the method comprises the following steps of: step 2, respectively executing human body analysis and posture estimation tasks by using two parallel NLNs; the method comprises the following steps:

5. The dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning according to claim 4, wherein the method comprises the following steps of:

the step 2.1 is as follows: for the analysis task, the point characteristics of all points in all time sequences are firstly calculated

And->

where sigma represents a non-linear function,

and->

Parameters representing a linear transformation;

similarly, in the pose estimation task, the same processing procedure is performedObtaining the attention matrix a in the corresponding task ^P The method comprises the steps of carrying out a first treatment on the surface of the The formula is as follows:

where sigma represents a non-linear function,

and->

And stacking the obtained feature matrix.

6. The dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning according to claim 4, wherein the method comprises the following steps of: the specific formula of the step 2.2 is as follows:

wherein the method comprises the steps of

And->

Weight parameters representing linear transformations;

wherein the method comprises the steps of

And->

The weight parameters representing the linear transformation.

7. The dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning according to claim 4, wherein the method comprises the following steps of: said step 2.3 is specifically that for human body analysis tasks, the feature matrix Z is first of all applied ^H Respectively linearly transformed into

And->

wherein the method comprises the steps of

And->

Respectively linear transformation parameters;

wherein the method comprises the steps of

And->

Respectively linear transformation parameters.

8. The dynamic millimeter wave Lei Dadian cloud human body analysis method based on joint learning according to claim 1, wherein the method comprises the following steps of: the step 3 needs to be described as follows: the Kinect system is only used in the offline training phase, but is not required in the reasoning phase; for human body analytic tasks, cross entropy loss is adopted to minimize errors between predicted body part classification and true classification of each point; the formula is as follows:

is a function of the output 0 or 1, when +.>

When 1 indicates that sample n belongs to category k, < >>

Is the predicted probability that sample b belongs to class k;

wherein I II the L2-paradigm is represented,

L＝γL _H +βL _p

where γ and β are hyper-parameters.