CN114511629A

CN114511629A - Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion

Info

Publication number: CN114511629A
Application number: CN202111649445.9A
Authority: CN
Inventors: 刘青山; 帅惠; 吴乐乐
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-05-17

Abstract

The invention discloses a single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion.A robust feature extractor is designed to extract 2D attitude features based on video sequence pictures of multiple cameras as input; based on the 2D attitude characteristics as input, a self-adaptive view self-attention transformation network is designed on the camera dimension, any number of two-dimensional attitudes under uncalibrated cameras are fused through relative camera position coding and a self-attention mechanism, and multi-view fused attitude characteristics are obtained; and (3) taking the attitude features based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame features through a self-attention mechanism to obtain the final 3D attitude. The invention has reasonable design, can be directly applied to scenes with any number of uncalibrated cameras without retraining and has small network calculation amount.

Description

Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a single three-dimensional attitude estimation method with self-adaptive multi-view and time sequence feature fusion.

Background

The single 3D human posture estimation task is a hot topic in the field of computer vision. It plays an important role in many applications such as motion recognition, human reconstruction and robotic operation. According to the difference of the number of cameras, the single 3D human body posture system can be mainly divided into monocular 3D human body posture estimation and monocular 3D human body posture estimation. For a monocular estimation model, only the visual information of the image under a single camera is used in feature coding, and the problems of occlusion and depth blurring cannot be effectively solved. The multi-view model can make up for the shielded joint information and the depth information lost after the camera projection by using multi-view geometric constraint, so that the three-dimensional human body posture can be predicted more accurately. For multi-view 3D human body posture estimation, the existing model mainly utilizes camera parameters to provide multi-view geometric constraints, and is seriously dependent on the configuration of multiple cameras, so that the existing model cannot be directly applied to scenes of single-view or few cameras. In a natural scene, the position of the camera is often changed, and real-time camera calibration in a dynamic scene is not practical. In addition, the existing multi-view model is very computationally intensive, and cannot capture time-series information to obtain a smooth three-dimensional pose using a time-series model.

Disclosure of Invention

The invention provides a single three-dimensional attitude estimation method for fusing self-adaptive multi-view and time sequence characteristics, which aims to overcome the defects of the prior art, extracts robust three-dimensional characteristics through a well-designed characteristic extractor, and fuses view geometric information and sequence time sequence information through a view self-attention transformation network and a time sequence self-attention transformation network; therefore, the method can be better applied to the fields of action recognition, human body reconstruction, robot operation and the like.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a single three-dimensional attitude estimation method with self-adaptive multi-view and time sequence feature fusion, which comprises the following steps:

firstly, extracting 2D attitude characteristics by a robust characteristic extractor based on video sequence pictures of multiple cameras as input;

secondly, designing a self-adaptive view self-attention transformation network on the camera dimension based on the 2D attitude characteristics as input, fusing any number of two-dimensional attitudes under uncalibrated cameras through relative camera position coding and a self-attention mechanism, and acquiring multi-view fused attitude characteristics;

and thirdly, taking the posture characteristics based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame characteristics through a self-attention mechanism to obtain the final 3D posture.

As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, a feature extractor in the first step is specifically as follows:

firstly, giving video sequences under N cameras, wherein each video sequence comprises F frame pictures, and the N multiplied by F frame pictures share the same feature extractor; each frame of picture I has a width W and a height H, and comprises three color channels of R, G and B,

is a three-dimensional matrix space with dimension W multiplied by H multiplied by 3; each picture only contains one person;

the feature extractor comprises a 2D attitude detector and a 3D attitude feature extractor, and for each frame of picture I, the 2D attitude detector is adopted firstly

Predicting 2D pose information

Wherein, offThe total number of the nodes is J,

P_2D、C_2D2D coordinates and confidence, p, for J joints, respectively_jIs the 2D coordinate of the j-th joint, c_jConfidence for j joint;

second step for P_2DAnd C_2DJ joints are divided into G groups according to the motion correlation of human joints:

wherein G is in the form of {1, 2.,. G },

are respectively P_2DAnd C_2DThe g subset of (a);

is one dimension of 2J^gA one-dimensional matrix space of (a);

is one dimension of J^gA one-dimensional matrix space of (a); wherein

Index representing all joints of group g, J^gNumber of joints in group g, p_iAnd c_iRespectively representing the 2D coordinates and the confidence of the ith joint;

third, the 3D attitude feature extractor firstly uses the first full connection layer

Coordinate the g-th group of 2D joints

Mapping as a feature

Is a one-dimensional matrix space with the dimension of C/2; c represents the characteristics of the joint of group G

The channel dimensions of the obtained global features are:

fourth step, second fully-connected network

Input the method

Outputting mapping matrix of the g group of joints

Is one dimension of (C/2) x 2J^gA two-dimensional matrix space of (a);

will be provided with

Mapping to C/2 channel number feature

For modulation

Fifth, for each of the G groups, it will be

And

after addition, the residual error network of the g-th group

Further extracting spatial information to obtain the adjusted characteristics of the g group

Sixth step, group G features

Spliced together through a third fully-connected layer

Global features mapped to a person

Is a one-dimensional matrix space with dimension C; wherein

Concat(f¹,f²,…,f^G) Representing the splicing of G groups of joint features;

splicing the global features of the N multiplied by F frame pictures to obtain the features X of all the pictures, wherein

Is a three-dimensional matrix space with dimensions C × N × F.

As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, the view self-attention transformation network in the step two consists of a relative camera position encoder and a view self-attention fusion module.

As a further optimization scheme of the single three-dimensional attitude estimation method based on the fusion of the self-adaptive multi-view and time sequence characteristics, the view self-attention transformation network in the step two is obtained by the following steps:

in step 201, in the camera dimension,

by N camera characteristics

The assembly is formed by splicing, wherein,

v∈{1,2,...,N}，

is a two-dimensional matrix space with dimension C multiplied by F;

the global features of F frame pictures of the v camera are spliced to obtain the features of the v camera; in the camera feature fusion process, the time sequence dimension F is omitted, namely simplified to

v∈{1,2,...,N}；

Step 202, the view self-attention transformation network firstly learns the relative position relationship between the cameras in a self-adaptive way through a neural network, and inputs the query variable of the a-th camera

And key value variable of the b-th camera

Outputting a mapping matrix M of the relative position relationship between the a-th camera and the b-th camera^abAnd a feature fusion weighting coefficient A^ab(ii) a Wherein the content of the first and second substances,

respectively representing the a-th and b-th camera characteristics;

is a two-dimensional matrix space with dimension D multiplied by D;

C＝H×D；

wherein the content of the first and second substances,

and

are two neural network layers that share the same residual network for acquisition

And

a relationship characteristic between; then respectively outputting M by using the fourth and the fifth full connection layers^abAnd A^ab；

Step 203, changing the numerical characteristics of the b-th camera

Is divided into H D-dimensional local feature points

Wherein

The numerical characteristics of the b-th camera after changing the shape;

is a two-dimensional matrix space with dimension of D multiplied by H; then through M^abTo pair

And performing linear mapping to realize relative camera position coding:

wherein the content of the first and second substances,

representing the feature of the a-th camera after the b-camera feature is subjected to relative camera position coding; wherein the content of the first and second substances,

NxN group camera feature combination

Will obtain the coded features

From N x N

Splicing is obtained, changed

After the shape of (2) to obtain a characteristic V_map；

Wherein

A four-dimensional matrix space with dimensions of D multiplied by H multiplied by N and a three-dimensional matrix space with dimensions of C multiplied by N are respectively adopted;

step 204, a ═ a^abL a ∈ {1,2, …, N }, b ∈ {1,2, …, N } } denotes N × N fusion coefficients a^abForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;

step 205, using the randomly masked A, pair V_mapPerforming weighted fusion to obtain multi-view fused feature V_fuse，

Is a two-dimensional matrix space with dimension C × N:

V_fuse＝sum((softmax(A)⊙V_map))，

wherein softmax (×) represents a normalized exponential function, normalizing the third dimension of the fusion coefficient matrix a, which indicates dot multiplication, and sum (×) represents merging features of N cameras in the third dimension.

As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, a time sequence self-attention transformation network in the third step consists of a coding module and two layers of feature fusion modules.

As a further optimization scheme of the single three-dimensional attitude estimation method with the fusion of the self-adaptive multi-view and time sequence characteristics, the time sequence self-attention transformation network in the third step is implemented by the following steps:

step 301, in the process of time sequence feature fusion, the dimension N of the camera is omitted, and the coding module of the time sequence self-attention transformation network firstly passes through the sixth full connection layer

To V_fuseAnd (3) carrying out characteristic coding:

wherein, the first and the second end of the pipe are connected with each other,

features after feature encoding;

step 302, the coding module then constructs sequence position codes by using cos and sin functions

And carrying out position coding on Z, wherein the coding process is as follows:

Z⁰＝Z+P_emb，

wherein the content of the first and second substances,

features that are position encoded;

step 303, the time sequence self-attention transformation network comprises two layers of feature fusion modules based on a self-attention mechanism; m belongs to {1,2} as an index of the number of layers of the feature fusion module, and the m-th layer feature fusion process is as follows:

wherein the content of the first and second substances,

and

respectively passing through a seventh, an eighth and a ninth full connection layer

And

timing characteristics for m-1 layers

Mapping to obtain a query vector, a key value vector and a numerical value vector; FFN is a multilayer perceptron with residual connection;

is a time sequence fusion feature of the mth layer, wherein the input feature of the fusion module of the first layer

Step 304, finally passing through a tenth full connection layer

Regressing the 3D pose P of the intermediate frame of each video sequence under N cameras_3DWherein

Is a two-dimensional matrix space with 3 jxn dimensions:

compared with the prior art, the invention adopting the technical scheme has the following technical effects:

the feature extractor designed by the invention can effectively utilize the confidence coefficient to reduce the noise of the 2D attitude detector, thereby extracting robust noise; the designed view self-attention transformation network can adaptively fuse any number of camera features which are not calibrated, so that the view self-attention transformation network can be directly applied to any camera configuration scene without retraining; the designed time sequence self-attention transformation network can adaptively fuse single-frame and multi-frame time sequence characteristics, so that the time sequence self-attention transformation network is directly applied to static scenes and dynamic video scenes without retraining;

(1) modulating the 2D attitude characteristics by a characteristic extractor of a joint confidence coefficient, and reducing the influence of noise of the 2D attitude on 3D attitude estimation to extract robust attitude characteristics;

(2) the multi-view characteristics are adaptively fused through a view self-attention transformation network, and the method can be directly applied to any camera configuration (camera number and camera parameters) scene without camera calibration and retraining;

(3) the time sequence features are adaptively fused through a time sequence self-attention transformation network, retraining is not needed, and the time sequence features can be directly applied to single-frame and multi-frame scenes.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a block diagram of a feature extractor of the present invention;

FIG. 3 is a block diagram of an attention mechanism in a view self-attention translation network of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

A single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion comprises the following steps:

Preferably, the feature extractor in the step one uses a pre-trained 2D detector to acquire the posture information of the joints as input, and uses a full-connection network to extract robust features.

Preferably, the feature extractor in the first step comprises the following steps:

as shown in fig. 1, given video sequences in N cameras, each video sequence contains F frame pictures, and the N × F frame pictures share the same feature extractor; each frame of picture I has a width W and a height H, and comprises three color channels of R, G and B,

Predicting 2D pose information

Wherein the total number of joints is J,

second step, as shown in FIG. 2, for P_2DAnd C_2DJ joints are divided into G groups according to the motion correlation of human joints:

wherein G belongs to {1,2,. eta., G },

are respectively P_2DAnd C_2DThe g subset of (a);

is one dimension of 2J^gA one-dimensional matrix space of (a);

is one dimension of J^gA one-dimensional matrix space of (a); wherein

Index representing all joints of group g, J^gNumber of joints in group g, p_iAnd c_iRespectively representing the 2D coordinates and the confidence of the ith joint; in the present invention, the motion of each joint of the human body is relatedSex division, divided into 5 groups, namely head, left/right hand and left/right leg, namely G-5;

Coordinate the g-th group of 2D joints

Mapping as a feature

Channel dimensions of the resulting global features:

fourth step, second fully-connected network

Input device

Outputting mapping matrix of the g group of joints

Is one dimension of (C/2) x 2J^gA two-dimensional matrix space of (a);

will be provided with

Mapping to C/2 channel number feature

For modulation

Fifth, for each of the G groups, it will be

And

after addition, the residual error network of the g-th group

Sixth step, group G features

Is spliced at oneThrough a third full connection layer

Global features mapped to a person

Is a one-dimensional matrix space with dimension C; wherein

Is a three-dimensional matrix space with dimensions C × N × F.

Preferably, the view self-attention transformation network in the step two fuses the view features based on the relative camera position coding module and the self-attention fusion module.

Preferably, the view self-attention transformation network in the step two comprises the following steps:

in a first step, in the camera dimension,

by N camera characteristics

The assembly is formed by splicing, wherein,

v∈{1,2,...,N}，

is a two-dimensional matrix space with dimension C multiplied by F;

v∈{1,2,...,N}；

Second, as shown in fig. 3, the view self-attention transformation network first adaptively learns the relative position relationship between the cameras through the neural network, and inputs the query variable of the a-th camera

And key value variable of the b-th camera

respectively representing the a-th and b-th camera characteristics;

is a two-dimensional matrix space with dimension D multiplied by D;

C＝H×D；

wherein the content of the first and second substances,

and

And

Third, changing the numerical characteristics of the b-th camera

Is divided into H D-dimensional local feature points

Wherein

The numerical characteristics of the b-th camera after changing the shape;

And performing linear mapping to realize relative camera position coding:

representing the feature of the a-th camera after the b-camera feature is subjected to relative camera position coding; wherein, the first and the second end of the pipe are connected with each other,

NxN group camera feature combination

Will obtain the coded features

From N x N

Splicing is obtained, changed

After the shape of (2) to obtain a characteristic V_map；

Wherein

A four-dimensional matrix space of dimensions dXHXNXN and dimensions dHXN, respectivelyA C × N × N three-dimensional matrix space;

step four, A ═ A^ab| a ∈ {1,2, …, N }, b ∈ {1,2, …, N } } denotes N × N fusion coefficients a^abForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;

the fifth step, using A after random shielding, to V_mapPerforming weighted fusion to obtain multi-view fused feature V_fuse，

Is a two-dimensional matrix space with dimension C × N:

V_fuse＝sum((softmax(A)⊙V_map))，

Preferably, the time-sequential self-attention transform network in step three consists of two time-sequential encoders, wherein each encoder consists of a multi-headed self-attention and forward propagation network in time.

Preferably, the time-sequential self-attention transition network in step three comprises the following steps:

the first step is that in the process of time sequence feature fusion, the dimension N of a camera is omitted, and a coding module of a time sequence self-attention transformation network firstly passes through a sixth full connection layer

To V_fuseAnd (3) carrying out characteristic coding:

features after feature encoding;

secondly, constructing sequence position codes by a coding module by adopting cos and sin functions

Z⁰＝Z+P_emb，

wherein the content of the first and second substances,

features that are position encoded;

thirdly, the time sequence self-attention transformation network comprises two layers of feature fusion modules based on a self-attention mechanism; m belongs to {1,2} as an index of the number of layers of the feature fusion module, and the m-th layer feature fusion process is as follows:

wherein the content of the first and second substances,

and

And

timing characteristics for m-1 layers

Step 304, finally passing through a tenth full connection layer

Is a two-dimensional matrix space with 3 jxn dimensions:

the feature extractor designed by the invention can effectively utilize the confidence coefficient to reduce the noise of the 2D attitude detector, thereby extracting robust noise; the designed view self-attention transformation network can adaptively fuse any number of camera features which are not calibrated, so that the view self-attention transformation network can be directly applied to any camera configuration scene without retraining; the designed time sequence self-attention transformation network can adaptively fuse single-frame and multi-frame time sequence characteristics, so that the time sequence self-attention transformation network is directly applied to static scenes and dynamic video scenes without retraining.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion is characterized by comprising the following steps:

2. The method for estimating the single three-dimensional posture by fusing the self-adaptive multi-view and time-series characteristics as claimed in claim 1, wherein the characteristic extractor in the first step is specifically as follows:

Predicting 2D pose information

Wherein the total number of joints is J,

wherein G is in the form of {1, 2.,. G },

are respectively P_2DAnd C_2DThe g subset of (a);

is a dimension of 2J^gA one-dimensional matrix space of (a);

is one dimension of J^gA one-dimensional matrix space of (a); wherein

Coordinate the g-th group of 2D joints

Mapping as a feature

The channel dimensions of the obtained global features are:

fourth step, second fully-connected network

Input device

Outputting mapping matrix of the g group of joints

Is one dimension of (C/2) x 2J^gA two-dimensional matrix space of (a);

will be provided with

Mapping to C/2 channel number feature

For modulation

Fifth, for each of the G groups, it will be

And

after addition, the residual error network of the g-th group

Sixth step, group G features

Spliced together through a third fully-connected layer

Global features mapped to a person

Is a one-dimensional matrix space with dimension C; wherein

Concat(f¹，f²，…，f^G) Representing the splicing of G groups of joint features;

Is a three-dimensional matrix space with dimensions C × N × F.

3. The method for estimating the single three-dimensional pose with adaptive multi-view and temporal feature fusion according to claim 1, wherein the view-attention transform network in the second step is composed of a relative camera position encoder and a view-attention fusion module.

4. The method for estimating the single three-dimensional pose with fusion of adaptive multi-view and time-series characteristics according to claim 2, wherein the view self-attention transformation network in the second step is obtained by the following steps:

in step 201, in the camera dimension,

by N camera characteristics

The assembly is formed by splicing, wherein,

v∈{1，2，...，N}，

is a two-dimensional matrix space with dimension C multiplied by F;

the global features of F frame pictures of the v camera are spliced to obtain the features of the v camera; in-camera feature fusionIn the process, the time sequence dimension F is omitted, namely simplified to

v∈{1，2，...，N}；

And key value variable of the b-th camera

respectively representing the a-th and b-th camera characteristics;

is a two-dimensional matrix space with dimension D multiplied by D;

C＝H×D；

wherein the content of the first and second substances,

and

And

a relationship characteristic therebetween; then respectively outputting M by using the fourth and the fifth full connection layers^abAnd A^ab；

Step 203, changing the numerical characteristics of the b-th camera

Is divided into H D-dimensional local feature points

Wherein

The numerical characteristics of the b-th camera after changing the shape;

is a two-dimensional matrix space with dimension of D multiplied by H; then through M^abFor is to

And performing linear mapping to realize relative camera position coding:

wherein the content of the first and second substances,

NxN group camera feature combination

Will obtain the coded features

From N x N

Splicing is obtained, changed

After the shape of (2) to obtain a characteristic V_map；

Wherein

A four-dimensional matrix space with dimensions of D × H × N × N and a database with dimensions of C × NA three-dimensional matrix space of N;

step 204, a ═ a^ab| a ∈ {1, 2., N }, and b ∈ {1, 2., N } } denote N × N fusion coefficients a^abForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;

Is a two-dimensional matrix space with dimension C × N:

V_fuse＝sum((softmax(A)⊙V_map))，

5. The method for estimating the single-person three-dimensional pose based on adaptive multi-view and temporal feature fusion according to claim 1, wherein the temporal self-attention transformation network in step three comprises an encoding module and a two-layer feature fusion module.

6. The method of claim 4, wherein the time-series self-attention transformation network in step three is implemented by the following steps: