CN114511629A - Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion - Google Patents

Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion Download PDF

Info

Publication number
CN114511629A
CN114511629A CN202111649445.9A CN202111649445A CN114511629A CN 114511629 A CN114511629 A CN 114511629A CN 202111649445 A CN202111649445 A CN 202111649445A CN 114511629 A CN114511629 A CN 114511629A
Authority
CN
China
Prior art keywords
self
camera
feature
dimension
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111649445.9A
Other languages
Chinese (zh)
Inventor
刘青山
帅惠
吴乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202111649445.9A priority Critical patent/CN114511629A/en
Publication of CN114511629A publication Critical patent/CN114511629A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion.A robust feature extractor is designed to extract 2D attitude features based on video sequence pictures of multiple cameras as input; based on the 2D attitude characteristics as input, a self-adaptive view self-attention transformation network is designed on the camera dimension, any number of two-dimensional attitudes under uncalibrated cameras are fused through relative camera position coding and a self-attention mechanism, and multi-view fused attitude characteristics are obtained; and (3) taking the attitude features based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame features through a self-attention mechanism to obtain the final 3D attitude. The invention has reasonable design, can be directly applied to scenes with any number of uncalibrated cameras without retraining and has small network calculation amount.

Description

Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a single three-dimensional attitude estimation method with self-adaptive multi-view and time sequence feature fusion.
Background
The single 3D human posture estimation task is a hot topic in the field of computer vision. It plays an important role in many applications such as motion recognition, human reconstruction and robotic operation. According to the difference of the number of cameras, the single 3D human body posture system can be mainly divided into monocular 3D human body posture estimation and monocular 3D human body posture estimation. For a monocular estimation model, only the visual information of the image under a single camera is used in feature coding, and the problems of occlusion and depth blurring cannot be effectively solved. The multi-view model can make up for the shielded joint information and the depth information lost after the camera projection by using multi-view geometric constraint, so that the three-dimensional human body posture can be predicted more accurately. For multi-view 3D human body posture estimation, the existing model mainly utilizes camera parameters to provide multi-view geometric constraints, and is seriously dependent on the configuration of multiple cameras, so that the existing model cannot be directly applied to scenes of single-view or few cameras. In a natural scene, the position of the camera is often changed, and real-time camera calibration in a dynamic scene is not practical. In addition, the existing multi-view model is very computationally intensive, and cannot capture time-series information to obtain a smooth three-dimensional pose using a time-series model.
Disclosure of Invention
The invention provides a single three-dimensional attitude estimation method for fusing self-adaptive multi-view and time sequence characteristics, which aims to overcome the defects of the prior art, extracts robust three-dimensional characteristics through a well-designed characteristic extractor, and fuses view geometric information and sequence time sequence information through a view self-attention transformation network and a time sequence self-attention transformation network; therefore, the method can be better applied to the fields of action recognition, human body reconstruction, robot operation and the like.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides a single three-dimensional attitude estimation method with self-adaptive multi-view and time sequence feature fusion, which comprises the following steps:
firstly, extracting 2D attitude characteristics by a robust characteristic extractor based on video sequence pictures of multiple cameras as input;
secondly, designing a self-adaptive view self-attention transformation network on the camera dimension based on the 2D attitude characteristics as input, fusing any number of two-dimensional attitudes under uncalibrated cameras through relative camera position coding and a self-attention mechanism, and acquiring multi-view fused attitude characteristics;
and thirdly, taking the posture characteristics based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame characteristics through a self-attention mechanism to obtain the final 3D posture.
As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, a feature extractor in the first step is specifically as follows:
firstly, giving video sequences under N cameras, wherein each video sequence comprises F frame pictures, and the N multiplied by F frame pictures share the same feature extractor; each frame of picture I has a width W and a height H, and comprises three color channels of R, G and B,
Figure BDA0003446143460000021
Figure BDA0003446143460000022
is a three-dimensional matrix space with dimension W multiplied by H multiplied by 3; each picture only contains one person;
the feature extractor comprises a 2D attitude detector and a 3D attitude feature extractor, and for each frame of picture I, the 2D attitude detector is adopted firstly
Figure BDA0003446143460000023
Predicting 2D pose information
Figure BDA0003446143460000024
Figure BDA0003446143460000025
Wherein, offThe total number of the nodes is J,
Figure BDA0003446143460000026
P2D、C2D2D coordinates and confidence, p, for J joints, respectivelyjIs the 2D coordinate of the j-th joint, cjConfidence for j joint;
second step for P2DAnd C2DJ joints are divided into G groups according to the motion correlation of human joints:
Figure BDA0003446143460000027
Figure BDA0003446143460000028
wherein G is in the form of {1, 2.,. G },
Figure BDA0003446143460000029
are respectively P2DAnd C2DThe g subset of (a);
Figure BDA00034461434600000210
Figure BDA00034461434600000211
is one dimension of 2JgA one-dimensional matrix space of (a);
Figure BDA00034461434600000212
Figure BDA00034461434600000213
is one dimension of JgA one-dimensional matrix space of (a); wherein
Figure BDA00034461434600000214
Index representing all joints of group g, JgNumber of joints in group g, piAnd ciRespectively representing the 2D coordinates and the confidence of the ith joint;
third, the 3D attitude feature extractor firstly uses the first full connection layer
Figure BDA00034461434600000215
Coordinate the g-th group of 2D joints
Figure BDA00034461434600000216
Mapping as a feature
Figure BDA00034461434600000217
Figure BDA00034461434600000218
Is a one-dimensional matrix space with the dimension of C/2; c represents the characteristics of the joint of group G
Figure BDA00034461434600000219
The channel dimensions of the obtained global features are:
Figure BDA00034461434600000220
fourth step, second fully-connected network
Figure BDA00034461434600000221
Input the method
Figure BDA00034461434600000222
Outputting mapping matrix of the g group of joints
Figure BDA00034461434600000223
Figure BDA00034461434600000224
Is one dimension of (C/2) x 2JgA two-dimensional matrix space of (a);
Figure BDA00034461434600000225
will be provided with
Figure BDA00034461434600000226
Mapping to C/2 channel number feature
Figure BDA00034461434600000227
Figure BDA00034461434600000228
For modulation
Figure BDA00034461434600000229
Figure BDA00034461434600000230
Figure BDA0003446143460000031
Fifth, for each of the G groups, it will be
Figure BDA0003446143460000032
And
Figure BDA0003446143460000033
after addition, the residual error network of the g-th group
Figure BDA0003446143460000034
Further extracting spatial information to obtain the adjusted characteristics of the g group
Figure BDA0003446143460000035
Figure BDA0003446143460000036
Sixth step, group G features
Figure BDA0003446143460000037
Spliced together through a third fully-connected layer
Figure BDA0003446143460000038
Global features mapped to a person
Figure BDA0003446143460000039
Figure BDA00034461434600000310
Is a one-dimensional matrix space with dimension C; wherein
Figure BDA00034461434600000311
Concat(f1,f2,…,fG) Representing the splicing of G groups of joint features;
splicing the global features of the N multiplied by F frame pictures to obtain the features X of all the pictures, wherein
Figure BDA00034461434600000312
Figure BDA00034461434600000313
Is a three-dimensional matrix space with dimensions C × N × F.
As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, the view self-attention transformation network in the step two consists of a relative camera position encoder and a view self-attention fusion module.
As a further optimization scheme of the single three-dimensional attitude estimation method based on the fusion of the self-adaptive multi-view and time sequence characteristics, the view self-attention transformation network in the step two is obtained by the following steps:
in step 201, in the camera dimension,
Figure BDA00034461434600000314
by N camera characteristics
Figure BDA00034461434600000315
The assembly is formed by splicing, wherein,
Figure BDA00034461434600000316
v∈{1,2,...,N},
Figure BDA00034461434600000317
is a two-dimensional matrix space with dimension C multiplied by F;
Figure BDA00034461434600000318
the global features of F frame pictures of the v camera are spliced to obtain the features of the v camera; in the camera feature fusion process, the time sequence dimension F is omitted, namely simplified to
Figure BDA00034461434600000319
v∈{1,2,...,N};
Step 202, the view self-attention transformation network firstly learns the relative position relationship between the cameras in a self-adaptive way through a neural network, and inputs the query variable of the a-th camera
Figure BDA00034461434600000320
And key value variable of the b-th camera
Figure BDA00034461434600000321
Outputting a mapping matrix M of the relative position relationship between the a-th camera and the b-th cameraabAnd a feature fusion weighting coefficient Aab(ii) a Wherein the content of the first and second substances,
Figure BDA00034461434600000322
Figure BDA00034461434600000323
respectively representing the a-th and b-th camera characteristics;
Figure BDA00034461434600000324
Figure BDA00034461434600000325
is a two-dimensional matrix space with dimension D multiplied by D;
Figure BDA00034461434600000326
C=H×D;
Figure BDA00034461434600000327
Figure BDA00034461434600000328
wherein the content of the first and second substances,
Figure BDA00034461434600000329
and
Figure BDA00034461434600000330
are two neural network layers that share the same residual network for acquisition
Figure BDA00034461434600000331
And
Figure BDA00034461434600000332
a relationship characteristic between; then respectively outputting M by using the fourth and the fifth full connection layersabAnd Aab
Step 203, changing the numerical characteristics of the b-th camera
Figure BDA00034461434600000333
Is divided into H D-dimensional local feature points
Figure BDA0003446143460000041
Wherein
Figure BDA0003446143460000042
Figure BDA0003446143460000043
The numerical characteristics of the b-th camera after changing the shape;
Figure BDA0003446143460000044
Figure BDA0003446143460000045
is a two-dimensional matrix space with dimension of D multiplied by H; then through MabTo pair
Figure BDA0003446143460000046
And performing linear mapping to realize relative camera position coding:
Figure BDA0003446143460000047
wherein the content of the first and second substances,
Figure BDA0003446143460000048
representing the feature of the a-th camera after the b-camera feature is subjected to relative camera position coding; wherein the content of the first and second substances,
Figure BDA0003446143460000049
NxN group camera feature combination
Figure BDA00034461434600000410
Will obtain the coded features
Figure BDA00034461434600000411
Figure BDA00034461434600000412
From N x N
Figure BDA00034461434600000413
Splicing is obtained, changed
Figure BDA00034461434600000414
After the shape of (2) to obtain a characteristic Vmap
Figure BDA00034461434600000415
Wherein
Figure BDA00034461434600000416
A four-dimensional matrix space with dimensions of D multiplied by H multiplied by N and a three-dimensional matrix space with dimensions of C multiplied by N are respectively adopted;
step 204, a ═ aabL a ∈ {1,2, …, N }, b ∈ {1,2, …, N } } denotes N × N fusion coefficients aabForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;
step 205, using the randomly masked A, pair VmapPerforming weighted fusion to obtain multi-view fused feature Vfuse
Figure BDA00034461434600000417
Figure BDA00034461434600000418
Is a two-dimensional matrix space with dimension C × N:
Vfuse=sum((softmax(A)⊙Vmap)),
wherein softmax (×) represents a normalized exponential function, normalizing the third dimension of the fusion coefficient matrix a, which indicates dot multiplication, and sum (×) represents merging features of N cameras in the third dimension.
As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, a time sequence self-attention transformation network in the third step consists of a coding module and two layers of feature fusion modules.
As a further optimization scheme of the single three-dimensional attitude estimation method with the fusion of the self-adaptive multi-view and time sequence characteristics, the time sequence self-attention transformation network in the third step is implemented by the following steps:
step 301, in the process of time sequence feature fusion, the dimension N of the camera is omitted, and the coding module of the time sequence self-attention transformation network firstly passes through the sixth full connection layer
Figure BDA00034461434600000419
To VfuseAnd (3) carrying out characteristic coding:
Figure BDA00034461434600000420
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00034461434600000421
features after feature encoding;
step 302, the coding module then constructs sequence position codes by using cos and sin functions
Figure BDA00034461434600000422
And carrying out position coding on Z, wherein the coding process is as follows:
Z0=Z+Pemb
wherein the content of the first and second substances,
Figure BDA00034461434600000423
features that are position encoded;
step 303, the time sequence self-attention transformation network comprises two layers of feature fusion modules based on a self-attention mechanism; m belongs to {1,2} as an index of the number of layers of the feature fusion module, and the m-th layer feature fusion process is as follows:
Figure BDA0003446143460000051
Figure BDA0003446143460000052
Figure BDA0003446143460000053
Figure BDA0003446143460000054
wherein the content of the first and second substances,
Figure BDA0003446143460000055
and
Figure BDA0003446143460000056
respectively passing through a seventh, an eighth and a ninth full connection layer
Figure BDA0003446143460000057
And
Figure BDA0003446143460000058
timing characteristics for m-1 layers
Figure BDA0003446143460000059
Mapping to obtain a query vector, a key value vector and a numerical value vector; FFN is a multilayer perceptron with residual connection;
Figure BDA00034461434600000510
is a time sequence fusion feature of the mth layer, wherein the input feature of the fusion module of the first layer
Figure BDA00034461434600000511
Step 304, finally passing through a tenth full connection layer
Figure BDA00034461434600000512
Regressing the 3D pose P of the intermediate frame of each video sequence under N cameras3DWherein
Figure BDA00034461434600000513
Figure BDA00034461434600000514
Is a two-dimensional matrix space with 3 jxn dimensions:
Figure BDA00034461434600000515
compared with the prior art, the invention adopting the technical scheme has the following technical effects:
the feature extractor designed by the invention can effectively utilize the confidence coefficient to reduce the noise of the 2D attitude detector, thereby extracting robust noise; the designed view self-attention transformation network can adaptively fuse any number of camera features which are not calibrated, so that the view self-attention transformation network can be directly applied to any camera configuration scene without retraining; the designed time sequence self-attention transformation network can adaptively fuse single-frame and multi-frame time sequence characteristics, so that the time sequence self-attention transformation network is directly applied to static scenes and dynamic video scenes without retraining;
(1) modulating the 2D attitude characteristics by a characteristic extractor of a joint confidence coefficient, and reducing the influence of noise of the 2D attitude on 3D attitude estimation to extract robust attitude characteristics;
(2) the multi-view characteristics are adaptively fused through a view self-attention transformation network, and the method can be directly applied to any camera configuration (camera number and camera parameters) scene without camera calibration and retraining;
(3) the time sequence features are adaptively fused through a time sequence self-attention transformation network, retraining is not needed, and the time sequence features can be directly applied to single-frame and multi-frame scenes.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a block diagram of a feature extractor of the present invention;
FIG. 3 is a block diagram of an attention mechanism in a view self-attention translation network of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
A single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion comprises the following steps:
firstly, extracting 2D attitude characteristics by a robust characteristic extractor based on video sequence pictures of multiple cameras as input;
secondly, designing a self-adaptive view self-attention transformation network on the camera dimension based on the 2D attitude characteristics as input, fusing any number of two-dimensional attitudes under uncalibrated cameras through relative camera position coding and a self-attention mechanism, and acquiring multi-view fused attitude characteristics;
and thirdly, taking the posture characteristics based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame characteristics through a self-attention mechanism to obtain the final 3D posture.
Preferably, the feature extractor in the step one uses a pre-trained 2D detector to acquire the posture information of the joints as input, and uses a full-connection network to extract robust features.
Preferably, the feature extractor in the first step comprises the following steps:
as shown in fig. 1, given video sequences in N cameras, each video sequence contains F frame pictures, and the N × F frame pictures share the same feature extractor; each frame of picture I has a width W and a height H, and comprises three color channels of R, G and B,
Figure BDA0003446143460000061
Figure BDA0003446143460000062
is a three-dimensional matrix space with dimension W multiplied by H multiplied by 3; each picture only contains one person;
the feature extractor comprises a 2D attitude detector and a 3D attitude feature extractor, and for each frame of picture I, the 2D attitude detector is adopted firstly
Figure BDA0003446143460000063
Predicting 2D pose information
Figure BDA0003446143460000064
Figure BDA0003446143460000065
Wherein the total number of joints is J,
Figure BDA0003446143460000066
P2D、C2D2D coordinates and confidence, p, for J joints, respectivelyjIs the 2D coordinate of the j-th joint, cjConfidence for j joint;
second step, as shown in FIG. 2, for P2DAnd C2DJ joints are divided into G groups according to the motion correlation of human joints:
Figure BDA0003446143460000067
Figure BDA0003446143460000068
wherein G belongs to {1,2,. eta., G },
Figure BDA0003446143460000069
are respectively P2DAnd C2DThe g subset of (a);
Figure BDA00034461434600000610
Figure BDA00034461434600000611
is one dimension of 2JgA one-dimensional matrix space of (a);
Figure BDA00034461434600000612
Figure BDA00034461434600000613
is one dimension of JgA one-dimensional matrix space of (a); wherein
Figure BDA00034461434600000614
Index representing all joints of group g, JgNumber of joints in group g, piAnd ciRespectively representing the 2D coordinates and the confidence of the ith joint; in the present invention, the motion of each joint of the human body is relatedSex division, divided into 5 groups, namely head, left/right hand and left/right leg, namely G-5;
third, the 3D attitude feature extractor firstly uses the first full connection layer
Figure BDA0003446143460000071
Coordinate the g-th group of 2D joints
Figure BDA0003446143460000072
Mapping as a feature
Figure BDA0003446143460000073
Figure BDA0003446143460000074
Is a one-dimensional matrix space with the dimension of C/2; c represents the characteristics of the joint of group G
Figure BDA0003446143460000075
Channel dimensions of the resulting global features:
Figure BDA0003446143460000076
fourth step, second fully-connected network
Figure BDA0003446143460000077
Input device
Figure BDA0003446143460000078
Outputting mapping matrix of the g group of joints
Figure BDA0003446143460000079
Figure BDA00034461434600000710
Is one dimension of (C/2) x 2JgA two-dimensional matrix space of (a);
Figure BDA00034461434600000711
will be provided with
Figure BDA00034461434600000712
Mapping to C/2 channel number feature
Figure BDA00034461434600000713
Figure BDA00034461434600000714
For modulation
Figure BDA00034461434600000715
Figure BDA00034461434600000716
Figure BDA00034461434600000717
Fifth, for each of the G groups, it will be
Figure BDA00034461434600000718
And
Figure BDA00034461434600000719
after addition, the residual error network of the g-th group
Figure BDA00034461434600000720
Further extracting spatial information to obtain the adjusted characteristics of the g group
Figure BDA00034461434600000721
Figure BDA00034461434600000722
Sixth step, group G features
Figure BDA00034461434600000723
Is spliced at oneThrough a third full connection layer
Figure BDA00034461434600000724
Global features mapped to a person
Figure BDA00034461434600000725
Figure BDA00034461434600000726
Is a one-dimensional matrix space with dimension C; wherein
Figure BDA00034461434600000727
Concat(f1,f2,…,fG) Representing the splicing of G groups of joint features;
splicing the global features of the N multiplied by F frame pictures to obtain the features X of all the pictures, wherein
Figure BDA00034461434600000728
Figure BDA00034461434600000729
Is a three-dimensional matrix space with dimensions C × N × F.
Preferably, the view self-attention transformation network in the step two fuses the view features based on the relative camera position coding module and the self-attention fusion module.
Preferably, the view self-attention transformation network in the step two comprises the following steps:
in a first step, in the camera dimension,
Figure BDA00034461434600000730
by N camera characteristics
Figure BDA00034461434600000731
The assembly is formed by splicing, wherein,
Figure BDA00034461434600000732
v∈{1,2,...,N},
Figure BDA00034461434600000733
is a two-dimensional matrix space with dimension C multiplied by F;
Figure BDA00034461434600000734
the global features of F frame pictures of the v camera are spliced to obtain the features of the v camera; in the camera feature fusion process, the time sequence dimension F is omitted, namely simplified to
Figure BDA00034461434600000735
v∈{1,2,...,N};
Second, as shown in fig. 3, the view self-attention transformation network first adaptively learns the relative position relationship between the cameras through the neural network, and inputs the query variable of the a-th camera
Figure BDA00034461434600000736
And key value variable of the b-th camera
Figure BDA0003446143460000081
Outputting a mapping matrix M of the relative position relationship between the a-th camera and the b-th cameraabAnd a feature fusion weighting coefficient Aab(ii) a Wherein the content of the first and second substances,
Figure BDA0003446143460000082
Figure BDA0003446143460000083
respectively representing the a-th and b-th camera characteristics;
Figure BDA0003446143460000084
Figure BDA0003446143460000085
is a two-dimensional matrix space with dimension D multiplied by D;
Figure BDA0003446143460000086
C=H×D;
Figure BDA0003446143460000087
Figure BDA0003446143460000088
wherein the content of the first and second substances,
Figure BDA0003446143460000089
and
Figure BDA00034461434600000810
are two neural network layers that share the same residual network for acquisition
Figure BDA00034461434600000811
And
Figure BDA00034461434600000812
a relationship characteristic between; then respectively outputting M by using the fourth and the fifth full connection layersabAnd Aab
Third, changing the numerical characteristics of the b-th camera
Figure BDA00034461434600000813
Is divided into H D-dimensional local feature points
Figure BDA00034461434600000814
Wherein
Figure BDA00034461434600000815
Figure BDA00034461434600000816
The numerical characteristics of the b-th camera after changing the shape;
Figure BDA00034461434600000817
Figure BDA00034461434600000818
is a two-dimensional matrix space with dimension of D multiplied by H; then through MabTo pair
Figure BDA00034461434600000819
And performing linear mapping to realize relative camera position coding:
Figure BDA00034461434600000820
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00034461434600000821
representing the feature of the a-th camera after the b-camera feature is subjected to relative camera position coding; wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00034461434600000822
NxN group camera feature combination
Figure BDA00034461434600000823
Will obtain the coded features
Figure BDA00034461434600000824
Figure BDA00034461434600000825
From N x N
Figure BDA00034461434600000826
Splicing is obtained, changed
Figure BDA00034461434600000827
After the shape of (2) to obtain a characteristic Vmap
Figure BDA00034461434600000828
Wherein
Figure BDA00034461434600000829
A four-dimensional matrix space of dimensions dXHXNXN and dimensions dHXN, respectivelyA C × N × N three-dimensional matrix space;
step four, A ═ Aab| a ∈ {1,2, …, N }, b ∈ {1,2, …, N } } denotes N × N fusion coefficients aabForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;
the fifth step, using A after random shielding, to VmapPerforming weighted fusion to obtain multi-view fused feature Vfuse
Figure BDA00034461434600000830
Figure BDA00034461434600000831
Is a two-dimensional matrix space with dimension C × N:
Vfuse=sum((softmax(A)⊙Vmap)),
wherein softmax (×) represents a normalized exponential function, normalizing the third dimension of the fusion coefficient matrix a, which indicates dot multiplication, and sum (×) represents merging features of N cameras in the third dimension.
Preferably, the time-sequential self-attention transform network in step three consists of two time-sequential encoders, wherein each encoder consists of a multi-headed self-attention and forward propagation network in time.
Preferably, the time-sequential self-attention transition network in step three comprises the following steps:
the first step is that in the process of time sequence feature fusion, the dimension N of a camera is omitted, and a coding module of a time sequence self-attention transformation network firstly passes through a sixth full connection layer
Figure BDA0003446143460000091
To VfuseAnd (3) carrying out characteristic coding:
Figure BDA0003446143460000092
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003446143460000093
features after feature encoding;
secondly, constructing sequence position codes by a coding module by adopting cos and sin functions
Figure BDA0003446143460000094
And carrying out position coding on Z, wherein the coding process is as follows:
Z0=Z+Pemb
wherein the content of the first and second substances,
Figure BDA0003446143460000095
features that are position encoded;
thirdly, the time sequence self-attention transformation network comprises two layers of feature fusion modules based on a self-attention mechanism; m belongs to {1,2} as an index of the number of layers of the feature fusion module, and the m-th layer feature fusion process is as follows:
Figure BDA0003446143460000096
Figure BDA0003446143460000097
Figure BDA0003446143460000098
Figure BDA0003446143460000099
wherein the content of the first and second substances,
Figure BDA00034461434600000910
and
Figure BDA00034461434600000911
respectively passing through a seventh, an eighth and a ninth full connection layer
Figure BDA00034461434600000912
And
Figure BDA00034461434600000913
timing characteristics for m-1 layers
Figure BDA00034461434600000914
Mapping to obtain a query vector, a key value vector and a numerical value vector; FFN is a multilayer perceptron with residual connection;
Figure BDA00034461434600000915
is a time sequence fusion feature of the mth layer, wherein the input feature of the fusion module of the first layer
Figure BDA00034461434600000916
Step 304, finally passing through a tenth full connection layer
Figure BDA00034461434600000917
Regressing the 3D pose P of the intermediate frame of each video sequence under N cameras3DWherein
Figure BDA00034461434600000918
Figure BDA00034461434600000919
Is a two-dimensional matrix space with 3 jxn dimensions:
Figure BDA00034461434600000920
the feature extractor designed by the invention can effectively utilize the confidence coefficient to reduce the noise of the 2D attitude detector, thereby extracting robust noise; the designed view self-attention transformation network can adaptively fuse any number of camera features which are not calibrated, so that the view self-attention transformation network can be directly applied to any camera configuration scene without retraining; the designed time sequence self-attention transformation network can adaptively fuse single-frame and multi-frame time sequence characteristics, so that the time sequence self-attention transformation network is directly applied to static scenes and dynamic video scenes without retraining.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (6)

1. A single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion is characterized by comprising the following steps:
firstly, extracting 2D attitude characteristics by a robust characteristic extractor based on video sequence pictures of multiple cameras as input;
secondly, designing a self-adaptive view self-attention transformation network on the camera dimension based on the 2D attitude characteristics as input, fusing any number of two-dimensional attitudes under uncalibrated cameras through relative camera position coding and a self-attention mechanism, and acquiring multi-view fused attitude characteristics;
and thirdly, taking the posture characteristics based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame characteristics through a self-attention mechanism to obtain the final 3D posture.
2. The method for estimating the single three-dimensional posture by fusing the self-adaptive multi-view and time-series characteristics as claimed in claim 1, wherein the characteristic extractor in the first step is specifically as follows:
firstly, giving video sequences under N cameras, wherein each video sequence comprises F frame pictures, and the N multiplied by F frame pictures share the same feature extractor; each frame of picture I has a width W and a height H, and comprises three color channels of R, G and B,
Figure FDA0003446143450000011
Figure FDA00034461434500000114
is a three-dimensional matrix space with dimension W multiplied by H multiplied by 3; each picture only contains one person;
the feature extractor comprises a 2D attitude detector and a 3D attitude feature extractor, and for each frame of picture I, the 2D attitude detector is adopted firstly
Figure FDA0003446143450000012
Predicting 2D pose information
Figure FDA0003446143450000013
Figure FDA0003446143450000014
Wherein the total number of joints is J,
Figure FDA0003446143450000015
P2D、C2D2D coordinates and confidence, p, for J joints, respectivelyjIs the 2D coordinate of the j-th joint, cjConfidence for j joint;
second step for P2DAnd C2DJ joints are divided into G groups according to the motion correlation of human joints:
Figure FDA0003446143450000016
Figure FDA0003446143450000017
wherein G is in the form of {1, 2.,. G },
Figure FDA0003446143450000018
are respectively P2DAnd C2DThe g subset of (a);
Figure FDA0003446143450000019
Figure FDA00034461434500000110
is a dimension of 2JgA one-dimensional matrix space of (a);
Figure FDA00034461434500000111
Figure FDA00034461434500000112
is one dimension of JgA one-dimensional matrix space of (a); wherein
Figure FDA00034461434500000113
Index representing all joints of group g, JgNumber of joints in group g, piAnd ciRespectively representing the 2D coordinates and the confidence of the ith joint;
third, the 3D attitude feature extractor firstly uses the first full connection layer
Figure FDA0003446143450000021
Coordinate the g-th group of 2D joints
Figure FDA0003446143450000022
Mapping as a feature
Figure FDA0003446143450000023
Figure FDA0003446143450000024
Is a one-dimensional matrix space with the dimension of C/2; c represents the characteristics of the joint of group G
Figure FDA0003446143450000025
The channel dimensions of the obtained global features are:
Figure FDA0003446143450000026
fourth step, second fully-connected network
Figure FDA0003446143450000027
Input device
Figure FDA0003446143450000028
Outputting mapping matrix of the g group of joints
Figure FDA00034461434500000233
Figure FDA00034461434500000234
Is one dimension of (C/2) x 2JgA two-dimensional matrix space of (a);
Figure FDA0003446143450000029
will be provided with
Figure FDA00034461434500000210
Mapping to C/2 channel number feature
Figure FDA00034461434500000211
Figure FDA00034461434500000212
For modulation
Figure FDA00034461434500000213
Figure FDA00034461434500000214
Figure FDA00034461434500000215
Fifth, for each of the G groups, it will be
Figure FDA00034461434500000216
And
Figure FDA00034461434500000217
after addition, the residual error network of the g-th group
Figure FDA00034461434500000218
Further extracting spatial information to obtain the adjusted characteristics of the g group
Figure FDA00034461434500000219
Figure FDA00034461434500000220
Sixth step, group G features
Figure FDA00034461434500000221
Spliced together through a third fully-connected layer
Figure FDA00034461434500000222
Global features mapped to a person
Figure FDA00034461434500000223
Figure FDA00034461434500000224
Is a one-dimensional matrix space with dimension C; wherein
Figure FDA00034461434500000225
Concat(f1,f2,…,fG) Representing the splicing of G groups of joint features;
splicing the global features of the N multiplied by F frame pictures to obtain the features X of all the pictures, wherein
Figure FDA00034461434500000226
Figure FDA00034461434500000227
Is a three-dimensional matrix space with dimensions C × N × F.
3. The method for estimating the single three-dimensional pose with adaptive multi-view and temporal feature fusion according to claim 1, wherein the view-attention transform network in the second step is composed of a relative camera position encoder and a view-attention fusion module.
4. The method for estimating the single three-dimensional pose with fusion of adaptive multi-view and time-series characteristics according to claim 2, wherein the view self-attention transformation network in the second step is obtained by the following steps:
in step 201, in the camera dimension,
Figure FDA00034461434500000228
by N camera characteristics
Figure FDA00034461434500000229
The assembly is formed by splicing, wherein,
Figure FDA00034461434500000230
v∈{1,2,...,N},
Figure FDA00034461434500000231
is a two-dimensional matrix space with dimension C multiplied by F;
Figure FDA00034461434500000232
the global features of F frame pictures of the v camera are spliced to obtain the features of the v camera; in-camera feature fusionIn the process, the time sequence dimension F is omitted, namely simplified to
Figure FDA0003446143450000031
v∈{1,2,...,N};
Step 202, the view self-attention transformation network firstly learns the relative position relationship between the cameras in a self-adaptive way through a neural network, and inputs the query variable of the a-th camera
Figure FDA0003446143450000032
And key value variable of the b-th camera
Figure FDA0003446143450000033
Outputting a mapping matrix M of the relative position relationship between the a-th camera and the b-th cameraabAnd a feature fusion weighting coefficient Aab(ii) a Wherein the content of the first and second substances,
Figure FDA0003446143450000034
Figure FDA0003446143450000035
respectively representing the a-th and b-th camera characteristics;
Figure FDA0003446143450000036
Figure FDA00034461434500000333
is a two-dimensional matrix space with dimension D multiplied by D;
Figure FDA0003446143450000037
C=H×D;
Figure FDA0003446143450000038
Figure FDA0003446143450000039
wherein the content of the first and second substances,
Figure FDA00034461434500000310
and
Figure FDA00034461434500000311
are two neural network layers that share the same residual network for acquisition
Figure FDA00034461434500000312
And
Figure FDA00034461434500000313
a relationship characteristic therebetween; then respectively outputting M by using the fourth and the fifth full connection layersabAnd Aab
Step 203, changing the numerical characteristics of the b-th camera
Figure FDA00034461434500000314
Is divided into H D-dimensional local feature points
Figure FDA00034461434500000315
Wherein
Figure FDA00034461434500000316
Figure FDA00034461434500000317
The numerical characteristics of the b-th camera after changing the shape;
Figure FDA00034461434500000318
Figure FDA00034461434500000319
is a two-dimensional matrix space with dimension of D multiplied by H; then through MabFor is to
Figure FDA00034461434500000320
And performing linear mapping to realize relative camera position coding:
Figure FDA00034461434500000321
wherein the content of the first and second substances,
Figure FDA00034461434500000322
representing the feature of the a-th camera after the b-camera feature is subjected to relative camera position coding; wherein the content of the first and second substances,
Figure FDA00034461434500000323
NxN group camera feature combination
Figure FDA00034461434500000324
Will obtain the coded features
Figure FDA00034461434500000325
Figure FDA00034461434500000326
From N x N
Figure FDA00034461434500000327
Splicing is obtained, changed
Figure FDA00034461434500000328
After the shape of (2) to obtain a characteristic Vmap
Figure FDA00034461434500000329
Wherein
Figure FDA00034461434500000330
A four-dimensional matrix space with dimensions of D × H × N × N and a database with dimensions of C × NA three-dimensional matrix space of N;
step 204, a ═ aab| a ∈ {1, 2., N }, and b ∈ {1, 2., N } } denote N × N fusion coefficients aabForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;
step 205, using the randomly masked A, pair VmapPerforming weighted fusion to obtain multi-view fused feature Vfuse
Figure FDA00034461434500000331
Figure FDA00034461434500000332
Is a two-dimensional matrix space with dimension C × N:
Vfuse=sum((softmax(A)⊙Vmap)),
wherein softmax (×) represents a normalized exponential function, normalizing the third dimension of the fusion coefficient matrix a, which indicates dot multiplication, and sum (×) represents merging features of N cameras in the third dimension.
5. The method for estimating the single-person three-dimensional pose based on adaptive multi-view and temporal feature fusion according to claim 1, wherein the temporal self-attention transformation network in step three comprises an encoding module and a two-layer feature fusion module.
6. The method of claim 4, wherein the time-series self-attention transformation network in step three is implemented by the following steps:
step 301, in the process of time sequence feature fusion, the dimension N of the camera is omitted, and the coding module of the time sequence self-attention transformation network firstly passes through the sixth full connection layer
Figure FDA0003446143450000041
To VfuseAnd (3) carrying out characteristic coding:
Figure FDA0003446143450000042
wherein the content of the first and second substances,
Figure FDA0003446143450000043
features after feature encoding;
step 302, the coding module then constructs sequence position codes by using cos and sin functions
Figure FDA0003446143450000044
And carrying out position coding on Z, wherein the coding process is as follows:
Z0=Z+Pemb
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003446143450000045
features that are position encoded;
step 303, the time sequence self-attention transformation network comprises two layers of feature fusion modules based on a self-attention mechanism; m belongs to {1,2} as an index of the number of layers of the feature fusion module, and the m-th layer feature fusion process is as follows:
Figure FDA0003446143450000046
Figure FDA0003446143450000047
Figure FDA0003446143450000048
Figure FDA0003446143450000049
wherein the content of the first and second substances,
Figure FDA00034461434500000410
and
Figure FDA00034461434500000411
respectively passing through a seventh, an eighth and a ninth full connection layer
Figure FDA00034461434500000412
And
Figure FDA00034461434500000413
timing characteristics for m-1 layers
Figure FDA00034461434500000414
Mapping to obtain a query vector, a key value vector and a numerical value vector; FFN is a multilayer perceptron with residual connection;
Figure FDA00034461434500000415
is a time sequence fusion feature of the mth layer, wherein the input feature of the fusion module of the first layer
Figure FDA00034461434500000416
Step 304, finally passing through a tenth full connection layer
Figure FDA00034461434500000417
Regressing the 3D pose P of the intermediate frame of each video sequence under N cameras3DWherein
Figure FDA00034461434500000418
Figure FDA00034461434500000419
Is a two-dimensional matrix space with 3 jxn dimensions:
Figure FDA00034461434500000420
CN202111649445.9A 2021-12-30 2021-12-30 Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion Pending CN114511629A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111649445.9A CN114511629A (en) 2021-12-30 2021-12-30 Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111649445.9A CN114511629A (en) 2021-12-30 2021-12-30 Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion

Publications (1)

Publication Number Publication Date
CN114511629A true CN114511629A (en) 2022-05-17

Family

ID=81547720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111649445.9A Pending CN114511629A (en) 2021-12-30 2021-12-30 Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion

Country Status (1)

Country Link
CN (1) CN114511629A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861930A (en) * 2022-12-13 2023-03-28 南京信息工程大学 Crowd counting network modeling method based on hierarchical difference feature aggregation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861930A (en) * 2022-12-13 2023-03-28 南京信息工程大学 Crowd counting network modeling method based on hierarchical difference feature aggregation
CN115861930B (en) * 2022-12-13 2024-02-06 南京信息工程大学 Crowd counting network modeling method based on hierarchical difference feature aggregation

Similar Documents

Publication Publication Date Title
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
CN113160375B (en) Three-dimensional reconstruction and camera pose estimation method based on multi-task learning algorithm
CN111626159B (en) Human body key point detection method based on attention residual error module and branch fusion
CN113283525B (en) Image matching method based on deep learning
CN111783582A (en) Unsupervised monocular depth estimation algorithm based on deep learning
CN114596520A (en) First visual angle video action identification method and device
Wang et al. Depth estimation of video sequences with perceptual losses
WO2024051184A1 (en) Optical flow mask-based unsupervised monocular depth estimation method
CN115484410B (en) Event camera video reconstruction method based on deep learning
CN113850900A (en) Method and system for recovering depth map based on image and geometric clue in three-dimensional reconstruction
CN111210382A (en) Image processing method, image processing device, computer equipment and storage medium
CN116258757A (en) Monocular image depth estimation method based on multi-scale cross attention
CN115330950A (en) Three-dimensional human body reconstruction method based on time sequence context clues
CN115035171A (en) Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
CN114511629A (en) Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion
CN116468769A (en) Depth information estimation method based on image
CN116524121A (en) Monocular video three-dimensional human body reconstruction method, system, equipment and medium
Zhang et al. Self-Supervised Monocular Depth Estimation With Self-Perceptual Anomaly Handling
CN116740290A (en) Three-dimensional interaction double-hand reconstruction method and system based on deformable attention
CN117011357A (en) Human body depth estimation method and system based on 3D motion flow and normal map constraint
CN112419387B (en) Unsupervised depth estimation method for solar greenhouse tomato plant image
CN115496859A (en) Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning
CN115115685A (en) Monocular image depth estimation algorithm based on self-attention neural network
CN110766732A (en) Robust single-camera depth map estimation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination