CN114511629A - Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion - Google Patents
Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion Download PDFInfo
- Publication number
- CN114511629A CN114511629A CN202111649445.9A CN202111649445A CN114511629A CN 114511629 A CN114511629 A CN 114511629A CN 202111649445 A CN202111649445 A CN 202111649445A CN 114511629 A CN114511629 A CN 114511629A
- Authority
- CN
- China
- Prior art keywords
- self
- camera
- feature
- dimension
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion.A robust feature extractor is designed to extract 2D attitude features based on video sequence pictures of multiple cameras as input; based on the 2D attitude characteristics as input, a self-adaptive view self-attention transformation network is designed on the camera dimension, any number of two-dimensional attitudes under uncalibrated cameras are fused through relative camera position coding and a self-attention mechanism, and multi-view fused attitude characteristics are obtained; and (3) taking the attitude features based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame features through a self-attention mechanism to obtain the final 3D attitude. The invention has reasonable design, can be directly applied to scenes with any number of uncalibrated cameras without retraining and has small network calculation amount.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a single three-dimensional attitude estimation method with self-adaptive multi-view and time sequence feature fusion.
Background
The single 3D human posture estimation task is a hot topic in the field of computer vision. It plays an important role in many applications such as motion recognition, human reconstruction and robotic operation. According to the difference of the number of cameras, the single 3D human body posture system can be mainly divided into monocular 3D human body posture estimation and monocular 3D human body posture estimation. For a monocular estimation model, only the visual information of the image under a single camera is used in feature coding, and the problems of occlusion and depth blurring cannot be effectively solved. The multi-view model can make up for the shielded joint information and the depth information lost after the camera projection by using multi-view geometric constraint, so that the three-dimensional human body posture can be predicted more accurately. For multi-view 3D human body posture estimation, the existing model mainly utilizes camera parameters to provide multi-view geometric constraints, and is seriously dependent on the configuration of multiple cameras, so that the existing model cannot be directly applied to scenes of single-view or few cameras. In a natural scene, the position of the camera is often changed, and real-time camera calibration in a dynamic scene is not practical. In addition, the existing multi-view model is very computationally intensive, and cannot capture time-series information to obtain a smooth three-dimensional pose using a time-series model.
Disclosure of Invention
The invention provides a single three-dimensional attitude estimation method for fusing self-adaptive multi-view and time sequence characteristics, which aims to overcome the defects of the prior art, extracts robust three-dimensional characteristics through a well-designed characteristic extractor, and fuses view geometric information and sequence time sequence information through a view self-attention transformation network and a time sequence self-attention transformation network; therefore, the method can be better applied to the fields of action recognition, human body reconstruction, robot operation and the like.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides a single three-dimensional attitude estimation method with self-adaptive multi-view and time sequence feature fusion, which comprises the following steps:
firstly, extracting 2D attitude characteristics by a robust characteristic extractor based on video sequence pictures of multiple cameras as input;
secondly, designing a self-adaptive view self-attention transformation network on the camera dimension based on the 2D attitude characteristics as input, fusing any number of two-dimensional attitudes under uncalibrated cameras through relative camera position coding and a self-attention mechanism, and acquiring multi-view fused attitude characteristics;
and thirdly, taking the posture characteristics based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame characteristics through a self-attention mechanism to obtain the final 3D posture.
As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, a feature extractor in the first step is specifically as follows:
firstly, giving video sequences under N cameras, wherein each video sequence comprises F frame pictures, and the N multiplied by F frame pictures share the same feature extractor; each frame of picture I has a width W and a height H, and comprises three color channels of R, G and B, is a three-dimensional matrix space with dimension W multiplied by H multiplied by 3; each picture only contains one person;
the feature extractor comprises a 2D attitude detector and a 3D attitude feature extractor, and for each frame of picture I, the 2D attitude detector is adopted firstlyPredicting 2D pose information
Wherein, offThe total number of the nodes is J,P2D、C2D2D coordinates and confidence, p, for J joints, respectivelyjIs the 2D coordinate of the j-th joint, cjConfidence for j joint;
second step for P2DAnd C2DJ joints are divided into G groups according to the motion correlation of human joints:
wherein G is in the form of {1, 2.,. G },are respectively P2DAnd C2DThe g subset of (a); is one dimension of 2JgA one-dimensional matrix space of (a); is one dimension of JgA one-dimensional matrix space of (a); whereinIndex representing all joints of group g, JgNumber of joints in group g, piAnd ciRespectively representing the 2D coordinates and the confidence of the ith joint;
third, the 3D attitude feature extractor firstly uses the first full connection layerCoordinate the g-th group of 2D jointsMapping as a feature Is a one-dimensional matrix space with the dimension of C/2; c represents the characteristics of the joint of group GThe channel dimensions of the obtained global features are:
fourth step, second fully-connected networkInput the methodOutputting mapping matrix of the g group of joints Is one dimension of (C/2) x 2JgA two-dimensional matrix space of (a);will be provided withMapping to C/2 channel number feature For modulation
Fifth, for each of the G groups, it will beAndafter addition, the residual error network of the g-th groupFurther extracting spatial information to obtain the adjusted characteristics of the g group
Sixth step, group G featuresSpliced together through a third fully-connected layerGlobal features mapped to a person Is a one-dimensional matrix space with dimension C; whereinConcat(f1,f2,…,fG) Representing the splicing of G groups of joint features;
splicing the global features of the N multiplied by F frame pictures to obtain the features X of all the pictures, wherein Is a three-dimensional matrix space with dimensions C × N × F.
As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, the view self-attention transformation network in the step two consists of a relative camera position encoder and a view self-attention fusion module.
As a further optimization scheme of the single three-dimensional attitude estimation method based on the fusion of the self-adaptive multi-view and time sequence characteristics, the view self-attention transformation network in the step two is obtained by the following steps:
in step 201, in the camera dimension,by N camera characteristicsThe assembly is formed by splicing, wherein,v∈{1,2,...,N},is a two-dimensional matrix space with dimension C multiplied by F;the global features of F frame pictures of the v camera are spliced to obtain the features of the v camera; in the camera feature fusion process, the time sequence dimension F is omitted, namely simplified tov∈{1,2,...,N};
Step 202, the view self-attention transformation network firstly learns the relative position relationship between the cameras in a self-adaptive way through a neural network, and inputs the query variable of the a-th cameraAnd key value variable of the b-th cameraOutputting a mapping matrix M of the relative position relationship between the a-th camera and the b-th cameraabAnd a feature fusion weighting coefficient Aab(ii) a Wherein the content of the first and second substances, respectively representing the a-th and b-th camera characteristics; is a two-dimensional matrix space with dimension D multiplied by D;C=H×D;
wherein the content of the first and second substances,andare two neural network layers that share the same residual network for acquisitionAnda relationship characteristic between; then respectively outputting M by using the fourth and the fifth full connection layersabAnd Aab;
Step 203, changing the numerical characteristics of the b-th cameraIs divided into H D-dimensional local feature pointsWherein The numerical characteristics of the b-th camera after changing the shape; is a two-dimensional matrix space with dimension of D multiplied by H; then through MabTo pairAnd performing linear mapping to realize relative camera position coding:
wherein the content of the first and second substances,representing the feature of the a-th camera after the b-camera feature is subjected to relative camera position coding; wherein the content of the first and second substances,
NxN group camera feature combinationWill obtain the coded features From N x NSplicing is obtained, changedAfter the shape of (2) to obtain a characteristic Vmap;WhereinA four-dimensional matrix space with dimensions of D multiplied by H multiplied by N and a three-dimensional matrix space with dimensions of C multiplied by N are respectively adopted;
step 204, a ═ aabL a ∈ {1,2, …, N }, b ∈ {1,2, …, N } } denotes N × N fusion coefficients aabForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;
step 205, using the randomly masked A, pair VmapPerforming weighted fusion to obtain multi-view fused feature Vfuse, Is a two-dimensional matrix space with dimension C × N:
Vfuse=sum((softmax(A)⊙Vmap)),
wherein softmax (×) represents a normalized exponential function, normalizing the third dimension of the fusion coefficient matrix a, which indicates dot multiplication, and sum (×) represents merging features of N cameras in the third dimension.
As a further optimization scheme of the single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion, a time sequence self-attention transformation network in the third step consists of a coding module and two layers of feature fusion modules.
As a further optimization scheme of the single three-dimensional attitude estimation method with the fusion of the self-adaptive multi-view and time sequence characteristics, the time sequence self-attention transformation network in the third step is implemented by the following steps:
step 301, in the process of time sequence feature fusion, the dimension N of the camera is omitted, and the coding module of the time sequence self-attention transformation network firstly passes through the sixth full connection layerTo VfuseAnd (3) carrying out characteristic coding:
wherein, the first and the second end of the pipe are connected with each other,features after feature encoding;
step 302, the coding module then constructs sequence position codes by using cos and sin functionsAnd carrying out position coding on Z, wherein the coding process is as follows:
Z0=Z+Pemb,
step 303, the time sequence self-attention transformation network comprises two layers of feature fusion modules based on a self-attention mechanism; m belongs to {1,2} as an index of the number of layers of the feature fusion module, and the m-th layer feature fusion process is as follows:
wherein the content of the first and second substances,andrespectively passing through a seventh, an eighth and a ninth full connection layerAndtiming characteristics for m-1 layersMapping to obtain a query vector, a key value vector and a numerical value vector; FFN is a multilayer perceptron with residual connection;is a time sequence fusion feature of the mth layer, wherein the input feature of the fusion module of the first layer
Step 304, finally passing through a tenth full connection layerRegressing the 3D pose P of the intermediate frame of each video sequence under N cameras3DWherein Is a two-dimensional matrix space with 3 jxn dimensions:
compared with the prior art, the invention adopting the technical scheme has the following technical effects:
the feature extractor designed by the invention can effectively utilize the confidence coefficient to reduce the noise of the 2D attitude detector, thereby extracting robust noise; the designed view self-attention transformation network can adaptively fuse any number of camera features which are not calibrated, so that the view self-attention transformation network can be directly applied to any camera configuration scene without retraining; the designed time sequence self-attention transformation network can adaptively fuse single-frame and multi-frame time sequence characteristics, so that the time sequence self-attention transformation network is directly applied to static scenes and dynamic video scenes without retraining;
(1) modulating the 2D attitude characteristics by a characteristic extractor of a joint confidence coefficient, and reducing the influence of noise of the 2D attitude on 3D attitude estimation to extract robust attitude characteristics;
(2) the multi-view characteristics are adaptively fused through a view self-attention transformation network, and the method can be directly applied to any camera configuration (camera number and camera parameters) scene without camera calibration and retraining;
(3) the time sequence features are adaptively fused through a time sequence self-attention transformation network, retraining is not needed, and the time sequence features can be directly applied to single-frame and multi-frame scenes.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a block diagram of a feature extractor of the present invention;
FIG. 3 is a block diagram of an attention mechanism in a view self-attention translation network of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
A single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion comprises the following steps:
firstly, extracting 2D attitude characteristics by a robust characteristic extractor based on video sequence pictures of multiple cameras as input;
secondly, designing a self-adaptive view self-attention transformation network on the camera dimension based on the 2D attitude characteristics as input, fusing any number of two-dimensional attitudes under uncalibrated cameras through relative camera position coding and a self-attention mechanism, and acquiring multi-view fused attitude characteristics;
and thirdly, taking the posture characteristics based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame characteristics through a self-attention mechanism to obtain the final 3D posture.
Preferably, the feature extractor in the step one uses a pre-trained 2D detector to acquire the posture information of the joints as input, and uses a full-connection network to extract robust features.
Preferably, the feature extractor in the first step comprises the following steps:
as shown in fig. 1, given video sequences in N cameras, each video sequence contains F frame pictures, and the N × F frame pictures share the same feature extractor; each frame of picture I has a width W and a height H, and comprises three color channels of R, G and B, is a three-dimensional matrix space with dimension W multiplied by H multiplied by 3; each picture only contains one person;
the feature extractor comprises a 2D attitude detector and a 3D attitude feature extractor, and for each frame of picture I, the 2D attitude detector is adopted firstlyPredicting 2D pose information
Wherein the total number of joints is J,P2D、C2D2D coordinates and confidence, p, for J joints, respectivelyjIs the 2D coordinate of the j-th joint, cjConfidence for j joint;
second step, as shown in FIG. 2, for P2DAnd C2DJ joints are divided into G groups according to the motion correlation of human joints:
wherein G belongs to {1,2,. eta., G },are respectively P2DAnd C2DThe g subset of (a); is one dimension of 2JgA one-dimensional matrix space of (a); is one dimension of JgA one-dimensional matrix space of (a); whereinIndex representing all joints of group g, JgNumber of joints in group g, piAnd ciRespectively representing the 2D coordinates and the confidence of the ith joint; in the present invention, the motion of each joint of the human body is relatedSex division, divided into 5 groups, namely head, left/right hand and left/right leg, namely G-5;
third, the 3D attitude feature extractor firstly uses the first full connection layerCoordinate the g-th group of 2D jointsMapping as a feature Is a one-dimensional matrix space with the dimension of C/2; c represents the characteristics of the joint of group GChannel dimensions of the resulting global features:
fourth step, second fully-connected networkInput deviceOutputting mapping matrix of the g group of joints Is one dimension of (C/2) x 2JgA two-dimensional matrix space of (a);will be provided withMapping to C/2 channel number feature For modulation
Fifth, for each of the G groups, it will beAndafter addition, the residual error network of the g-th groupFurther extracting spatial information to obtain the adjusted characteristics of the g group
Sixth step, group G featuresIs spliced at oneThrough a third full connection layerGlobal features mapped to a person Is a one-dimensional matrix space with dimension C; whereinConcat(f1,f2,…,fG) Representing the splicing of G groups of joint features;
splicing the global features of the N multiplied by F frame pictures to obtain the features X of all the pictures, wherein Is a three-dimensional matrix space with dimensions C × N × F.
Preferably, the view self-attention transformation network in the step two fuses the view features based on the relative camera position coding module and the self-attention fusion module.
Preferably, the view self-attention transformation network in the step two comprises the following steps:
in a first step, in the camera dimension,by N camera characteristicsThe assembly is formed by splicing, wherein,v∈{1,2,...,N},is a two-dimensional matrix space with dimension C multiplied by F;the global features of F frame pictures of the v camera are spliced to obtain the features of the v camera; in the camera feature fusion process, the time sequence dimension F is omitted, namely simplified tov∈{1,2,...,N};
Second, as shown in fig. 3, the view self-attention transformation network first adaptively learns the relative position relationship between the cameras through the neural network, and inputs the query variable of the a-th cameraAnd key value variable of the b-th cameraOutputting a mapping matrix M of the relative position relationship between the a-th camera and the b-th cameraabAnd a feature fusion weighting coefficient Aab(ii) a Wherein the content of the first and second substances, respectively representing the a-th and b-th camera characteristics; is a two-dimensional matrix space with dimension D multiplied by D;
wherein the content of the first and second substances,andare two neural network layers that share the same residual network for acquisitionAnda relationship characteristic between; then respectively outputting M by using the fourth and the fifth full connection layersabAnd Aab;
Third, changing the numerical characteristics of the b-th cameraIs divided into H D-dimensional local feature pointsWherein The numerical characteristics of the b-th camera after changing the shape; is a two-dimensional matrix space with dimension of D multiplied by H; then through MabTo pairAnd performing linear mapping to realize relative camera position coding:
wherein, the first and the second end of the pipe are connected with each other,representing the feature of the a-th camera after the b-camera feature is subjected to relative camera position coding; wherein, the first and the second end of the pipe are connected with each other,
NxN group camera feature combinationWill obtain the coded features From N x NSplicing is obtained, changedAfter the shape of (2) to obtain a characteristic Vmap;WhereinA four-dimensional matrix space of dimensions dXHXNXN and dimensions dHXN, respectivelyA C × N × N three-dimensional matrix space;
step four, A ═ Aab| a ∈ {1,2, …, N }, b ∈ {1,2, …, N } } denotes N × N fusion coefficients aabForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;
the fifth step, using A after random shielding, to VmapPerforming weighted fusion to obtain multi-view fused feature Vfuse, Is a two-dimensional matrix space with dimension C × N:
Vfuse=sum((softmax(A)⊙Vmap)),
wherein softmax (×) represents a normalized exponential function, normalizing the third dimension of the fusion coefficient matrix a, which indicates dot multiplication, and sum (×) represents merging features of N cameras in the third dimension.
Preferably, the time-sequential self-attention transform network in step three consists of two time-sequential encoders, wherein each encoder consists of a multi-headed self-attention and forward propagation network in time.
Preferably, the time-sequential self-attention transition network in step three comprises the following steps:
the first step is that in the process of time sequence feature fusion, the dimension N of a camera is omitted, and a coding module of a time sequence self-attention transformation network firstly passes through a sixth full connection layerTo VfuseAnd (3) carrying out characteristic coding:
wherein, the first and the second end of the pipe are connected with each other,features after feature encoding;
secondly, constructing sequence position codes by a coding module by adopting cos and sin functionsAnd carrying out position coding on Z, wherein the coding process is as follows:
Z0=Z+Pemb,
thirdly, the time sequence self-attention transformation network comprises two layers of feature fusion modules based on a self-attention mechanism; m belongs to {1,2} as an index of the number of layers of the feature fusion module, and the m-th layer feature fusion process is as follows:
wherein the content of the first and second substances,andrespectively passing through a seventh, an eighth and a ninth full connection layerAndtiming characteristics for m-1 layersMapping to obtain a query vector, a key value vector and a numerical value vector; FFN is a multilayer perceptron with residual connection;is a time sequence fusion feature of the mth layer, wherein the input feature of the fusion module of the first layer
Step 304, finally passing through a tenth full connection layerRegressing the 3D pose P of the intermediate frame of each video sequence under N cameras3DWherein Is a two-dimensional matrix space with 3 jxn dimensions:
the feature extractor designed by the invention can effectively utilize the confidence coefficient to reduce the noise of the 2D attitude detector, thereby extracting robust noise; the designed view self-attention transformation network can adaptively fuse any number of camera features which are not calibrated, so that the view self-attention transformation network can be directly applied to any camera configuration scene without retraining; the designed time sequence self-attention transformation network can adaptively fuse single-frame and multi-frame time sequence characteristics, so that the time sequence self-attention transformation network is directly applied to static scenes and dynamic video scenes without retraining.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.
Claims (6)
1. A single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion is characterized by comprising the following steps:
firstly, extracting 2D attitude characteristics by a robust characteristic extractor based on video sequence pictures of multiple cameras as input;
secondly, designing a self-adaptive view self-attention transformation network on the camera dimension based on the 2D attitude characteristics as input, fusing any number of two-dimensional attitudes under uncalibrated cameras through relative camera position coding and a self-attention mechanism, and acquiring multi-view fused attitude characteristics;
and thirdly, taking the posture characteristics based on multi-view fusion as input, designing a time sequence self-attention transformation network on the time dimension, and adaptively fusing multi-frame characteristics through a self-attention mechanism to obtain the final 3D posture.
2. The method for estimating the single three-dimensional posture by fusing the self-adaptive multi-view and time-series characteristics as claimed in claim 1, wherein the characteristic extractor in the first step is specifically as follows:
firstly, giving video sequences under N cameras, wherein each video sequence comprises F frame pictures, and the N multiplied by F frame pictures share the same feature extractor; each frame of picture I has a width W and a height H, and comprises three color channels of R, G and B, is a three-dimensional matrix space with dimension W multiplied by H multiplied by 3; each picture only contains one person;
the feature extractor comprises a 2D attitude detector and a 3D attitude feature extractor, and for each frame of picture I, the 2D attitude detector is adopted firstlyPredicting 2D pose information
Wherein the total number of joints is J,P2D、C2D2D coordinates and confidence, p, for J joints, respectivelyjIs the 2D coordinate of the j-th joint, cjConfidence for j joint;
second step for P2DAnd C2DJ joints are divided into G groups according to the motion correlation of human joints:
wherein G is in the form of {1, 2.,. G },are respectively P2DAnd C2DThe g subset of (a); is a dimension of 2JgA one-dimensional matrix space of (a); is one dimension of JgA one-dimensional matrix space of (a); whereinIndex representing all joints of group g, JgNumber of joints in group g, piAnd ciRespectively representing the 2D coordinates and the confidence of the ith joint;
third, the 3D attitude feature extractor firstly uses the first full connection layerCoordinate the g-th group of 2D jointsMapping as a feature Is a one-dimensional matrix space with the dimension of C/2; c represents the characteristics of the joint of group GThe channel dimensions of the obtained global features are:
fourth step, second fully-connected networkInput deviceOutputting mapping matrix of the g group of joints Is one dimension of (C/2) x 2JgA two-dimensional matrix space of (a);will be provided withMapping to C/2 channel number feature For modulation
Fifth, for each of the G groups, it will beAndafter addition, the residual error network of the g-th groupFurther extracting spatial information to obtain the adjusted characteristics of the g group
Sixth step, group G featuresSpliced together through a third fully-connected layerGlobal features mapped to a person Is a one-dimensional matrix space with dimension C; whereinConcat(f1,f2,…,fG) Representing the splicing of G groups of joint features;
3. The method for estimating the single three-dimensional pose with adaptive multi-view and temporal feature fusion according to claim 1, wherein the view-attention transform network in the second step is composed of a relative camera position encoder and a view-attention fusion module.
4. The method for estimating the single three-dimensional pose with fusion of adaptive multi-view and time-series characteristics according to claim 2, wherein the view self-attention transformation network in the second step is obtained by the following steps:
in step 201, in the camera dimension,by N camera characteristicsThe assembly is formed by splicing, wherein,v∈{1,2,...,N},is a two-dimensional matrix space with dimension C multiplied by F;the global features of F frame pictures of the v camera are spliced to obtain the features of the v camera; in-camera feature fusionIn the process, the time sequence dimension F is omitted, namely simplified tov∈{1,2,...,N};
Step 202, the view self-attention transformation network firstly learns the relative position relationship between the cameras in a self-adaptive way through a neural network, and inputs the query variable of the a-th cameraAnd key value variable of the b-th cameraOutputting a mapping matrix M of the relative position relationship between the a-th camera and the b-th cameraabAnd a feature fusion weighting coefficient Aab(ii) a Wherein the content of the first and second substances, respectively representing the a-th and b-th camera characteristics; is a two-dimensional matrix space with dimension D multiplied by D;C=H×D;
wherein the content of the first and second substances,andare two neural network layers that share the same residual network for acquisitionAnda relationship characteristic therebetween; then respectively outputting M by using the fourth and the fifth full connection layersabAnd Aab;
Step 203, changing the numerical characteristics of the b-th cameraIs divided into H D-dimensional local feature pointsWherein The numerical characteristics of the b-th camera after changing the shape; is a two-dimensional matrix space with dimension of D multiplied by H; then through MabFor is toAnd performing linear mapping to realize relative camera position coding:
wherein the content of the first and second substances,representing the feature of the a-th camera after the b-camera feature is subjected to relative camera position coding; wherein the content of the first and second substances,
NxN group camera feature combinationWill obtain the coded features From N x NSplicing is obtained, changedAfter the shape of (2) to obtain a characteristic Vmap;WhereinA four-dimensional matrix space with dimensions of D × H × N × N and a database with dimensions of C × NA three-dimensional matrix space of N;
step 204, a ═ aab| a ∈ {1, 2., N }, and b ∈ {1, 2., N } } denote N × N fusion coefficients aabForming a fusion coefficient matrix; randomly setting a partial fusion coefficient in the part A as 0 by a random shielding strategy so as to change the number of cameras participating in fusion;
step 205, using the randomly masked A, pair VmapPerforming weighted fusion to obtain multi-view fused feature Vfuse, Is a two-dimensional matrix space with dimension C × N:
Vfuse=sum((softmax(A)⊙Vmap)),
wherein softmax (×) represents a normalized exponential function, normalizing the third dimension of the fusion coefficient matrix a, which indicates dot multiplication, and sum (×) represents merging features of N cameras in the third dimension.
5. The method for estimating the single-person three-dimensional pose based on adaptive multi-view and temporal feature fusion according to claim 1, wherein the temporal self-attention transformation network in step three comprises an encoding module and a two-layer feature fusion module.
6. The method of claim 4, wherein the time-series self-attention transformation network in step three is implemented by the following steps:
step 301, in the process of time sequence feature fusion, the dimension N of the camera is omitted, and the coding module of the time sequence self-attention transformation network firstly passes through the sixth full connection layerTo VfuseAnd (3) carrying out characteristic coding:
step 302, the coding module then constructs sequence position codes by using cos and sin functionsAnd carrying out position coding on Z, wherein the coding process is as follows:
Z0=Z+Pemb,
wherein, the first and the second end of the pipe are connected with each other,features that are position encoded;
step 303, the time sequence self-attention transformation network comprises two layers of feature fusion modules based on a self-attention mechanism; m belongs to {1,2} as an index of the number of layers of the feature fusion module, and the m-th layer feature fusion process is as follows:
wherein the content of the first and second substances,andrespectively passing through a seventh, an eighth and a ninth full connection layerAndtiming characteristics for m-1 layersMapping to obtain a query vector, a key value vector and a numerical value vector; FFN is a multilayer perceptron with residual connection;is a time sequence fusion feature of the mth layer, wherein the input feature of the fusion module of the first layer
Step 304, finally passing through a tenth full connection layerRegressing the 3D pose P of the intermediate frame of each video sequence under N cameras3DWherein Is a two-dimensional matrix space with 3 jxn dimensions:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111649445.9A CN114511629A (en) | 2021-12-30 | 2021-12-30 | Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111649445.9A CN114511629A (en) | 2021-12-30 | 2021-12-30 | Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114511629A true CN114511629A (en) | 2022-05-17 |
Family
ID=81547720
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111649445.9A Pending CN114511629A (en) | 2021-12-30 | 2021-12-30 | Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114511629A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115861930A (en) * | 2022-12-13 | 2023-03-28 | 南京信息工程大学 | Crowd counting network modeling method based on hierarchical difference feature aggregation |
-
2021
- 2021-12-30 CN CN202111649445.9A patent/CN114511629A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115861930A (en) * | 2022-12-13 | 2023-03-28 | 南京信息工程大学 | Crowd counting network modeling method based on hierarchical difference feature aggregation |
CN115861930B (en) * | 2022-12-13 | 2024-02-06 | 南京信息工程大学 | Crowd counting network modeling method based on hierarchical difference feature aggregation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109377530B (en) | Binocular depth estimation method based on depth neural network | |
CN111047548B (en) | Attitude transformation data processing method and device, computer equipment and storage medium | |
CN113160375B (en) | Three-dimensional reconstruction and camera pose estimation method based on multi-task learning algorithm | |
CN111626159B (en) | Human body key point detection method based on attention residual error module and branch fusion | |
CN113283525B (en) | Image matching method based on deep learning | |
CN111783582A (en) | Unsupervised monocular depth estimation algorithm based on deep learning | |
CN114596520A (en) | First visual angle video action identification method and device | |
Wang et al. | Depth estimation of video sequences with perceptual losses | |
WO2024051184A1 (en) | Optical flow mask-based unsupervised monocular depth estimation method | |
CN115484410B (en) | Event camera video reconstruction method based on deep learning | |
CN113850900A (en) | Method and system for recovering depth map based on image and geometric clue in three-dimensional reconstruction | |
CN111210382A (en) | Image processing method, image processing device, computer equipment and storage medium | |
CN116258757A (en) | Monocular image depth estimation method based on multi-scale cross attention | |
CN115330950A (en) | Three-dimensional human body reconstruction method based on time sequence context clues | |
CN115035171A (en) | Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion | |
CN114511629A (en) | Single three-dimensional attitude estimation method based on self-adaptive multi-view and time sequence feature fusion | |
CN116468769A (en) | Depth information estimation method based on image | |
CN116524121A (en) | Monocular video three-dimensional human body reconstruction method, system, equipment and medium | |
Zhang et al. | Self-Supervised Monocular Depth Estimation With Self-Perceptual Anomaly Handling | |
CN116740290A (en) | Three-dimensional interaction double-hand reconstruction method and system based on deformable attention | |
CN117011357A (en) | Human body depth estimation method and system based on 3D motion flow and normal map constraint | |
CN112419387B (en) | Unsupervised depth estimation method for solar greenhouse tomato plant image | |
CN115496859A (en) | Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning | |
CN115115685A (en) | Monocular image depth estimation algorithm based on self-attention neural network | |
CN110766732A (en) | Robust single-camera depth map estimation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |