CN111291687B - 3D human body action standard identification method - Google Patents

3D human body action standard identification method Download PDF

Info

Publication number
CN111291687B
CN111291687B CN202010085665.2A CN202010085665A CN111291687B CN 111291687 B CN111291687 B CN 111291687B CN 202010085665 A CN202010085665 A CN 202010085665A CN 111291687 B CN111291687 B CN 111291687B
Authority
CN
China
Prior art keywords
equal
human body
joint point
camera
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010085665.2A
Other languages
Chinese (zh)
Other versions
CN111291687A (en
Inventor
纪刚
周萌萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Lianhe Chuangzhi Technology Co ltd
Original Assignee
Qingdao Lianhe Chuangzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Lianhe Chuangzhi Technology Co ltd filed Critical Qingdao Lianhe Chuangzhi Technology Co ltd
Priority to CN202010085665.2A priority Critical patent/CN111291687B/en
Publication of CN111291687A publication Critical patent/CN111291687A/en
Application granted granted Critical
Publication of CN111291687B publication Critical patent/CN111291687B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)
  • Length Measuring Devices By Optical Means (AREA)

Abstract

The invention discloses a 3D human body action standard identification method, which comprises the following steps: (1) Acquiring images of human body actions by adopting a plurality of multi-view cameras, traversing an image set, and finding 2D human body joint point positions of the images under the same timestamp; (2) According to the obtained 2D human body joint point position set under all the multi-view cameras, carrying out information fusion on the 2D joint points of the multi-view cameras to obtain fused 2D joint point positions under all the camera pixel coordinate systems; (3) According to the obtained fused 2D joint point positions under all the camera pixel coordinate systems, calculating side information formed by the 3D human body joint point positions and the joint points, namely 3D human body postures; (4) And performing action standard type identification according to the finally calculated 3D human body posture information. The method disclosed by the invention can accurately judge and recognize the complicated human body postures of the target bodies under different angles, and has real-time property.

Description

3D human body action standard identification method
Technical Field
The invention relates to a 3D human body action standard identification method.
Background
At present, many techniques for extracting human body joint points based on a posture estimation algorithm to perform human body action standard judgment include:
(1) The 2D attitude estimation algorithm based on the monocular camera is divided into a bottom-up calculation method and a top-down calculation method, wherein the bottom-up detection method comprises the steps of firstly detecting a joint point and then judging whether the joint point belongs to a uniform target body, and the method has real-time performance, can quickly respond to a single image, but sacrifices detection precision; the top-down detection method firstly detects the target body and then extracts the joint points according to the target body, has higher detection precision but insufficient real-time performance.
The methods have common defects that joint points exist on the side surface of a target body, misjudgment and wrong joint point connection exist, action standard scoring errors are caused, the extraction accuracy is limited by the angle of a human body, and the method cannot be put into use.
(2) A3D posture estimation algorithm based on a monocular camera firstly virtualizes a relative 3D grid space, and then carries out 3D joint point reasoning on an input 2D image in a 3D grid space coordinate. Although the method improves the deficiency of the 2D human body posture estimation, the method still cannot meet the actual requirement, depends on the precision of the 2D posture, and can only judge the obtained human body action under a relative coordinate system.
(3) The 3D attitude estimation algorithm based on the multi-view camera depends on 2D attitude estimation precision, and the improvement on the 2D attitude precision of complex and sheltered 2D attitude is still limited.
Therefore, at present, the method for judging the human action standard property according to the human posture estimation has certain defects.
Disclosure of Invention
In order to solve the technical problems, the invention provides a 3D human body action standard identification method, so as to achieve the purposes of accurately identifying complex human body postures of target bodies at different angles and having real-time performance.
In order to achieve the purpose, the technical scheme of the invention is as follows:
A3D human body action standard identification method comprises the following steps:
(1) Acquiring images of human body actions by adopting a plurality of multi-view cameras, traversing an image set, and finding 2D human body joint point positions of the images under the same timestamp;
(2) According to the obtained 2D human body joint point position set under all the multi-view cameras, carrying out information fusion on the 2D joint points of the multi-view cameras to obtain fused 2D joint point positions under all the camera pixel coordinate systems;
(3) According to the obtained fused 2D joint point positions under all the camera pixel coordinate systems, calculating side information formed by the 3D human body joint point positions and the joint points, namely 3D human body postures;
(4) Performing action standard type identification according to the finally calculated 3D human body posture information;
the step (1) is specifically as follows:
given theCamera set Camera = { C 1 ,C 2 ,...,C i ,...,C a I is more than or equal to 1 and less than or equal to a, a represents the number of the multi-view cameras, and a is more than or equal to 2; the image data collected by the multi-view camera is Ig = { I = { (I) } 1 ,I 2 ,...,I i ,...,I a };I i (x, y, C) are at the same time stamp, C i The method comprises the steps that image samples collected by a camera are obtained, wherein x is more than or equal to 0 and less than or equal to W-1, W represents the image width, y is more than or equal to 0 and less than or equal to H-1, H represents the image height, c is channel information of an input image, and c is more than or equal to 0 and less than or equal to 2; traversing the image set Ig, and finding the image I under the same time stamp i The 2D human joint point position of (x, y, c) comprises the following specific steps:
(i) According to image I i (x, y, c) executing a high-low resolution fusion network, and respectively solving a characteristic response matrix of the high-low resolution of the fusion network;
defining its high resolution sub-network characteristic response matrix as
Figure GDA0003851265090000021
I' is more than or equal to 1 and less than or equal to N, wherein N represents the number of feature layers of the high-resolution subnetwork,
Figure GDA0003851265090000022
for the characteristic response submatrix of the i' th layer of the high resolution subnetwork, x is more than or equal to 0 i′ W ' -1, W ' = W, W ' represents the high resolution sub-network feature matrix width, 0 ≦ y i′ H ' -1,H ' = H, H ' represents the length of the high-resolution sub-network feature matrix;
Figure GDA0003851265090000023
channel information of the feature matrix;
defining a set of low resolution sub-network feature response matrices as
Figure GDA0003851265090000024
l1 and l2 respectively represent two low-resolution sub-network structures, i '< N > is more than or equal to 3, i' < N > is more than or equal to 7,
Figure GDA0003851265090000025
and
Figure GDA0003851265090000026
respectively are feature response submatrices of two low-resolution sub-networks;
0≤x i″ w ≦ W "= W/2-1, W" represents the first low resolution sub-network feature matrix width,
0≤y i″ h ≦ H "= H/2-1, H" represents the first low resolution sub-network feature matrix height,
0≤x i″′ w '≦ W' ″, W '= W/4-1, W' ″ for the second low resolution subnetwork feature matrix width,
0≤y i″′ less than or equal to H ', H ' = H/4-1, H ' indicates the second low resolution subnetwork feature matrix height;
Figure GDA0003851265090000027
channel information of two low-resolution sub-networks respectively;
when i ', i' is an even number, the two low resolution subnetworks
Figure GDA0003851265090000028
And
Figure GDA0003851265090000029
fusing with a high-resolution sub-network through deconvolution operation, wherein a fusion formula is as follows:
Figure GDA0003851265090000031
Figure GDA0003851265090000032
Figure GDA0003851265090000033
is composed of
Figure GDA0003851265090000034
To the channel
Figure GDA0003851265090000035
The transformation matrix to be deconvolved is performed,
Figure GDA0003851265090000036
is composed of
Figure GDA0003851265090000037
To the channel
Figure GDA0003851265090000038
A transformation matrix for deconvolution;
the recursion formula for the feature response submatrix of the high resolution subnetwork is:
Figure GDA0003851265090000039
(ii) The characteristic response submatrix of the high resolution subnetwork calculated according to equation (3)
Figure GDA00038512650900000310
Solving output response matrix set HeatMap i ={H i,1 ,H i,2 ,...,H i,k ,...,H i.K K is more than or equal to 1 and less than or equal to K, K =17 and represents the number of 17 joint points of the human body to be solved, and then the image I is processed i (x, y, c) each pixel coordinate location is evaluated to determine if the kth joint point is located, H i,k (x, y) represents a confidence coefficient matrix of the kth joint point under the ith camera, wherein K is more than or equal to 1 and less than or equal to K:
Figure GDA00038512650900000311
Figure GDA00038512650900000312
weights for solving k-th joint position confidence matrixThe weight of the parameter is determined,
Figure GDA00038512650900000313
in order to be offset in the amount of the offset,
Figure GDA00038512650900000314
representing the Nth layer of a converged network
Figure GDA00038512650900000315
A characteristic response submatrix of the channel;
(iii) According to H calculated in (ii) i,k (x, y), K is more than or equal to 1 and less than or equal to K, and the obtained output response matrix set HeatMap i Solving the mean square error distance:
Figure GDA00038512650900000316
Figure GDA00038512650900000317
wherein (mu) x,ky,k ) Is the true pixel coordinate position, σ, of the joint point k x,ky,k The variances of the target output are all 1.5;
(iv) By H i,k (x, y) pairs
Figure GDA00038512650900000318
Solving the gradient to update the weight parameter,
Figure GDA00038512650900000319
Figure GDA00038512650900000320
and (3) representing the number of the feature layers of the N-th layer high-resolution sub-network, wherein the parameter updating formula is as follows:
Figure GDA0003851265090000041
Figure GDA0003851265090000042
wherein τ is a small number, 0.1 or 0.01;
similarly, updating the weight parameters of the feature layer of the high-resolution and low-resolution sub-network;
(v) (iv) repeating steps (i) - (iv) until MSE loss Converging or satisfying the maximum iteration number iter to obtain the final output response matrix set HeatMap i ={H i,1 ,H i,2 ,...,H i,k ,...,H i,K },H i,k A position confidence coefficient matrix representing the kth joint point of the human body;
(vi) According to HeadMap i Obtain camera C i Lower image I i The position of the 2D joint point of (x, y, c) is expressed as:
Figure GDA0003851265090000043
wherein,
Figure GDA0003851265090000044
confidence matrix H for representing position of human joint point i,k The pixel coordinate corresponding to the medium maximum value;
then the set of human body 2D joint point positions under all multi-view cameras is represented as J = { J = 1 ,J 2 ,...,J i ,...,J a };
The step (2) is specifically as follows:
(i) Carrying out alignment operation on a plurality of multi-view cameras: converting the world coordinate system into a Camera coordinate system, and then converting the Camera coordinate system into a pixel coordinate system to obtain a multi-view Camera Camera = { C = { (C) } 1 ,C 2 ,...,C i ,...,C a Converting relation between cameras, thereby realizing alignment operation of the multi-view camera;
(ii) Calculating the position of the fused 2D joint point of the multi-view camera, and aligning the camera C i Is as follows
Figure GDA0003851265090000045
By passingFusing weight matrix theta and J obtained in step (1) j Calculated as fused J i J is more than or equal to 1 and less than or equal to a, j is not equal to i, a represents the number of multi-view cameras, theta satisfies Gaussian distribution on an additive epipolar line, theta-N (0, 1), wherein the position of a joint point after fusion is represented as:
Figure GDA0003851265090000046
wherein,
Figure GDA0003851265090000047
camera C for showing multiple eyes i The position of the kth joint point, θ, denotes the multi-view camera C j A fusion weight matrix of the detected position of the kth joint point,
Figure GDA0003851265090000048
is switched to the camera C i Pixel coordinates in a pixel coordinate system;
updating the fused 2D joint positions under all camera pixel coordinate systems:
Figure GDA0003851265090000051
Figure GDA0003851265090000052
in the above scheme, the step (3) is specifically as follows:
(i) Expressing 3D human body posture information as
Figure GDA0003851265090000053
l 3D =[l 1 ,l 2 ,...,l n′ ...,l K-1 ]Wherein, in the process,
Figure GDA0003851265090000054
the world coordinate system position of the kth joint point of the human body is represented, K is more than or equal to 1 and less than or equal to K, K =17, and 17 joints of the human body are representedNode, l n′ Representing the edge vector formed by the joint points, N' is more than or equal to 1 and less than or equal to K-1, K-1 is the number of the edges formed by the joint points, obtaining a 3D space V taking a root node as the center according to the principle of triangulation, taking the edge length s of the V as an initial value of 2000mm, and discretizing the volume of the V into N g ×N g ×N g Coarse grid of (2):
Figure GDA0003851265090000055
Figure GDA0003851265090000056
are all in N 3 In the number of grids g, t is more than or equal to 1 0 ,t 1 ,t 2 ≤N g ,t 0 Representing the depth of the grid, t 1 Denotes the grid width, t 2 Indicating the grid height, N g Taking an initial value of 16;
(ii) 3D pose estimation representation in initial mesh computed according to graph model algorithm PSM
Figure GDA0003851265090000057
l 3D (1)=[l 1 ,l 2 ,...,l n′ ...,l K-1 ],s(1)=2000mm,J 3D (1),l 3D (1) Are all shown in N g The result of 3D pose estimation in the initial grid of =1 is the position coordinate of the joint point in the world coordinate system and the joint point edge vector, respectively;
(iii) Using an iterative algorithm for each joint point
Figure GDA0003851265090000058
K is more than or equal to 1 and less than or equal to K, discretizes the grid where the K surrounds the current position of the joint point
Figure GDA0003851265090000059
Is a 2X 2 local grid, i.e. let N g =2, repeating the PSM method in the last step, obtaining updated 3D pose estimation
Figure GDA00038512650900000510
l 3D (2)=[l 1 ,l 2 ,...,l n′ ...,l K-1 ]Along with iteration, the 3D posture of the human body is refined, and the side length s of V is updated to
Figure GDA00038512650900000511
The precision is improved, and the iteration number iter _3D is determined according to the sample complexity.
In the above scheme, the step (4) is specifically as follows:
according to the finally estimated 3D human body posture information:
Figure GDA0003851265090000061
l 3D (iter_3D)=[l 1 ,l 2 ,...,l n′ ...,l K-1 ]the action standard judgment is carried out, and the specific method comprises the following steps:
(i) The standard action angle set of the human body posture is as follows:
Figure GDA0003851265090000062
Figure GDA0003851265090000063
representing any two edges l n′ ,l n″ Wherein the angle of the first and second guide rails,
Figure GDA0003851265090000064
n is more than or equal to 1 and K-1 is more than or equal to n', according to the formula:
Figure GDA0003851265090000065
solving each corner degree
Figure GDA0003851265090000066
(ii) Establishing a Gaussian mixture model
Figure GDA0003851265090000067
Is provided with
Figure GDA0003851265090000068
Is shown in
Figure GDA0003851265090000069
Is a mean Gaussian distribution with n' being less than or equal to 1 and K-1 being less than or equal to n ″, then
Figure GDA00038512650900000610
Subject to a standard normal distribution, with a confidence of 95% Y according to a standard normal distribution table 0.025 ,Y 0.0975 Obtaining
Figure GDA00038512650900000611
The value:
Figure GDA00038512650900000612
(iii) By Gaussian mixture model
Figure GDA00038512650900000613
Judging the gesture motion of the human body 3D gesture to be evaluated as a reference
Figure GDA00038512650900000614
Scoring to obtain a final action standard score and total score set:
Figure GDA00038512650900000615
specifically, judging
Figure GDA00038512650900000616
Whether or not to satisfy the distribution
Figure GDA00038512650900000617
If the standard distribution table meets the standard distribution table, the action is qualified, and the standard score of the decomposition action under the distribution is calculated according to the standard distribution table
Figure GDA00038512650900000618
Figure GDA00038512650900000619
1. Ltoreq. N ', n' K-1, usually, ω ii =1;
score is the current action standard score, and the decomposition action standard score is
Figure GDA00038512650900000620
Through the technical scheme, the 3D human body action standard identification method provided by the invention has the following beneficial effects:
(1) Accurately judging and recognizing complex human body postures of target bodies at different angles;
(2) The multi-view camera is calibrated by following the positioning of a world coordinate system, and the 2D attitude precision can be improved by accurately fusing the characteristics under different viewing angles according to the aligned coordinate system.
(3) The method has real-time performance, and common hardware resources can run.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below.
FIG. 1 is a diagram of a joint point of a human body according to an embodiment of the present invention;
FIG. 2 is a diagram of the interconversion between the world coordinate system and the camera coordinate system;
FIG. 3 is a diagram illustrating a transformation relationship between a camera coordinate system and a pixel coordinate system.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
The invention provides a 3D human body action standard identification method, which comprises the following steps:
(1) Acquiring images of human body actions by adopting a plurality of multi-view cameras, traversing an image set, and finding 2D human body joint point positions of the images under the same timestamp;
given set of cameras Camera = { C 1 ,C 2 ,...,C i ,...,C a I is more than or equal to 1 and less than or equal to a, a represents the number of the multi-view cameras, and a is more than or equal to 2; the image data collected by the multi-view camera is Ig = { I = { (I) 1 ,I 2 ,...,I i ,...,I a };I i (x, y, C) are at the same time stamp, C i The method comprises the steps that image samples collected by a camera are obtained, wherein x is more than or equal to 0 and less than or equal to W-1, W represents the image width, y is more than or equal to 0 and less than or equal to H-1, H represents the image height, c is channel information of an input image, and c is more than or equal to 0 and less than or equal to 2; traversing the image set Ig, and finding the image I under the same time stamp i The 2D human joint point position of (x, y, c) comprises the following specific steps:
(i) According to image I i (x, y, c) executing a high-low resolution fusion network, and respectively solving a characteristic response matrix of the high-low resolution of the fusion network;
defining its high resolution sub-network characteristic response matrix as
Figure GDA0003851265090000071
I' is less than or equal to 1 and less than or equal to N, wherein N represents the number of high-resolution sub-network feature layers,
Figure GDA0003851265090000072
for the characteristic response submatrix of the i' th layer of the high resolution subnetwork, x is more than or equal to 0 i′ W '-1,W' = W ', W' represents the high resolution sub-network feature matrix width, y is greater than or equal to 0 i′ H ' -1,H ' = H, H ' represents the length of the high-resolution sub-network feature matrix;
Figure GDA0003851265090000073
channel information that is a feature matrix;
defining a set of low resolution sub-network feature response matrices as
Figure GDA0003851265090000081
l1 and l2 respectively represent two low-resolution sub-network structures, i '< N > is more than or equal to 3, i' < N > is more than or equal to 7,
Figure GDA0003851265090000082
and
Figure GDA0003851265090000083
respectively a feature response submatrix of two low-resolution sub-networks;
0≤x i″ w ≦ W "= W/2-1, W" represents the first low resolution sub-network feature matrix width,
0≤y i″ h ≦ H ', H ' = H/2-1, H ' represents the first low resolution sub-network feature matrix height,
0≤x i″′ w '≦ W' ″, W '= W/4-1, W' ″ represents the second low resolution sub-network feature matrix width,
0≤y i″′ h "", H "" = H/4-1, H "", indicates the second low resolution subnetwork feature matrix height;
Figure GDA0003851265090000084
channel information of two low-resolution sub-networks respectively;
when i ', i' is an even number, the two low resolution subnetworks
Figure GDA0003851265090000085
And
Figure GDA0003851265090000086
fusing with a high-resolution sub-network through deconvolution operation, wherein a fusion formula is as follows:
Figure GDA0003851265090000087
Figure GDA0003851265090000088
Figure GDA0003851265090000089
is composed of
Figure GDA00038512650900000810
To the channel
Figure GDA00038512650900000811
The transformation matrix of the deconvolution is performed,
Figure GDA00038512650900000812
is composed of
Figure GDA00038512650900000813
To the channel
Figure GDA00038512650900000814
A transformation matrix for deconvolution;
the recurrence formula of the feature response submatrix of the high resolution subnetwork is:
Figure GDA00038512650900000815
(ii) The characteristic response submatrix of the high resolution subnetwork calculated according to equation (3)
Figure GDA00038512650900000816
Solving output response matrix set HeatMap i ={H i,1 ,H i,2 ,...,H i,k ,...,H i.K 1 ≦ K, K =17, representing the number of 17 joint points of the body to be sought, as shown in fig. 1, and then for image I i (x, y, c) each pixel coordinate location is evaluated to determine if it is the location of the kth joint point, H i,k (x, y) represents a confidence matrix of the kth joint point in the ith camera, wherein K is more than or equal to 1 and less than or equal to K:
Figure GDA00038512650900000817
Figure GDA00038512650900000818
to solve the problemThe weight parameters of the k joint position confidence matrices,
Figure GDA00038512650900000819
in order to be an offset amount,
Figure GDA00038512650900000820
layer N representing a converged network
Figure GDA0003851265090000091
A characteristic response submatrix of the channel;
(iii) According to H calculated in (ii) i,k (x, y), K is more than or equal to 1 and less than or equal to K, and the obtained output response matrix set HeatMap i Solving the mean square error distance:
Figure GDA0003851265090000092
Figure GDA0003851265090000093
wherein (mu) x,ky,k ) Is the true pixel coordinate position, σ, of the joint point k x,ky,k The variances of the target output are all 1.5;
(iv) By H i,k (x, y) pairs
Figure GDA0003851265090000094
Solving the gradient to update the weight parameter,
Figure GDA0003851265090000095
Figure GDA0003851265090000096
and (3) representing the number of the feature layers of the N-th layer high-resolution sub-network, wherein the parameter updating formula is as follows:
Figure GDA0003851265090000097
Figure GDA0003851265090000098
wherein τ is a small number, 0.1 or 0.01;
similarly, updating the weight parameters of the feature layer of the high-resolution and low-resolution sub-network;
(v) (iii) repeating steps (i) - (iv) until MSE loss Converging or satisfying the maximum iteration number iter to obtain the final output response matrix set HeatMap i ={H i,1 ,H i,2 ,...,H i,k ,...,H i,K },H i,k A position confidence coefficient matrix representing the kth joint point of the human body;
(vi) According to HeadMap i Get camera C i Lower image I i The position of the 2D joint point of (x, y, c) is expressed as:
Figure GDA0003851265090000099
wherein,
Figure GDA00038512650900000910
confidence matrix H for representing human joint point position i,k The pixel coordinate corresponding to the medium maximum value;
then the set of human body 2D joint point positions under all multi-view cameras is represented as J = { J = 1 ,J 2 ,...,J i ,...,J a }。
(2) According to the obtained 2D human body joint point position set under all the multi-view cameras, carrying out information fusion on the 2D joint points of the multi-view cameras to obtain fused 2D joint point positions under all the camera pixel coordinate systems;
(i) Performing an alignment operation on a plurality of multi-view cameras:
the alignment operation of the multi-view Camera is performed according to the paper a Flexible New Technique for Camera Calibration, as shown in fig. 2, R is an orthogonal identity matrix of 3 × 3, t is a translation vector, and R, t is an external parameter of the Camera, which is used to represent a distance between a world coordinate system and a Camera coordinate system.
As shown in FIG. 3, the plane π is referred to as the image plane of the camera, point O c Called the center of the camera (optical center), f is the focal length of the camera, and O c Making a ray perpendicular to the image plane for the end point, intersecting the image plane with point p, then ray O c p is called the optical axis (principal axis), the point p is the principal point of the camera, and there are,
Figure GDA0003851265090000101
(x c ,y c ,z c ) Representing the camera coordinate system coordinates.
Obtaining world coordinate system coordinate P according to FIG. 2 and FIG. 3 w (x w ,y w ,z w ) Conversion to pixel coordinate system coordinates P 1 The transformation process of (u, v) is as follows:
Figure GDA0003851265090000102
Figure GDA0003851265090000104
accordingly, the multi-view Camera Camera = { C 1 ,C 2 ,...,C i ,...,C a And (5) converting the relation between the cameras, thereby realizing the alignment operation of the multi-view camera.
(ii) Calculating the position of the fused 2D joint point of the multi-view camera, and aligning the camera C i Is as follows
Figure GDA0003851265090000103
Fusing a weight matrix theta with the J obtained in the step (1) j Calculated as fused J i J is more than or equal to 1 and less than or equal to a, j is not equal to i, a represents the number of multi-view cameras, theta satisfies Gaussian distribution on an addition polar line, theta-N (0, 1), wherein the position of the joint point after fusion is represented as:
Figure GDA0003851265090000111
wherein,
Figure GDA0003851265090000112
camera C for showing multiple eyes i The position of the kth joint point, θ, denotes the multi-view camera C j A fusion weight matrix of the detected position of the kth joint point,
Figure GDA0003851265090000113
is switched to the camera C i Pixel coordinates in a pixel coordinate system;
updating the fused 2D joint positions under all camera pixel coordinate systems: (11)
Figure GDA0003851265090000114
Figure GDA0003851265090000115
(3) According to the obtained fused 2D joint point positions under all the camera pixel coordinate systems, calculating side information formed by the 3D human body joint point positions and the joint points, namely 3D human body postures;
(i) Expressing 3D human body posture information as
Figure GDA0003851265090000116
l 3D =[l 1 ,l 2 ,...,l n′ ...,l K-1 ]Wherein
Figure GDA0003851265090000117
the world coordinate system position of the kth joint point of the human body is represented, K is more than or equal to 1 and less than or equal to K, K =17, the K-th joint point of the human body is represented, and l n′ Representing the edge vector formed by the joint points, N' is more than or equal to 1 and less than or equal to K-1, K-1 is the number of the edges formed by the joint points, obtaining a 3D space V taking a root node as the center according to the principle of triangulation, taking the edge length s of the V as an initial value of 2000mm, and discretizing the volume of the V into N g ×N g ×N g Coarse grid of (2):
Figure GDA0003851265090000118
Figure GDA0003851265090000119
are all at N 3 In the number of grids g, t is more than or equal to 1 0 ,t 1 ,t 2 ≤N g ,t 0 Representing mesh depth, t 1 Denotes the grid width, t 2 Indicating the grid height, N g Taking an initial value of 16;
(ii) 3D pose estimation representation in initial mesh computed according to graph model algorithm PSM
Figure GDA00038512650900001110
s(1)=2000mm,J 3D (1),l 3D (1) Are all shown in N g =1, wherein the results of 3D posture estimation in the initial grid are position coordinates and joint point edge vectors of the joint point under a world coordinate system respectively;
(iii) Using an iterative algorithm for each joint point
Figure GDA0003851265090000121
K is more than or equal to 1 and less than or equal to K, discretizes the grid where the K surrounds the current position of the joint point
Figure GDA0003851265090000122
Is a 2X 2 local grid, i.e. let N g =2, repeating the PSM method in the last step, obtaining updated 3D pose estimation
Figure GDA0003851265090000123
l 3D (2)=[l 1 ,l 2 ,...,l n′ ...,l K-1 ]Along with iteration, the 3D posture of the human body is refined, and the side length s of V is updated to
Figure GDA0003851265090000124
The precision is improved, and the iteration number iter _3D is determined according to the sample complexity.
(4) And performing action standard type identification according to the finally calculated 3D human body posture information.
According to the finally estimated 3D human body posture information:
Figure GDA0003851265090000125
l 3D (iter_3D)=[l 1 ,l 2 ,...,l n′ ...,l K-1 ]the method for judging the action standard comprises the following steps:
(i) The standard action angle set of the human body postures is as follows:
Figure GDA0003851265090000126
Figure GDA0003851265090000127
representing any two sides l n′ ,l n″ The angle of (a) of (b), wherein,
Figure GDA0003851265090000128
n is more than or equal to 1, and K-1 is less than or equal to n', according to the formula:
Figure GDA0003851265090000129
solving each corner degree
Figure GDA00038512650900001210
(ii) Establishing a Gaussian mixture model
Figure GDA00038512650900001211
Is provided with
Figure GDA00038512650900001212
Is shown in
Figure GDA00038512650900001213
Is a mean Gaussian distribution with n' being less than or equal to 1 and K-1 being less than or equal to n ″, then
Figure GDA00038512650900001214
ComplianceStandard normal distribution, Y with 95% confidence level according to standard normal distribution table 0.025 ,Y 0.0975 Obtaining
Figure GDA00038512650900001215
The value:
Figure GDA00038512650900001216
(iii) Using Gaussian mixture model
Figure GDA00038512650900001217
Judging the gesture motion of the human body 3D gesture to be evaluated as a reference
Figure GDA00038512650900001218
And (4) scoring to obtain a final action standard score and total score set:
Figure GDA00038512650900001219
specifically, judging
Figure GDA00038512650900001220
Whether the distribution is satisfied:
Figure GDA00038512650900001221
if the standard distribution table meets the standard distribution table, the action is qualified, and the standard score of the decomposition action under the distribution is calculated according to the standard distribution table
Figure GDA0003851265090000131
Figure GDA0003851265090000132
1. Ltoreq. N ', n'. Ltoreq. K-1, usually, ω ii =1;
score is the current action standard score, and the decomposition action standard score is
Figure GDA0003851265090000133
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (3)

1. A3D human body action standard identification method is characterized by comprising the following steps:
(1) Acquiring images of human body actions by adopting a plurality of multi-view cameras, traversing an image set, and finding 2D human body joint point positions of the images under the same timestamp;
(2) According to the obtained 2D human body joint point position set under all the multi-view cameras, carrying out information fusion on the 2D joint points of the multi-view cameras to obtain fused 2D joint point positions under all the camera pixel coordinate systems;
(3) According to the obtained fused 2D joint point positions under all camera pixel coordinate systems, calculating 3D human body joint point positions and side information formed by joint points, namely 3D human body postures;
(4) Performing action standard type identification according to the finally calculated 3D human body posture information;
the step (1) is specifically as follows:
camera = { C) given set of cameras 1 ,C 2 ,...,C i ,...,C a I is more than or equal to 1 and less than or equal to a, a represents the number of the multi-view cameras, and a is more than or equal to 2; the image data collected by the multi-view camera is Ig = { I = { (I) } 1 ,I 2 ,...,I i ,...,I a };I i (x, y, C) are at the same time stamp, C i The method comprises the steps that image samples collected by a camera are obtained, wherein x is more than or equal to 0 and less than or equal to W-1, W represents the image width, y is more than or equal to 0 and less than or equal to H-1, H represents the image height, c is channel information of an input image, and c is more than or equal to 0 and less than or equal to 2; traversing the image set Ig, and finding the image I under the same time stamp i The 2D human body joint point positions of (x, y, c) comprise the following specific steps:
(i) According to image I i (x, y, c) executing a high-low resolution fusion network, and respectively solving a characteristic response matrix of the high-low resolution of the fusion network;
defining its high resolution sub-network characteristic response matrix as
Figure FDA0003851265080000011
I' is less than or equal to 1 and less than or equal to N, wherein N represents the number of high-resolution sub-network feature layers,
Figure FDA0003851265080000012
for the characteristic response submatrix of the i' th layer of the high resolution subnetwork, x is more than or equal to 0 i′ W '-1,W' = W ', W' represents the high resolution sub-network feature matrix width, y is greater than or equal to 0 i′ H ' -1,H ' = H, H ' represents the length of the high-resolution sub-network feature matrix;
Figure FDA0003851265080000013
channel information that is a feature matrix;
defining a set of low resolution sub-network feature response matrices as
Figure FDA0003851265080000014
l1 and l2 respectively represent two low-resolution sub-network structures, i '< N > is more than or equal to 3, i' < N > is more than or equal to 7,
Figure FDA0003851265080000015
and
Figure FDA0003851265080000016
respectively a feature response submatrix of two low-resolution sub-networks;
0≤x i″ w ≦ W "= W/2-1, W" represents the first low resolution sub-network feature matrix width,
0≤y i″ h ≦ H "= H/2-1, H" represents the first low resolution sub-network feature matrix height,
0≤x i″′ w '≦ W' ″, W '= W/4-1, W' ″ represents the second low resolution sub-network feature matrix width,
0≤y i″′ h "", H "" = H/4-1, H "", indicates the second low resolution subnetwork feature matrix height;
Figure FDA0003851265080000021
channel information of two low-resolution sub-networks respectively;
when i ', i' is an even number, the two low resolution subnetworks
Figure FDA0003851265080000022
And
Figure FDA0003851265080000023
fusing with a high-resolution sub-network through deconvolution operation, wherein a fusion formula is as follows:
Figure FDA0003851265080000024
Figure FDA0003851265080000025
Figure FDA0003851265080000026
is composed of
Figure FDA0003851265080000027
To the channel
Figure FDA0003851265080000028
The transformation matrix to be deconvolved is performed,
Figure FDA0003851265080000029
is composed of
Figure FDA00038512650800000210
To the channel
Figure FDA00038512650800000211
A transformation matrix for deconvolution;
the recursion formula for the feature response submatrix of the high resolution subnetwork is:
Figure FDA00038512650800000212
(ii) The characteristic response submatrix of the high resolution subnetwork calculated according to equation (3)
Figure FDA00038512650800000213
Solving a set of output response matrices, heatMap i ={H i,1 ,H i,2 ,...,H i,k ,...,H i.K K is more than or equal to 1 and less than or equal to K, K =17 and represents the number of 17 joint points of the human body to be solved, and then the image I is processed i (x, y, c) each pixel coordinate location is evaluated to determine if it is the location of the kth joint point, H i,k (x, y) represents a confidence matrix of the kth joint point in the ith camera, wherein K is more than or equal to 1 and less than or equal to K:
Figure FDA00038512650800000214
Figure FDA00038512650800000215
to solve for the weight parameters of the k-th joint position confidence matrix,
Figure FDA00038512650800000216
in order to be offset in the amount of the offset,
Figure FDA00038512650800000217
representing the Nth layer of a converged network
Figure FDA00038512650800000218
A characteristic response submatrix of the channel;
(iii) According to H calculated in (ii) i,k (x, y), K is more than or equal to 1 and less than or equal to K, and the obtained output response matrix set HeatMap i Solving the mean square error distance:
Figure FDA00038512650800000219
Figure FDA00038512650800000220
wherein (mu) x,ky,k ) Is the true pixel coordinate position, σ, of the joint point k x,ky,k The variances of the target output are all 1.5;
(iv) By H i,k (x, y) pairs
Figure FDA00038512650800000310
Solving the gradient to update the weight parameter,
Figure FDA0003851265080000031
Figure FDA0003851265080000032
the number of the feature layers of the Nth layer high-resolution sub-network is expressed, and the parameter updating formula is as follows:
Figure FDA0003851265080000033
Figure FDA0003851265080000034
wherein τ is a small number, 0.1 or 0.01;
similarly, updating the weight parameters of the feature layer of the high-resolution and low-resolution sub-network;
(v) (iv) repeating steps (i) - (iv) until MSE loss Converging or meeting the maximum iteration number iter to obtain the final output response matrix set HeatMap i ={H i,1 ,H i,2 ,...,H i,k ,...,H i,K },H i,k A position confidence coefficient matrix representing the kth joint point of the human body;
(vi) According to HeadMap i Get camera C i Lower image I i The position of the 2D joint point of (x, y, c) is expressed as:
Figure FDA0003851265080000035
wherein,
Figure FDA0003851265080000036
confidence matrix H for representing human joint point position i,k The pixel coordinate corresponding to the medium maximum value;
the set of human 2D joint point positions under all multi-view cameras is represented as J = { J = { (J) 1 ,J 2 ,...,J i ,...,J a };
The step (2) is specifically as follows:
(i) Performing an alignment operation on a plurality of multi-view cameras: converting the world coordinate system into a Camera coordinate system, and then converting the Camera coordinate system into a pixel coordinate system to obtain a multi-view Camera = { C = 1 ,C 2 ,...,C i ,...,C a Converting the relationship between the cameras, thereby realizing the alignment operation of the multi-view camera;
(ii) Calculating the fused 2D joint point position of the multi-view camera, and aligning the camera C i Is as follows
Figure FDA0003851265080000037
Fusing a weight matrix theta with the J obtained in the step (1) j Calculated as fused J i ,1≤j≤a, j ≠ i, a denotes the number of multi-view cameras, θ satisfies the gaussian distribution on the additive epipolar line, θ -N (0, 1), where the fused joint point position is expressed as:
Figure FDA0003851265080000038
wherein,
Figure FDA0003851265080000039
camera C for showing multiple eyes i The position of the kth joint point, θ, denotes the multi-view camera C j A fusion weight matrix of the detected position of the kth joint point,
Figure FDA0003851265080000041
is switched to the camera C i Pixel coordinates in a pixel coordinate system;
updating the fused 2D joint positions under all camera pixel coordinate systems:
Figure FDA0003851265080000042
Figure FDA0003851265080000043
2. the method for 3D human body action standard judgment according to claim 1, wherein the step (3) is as follows:
(i) Expressing 3D human body posture information as
Figure FDA0003851265080000044
l 3D =[l 1 ,l 2 ,...,l n′ ...,l K-1 ]Wherein, in the process,
Figure FDA0003851265080000045
the world coordinate system position of the kth joint point of the human body is represented, K is more than or equal to 1 and less than or equal to K, K =17, the K-th joint point of the human body is represented, and l n′ Representing the edge vector formed by the joint points, N' is more than or equal to 1 and less than or equal to K-1, K-1 is the number of the edges formed by the joint points, obtaining a 3D space V taking a root node as the center according to the principle of triangulation, taking the edge length s of the V as an initial value of 2000mm, and discretizing the volume of the V into N g ×N g ×N g Coarse grid of (2):
Figure FDA0003851265080000046
Figure FDA0003851265080000047
are all at N 3 In the number of grids g, t is more than or equal to 1 0 ,t 1 ,t 2 ≤N g ,t 0 Representing the depth of the grid, t 1 Denotes the grid width, t 2 Indicating the grid height, N g Taking an initial value of 16;
(ii) 3D pose estimation representation in initial mesh computed according to graph model algorithm PSM
Figure FDA0003851265080000048
l 3D (1)=[l 1 ,l 2 ,...,l n′ ...,l K-1 ],s(1)=2000mm,J 3D (1),l 3D (1) Are all shown in N g The result of 3D pose estimation in the initial grid of =1 is the position coordinate of the joint point in the world coordinate system and the joint point edge vector, respectively;
(iii) Using an iterative algorithm for each joint point
Figure FDA0003851265080000049
K is more than or equal to 1 and less than or equal to K, discretizes the grid where the K surrounds the current position of the joint point
Figure FDA00038512650800000410
Is a 2X 2 local grid, i.e. let N g =2, repeating the PSM method in the last step, obtaining updated 3D pose estimation
Figure FDA00038512650800000411
l 3D (2)=[l 1 ,l 2 ,...,l n′ ...,l K-1 ]Along with iteration, the 3D posture of the human body is refined, and the side length s of V is updated to
Figure FDA0003851265080000051
The precision is improved, and the iteration number iter _3D is determined according to the sample complexity.
3. The method for 3D human body action standard judgment according to claim 2, wherein the step (4) is as follows:
and according to the finally estimated 3D human body posture information:
Figure FDA0003851265080000052
l 3D (iter_3D)=[l 1 ,l 2 ,...,l n′ ...,l K-1 ]the method for judging the action standard comprises the following steps:
(i) The standard action angle set of the human body posture is as follows:
Figure FDA0003851265080000053
Figure FDA0003851265080000054
representing any two sides l n′ ,l n″ The angle of (a) of (b), wherein,
Figure FDA0003851265080000055
n is more than or equal to 1, and K-1 is less than or equal to n', according to the formula:
Figure FDA0003851265080000056
solving each corner degree
Figure FDA0003851265080000057
(ii) Establishing a Gaussian mixture model
Figure FDA0003851265080000058
Is provided with
Figure FDA0003851265080000059
Is shown in
Figure FDA00038512650800000510
Is a mean Gaussian distribution with n' being less than or equal to 1 and K-1 being less than or equal to n ″, then
Figure FDA00038512650800000511
Subject to a standard normal distribution, with a confidence of 95% Y according to a standard normal distribution table 0.025 ,Y 0.0975 Obtaining
Figure FDA00038512650800000512
The value:
Figure FDA00038512650800000513
(iii) Using Gaussian mixture model
Figure FDA00038512650800000514
As a reference, judging the gesture action of the 3D gesture of the human body to be evaluated
Figure FDA00038512650800000515
And (4) scoring to obtain a final action standard score and total score set:
Figure FDA00038512650800000516
in particular to judge
Figure FDA00038512650800000517
Whether or not to satisfy the distribution
Figure FDA00038512650800000518
If yes, the action is qualified, and then the standard score of the decomposition action under the distribution is calculated according to a standard normal distribution table
Figure FDA00038512650800000519
Figure FDA0003851265080000061
1. Ltoreq. N ', n' K-1, usually, ω ii =1;
score is the current action standard score, and the decomposition action standard score is
Figure FDA0003851265080000062
CN202010085665.2A 2020-02-11 2020-02-11 3D human body action standard identification method Active CN111291687B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010085665.2A CN111291687B (en) 2020-02-11 2020-02-11 3D human body action standard identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010085665.2A CN111291687B (en) 2020-02-11 2020-02-11 3D human body action standard identification method

Publications (2)

Publication Number Publication Date
CN111291687A CN111291687A (en) 2020-06-16
CN111291687B true CN111291687B (en) 2022-11-11

Family

ID=71025534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010085665.2A Active CN111291687B (en) 2020-02-11 2020-02-11 3D human body action standard identification method

Country Status (1)

Country Link
CN (1) CN111291687B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183506A (en) * 2020-11-30 2021-01-05 成都市谛视科技有限公司 Human body posture generation method and system
CN112435731B (en) * 2020-12-16 2024-03-19 成都翡铭科技有限公司 Method for judging whether real-time gesture meets preset rules

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561881A (en) * 2009-05-19 2009-10-21 华中科技大学 Emotion identification method for human non-programmed motion
CN108427282A (en) * 2018-03-30 2018-08-21 华中科技大学 A kind of solution of Inverse Kinematics method based on learning from instruction
CN108549856A (en) * 2018-04-02 2018-09-18 上海理工大学 A kind of human action and road conditions recognition methods
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN110464349A (en) * 2019-08-30 2019-11-19 南京邮电大学 A kind of upper extremity exercise function score method based on hidden Semi-Markov Process
CN110633005A (en) * 2019-04-02 2019-12-31 北京理工大学 Optical unmarked three-dimensional human body motion capture method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561881A (en) * 2009-05-19 2009-10-21 华中科技大学 Emotion identification method for human non-programmed motion
CN108427282A (en) * 2018-03-30 2018-08-21 华中科技大学 A kind of solution of Inverse Kinematics method based on learning from instruction
CN108549856A (en) * 2018-04-02 2018-09-18 上海理工大学 A kind of human action and road conditions recognition methods
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN110633005A (en) * 2019-04-02 2019-12-31 北京理工大学 Optical unmarked three-dimensional human body motion capture method
CN110464349A (en) * 2019-08-30 2019-11-19 南京邮电大学 A kind of upper extremity exercise function score method based on hidden Semi-Markov Process

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
H. Jiang.3D Human Pose Reconstruction Using Millions of Exemplars.《2010 20th International Conference on Pattern Recognition》.2010,第1674-1677页. *
H. Qiu et al.Cross View Fusion for 3D Human Pose Estimation.《2019 IEEE/CVF International Conference on Computer Vision (ICCV)》.2019,第4341-4350页. *
K. Sun et al.Deep High-Resolution Representation Learning for Human Pose Estimation.《2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》.2019,第5686-5696页. *

Also Published As

Publication number Publication date
CN111291687A (en) 2020-06-16

Similar Documents

Publication Publication Date Title
US11763485B1 (en) Deep learning based robot target recognition and motion detection method, storage medium and apparatus
CN112102458B (en) Single-lens three-dimensional image reconstruction method based on laser radar point cloud data assistance
CN110108258B (en) Monocular vision odometer positioning method
CN108960211B (en) Multi-target human body posture detection method and system
CN112435325A (en) VI-SLAM and depth estimation network-based unmanned aerial vehicle scene density reconstruction method
CN106960449B (en) Heterogeneous registration method based on multi-feature constraint
CN107358629B (en) Indoor mapping and positioning method based on target identification
CN107274483A (en) A kind of object dimensional model building method
CN106709950A (en) Binocular-vision-based cross-obstacle lead positioning method of line patrol robot
CN112509044A (en) Binocular vision SLAM method based on dotted line feature fusion
CN110310305B (en) Target tracking method and device based on BSSD detection and Kalman filtering
CN113393524B (en) Target pose estimation method combining deep learning and contour point cloud reconstruction
CN111998862B (en) BNN-based dense binocular SLAM method
CN111291687B (en) 3D human body action standard identification method
US11741615B2 (en) Map segmentation method and device, motion estimation method, and device terminal
CN107862735A (en) A kind of RGBD method for reconstructing three-dimensional scene based on structural information
CN105513094A (en) Stereo vision tracking method and stereo vision tracking system based on 3D Delaunay triangulation
CN110070610A (en) The characteristic point matching method and device of characteristic point matching method, three-dimensionalreconstruction process
CN114004900A (en) Indoor binocular vision odometer method based on point-line-surface characteristics
CN115841602A (en) Construction method and device of three-dimensional attitude estimation data set based on multiple visual angles
CN111664845B (en) Traffic sign positioning and visual map making method and device and positioning system
CN114494644A (en) Binocular stereo matching-based spatial non-cooperative target pose estimation and three-dimensional reconstruction method and system
CN110060290B (en) Binocular parallax calculation method based on 3D convolutional neural network
CN116630423A (en) ORB (object oriented analysis) feature-based multi-target binocular positioning method and system for micro robot
CN110570473A (en) weight self-adaptive posture estimation method based on point-line fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant