CN115482481A

CN115482481A - Single-view three-dimensional human skeleton key point detection method, device, equipment and medium

Info

Publication number: CN115482481A
Application number: CN202210673877.1A
Authority: CN
Inventors: 张丽君; 徐卉; 周祥东; 石宇; 罗代建; 程俊
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-12-16

Abstract

The invention provides a single-view three-dimensional human skeleton key point detection method, a single-view three-dimensional human skeleton key point detection device, single-view three-dimensional human skeleton key point detection equipment and a single-view three-dimensional human skeleton key point detection medium, wherein the single-view three-dimensional human skeleton key point detection method comprises the following steps: acquiring a single-view human body image sequence, wherein the single-view human body image sequence comprises a target image and a related image; respectively acquiring a first skeleton key point of the target image and a second skeleton key point of the related image, performing spatial feature extraction on the first skeleton key point to obtain a spatial semantic feature, and performing time sequence feature extraction on the first skeleton key point and the second skeleton key point to obtain a time sequence feature; and fusing the supervision characteristics with the space semantic characteristics and the time sequence characteristics to obtain the three-dimensional human skeleton key point characteristic information. In the invention, spatial semantic features and time sequence features are extracted to obtain plane information and depth information of skeletal key points; the monitoring features are fused with the space semantic features and the time sequence features, so that the depth ambiguity and the unsuitability of the mapping of the two-dimensional human skeleton key points of the single-view-angle image to the three-dimensional human skeleton key points can be effectively reduced.

Description

Single-view three-dimensional human skeleton key point detection method, device, equipment and medium

Technical Field

The invention relates to the field of computer vision, in particular to a single-view three-dimensional human skeleton key point detection method, a single-view three-dimensional human skeleton key point detection device, single-view three-dimensional human skeleton key point detection equipment and a single-view three-dimensional human skeleton key point detection medium.

Background

The detection of the key points of the skeleton of the human body in three dimensions is one of the basic problems and research hotspots in the field of computer vision, the task of the detection is to acquire the position and connection information of the key points of the skeleton of the human body in a target video frame in the three-dimensional space, the detection is a basic technology of numerous visual tasks such as scene understanding, behavior recognition, pedestrian re-recognition and the like, and the detection is widely applied to the fields of video monitoring, behavior recognition, online teaching, action capture, virtual reality, medical assistance and the like.

In the detection of three-dimensional human skeleton key points of a single-view image, due to the unicity of the view angle of the image, depth information is lost when the image is projected from a three-dimensional space to a two-dimensional plane, and a plurality of different three-dimensional human skeleton key points are possibly projected to the same two-dimensional key point. Therefore, in the single-view image, the mapping of the human skeleton information from the two-dimensional space to the three-dimensional space has depth ambiguity and discomfort, and a small positioning error of the two-dimensional key point may cause a large posture distortion in the three-dimensional space.

The existing single-view three-dimensional human skeleton key point detection method mainly comprises a direct estimation method and a two-dimensional to three-dimensional lifting method.

The direct estimation method generally designs an end-to-end network to directly deduce three-dimensional human skeleton key points from an input two-dimensional image without intermediate estimation of two-dimensional human information representation. Although the method can acquire rich information from the image, the method lacks an intermediate supervision process from two-dimensional human skeleton information to three-dimensional human skeleton information, and a large amount of three-dimensional labeling data is needed to train a model with excellent performance.

The two-dimensional to three-dimensional lifting method comprises the steps of firstly adopting a two-dimensional human skeleton key point detection model to estimate two-dimensional skeleton key point information, and then obtaining three-dimensional human skeleton key points by utilizing two-dimensional to three-dimensional lifting methods such as two-dimensional skeleton information and three-dimensional image feature fusion, three-dimensional space reprojection and the like. The method is generally superior to a direct estimation method, the learning pressure of the model on two-dimensional bone key points is reduced, but the three-dimensional information is directly estimated based on the existing two-dimensional information, the influence of the performance of a two-dimensional human body bone key point detector is large, and the supervision of original image characteristics is lacked.

Disclosure of Invention

In view of the problems in the prior art, the invention provides a single-view three-dimensional human skeleton key point detection method, a single-view three-dimensional human skeleton key point detection device, single-view three-dimensional human skeleton key point detection equipment and a single-view three-dimensional human skeleton key point detection medium, and mainly solves the problems that in the prior art, the single-view three-dimensional human skeleton key point detection is free of intermediate supervision and lacks of original image feature supervision.

In order to achieve the above and other objects, the present invention adopts the following technical solutions.

Optionally, a single-view three-dimensional human bone key point detection method is provided, including:

acquiring a single-view human body image sequence, wherein the single-view human body image sequence comprises multiple frames of images of the same target object in different postures in a preset time period, one frame of image is used as a target image, and the rest of images are used as related images of the target image;

respectively acquiring a first skeleton key point of the target image and a second skeleton key point of the related image, extracting spatial features of the first skeleton key point to obtain spatial semantic features, and extracting time sequence features of the first skeleton key point and the second skeleton key point to obtain time sequence features, wherein the spatial semantic features comprise global semantic features and local semantic features;

and (3) fusing a supervision characteristic with the space semantic characteristic and the time sequence characteristic to obtain three-dimensional human skeleton key point characteristic information, wherein the supervision characteristic is obtained by extracting information in the target image through a characteristic extraction network, and comprises deep semantic information, texture information and edge information of the target image.

Optionally, performing spatial feature extraction on the first bone key point to obtain a spatial semantic feature, including:

constructing a global space map and a global adjacency matrix according to the first bone key points;

mining a weight matrix of the global space map through a multi-head attention mechanism, and obtaining multi-dimensional characteristics of a plurality of receptive fields in the global space map by using a cavity convolution network;

updating the global adjacent matrix by using a graph convolution network to obtain a first matrix;

combining the weight matrix, the multidimensional characteristics and the first matrix to construct a multi-head attention global space map;

and performing feature representation on the multi-head attention global space diagram to obtain the global semantic features.

constructing a plurality of local adjacency matrixes according to the local connection relation of the first skeleton key points, and updating the plurality of local adjacency matrixes through a graph convolution network to obtain a plurality of second matrixes;

and constructing a plurality of local space graphs according to the plurality of second matrixes, and performing feature representation on the plurality of local space graphs to obtain the local semantic features.

Optionally, performing time series feature extraction on the first bone key point and the second bone key point to obtain a time series feature, including:

constructing a time sequence adjacent matrix of the first skeleton key point and the second skeleton key point, and updating the time sequence adjacent matrix through a graph convolution network to obtain a third matrix;

and constructing a time sequence diagram through the third matrix, and performing characteristic representation on the time sequence diagram to obtain the time sequence characteristics.

Optionally, the method for obtaining three-dimensional human skeleton key point feature information by fusing the supervision feature with the spatial semantic feature and the time sequence feature includes:

taking the global semantic features and the local semantic features as attention factors of the supervision features, connecting the global semantic features and the local semantic features to form first spatial structure features of a target image, and adjusting the first spatial structure features by using a spatial feature adjuster to obtain second spatial structure features;

fusing the time sequence characteristics with the supervision characteristics to obtain fused first time sequence structural characteristics, and adjusting the first time sequence structural characteristics by using a time sequence characteristic adjuster to obtain second time sequence characteristics;

and taking the second time sequence characteristic as an attention factor of the second space structure characteristic to obtain the three-dimensional human skeleton key point characteristic information.

Optionally, before acquiring the single-view human body image sequence, the method includes the steps of:

constructing an initial network model, and obtaining a single-view human body image training sample;

inputting the single-view human body image training sample into the initial network model, optimizing parameters of the initial network model according to a preset objective function to obtain a trained network model, and inputting the single-view human body image sequence into the trained network model.

Optionally, respectively acquiring a first bone key point of the target image and a second bone key point of the related image, including:

and respectively detecting the target object in the target image and the target object in the related image by using a two-dimensional bone key point detector to obtain a first bone key point of the target image and a second bone key point of the related image.

Optionally, there is provided a single-view three-dimensional human bone key point detection device, including:

the system comprises an image acquisition module, a display module and a display module, wherein the image acquisition module is used for acquiring a single-view human body image sequence, the single-view human body image sequence comprises multiple frames of images of the same target object in different postures within a preset time period, one frame of image is used as a target image, and the rest images are used as related images of the target image;

the feature acquisition module is used for respectively acquiring a first skeleton key point of the target image and a second skeleton key point of the related image, extracting spatial features of the first skeleton key point to obtain spatial semantic features, and extracting time sequence features of the first skeleton key point and the second skeleton key point to obtain time sequence features, wherein the spatial semantic features comprise global semantic features and local semantic features;

and the feature fusion module is used for fusing a supervision feature with the space semantic feature and the time sequence feature to obtain feature information of three-dimensional human skeleton key points, wherein the supervision feature is obtained by extracting information in the target image through a feature extraction network, and comprises deep semantic information, texture information and edge information of the target image.

Optionally, a computer equipment is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the single-view three-dimensional human bone keypoint detection method as described above when executing the computer program.

Optionally, a computer readable storage medium is provided, which stores a computer program, which when executed by a processor implements the following steps of the single-view three-dimensional human bone keypoint detection method:

acquiring a single-view human body image sequence, wherein the single-view human body image sequence comprises multiple frames of images of the same target object in different postures within a preset time period, one frame of image is used as a target image, and the rest of images are used as related images of the target image;

and (3) integrating supervision features with the space semantic features and the time sequence features to obtain three-dimensional human skeleton key point feature information, wherein the supervision features are obtained by extracting information in the target image through a feature extraction network, and comprise deep semantic information, texture information and edge information of the target image.

In the single-view three-dimensional human skeleton key point detection method, a single-view human body image sequence is divided into a target image and a related image, and skeleton key points in the target image and the related image are obtained; and extracting global semantic features and local semantic features of the target image, extracting time sequence features of the single-view human body image sequence, and fusing the supervision features, the spatial semantic features and the time sequence features to obtain three-dimensional human body skeleton key point feature information. Mapping plane information of three-dimensional human skeleton key points on a spatial level through global semantic features and local semantic features of a target image; mapping depth information of three-dimensional human skeleton key points through time sequence characteristics of a plurality of continuous images with the same visual angle; in addition, the supervision features are fused with the space semantic features and the time sequence features in the feature fusion process, so that the depth ambiguity and the unsuitability of the single-view-angle image two-dimensional human skeleton key point mapping three-dimensional human skeleton key point can be effectively reduced, and the detection accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a flow chart of a single-view three-dimensional human bone key point detection method according to an embodiment of the invention.

FIG. 2 is a schematic flow chart of one embodiment of step S1 in FIG. 1;

FIG. 3 is a flowchart illustrating an embodiment of step S2 in FIG. 1;

FIG. 4 is a flowchart illustrating an embodiment of step S2 in FIG. 1;

FIG. 5 is a flowchart illustrating an embodiment of step S2 in FIG. 1;

fig. 6 is a schematic structural diagram of a single-view three-dimensional human skeleton key point detection device according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The single-view three-dimensional human skeleton key point detection method provided by the invention realizes effective acquisition and fusion of space semantic features and time sequence features of the single-view three-dimensional human skeleton key points by utilizing a graph convolution network and an attention mechanism aiming at a human image sequence acquired from a single view, excavates feature representation with space-time consistency and strong representation, and fuses supervision features, space semantic features and time sequence features to reduce the depth ambiguity and the ill-qualification of mapping the single-view image from two-dimensional to three-dimensional human skeleton key points and improve the accuracy of skeleton key point detection.

Referring to fig. 1, fig. 1 is a schematic flow chart of a single-view three-dimensional human skeleton key point detection method provided by an embodiment of the present invention, including the following steps:

s1, obtaining a single-view human body image sequence, and detecting a first skeleton key point of an obtained target image and a second skeleton key point of a related image

When detecting the single-view human skeleton key points, firstly, a single-view human body image sequence is obtained, wherein the single-view human body image sequence comprises multiple frames of images of the same target object in different postures in a preset time period, one frame of image is used as a target image, and the rest of images are used as related images of the target image. And then acquiring two-dimensional bone key points of the target image and the related image, wherein the two-dimensional bone key points of the target image are first bone key points, and the two-dimensional bone key points of the related image are second bone key points in the embodiment.

The single-view human body image sequence in the embodiment can be obtained by cutting consecutive video frames in a video, or can be obtained by acquiring images for multiple times at a single view within a preset time by a camera.

In this embodiment, let the single-view human body image sequence be I ₁ ,I ₂ ,……，I _t-2 ，I _t-1 ，I _t ，I _t+1 ，I _t+2 ，……，I _N-1 ，I _N In which I _t The image is a target image, and N is the total number of image samples; selecting multi-frame images of the same target object in different postures in a preset time period as related images, wherein the related images are represented as I ₁ ,I ₂ ,……，I _t-2 ，I _t-1 ，I _t+1 ，I _t+2 ，……，I _n-1 ，I _n Wherein N is the number of selected correlated image samples, N is less than or equal to the total number of image samples N, and the specific number of correlated image samples can be determined according to the actual situation. For example, when n =2, the relevant image is I _t-1 、I _t+1 。

As shown in fig. 2, in step S1, obtaining two-dimensional bone key points of a target image and a related image in a single-view human body image includes the following steps:

s11, acquiring a single-view human body image sequence, and dividing the single-view human body image sequence into a target image and a related image;

and S12, respectively detecting the target objects in the target image and the related image by using a two-dimensional skeleton key point detector to obtain a first skeleton key point and a second skeleton key point.

In this embodiment, after the target image and the related image are acquired, the target image and the related image are input into the two-dimensional bone key point detector for detection, so as to obtain a first bone key point of the target image and a second bone key point of the related image, and the two-dimensional bone key point detector is not limited to CPN and HRNet.

S2, extracting spatial features of the first skeleton key points to obtain spatial semantic features, and extracting time sequence features of the first skeleton key points and the second skeleton key points to obtain time sequence features

Establishing a corresponding space map extractor and a corresponding time sequence diagram extractor by using a graph convolution network and an attention mechanism, extracting global semantic features and local semantic features of a target image through the space map extractor, and extracting time sequence features of a single-view human body image sequence through the time sequence diagram extractor.

As shown in fig. 3, the spatial feature extraction of the first skeleton key points in step S2 to obtain global semantic features includes the following steps:

s211, constructing a global space map and a global adjacency matrix according to the first skeleton key points;

s212, excavating a weight matrix of the global space map through a multi-head attention mechanism, and obtaining multi-dimensional characteristics of a plurality of receptive fields in the global space map by utilizing a cavity convolution network;

s213, updating the global adjacency matrix by using a graph convolution network to obtain a first matrix;

s214, combining the weight matrix, the multidimensional characteristics and the first matrix to construct a multi-head attention global space map;

s215, carrying out feature representation on the multi-head attention global space diagram to obtain the global semantic features.

Wherein, the purpose of utilizationTarget image I _t The first skeleton key point of (2) constructs a global space map, a target image I _t Is expressed as g = (V, epsilon), where V = { V = _ti I =1, …, M is a set of vertex points of the graph, and M is the number of key points of human bones; ε = } e _ij I =1,2,.. M; j =1,2, the. The feature vector of the global space map is X = { X = ₁ ,x ₂ ,……,x _M |x ₁ E is R1 multiplied by C, wherein C is the number of the characteristic channels.

According to SemGCN, after the input feature vector passes through a layer of graph convolution network, the feature changes into

Wherein

Is composed of

The normalized diagonal matrix of (a); a is an element of R ^M×M The joint point connection matrix is an adjacent matrix and represents the connection condition of adjacent skeleton key points, a first-order matrix represents the first-order connection condition between joint points, and other second-order and third-order matrices represent the second-order and third-order connection condition between the joint points; w = { ω = _ij The method is characterized in that the method is a learnable weight matrix which represents the interaction relation among all skeleton key points; σ is a nonlinear activation function.

In order to pay attention to the overall global influence relationship of the human skeleton key points, different trainable weight values are distributed to the connection between any two joint points, and an adjacency matrix of the connection relationship between all the human skeleton key points is constructed. A multi-head attention mechanism is utilized to randomly excavate a weight matrix of the global space map from the global space map structure for multiple times so as to excavate the mutual influence relation among the key points of the human skeleton and obtain the global space characteristics among the key points of the skeleton; extracting multidimensional characteristics of different receptive fields by utilizing a cavity convolution network; updating a plurality of global adjacent matrix elements through a graph convolution network, and defining the updated global adjacent matrix as a first matrix; and combining the weight matrix, the most features and the first matrix to complete the construction of the multi-head attention global space diagram, and performing feature representation on the multi-head attention global space diagram to obtain global semantic features.

In one embodiment, the global spatial semantic features based on the multi-head attention global spatial map are expressed as:

k is the attention head number, attention to the global space semantic features of the omnibearing bone key points is realized through multi-head attention, and the influence of some local interference factors is reduced;

is an adjustment matrix;

for the global adjacent matrix capable of self-adapting learning, the connection relation and connection strength among all the key points of the skeleton are characterized, and the element b thereof _ij Instead of using simple 1 and 0 to indicate whether there is a connection between two bone key points, the description of the inter-influencing attention coefficient of the features between the bone key points is specifically expressed as:

wherein θ and

the convolution layer with the kernel of 1 is used for adjusting the characteristic dimension; | represents a feature join; Γ is a mapping function used to map high-dimensional features to low-dimensional or real numbers; ρ is the activation function, leakyReLU can be used.

Referring to fig. 4, the step S2 of extracting the spatial features of the first bone key points to obtain local semantic features includes the following steps:

s221, constructing a plurality of local adjacency matrixes according to the local connection relation of the first skeleton key points, and updating the plurality of local adjacency matrixes through a graph convolution network to obtain a plurality of second matrixes;

s222, constructing a plurality of local space diagrams according to the plurality of second matrixes, and performing feature representation on the plurality of local space diagrams to obtain the local semantic features.

Besides paying attention to the overall connection relation among the key points of the human skeleton, the local special relation among all the key points of the skeleton also has influence on the final detection precision, and the symmetry, the first-order connection, the second-order connection and the like of the key points of the skeleton can be selected as local characteristics. The symmetry is favorable for distinguishing the limb joint points and the trunk joint points, the main area of the target key points can be limited, the corresponding rough positions of the corresponding symmetric points can be deduced according to the known points, the first-order connecting points are the points directly connected with the target key points, and the position change of the first-order connecting points generally has the most direct influence on the positions of the target key points. The second order connection points may form a body member together with the first order connection points and the target keypoints as an auxiliary information for determining the position of the target keypoints. Aiming at each type of local characteristics such as symmetry, first-order connection, second-order connection and the like, firstly, a plurality of local adjacency matrixes are constructed by utilizing the local relation of first skeleton key points, then, elements in the plurality of local adjacency matrixes are updated through the learning of a graph convolution network to obtain a plurality of updated local adjacency matrixes, and the updated local adjacency matrixes can be defined as second matrixes; and constructing a plurality of local space maps according to the plurality of second matrixes, and performing feature representation on the plurality of local space maps to obtain local semantic features, thereby realizing the description of local influence relations among human body bone joint points.

In one embodiment, the local spatial semantic features based on the plurality of local spatial maps are represented as:

wherein S is a category of local features;

a local adjacency matrix generated for the S-th local feature; m _S Is a matrix of masks that can be learned,the method can be used for shielding the influence of non-S local characteristics and reducing model parameters; an as dot-product operation. Wherein elements d of the matrix are locally adjacent _ij Expressed as: when there is symmetry between nodes i and j, element d _ij ＝d _ji =1, and the remaining elements are 0. The elements of the first-order connected adjacency matrix and the second-order connected adjacency matrix are expressed as: when there is a first order connection or a second order connection between the nodes i and j, d _ij Using b in the global semantic feature representation _ij And (3) expression, describing the local connection strength between the joint points i and j.

Through the operation, the acquisition of the global semantic features and the local semantic features of the skeletal key points of the target image in a single-view human body image can be realized.

As shown in fig. 5, the step S2 of extracting the time sequence feature of the first skeleton key point and the second skeleton key point to obtain the time sequence feature includes the following steps:

s231, constructing a time sequence adjacent matrix of the first skeleton key point and the second skeleton key point, and updating the time sequence adjacent matrix through a graph convolution network to obtain a third matrix;

and S232, constructing a timing diagram through the third matrix, and performing characteristic representation on the timing diagram to obtain timing characteristics.

Constructing a time sequence adjacent matrix of each key point by using the first skeleton key point and the second skeleton key point, and then updating elements in the time sequence adjacent matrix through the learning of a graph convolution network to obtain an updated time sequence adjacent matrix which can be defined as a third matrix; and constructing a time sequence diagram through a third matrix, performing characteristic representation on the time sequence diagram to obtain time sequence characteristics, and realizing the position change condition and time sequence context correlation depiction of each joint point on a time domain receptive field.

In one embodiment, the timing characteristics based on the timing diagram are represented as:

wherein

And forming a adjacency matrix for the key point i in the time domain T, wherein the elements of the adjacency matrix are represented by the cosine similarity distance between the related image and the target image. The elements of the time sequence adjacency matrix are represented by the similarity between each bone key point corresponding to the target frame and the adjacent frame, and the normalized cosine similarity is used in this embodiment and is represented as:

λ _t ＝σ(ψ _t [L(P ₁ ,P _t )，…，L(P _T-1 ,P _t )] ^T )

wherein P is _t The two-dimensional joint point coordinates are obtained, t is the index of the target frame image, and L is the cosine distance calculation function.

S3, fusing the supervision features with the space semantic features and the time sequence features to obtain three-dimensional human skeleton key point feature information

In order to form human skeleton key point characteristics with space-time consistency and strong representation, a characteristic fusion device is constructed to fuse the global semantic characteristics, the local semantic characteristics and the time sequence characteristics, and the supervision characteristics of a target image are adopted as the supplement and intermediate supervision of human background information in the characteristic fusion process. And the supervision characteristic is obtained by extracting information in the target image through a characteristic extraction network, and the supervision characteristic of the target image comprises deep information and shallow information of the target image. The deep information refers to deep semantic information, which refers to abstract features that semantically convert image features into human reasonable and understandable features, belongs to high-level features possessed by human beings, for example, the shapes of arms can be imagined through skeletal key points of the arms, eyes can be imagined through eyebox key points, and because a neural network is actually constructed by simulating the learning and reasoning processes of human brain nerves, the learned features of the neural network are close to the features understood by human beings as much as possible; the shallow features are typically surface information such as directly visible surface texture, shape, color, etc. Deep features are abstract information at the human learning and understanding level, such as shallow information: blue, corresponding to deep information: melancholy; shallow information: green light, corresponding to deep information: can be used for passing. The deep information mainly comprises deep semantic information, and the shallow information mainly comprises texture information and edge information.

The supervision characteristics refer to original target image information obtained by utilizing the multi-scale convolution layer, and the original target image information comprises shallow texture information, edge information and deep semantic information of a human body target and a background in a target image. Because the spatial semantic features and the time sequence features are obtained based on the graph volume network, the graph features related to the graph volume network are mainly obtained based on the position information of key skeletal joint points of a human body, and the original image information is lacked, the supervision features of the target image can be used as supplements of the spatial features and the time sequence features.

The global semantic features and the local semantic features are used as attention factors of the supervision features, the global semantic features and the local semantic features are connected to form first spatial structure features of the target image, and the spatial structure features are adjusted by a spatial feature adjuster to obtain second spatial structure features; fusing the time sequence characteristics with the supervision characteristics to obtain first time sequence structural characteristics, and adjusting the time sequence structural characteristics by using a time sequence characteristic adjuster to obtain second time sequence structural characteristics; and taking the second time sequence structural characteristic as an attention factor of the second space structural characteristic to obtain the three-dimensional human skeleton key point characteristic information. The space characteristic adjuster and the time sequence characteristic adjuster are a convolution network or a full-connection layer network, and can perform dimension increasing or dimension reducing on the space semantic characteristics and the time sequence characteristics, and adjust the space semantic characteristics and the time sequence characteristics to be consistent dimensions.

In one embodiment, the supervision characteristic of the target image in the original single-view human body image is set as F _S Global spatial semantic feature F _G And local spatial semantic features F _L As a supervision feature F _S The attention factor of (2) represents the attention degree of global space information and local space information of the skeleton key points, and then the global semantic features and the local semantic features are connected to form the space structure features of the image; will time sequence characteristic F _T Re-association supervision feature F _S Fusing, adjusting the feature dimension by a time sequence feature adjuster, taking the adjusted time sequence structural feature as an attention factor of the adjusted space structural feature, and focusing on the time sequence alignment of the related imagesAnd (3) supplementing depth information of the target image to obtain fusion characteristics F of the three-dimensional human skeleton key points, wherein the fusion characteristics F are expressed as: f = a (F) _T ||F _S )⊙β(F _S ||F _G ⊙(F _S ||F _L ⊙F _S ) Wherein α and β are regulatory functions; i is characteristic connection; an indication of a dot product; f _S ＝G(I _t ) The feature extraction network G may adopt a conventional convolutional neural network such as Resnet.

S4, refining the characteristics of the three-dimensional human skeleton key point characteristic information to obtain three-dimensional skeleton key point coordinate information

And performing feature optimization and adjustment on the feature information of the fused three-dimensional human skeleton key points by using a feature refiner to obtain coordinate information of the three-dimensional human skeleton key points.

Before acquiring a single-view human body image sequence, the method also comprises the following steps:

constructing an initial network model, and acquiring a single-view human body image training sample;

inputting single-view human body image training samples into the initial network model, optimizing parameters of the initial network model according to a preset objective function to obtain a trained network model, and inputting single-view human body image sequences into the trained network model.

Wherein the predetermined objective function is

Wherein the content of the first and second substances,

φ _t,i the three-dimensional coordinates of the estimated position and the three-dimensional coordinates of the reference position of the ith skeleton key point in the t frame target image are respectively, the three-dimensional coordinates of the estimated position are the three-dimensional skeleton key point coordinate information in the invention, and the reference coordinates are the key point coordinate information in the training data. When model training is performed based on a preset target function, the smaller the value of the obtained target function is, the higher the detection accuracy of the key point is.

Therefore, in the scheme, the graph convolution network is used for acquiring the space semantic features and the time sequence features, wherein the space semantic features and the time sequence features mainly comprise the connection and position information of human skeleton joint points, and the features have space-time correlation through fusion between the space semantic features and the time sequence features; the supervision characteristics comprise original image information, are supplementary to the space semantic characteristics and the time sequence characteristics, and can further improve the representation of the final network acquisition characteristics. The global semantic features and the local semantic structure features of the two-dimensional skeleton key points of the target image in the single-view single human body image can map the plane information of the three-dimensional human body skeleton key points; the time sequence characteristics of a plurality of continuous images with the same visual angle are jointly mapped to the depth information of the key points of the three-dimensional human skeleton; the monitoring features are used as intermediate monitoring in the fusion process of the space semantic features and the time sequence features, so that the depth ambiguity and the ill-posed character when the two-dimensional human skeleton key points are mapped to the three-dimensional human skeleton key points can be effectively reduced, and the detection accuracy of the three-dimensional human skeleton key points is improved.

In one embodiment, a single-view three-dimensional human skeleton key point detection device is provided, which corresponds to the detection method in the above embodiments one to one. Specifically, the detection device includes: the system comprises an image acquisition module, a display module and a display module, wherein the image acquisition module is used for acquiring a single-view human body image sequence, the single-view human body image sequence comprises multiple frames of images of the same target object in different postures within a preset time period, one frame of image is used as a target image, and the rest of images are used as related images of the target image; the characteristic acquisition module is used for respectively acquiring a first skeleton key point of a target image and a second skeleton key point of a related image, extracting the spatial characteristic of the first skeleton key point to obtain a spatial semantic characteristic, and extracting the time sequence characteristic of the first skeleton key point and the second skeleton key point to obtain a time sequence characteristic, wherein the spatial semantic characteristic comprises a global semantic characteristic and a local semantic characteristic; and the feature fusion module is used for fusing the supervision features with the space semantic features and the time sequence features to obtain three-dimensional human skeleton key point feature information, wherein the supervision features comprise deep semantic information, shallow texture information and edge information of the target image.

The method comprises the following steps of acquiring a first bone key point of a target image and a second bone key point of a related image through a two-dimensional bone key point detector; the characteristic acquisition module can be a timing diagram extractor and a space diagram extractor respectively; the feature fusion module may be a feature fusion engine. Referring to fig. 6, a single-view three-dimensional human skeleton key point detection device and a three-dimensional human skeleton key point detection step based on the device are provided.

And the two-dimensional skeleton key point detector is used for inputting the target image and the related image into the two-dimensional skeleton key point detector for detection after acquiring the target image and the related image so as to obtain a first skeleton key point of the target image and a second skeleton key point of the related image, and is not limited to CPN and HRNet.

And the space map extractor is constructed by utilizing a graph convolution network and an attention mechanism, and the global semantic features and the local semantic features of the first skeleton key points of the target image are obtained. Constructing a global space map and a global adjacency matrix according to the first skeleton key points; mining a weight matrix of the global space map through a multi-head attention mechanism, and obtaining multi-dimensional characteristics of a plurality of receptive fields in the global space map by using a cavity convolution network; updating the global adjacent matrix by using a graph convolution network to obtain a first matrix; combining the weight matrix, the multidimensional characteristics and the first matrix to construct a multi-head attention global space map; and performing feature representation on the multi-head attention global space diagram to obtain global semantic features. Constructing a plurality of local adjacency matrixes according to the local connection relation of the first skeleton key points, and updating the plurality of local adjacency matrixes through a graph convolution network to obtain a plurality of second matrixes; and constructing a plurality of local space graphs according to the plurality of second matrixes, and performing feature representation on the plurality of local space graphs to obtain the local semantic features.

And the timing diagram extractor is constructed by utilizing a graph convolution network and used for acquiring the timing characteristics of the single-view human body image sequence. Constructing a time sequence adjacent matrix of the first skeleton key point and the second skeleton key point, and updating the time sequence adjacent matrix through a graph convolution network to obtain a third matrix; and constructing a timing diagram through a third matrix, and performing characteristic representation on the timing diagram to obtain timing characteristics.

And the feature fusion device is used for fusing the supervision features with the space semantic features and the time sequence features to obtain three-dimensional human skeleton key point feature information and obtain feature representation with space-time consistency and strong representation. The global semantic features and the local semantic features are used as attention factors of the supervision features, the global semantic features and the local semantic features are connected to form first spatial structure features of the target image, and the first spatial structure features are adjusted by a spatial feature adjuster to obtain second spatial structure features; fusing the time sequence characteristics with the supervision characteristics to obtain fused first time sequence structural characteristics, and adjusting the first time sequence structural characteristics by using a time sequence characteristic adjuster to obtain adjusted second time sequence structural characteristics; and taking the second time sequence structural characteristic as an attention factor of the second space structural characteristic to obtain the three-dimensional human skeleton key point characteristic information. The spatial feature adjuster and the time sequence feature adjuster in this embodiment are convolutional networks or fully-connected layer networks, and perform dimension reduction or dimension increase on the spatial semantic features and the time sequence features, and are mainly responsible for adjusting the two features to be consistent dimensions.

In addition, the device also comprises a characteristic refiner, and the characteristic refiner utilizes the network layer to optimize and regulate the fusion characteristics of the three-dimensional human skeleton key points to obtain the coordinate information of the three-dimensional human skeleton key points.

The invention provides a single-view three-dimensional human skeleton key point detection device, which obtains a first skeleton key point and a second skeleton key point through detection of a two-dimensional skeleton key point detector, a space diagram extractor extracts global semantic features and local semantic features of a target image, a timing diagram extractor acquires time sequence features of a single-view human body image sequence, a feature fusion device fuses supervision features, spatial semantic features and time sequence features to realize effective extraction and fusion of the spatial semantic features and the time sequence features of the single-view three-dimensional human skeleton key points, characteristic representation with space-time consistency and strong representation is mined, the depth fuzziness and the unsuitability of mapping from a single-view image to the three-dimensional human skeleton key points are reduced, and the detection accuracy of the three-dimensional human skeleton key points is improved.

The specific limitation of the single-view three-dimensional human bone key point detection device can be defined in the above-mentioned method for detecting single-view three-dimensional human bone key points, and is not described herein again. All modules or parts of the single-view three-dimensional human skeleton key point detection device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a single-view human image processing side, comprising a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and/or volatile storage media, internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer equipment is used for connecting and communicating with an external single-view human body image acquisition end through a network. The computer program is executed by a processor to implement functions or steps of a service side of a single-view three-dimensional human bone key point detection method. The processor, when executing the computer program, implements the steps of:

respectively acquiring a first skeleton key point of a target image and a second skeleton key point of a related image, extracting spatial features of the first skeleton key point to obtain spatial semantic features, and extracting time sequence features of the first skeleton key point and the second skeleton key point to obtain time sequence features, wherein the spatial semantic features comprise global semantic features and local semantic features;

the supervision characteristics are fused with the space semantic characteristics and the time sequence characteristics to obtain three-dimensional human skeleton key point characteristic information, wherein the supervision characteristics comprise deep semantic information, texture information and edge information of a target image;

and refining the feature information of the key points of the three-dimensional human skeleton by using a feature refiner to obtain coordinate information of the key points of the three-dimensional skeleton.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

respectively obtaining a first skeleton key point of a target image and a second skeleton key point of a related image, extracting spatial features of the first skeleton key point to obtain spatial semantic features, and extracting time sequence features of the first skeleton key point and the second skeleton key point to obtain time sequence features, wherein the spatial semantic features comprise global semantic features and local semantic features;

and (4) fusing the supervision characteristics with the space semantic characteristics and the time sequence characteristics to obtain three-dimensional human skeleton key point characteristic information, wherein the supervision characteristics comprise deep semantic information, texture information and edge information of the target image.

It should be noted that, the functions or steps that can be implemented by the computer-readable storage medium or the computer device can be referred to the related descriptions of the server side and the client side in the foregoing method embodiments, and are not described here one by one to avoid repetition.

In conclusion, the method for detecting the three-dimensional human skeleton key points of the single-view-angle image can acquire the characteristics with strong representation and space-time consistency, utilizes the global semantic characteristics and the local semantic characteristics of the target image to map the plane information of the three-dimensional human skeleton key points, utilizes the time sequence characteristics of a plurality of continuous images with the same view angle to map the depth information of the three-dimensional human skeleton key points, utilizes the supervision characteristics as intermediate supervision, fuses the supervision characteristics with the space semantic characteristics and the time sequence characteristics, can effectively reduce the depth ambiguity and the unsuitability of the mapping of the two-dimensional human skeleton key points of the single-view-angle image to the three-dimensional human skeleton key points, and improves the detection accuracy. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A single-view three-dimensional human skeleton key point detection method is characterized by comprising the following steps:

2. The method for detecting single-view three-dimensional human skeleton key points according to claim 1, wherein the step of extracting spatial features of the first skeleton key points to obtain spatial semantic features comprises the steps of:

mining a weight matrix of the global space diagram through a multi-head attention mechanism, and obtaining multi-dimensional characteristics of a plurality of receptive fields in the global space diagram by utilizing a cavity convolution network;

combining the weight matrix, the multidimensional characteristics and the first matrix to construct a multi-head attention global space diagram;

3. The method for detecting single-view three-dimensional human skeleton key points according to claim 1, wherein performing spatial feature extraction on the first skeleton key points to obtain spatial semantic features comprises:

4. The method for detecting single-view three-dimensional human skeleton key points according to claim 1, wherein the extracting of the time sequence features of the first skeleton key points and the second skeleton key points to obtain the time sequence features comprises:

5. The single-view three-dimensional human skeleton key point detection method according to claim 1, wherein the step of fusing a supervision feature with the spatial semantic feature and the time sequence feature to obtain three-dimensional human skeleton key point feature information comprises the steps of:

6. The single-view three-dimensional human bone key point detection method according to claim 1, wherein before acquiring the single-view human image sequence, the method comprises the steps of:

7. The method for detecting single-view three-dimensional human bone key points according to claim 1, wherein the step of respectively obtaining a first bone key point of the target image and a second bone key point of the related image comprises:

and respectively detecting the target objects in the target image and the related image by using a two-dimensional skeleton key point detector to obtain a first skeleton key point of the target image and a second skeleton key point of the related image.

8. A single-view three-dimensional human bone key point detection device is characterized by comprising:

the system comprises an image acquisition module, a display module and a display module, wherein the image acquisition module is used for acquiring a single-view human body image sequence, and the single-view human body image sequence comprises a plurality of frames of images of the same target object in different postures in a preset time period, one frame of image is used as a target image, and the rest of images are used as related images of the target image;

9. A computer apparatus comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, implements the steps of the single-view three-dimensional human bone keypoint detection method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, characterized in that: the computer program when being executed by a processor realizes the steps of the single-view three-dimensional human bone key point detection method according to any one of claims 1 to 7.