CN113095254A

CN113095254A - Method and system for positioning key points of human body part

Info

Publication number: CN113095254A
Application number: CN202110422052.8A
Authority: CN
Inventors: 王好谦; 蔡元昊
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-07-09
Anticipated expiration: 2041-04-20
Also published as: CN113095254B

Abstract

The invention provides a method and a system for positioning key points of human body parts, wherein the method comprises the following steps: s1, preprocessing an image containing a human body part; s2, inputting the image preprocessed in the step S1 into a convolutional neural network branch to obtain a key point thermodynamic diagram, and decoding the key point thermodynamic diagram to obtain an initial coordinate of a key point; convolving the feature maps of all stages in the convolutional neural network branches through a connecting layer to obtain corresponding relay thermodynamic maps of all stages, and encoding and decoding the relay thermodynamic maps of all stages to generate corresponding node features and relay key point coordinates of all stages; respectively inputting the node characteristics of each stage and the coordinates of the relay key points into each corresponding stage in the graph convolution neural network branch to obtain the coordinate compensation of the key points; and S3, calculating and obtaining the final coordinate of the key point according to the initial coordinate of the key point and the coordinate compensation of the key point. The method and the system can improve the detection precision of key points of human body parts.

Description

Method and system for positioning key points of human body part

Technical Field

The invention relates to the technical field of computer data processing, in particular to a method and a system for positioning key points of human body parts.

Background

The main goal of human pose estimation is to locate and connect all human skeletal keypoints in a single RGB image into individual human instances. Human pose estimation is a very important and fundamental task in computer vision. In the traditional algorithm, the human body posture estimation task is regarded as a tree-shaped or net-shaped graph theory model, and the solution is carried out based on the characteristics of manual design. The method has limited characterization capability and cannot achieve good effect. With the continuous breakthrough of deep learning, the field of human posture estimation has also made rapid progress.

The current mainstream algorithms for estimating the human body posture are mainly divided into two types: top-down (Top-down) and Bottom-up (Bottom-up). The top-down algorithm first uses a human body detector to output a rectangular bounding box (bounding box) to map the pedestrian location. Generally, a rectangular bounding box is a quadruple parameter (x, y, w, h), where x denotes the abscissa of the upper left corner of the rectangular bounding box, y denotes the ordinate of the upper left corner of the rectangular bounding box, w denotes the width of the rectangular bounding box, and h denotes the height of the rectangular bounding box, and the position and size information of the rectangular bounding box is shown by such a quadruple. And then, deducting out the rectangular frame area containing the pedestrians, and carrying out single posture estimation on each human body example. The single-person posture estimation process is to input a picture including a single person into a designed convolutional neural network, assuming that a person has K skeletal key points, the neural network outputs thermodynamic diagrams of K channels, each channel represents the probability that any position in the picture is the skeletal key point of the kind, and then the thermodynamic diagrams of each channel are decoded (generally, peak value to peak value shift is taken) to obtain two-dimensional coordinates of each skeletal key point. The bottom-up algorithm firstly detects all human skeleton key points without example labels in the whole picture, specifically, the whole picture containing a plurality of persons is input into a convolutional neural network, then thermodynamic diagrams of all skeleton key points are output, the thermodynamic diagrams are also K channels, then the thermodynamic diagrams of each channel are decoded to obtain two-dimensional coordinate information of each type of skeleton key point, and then key points belonging to the same person are connected to obtain individual human body examples.

However, the various methods/algorithms in the prior art are not accurate for human pose estimation. The present inventors have discovered through careful study that various prior art methods/algorithms focus solely on learning better and more sophisticated image representations to generate higher quality keypoint thermodynamic diagrams. However, in the thermodynamic diagram, information at a pixel point position is compressed into a probability value of corresponding human key point compression, which causes other information carried by the pixel point itself to be erased. For example, a location on a thermodynamic diagram has a large area of response, and we can only assume that the location belongs to the corresponding key point. However, we cannot tell the direction of rotation of the key point into and out of the plane, and the direction of extension of the limb hinged to the key point. In addition, the variety of apparel and severe occlusion can cause difficulty in learning appearance characterizations. In response to this problem, the applicant perceived that the performance of learning and estimating the human pose from the appearance features alone can be improved by implicitly modeling the spatial characterization between skeletal key points of the mutual hinges.

Graph Convolutional neural Network (GCN) is a novel neural Network proposed by Thomas N.Kipf and Maxwelling in Semi-assisted classification with Graph conditional networks. Such neural networks exclusively handle graph-shaped data structures. Generally, graph convolution networks can be divided into two categories, spectrum-based and space-based. The former uses fourier transform to implement the convolution process, and the latter expands the spatial definition of ordinary convolution to implement traditional convolution on nodes and their neighboring nodes in the graph. Generally, spectral-based graph convolution is suitable for handling topology-invariant graph data, while spatial-based graph convolution is good at handling topology-variant graph data.

A simple graph volume layer can be defined as follows:

wherein X represents the inputThe characteristics of the nodes of (a) are,

is a normalized version of adjacency matrix a. W is a learnable parameter matrix. σ (-) denotes the activation function, commonly the ReLU function. However, the applicant has found that simple graph-convolution networks are not suitable for simulating spatial relationships within skeletal keypoints. The reason is as follows: (1) the learnable matrix W is shared with edges in the graph structure, and therefore, the internal structure of the graph data is not well utilized; (2) adjacency matrix

The simple graph convolution layer is limited to capture features from the first-order field of each node; (3) simple graph-convolution networks only exploit spatial information such as two-dimensional skeletal point coordinates but ignore limb-based semantic features.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method and a system for positioning key points of a human body part, which can improve the detection accuracy of the key points.

The invention provides a method for positioning key points of human body parts, which comprises the following steps: s1, preprocessing an image containing a human body part; s2, inputting the image preprocessed in the step S1 into a convolutional neural network branch in a spatial structure representation network module to obtain a key point thermodynamic diagram, and decoding the key point thermodynamic diagram to obtain an initial coordinate of a key point; convolving the feature maps of all stages in the convolutional neural network branches through a connecting layer in the spatial structure representation network module to obtain corresponding relay thermodynamic diagrams of all stages, and encoding and decoding the relay thermodynamic diagrams of all stages to generate corresponding node features and relay key point coordinates of all stages; respectively inputting the node characteristics and the relay key point coordinates of each stage into the corresponding stages in the graph convolution neural network branches in the spatial structure representation network module so as to obtain the coordinate compensation of the key points; and S3, calculating and obtaining the final coordinate of the key point according to the initial coordinate of the key point and the coordinate compensation of the key point.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for locating key points of a human body part as described above.

The present application further provides a positioning system for key points of human body parts, comprising: a preprocessing module: the image preprocessing device is used for preprocessing an image containing a human body part; spatial structure characterization network module, comprising: branching of the convolutional neural network: the system comprises a preprocessing unit, a key point thermodynamic diagram acquiring unit, a coordinate calculating unit and a coordinate calculating unit, wherein the preprocessing unit is used for inputting a preprocessed image to acquire a key point thermodynamic diagram and decoding the key point thermodynamic diagram to acquire initial coordinates of key points; connecting layers: the relay thermodynamic diagrams are used for connecting the convolutional neural network branches and the stages of the graph convolutional neural network branches, performing convolution on the characteristic diagrams of the stages in the convolutional neural network branches to obtain corresponding relay thermodynamic diagrams of the stages, and encoding and decoding the relay thermodynamic diagrams of the stages to generate corresponding node characteristics and relay key point coordinates of the stages; graph convolution neural network branching: the system comprises a key point, a node characteristic and a relay key point coordinate, wherein the key point is used for inputting the node characteristic and the relay key point coordinate of each stage to obtain the coordinate compensation of the key point; a final coordinate calculation module: and the method is used for calculating and obtaining the final coordinate of the key point according to the initial coordinate of the key point and the coordinate compensation of the key point.

The invention has the beneficial effects that:

1) in the traditional algorithm, manual design is adopted for spatial structure representation, and the characteristic generalization capability of the manual design is weak. The invention takes the convolutional neural network and the graph convolutional neural network as two parallel and mutually intersected branches to jointly iterate, on one hand, the convolutional neural network branches are arranged, the most advanced and best-effect human body part key point estimation algorithm at present can be directly utilized, and on the other hand, the graph convolutional neural network branches are adopted to implicitly simulate the spatial structure relation among the human body part key points, thereby overcoming the limitations of the traditional method and the current mainstream method for estimating the human body part key points and greatly improving the detection precision of the key points.

2) The method has good flexibility and expansibility. By adaptively replacing the convolutional neural network branches and the graph convolutional neural network branches, the key point estimation is performed according to different human body parts, such as: human body posture estimation, human body key point detection, gesture estimation and the like. The spatial structure representation network module is suitable for a single-stage convolution network and a multi-stage network, and is high in expandability.

3) The traditional convolution calculation is based on a two-dimensional image, a filter is used for traversing, large space storage is occupied, and the calculation complexity is high. The graph convolution in the invention is based on the node characteristics, matrix multiplication is used, and the space storage complexity and the calculation complexity are both low. Therefore, the calculation efficiency of the design of the invention is higher, the extra calculation time is hardly increased, and the requirements of small model capacity, high calculation speed, low delay and the like of mobile terminal equipment deployment can be met.

4) The general simple graph convolution has weaker representation capability when processing human skeleton data, and the graph convolution layer in the graph convolution neural network provided by the invention is elaborately designed to capture the stable dependency relationship between key points of human body parts, and can automatically generate key point coordinate compensation offset by using context information with large information amount to guide the optimization of the posture of the human body parts. Meanwhile, on the other hand, the graph convolution neural network branches can utilize local key point representation and global background representation to inhibit meaningless information propagation and reduce noise interference. It is noted that the graph convolution layer can also carry non-local blocks to further improve performance.

5) The design of the invention adopts an end-to-end optimization technology, and does not need a complicated multi-step training process. The optimal effect can be achieved by direct end-to-end training. Meanwhile, the space storage and the calculation complexity of the branches of the graph convolution network are not large, so that the occupied video memory resources are not large, the training overhead is controlled, and the training speed is high.

Drawings

Fig. 1 is a frame diagram of a positioning system for key points of human bones in embodiment 1 of the present invention.

Fig. 2 is a frame diagram of a spatial structure characterization network module in embodiment 1 of the present invention.

FIG. 3 is a block diagram of convolutional neural network branches in embodiment 1 of the present invention.

FIG. 4 is a block diagram of a convolutional layer in a convolutional neural network branch in embodiment 1 of the present invention.

Fig. 5 is a block diagram of a convolutional layer carrying non-local blocks in convolutional neural network branches in embodiment 1 of the present invention.

Fig. 6 is a flowchart of a method for locating key points of a human body part in embodiment 1 of the present invention.

Fig. 7 is a diagram illustrating the effect of the positioning experiment on key points of human bones in example 1 of the present invention.

Detailed Description

The present invention is described in further detail below with reference to specific embodiments and with reference to the attached drawings, it should be emphasized that the following description is only exemplary and is not intended to limit the scope and application of the present invention.

The invention relates to a method and a system for positioning key points of human body parts. The positioning of the key points of the human body parts comprises positioning of key points of human skeleton, detection of key points of human face, gesture estimation and the like. In the following examples, the key points of human bones are taken as examples for explanation.

Example 1

The embodiment provides a high-precision human body posture estimation method, in particular to a method and a system for positioning key points of human bones, wherein the system comprises three modules, as shown in fig. 1, which are respectively: the system comprises a preprocessing module 10, a spatial structure representation network module 20 and a final coordinate calculation module 30; the spatial structure characterization network module 20 further includes a convolutional neural network branch (CNN)21, a connection layer 22, and a convolutional neural network branch (GCN) 23.

Based on the system, the positioning method comprises the following steps: s1, preprocessing an image containing a human body part. S2, inputting the image preprocessed in the step S1 into a convolutional neural network branch in a spatial structure representation network module to obtain a key point thermodynamic diagram, and decoding the key point thermodynamic diagram to obtain an initial coordinate of the key point. And performing convolution on the feature graphs of each stage in the convolutional neural network branch through a connecting layer in the spatial structure representation network module to obtain corresponding relay thermodynamic diagrams of each stage, and encoding and decoding the relay thermodynamic diagrams of each stage to generate corresponding node features and relay key point coordinates of each stage. And respectively inputting the node characteristics and the relay key point coordinates of each stage into the corresponding stages in the graph convolution neural network branches in the spatial structure representation network module so as to obtain the coordinate compensation of the key points. And S3, calculating and obtaining the final coordinate of the key point according to the initial coordinate of the key point and the coordinate compensation of the key point.

The modules and the corresponding method steps are further explained below.

1. Pre-processing module

The preprocessing module 10 is used for preprocessing an image including a human body part, and includes two steps: step 101 and step 102.

101: and (4) completing pre-training of the convolutional neural network branches on the data set ImageNet, and storing the weights for later use.

102: firstly, a pedestrian detector outputs a series of rectangular boundary frames to detect people in the picture, and the rectangular frames are deducted to be used as training data. The training data is then resized to a uniform size (256 × 256 or 256 × 192 or 384 × 288) and data enhanced with random cropping, rotation, symmetry, occlusion, truncation, etc.

2. Spatial structure representation network module

As shown in FIG. 2, the spatial structure characterization network module includes two branches and a connection layer, CNN₁～CNN_MRepresents a convolutional neural network branch 21 and comprises M stages; GCN₁～GCN_MRepresenting the convolutional neural network branch 23, and comprising M stages; the arrows connecting the stages in CNN and GCN represent the connection layer 22. The convolutional neural network branch 21 is used for inputting the preprocessed image to obtain a key point thermodynamic diagram, and decoding the key point thermodynamic diagram to obtain an initial coordinate of a key point; the connection layer 22 is used for connecting the convolutional neural network branch and each stage of the graph convolutional neural network branch, and divides the convolutional neural network intoCarrying out convolution on the feature graphs of each stage in the branch to obtain a corresponding relay thermodynamic diagram of each stage, and coding and decoding the relay thermodynamic diagrams of each stage to generate node features and relay key point coordinates of each stage; the graph convolution neural network branch 23 is used for inputting node characteristics of each stage and relaying the key point coordinates to obtain the coordinate compensation of the key point.

Let the input image be I_in。I_inConvolutional layer passing through several heads (defined as Conv)_down(. o)) to obtain an initial characteristic F₀. Let the backbone portion of the convolutional neural network branch have M stages, the kth stage is defined as CNN_k(. cndot.). This flow can be derived as follows:

F₀＝Conv_down(I_in)，F_k＝CNN_k(F_k-1) Formula (1)

Wherein k is more than or equal to 1 and less than or equal to M. Characteristic F of the k-th stage_kGenerating a k-th stage relay thermodynamic diagram through a connection layer

R represents a real number domain. K denotes the number of bone keypoint classes, H and W define the height and width of the feature map, respectively. Then, the process of the present invention is carried out,

a node characteristic N used to generate a kth relay thermodynamic diagram by an encoding function (En (-) and a decoding function (De (-) respectively)_k∈R^K+1And two-dimensional coordinates C of relay key point_k∈R^K×2. This process can be derived as follows:

wherein the encoding function En (-) is a 1 × 1 convolutional layer, and the decoding function De (-) is a weighted sum of the two-dimensional keypoint coordinates of the Top-k thermodynamic diagram response maximum. Because of the difference from the real label, it is used

And

to represent the network prediction that is input into the graph convolution neural network branches. There are also M stages of the graph convolution neural network branch. The graph convolution network of the kth stage is defined as GCN_k(. o), the output is defined as G_k。G_kComposed of two parts, node characteristics G_k(0) And key point coordinate compensation G_k(1)。GCN_kThe main task of (DEG) is to extract spatial structure contact information among the skeletal key points to generate and input C_kAnd correspondingly compensating the two-dimensional coordinates. G_kCan be derived from the following formula:

where λ is a hyper-parameter for balancing GCN_kThe input of (c) is (c),

and

respectively representing the node features and the key point coordinates output in the first stage. From this, the keypoint prediction of the k-th stage can be further derived as:

as previously described, G_k(1) Is that

Relay keypoint coordinate compensation. Therefore, the temperature of the molten metal is controlled,

convenient substitute

The coordinates after optimization and improvement. The final coordinates of the keypoints finally output by the final coordinate calculation module can be expressed as:

2.1 convolutional neural network Branch

The convolutional neural network branches, which are currently leading high-resolution networks HRNet in this embodiment, and the details of the network structure are shown in fig. 3. The whole network structure is divided into 4 stages, and can be divided into four layers according to the spatial resolution of the characteristic diagram. Note the convolution network of the ith stage and the jth layer as N_ijThe spatial division ratios of the four layers from top to bottom are 1/4, 1/8, 1/16 and 1/32 of the original size in sequence. After the preprocessed training data is fed into the convolutional neural network branch, a relay thermodynamic diagram is generated at each stage of the HRNet, and the relay thermodynamic diagram is used for relay supervision on one hand, and is used for encoding to generate node characteristics and decoding to generate relay key point coordinates to serve as multi-element input of the convolutional neural network branch on the other hand. The final output key point thermodynamic diagram of the convolutional neural network branch is used for generating initial coordinates of human skeleton points.

2.2 graph convolution neural network Branch

In order to overcome the deficiency of the simple graph convolution layer in processing the human skeleton data, the present embodiment specially designs the graph convolution layer in the graph convolution neural network branch, and the block diagram of the graph convolution layer is shown in fig. 4.

Wherein,

spatial coordinate information representing the key points,

node features representing local skeletal keypoints,

and representing the node characteristics of the relay thermodynamic diagram in a global context.

2.2.1 base map convolutional layers. In order to facilitate the training process, the jump level connection is introduced through the fc full-link layer in the embodiment, and in a basic graph convolution layer, a simple graph convolution is calculated in the adjacent matrix A, so that the parameter matrix W and the input key point space coordinate information J can be learned_inAnd is spread out. This process can be derived as follows:

J_out＝σ(fc₁(σ(AJ_inW))+fc₂(J_in) Equation (6)

Where the batch normalization is omitted, fc₁And fc₂Are all-connected layers not shared with each other, J_outAnd represents the spatial coordinate information of the key points after the graph convolution layer processing.

2.2.2 limb-based and background-based features. Encoding the features of local keypoints into node features H_inThen the node feature comprises a limb-based characterization. The local features contain very rich spatial texture context information and semantic description information. The former facilitates the localization of keypoints and the latter facilitates the classification of keypoints. On the other hand, some background regions are often characterized by little meaning or even noise. This information reduces computational efficiency and detection performance. The present invention is therefore designed to suppress the propagation of useless information and to enable the network to concentrate more on areas of body key points. Thus, introduction of H_inAnd G_inAs input for the skeleton structure map convolutional layer. Then the output limb-based and background-based features H_outAnd G_outCan be derived as follows.

Wherein abs (. cndot.) represents absolute value, [. cndot. ], cascade operation, and split represents the channel-by-channel partition characteristics.

2.2.3 keypoint perceptual steering matrices. By means of H_inAnd G_inTo generate a key-point-aware steering matrix A^kwThe derivation process can be derived as follows:

A^kw＝fc(repeat(H_out) Equation (8)

Wherein the repeat (. cndot.) function represents a negative profile and concatenates them to produce a

Of the matrix of (a). This matrix produces A by a full connection layer and a batch normalization^kw。A^kwRich limb-based features can be maintained while suppressing interference of meaningless information. In addition, due to the copy operation using the full connection layer, A^kwSpatial relationships between skeletal points can also be modeled. Then, in A^kwAn, W and J_inA simple graph convolution operation is performed. An indicator indicates a matrix bit-wise multiplication operation. Then J is finally_outCan be expressed as follows:

J_out＝σ(fc₁(σ(A^kw⊙A)J_inW))+fc₂(J_in) Equation (9)

2.2.4 non-local blocks. The graph convolutional layer designed as above can be seamlessly interfaced with non-local block technology. As shown in fig. 5, adding Non-Local blocks (Non-Local) may reinforce long-distance dependencies between key points of human bones.

2.3 training the loss function. The final training loss function is the mean square error of the thermodynamic diagram on the convolutional neural network branch plus the two-dimensional coordinates of the key points on the convolutional neural network branch and the regression loss of the limb length, as shown in the following formula:

wherein L represents a total loss function, γ is a hyperparameter, M is the number of stages of a convolutional neural network branch or a graph convolutional neural network branch,

representing the loss function of the convolutional neural network at stage k,

represents the loss function of the k-th stage graph convolutional neural network.

For the whole process, the specific steps are shown in fig. 6, the preprocessed image (a) is input into the branch of the convolutional neural network CNN in the spatial structure characterization network module, the whole CNN network structure is divided into 4 stages, the convolutional network of the i-th stage and the j-th layer is N_ijThe spatial division ratios of the four layers from top to bottom are 1/4, 1/8, 1/16 and 1/32 of the original size in sequence. The convolutional neural network CNN branch will eventually output a key point thermodynamic diagram (thermodynamic diagram in the upper right corner of the diagram). The keypoint thermodynamic diagram is decoded to obtain the initial coordinates of the keypoints.

Feature map (N) with maximum spatial resolution at each stage in convolutional neural network CNN₁₁、N₂₁、N₃₁、N₄₁) And respectively obtaining the relay thermodynamic diagrams (b) of each stage through a 1 × 1 convolutional layer, and encoding and decoding the relay thermodynamic diagrams to generate the corresponding node characteristics (c) and the relay key point coordinates (d) of each stage.

Then respectively inputting the node characteristics (c) and the relay key point coordinates (d) of each stage into corresponding stages (GCN) in the graph convolution neural network branch in the spatial structure representation network module₁、GCN₂、GCN₃、GCN₄) And obtaining the coordinate compensation of the key point after output.

And finally, calculating and obtaining the final coordinate (e) of the key point according to the initial coordinate of the key point output by the convolutional neural network branch and the coordinate compensation of the key point output by the graph convolutional neural network branch.

The method is used for positioning key points of human bones, and the finally obtained experimental effect graph is shown in fig. 6, so that the result is good, and the positioning is accurate.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method of the embodiments of the present invention.

Example 2

The difference from embodiment 1 is that the convolutional neural network branch in this embodiment is replaced with SimpleBaseline-152 in the paper "Simple bases for human position estimation and tracking", and the graph convolutional neural network branch is replaced with SemGCN in the paper "Semantic graph volumetric network for 3d human position regression".

The foregoing is a more detailed description of the invention in connection with specific/preferred embodiments and is not intended to limit the practice of the invention to those descriptions. It will be apparent to those skilled in the art that various substitutions and modifications can be made to the described embodiments without departing from the spirit of the invention, and these substitutions and modifications should be considered to fall within the scope of the invention.

Claims

1. A method for locating key points of human body parts is characterized by comprising

S1, preprocessing an image containing a human body part;

s2, inputting the image preprocessed in the step S1 into a convolutional neural network branch in a spatial structure representation network module to obtain a key point thermodynamic diagram, and decoding the key point thermodynamic diagram to obtain an initial coordinate of a key point;

convolving the feature maps of all stages in the convolutional neural network branches through a connecting layer in the spatial structure representation network module to obtain corresponding relay thermodynamic diagrams of all stages, and encoding and decoding the relay thermodynamic diagrams of all stages to generate corresponding node features and relay key point coordinates of all stages;

respectively inputting the node characteristics and the relay key point coordinates of each stage into the corresponding stages in the graph convolution neural network branches in the spatial structure representation network module so as to obtain the coordinate compensation of the key points;

and S3, calculating and obtaining the final coordinate of the key point according to the initial coordinate of the key point and the coordinate compensation of the key point.

2. The method for locating key points of human body parts according to claim 1, wherein the preprocessing of the step S1 includes:

pre-training the convolutional neural network branches on a data set ImageNet;

and respectively detecting human body parts in the image one by using a detector, and enhancing data.

3. The method according to claim 1, wherein the convolutional neural network branch comprises HRNet or simplebaeline-152.

4. The method as claimed in claim 3, wherein the HRNet comprises 4 stages and is divided into four layers according to the spatial resolution, and the spatial resolution of the four layers from top to bottom is 1/4, 1/8, 1/16 and 1/32.

5. The method of claim 1, wherein the graph convolutional neural network branch comprises a SemGCN.

6. The method of claim 1, wherein the outputs of the graph convolution layer in the graph convolution neural network branches are:

J_out＝σ(fc₁(σ((A^kw⊙A)J_inW))+fc₂(J_in))

wherein fc₁And fc₂Are fully connected layers that are not shared with each other; a is an adjacency matrix; w is a learning parameter matrix; j. the design is a square_inInputting key point space coordinate information; a. the^kwA steering matrix perceivable as a key point, and^kw＝fc(repeat(H_out) Repeat () function represents a negative characteristic graph and concatenates them to produce one

The matrix is normalized by a full connection layer and a batch to generate A^kw；

H_outThe following are satisfied:

wherein H_inAnd G_inFor the input of the convolutional layer of the skeleton map, abs (. cndot.) represents the absolute value, [, ]]Representing cascade operation, split representing per-channel partitioning feature; h_outAnd G_outAre limb-based and background-based features of the output.

7. The method according to claim 1, wherein the training loss function of the spatial structure characterization network module is:

wherein gamma is a hyper-parameter, M is the number of stages of the convolutional neural network branch or the graph convolutional neural network branch,

representing the loss function of the convolutional neural network at stage k,

8. The method of claim 1, wherein the human body part comprises: human skeleton, human face, human palm.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for locating key points of a body part according to any one of claims 1 to 8.

10. A system for locating key points on a human body part, comprising:

a preprocessing module: the image preprocessing device is used for preprocessing an image containing a human body part;

spatial structure characterization network module, comprising:

branching of the convolutional neural network: the system comprises a preprocessing unit, a key point thermodynamic diagram acquiring unit, a coordinate calculating unit and a coordinate calculating unit, wherein the preprocessing unit is used for inputting a preprocessed image to acquire a key point thermodynamic diagram and decoding the key point thermodynamic diagram to acquire initial coordinates of key points;

connecting layers: the relay thermodynamic diagrams are used for connecting the convolutional neural network branches and the stages of the graph convolutional neural network branches, performing convolution on the characteristic diagrams of the stages in the convolutional neural network branches to obtain corresponding relay thermodynamic diagrams of the stages, and encoding and decoding the relay thermodynamic diagrams of the stages to generate corresponding node characteristics and relay key point coordinates of the stages;

graph convolution neural network branching: the system comprises a key point, a node characteristic and a relay key point coordinate, wherein the key point is used for inputting the node characteristic and the relay key point coordinate of each stage to obtain the coordinate compensation of the key point;

a final coordinate calculation module: and the method is used for calculating and obtaining the final coordinate of the key point according to the initial coordinate of the key point and the coordinate compensation of the key point.