CN112801138A

CN112801138A - Multi-person attitude estimation method based on human body topological structure alignment

Info

Publication number: CN112801138A
Application number: CN202110009492.0A
Authority: CN
Inventors: 李浥东; 郎丛妍; 孙鑫雨; 冯紫钰; 赵治坤; 汪敏
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-05-14
Anticipated expiration: 2041-01-05
Also published as: CN112801138B

Abstract

The invention provides a multi-person posture estimation method based on human body topological structure alignment. The method comprises the following steps: training the HRNet network by using the MS-COCO and MPLL data sets as input to obtain an XZZNet network; learning the image sample of the data set by using an HRNet network to obtain a human body key point image; inputting the SZF data set into an XZZNet network to generate candidate key points of the human posture without sleeves; carrying out graph matching on a human body key point graph generated by the HRNet network and a candidate human body key point graph generated by the XZZNet network, and finely adjusting the XZZNet network by using a cross entropy loss function to obtain an optimized XZZNet network; and inputting the SZF data set into the optimized XZZNet network, generating a key point detection image corresponding to the image in the SZF data set, and obtaining the posture information of each key point containing the human body in the image according to the key point detection image. The invention can obviously improve the performance of the target domain in the image without the mark or the sparse mark, and can accurately distinguish each key point of the human body in the image under the unsupervised network learning framework.

Description

Multi-person attitude estimation method based on human body topological structure alignment

Technical Field

The invention relates to the technical field of human body behavior analysis, in particular to a multi-person posture estimation method based on human body topological structure alignment.

Background

In recent years, with the development of information technology and the popularization of intelligent technology, the global technology change is further promoted, and technologies such as cloud computing, internet of things, big data and artificial intelligence are rapidly developed, wherein the human body posture recognition technology is widely applied to the related field of computer vision.

The hot spot of the current artificial intelligence technology has very important research significance for human body posture recognition under a fixed scene, and has certain promotion effect for realizing modern construction in China, so that the enhancement of technical analysis and study are very important. As early as 70 s in the last century, China has started research on human behavior analysis, the research has a strong pushing effect on the development of artificial intelligence in China, and it has become possible to analyze simpler gestures and actions under specific situations or in more standard scenes.

With the continuous improvement of the social level in China, the quality requirements of people on social life are continuously increased, so that video monitoring becomes an indispensable safety measure for people in the actual life process, and the technical requirements based on video analysis are higher and higher. For example, the human body posture recognition method is widely applied to industries such as intelligent home decoration, the medical field and motion analysis, and the human body posture recognition under the solid-state scene has obvious functions in various fields. In particular, in recent years, the enhancement of security work in our country has strong demands for the dense and mobile population in large cities, the discrimination of criminals and the like.

At present, no multi-person posture estimation method based on human body topological structure alignment effectively exists in the prior art.

Disclosure of Invention

The embodiment of the invention provides a multi-person posture estimation method based on human body topological structure alignment, so that each key point of a human body in an image can be accurately distinguished under an unsupervised network learning framework.

In order to achieve the purpose, the invention adopts the following technical scheme.

(corresponding claims)

According to the technical scheme provided by the embodiment of the invention, the performance of the target domain can be obviously improved in the image without the mark or the sparse mark, and each key point of the human body can be accurately distinguished under the unsupervised network learning framework.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a multi-person pose estimation method applied to an image based on human body topological structure alignment according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

The flow chart of the multi-person posture estimation method applied to the image based on human body topological structure alignment provided by the embodiment of the invention is shown in figure 1, and the method comprises the following steps:

step 1, crawling human body images without sleeves through a web crawler to construct an SZF (Sleeve-zero Figures) data set.

And 2, learning the image samples of the MS-COCO and MPLL data sets by using an HRNet network to obtain a human body key point image, and training the HRNet network by using the MS-COCO and MPLL data sets as input so as to obtain a robust XZZNet network.

The MS-COCO data set is called Common Objects in Context, is a data set provided by Microsoft team and can be used for image recognition, and comprises tasks of detection (detection), segmentation (segmentation), keypoints (keypoints) and the like. Images in the MS-COCO data set comprise natural images and common target images in life, the background is complex, the number of targets is large, and the size of the targets is smaller, so that the task on the MS-COCO data set is difficult to realize. Furthermore, the MS-COCO dataset contains 91 types of images. Its training set has 82,783 images, the verification set has 40,504 images, and the test set has 40,775 images. Each image provides 17 body keypoints.

The MPII human pose dataset consists of images taken from the real world with full body pose annotations. There were approximately 25K images, 40K subjects, of which approximately 7K were used for testing and the remaining 18K for training and validation. Each image provides 16 body keypoints.

And 3, inputting the SZF data set into an XZZNet network, and generating candidate human body posture key points without sleeves by the XZZNet network.

And 4, constructing graph models for the human body key point diagram generated by the HRNet network and the candidate human body key point diagram generated by the XZZNet network, carrying out graph matching, continuously correcting through a supervision loss function to enable the graph models to be more sensitive to key points in human body model identification, learning generalized high-order structure invariant representation on two fields by utilizing a minimized loss function to help judge the positions of the key points, and adjusting the XZZNet network by utilizing a cross entropy loss function to obtain the optimized XZZNet network.

And 5: and generating a key point detection image corresponding to the image in the SZF data set by using the optimized XZZNet network and taking the SZF data set as input, and obtaining the posture information of each key point containing the human body in the image according to the key point detection image.

The processing procedure of the step 1 specifically comprises the step of acquiring a large number of human body images without sleeves from a network in a python crawler mode. The python has a powerful crawler and a mature and efficient crawler-remote dictionary service distributed strategy crawler frame, and is convenient for efficiently downloading the webpage; multithreading and process models are mature and stable, and multithreading or processes can optimize program efficiency and improve the downloading and analysis capability of the whole system. Meanwhile, the python has excellent third-party package which can simulate the behavior of the user agent to construct a proper request, and the website is prevented from killing the crawler.

Implementing web crawlers using the requests package of python: the requests library is a module commonly used for http requests, and by introducing the requests library, the webpage can be more conveniently crawled in a program; knowing the encoding mode of the webpage, performing character string matching on the defined data format by using a regular expression in a source program, and converting the successfully matched website into a character string form to be stored in a dictionary; and requesting the website where the externally input image is located, and opening the file at the specified position by using an open () function.

The subject function of python mainly includes three aspects: firstly, acquiring a target webpage address (an image data set can be acquired by using a fixed King seal hundred-degree view), secondly, calling a function for capturing an image, and thirdly, circularly traversing the image website stored in the dictionary.

And deleting unqualified images from the SZF image data set, renaming original messy codes of the images to easily recognized names by using python codes, and labeling the SZF data set. And after the annotation is finished, reading and storing the image and the annotation file one by one in sequence to form an SZF image data set.

The processing procedure of the step 2 specifically includes:

turning (flip), cutting (crop) and remolding (reshape) an original image in the MS-COCO and MPLL data sets, and training the HRNet network by taking the changed image as input to obtain a robust XZZNet network; learning image samples of the MS-COCO and MPLL data sets by using an HRNet network to obtain a human body key point diagram, averaging the heat diagrams generated by the changed images, and predicting the position of each human body key point by adjusting the position of the highest heat value and shifting the position by one fourth in the direction from the highest response to the second highest response.

The HRNet neural network model was trained using the existing MS-COCO and MPLL datasets. In the whole process, a switching unit crossing parallel subnets is introduced into the HRNet network, and information is repeatedly switched on the parallel multi-resolution subnets to perform multi-scale repeated fusion, so that each subnet repeatedly receives information from other parallel subnets. The keypoints are estimated by a high resolution representation of the network output.

The HRNet network used by the invention comprises four stages, the main body is four parallel subnetworks, the resolution ratio is gradually reduced to half, and the corresponding width (channel number) is doubled. The first stage contains 4 remaining modules, each of which, like ResNet-50, is composed of a bottompiece with 64 channels, followed by a 3 × 3 convolution map with the channel number reduced to (C), and the second, third and fourth stages contain 1, 4 and 3 switching modules, respectively. One switch module contains 4 residual modules, each of which contains 2 convolutional layers of 3x3, and one switch module. In summary, there are 8 switch modules, i.e. 8 times of multi-scale fusion. In training, a first stage generates a top-level feature (high-level feature), and the feature dimension is gradually reduced in the next several stages. Thus, the features of the subsequent stage parallel subnet are composed of the features of the previous stage and the features of the next stage. The heatmap is simply regressed from the higher-order feature representation output by the last exchange module. The loss function is defined as the mean square error and is used to compare the predicted heatmap with the groudtuthheatmpas. The group transmit heatmpas is generated by adopting two-dimensional Gaussian distribution, taking the group position of each key point as the center and the standard deviation as 1 pixel.

Training on the MS-COCO dataset:

the aspect ratio of the human detection bounding box (object detection bounding box) is expanded to a fixed aspect ratio, i.e., height: width: 4:3, and then the bounding box is cut out from the image and adjusted to a fixed size, 256 × 192 or 384 × 288. Data increments included random rotations (-45 °,45 °), random scales ([0.65,1.35]), and then inversions. We use Adam optimizer,. The initial learning rate (learning rate) is set to 1e-3, trained at 170 th and 200epochs, and ended within 210 epochs.

Training on MPLL dataset:

the test procedure is almost the same as in COCO, except that a standard test strategy is used. A standard metric, PCKh (head normalized probability of correct keypoints) score, is used. The union is correct if it belongs to a pixel at the location of α l group, where α is a constant and l is the 60% header bounding box whose header size corresponds to the true diagonal length. PCKh @0.5(α ═ 0.5) score.

And training the HRNet network by an MS-COCO data set and an MPLL data set to obtain the XZZNet network.

The XZZNet network and the HRNet network have the same network structure, but the weight parameters of the networks are different, and the XZZNet network at the moment is equivalent to an HRNet model pre-trained on MS-COCO and MPII data sets in the first step.

The processing procedure of the step 3 specifically includes that the XZZNet network is obtained by HRNet network training, so that the convolutional layer and the network parameters (the convolutional layer is the same, the network parameters are different and are continuously updated) are the same as the HRNet network. After the construction of the SZF data set and the generation of the network XZZNet through supervised training are completed, the previously constructed SZF data set is input into the XZZNet network, and the L-utilized pre-training network and the CAFA (Cross attention alignment) module in the XZZNet network_fdThe function learns the domain-invariant and fine-grained human cross-domain representation:

fine-grained features are effective for accurate pose estimation. The goal of CAFA is to adapt to more domain-invariant fine-grained human features across domains. Unlike previous feature adaptation methods, we capture cross-domain correlated fine-grained features through BSAM. It explores the local spatial feature dependencies across domains, rather than simply considering domain features separately. By exploring feature interactions in a bi-directional manner, fine-grained human features can be well encoded for each domain. Specifically, we have designed a source-to-target adaptation (STA) mechanism to enhance the source-to-target human features by adaptively aggregating target features based on their similarities. Similarly, we also use a target-to-source adaptation (TSA) mechanism to update target domain features by aggregating relevant source domain features. Details of the CAFA are shown in the following figures.

Given sample pair x_s,x_t(one from the source domain and one from the target domain) using a feature extractor (feature encoder) to generate corresponding features F_sAnd F_tAnd two convolutional layers are applied to generate a and B, respectively. Furthermore F_s，F_tIs also fed into another convolution layer to obtain S_c，T_c

To determine F_sAnd F_tFine-grained characteristic dependency relationship of each corresponding position in the graph, a correlation map (correlation map) phi ═ a^TB. Wherein phi^(i,j)Is used for measuring F_sThe ith position of (1) and F_tThe jth position of (a). To make F_sAnd F_tMutual enhancement, so a bi-directional enhancement mechanism 1) source domain to target domain adaptation mechanism (STA) is employed:

in STA, we define the spatial association graph of source domain to target domain as:

wherein,

is represented by F_sThe ith position pair F_tThe j-th position of (a). To exploit the fine-grained features with similar spatial response in the target domain, we update F_sIs composed of

F_s'＝F_s+λ_sT_sψ_s-＞t (Eq.2)，

Wherein λ_sAnd (4) utilizing the importance of the target domain related spatial information and the source domain characteristics. Thus, the object similarity characteristicF into which the eigenresponse is encoded_s' each position.

2) Target Domain to Source Domain Adaptation mechanism (TSA)

Similarly, we can get the correlation diagram from the target domain to the source domain according to Eq.1

It indicates F_tJ position of (2) for F_sThe ith position of (a). Combining similar fine-grained source domain responses and the original target properties in Eq.2, we updated F_t. Thus, F_s' and F_t' so we can encode more fine-grained features for each domain.

Finally, by applying the maximum mean error l_fd(MMD for short) alignment F_s' and F_t'。

Where M, N represents the number of sample images of the source domain and the target domain. F_s,i'，F_t,j' denotes F at the i-th position_s' and j-th position F_t'. φ is a mapping operation that projects domain features into a kernel Hilbert space H (kernel Hilbert space). Arbitrary distribution of features can be represented by core embedding techniques. It allows us to minimize l_fdTo learn domain invariant fine-grained human features.

Marking and annotating key points of image samples in the SZF data set, adopting a modified simple baseline as a baseline, performing attitude prediction on the baseline by utilizing an encoder-decoder framework, and generating corresponding image triple features by using a feature extractor

And

and then based on image triplet characteristics

And

obtaining self-adaptive characteristic point F through CAFA module_s,F_tThen the feature point F is set_s,F_tInputting the predicted key point heat maps into an estimator, predicting the respective key point heat maps, and generating candidate human body key point images.

The processing procedure of the step 4 specifically includes:

human body key point images (heat maps) marked by an HRNet network and candidate marked human body key point images (heat maps) generated by an XZZNet network are subjected to intra-domain structure self-adaptation by human body topology alignment, and then graph construction and graph matching operation are carried out to achieve the effect of inter-domain structure alignment. And then, reversely fine-tuning the XZZNet network according to the loss error given by the cross entropy function, and correcting the candidate human body key point images.

The human body key point images (heat maps) marked by the HRNet network and the candidate marked human body key point images (heat maps) generated by the XZZNet network are subjected to image matching by human body topology alignment, although the alignment of the first-order key points in the domain, namely 'structure adaptation in the domain', can not well overcome the problems of large pose difference and serious geometric deformation, particularly under the condition of cross-domain serious shielding. Therefore, the invention considers the Topology knowledge of people with unchanged fields, and adopts an IHTA (the module adopts the IHTA based on a GCN model to carry out Inter-domain Human body Topology Alignment, and SemGCN is also a mechanism of the GCN) (Inter-domain Human-Topology Alignment) to solve the problem, wherein the IHTA is designed by a Graph Convolutional neural Network (GCN for short), and the mechanism provides a clear high-order Human body framework structure modeling mode, is beneficial to obtaining the space Topology information of joints, and enables the Inter-field Human Topology adaptation to be effective and reliable.

And (3) finely adjusting the XZZNet network by combining the entropy loss of the heat map and a loss function of high-order topological matching (Fine-tune), and continuously updating the parameters of the model by utilizing back propagation so as to train the whole model. We use the parameters of the second pre-training step to initialize the unsupervised network, then fine-tune XZZ the network by minimizing the loss function of alignment of the two topologies generated by the HRNet network and the XZZ network, and continuously update the parameters of the model by back-propagation BP algorithm to optimize the model.

1. Local keypoint feature extraction

First, based on the feature map F and the key-point heat map Y_i ^kpA set of key points V of two domains can be obtained by outer product and global mean pooling operations^KPThe semantic local feature of (1). The specific formula is as follows:

2. representation of the figure

Constructing a visual topological graph representation G ═ (V, E), where V is a set of points in the graph, which can be represented as V ═ V_iI 1, 2.., H }, E is an edge set in the graph, which can be expressed mathematically as:

E＝{v_iand v_j,i,j＝1,2,...H,v_iAnd v_jConnected in the figure } (Eq.2)

Here we introduce the concept of a contiguous matrix, where the element a in the matrix is given by A_ijHas a value of 1 if and only if v_iAnd v_jAdjacent in the topological graph, otherwise a_ijIs 0.

3. Graph convolution network

Obviously, we can naturally consider the human body structure as a natural figure, and there are potential space constraints between the joints. We can consider the human joint as the key point v_iThe four limbs of the person are the edges e between the key points_ij. Based on the above viewpoints, the topological representation of the human is modeled by adopting SemGCN (Semantic Graph probabilistic Networks), and for a Graph convolution model, the propagation of features through adjacent nodes is helpful for learning robust Graph convolution modelsThe relationship information between the bar local structure and the nodes. At the same time, the non-local layer is adopted to help capture local and global remote dependency relationships between nodes so as to know more human context information. It enables us to obtain robust human topological information, which is a necessary condition for learning structure invariant information across domains. There are two steps to apply node i based on a graph of convolution propagation. First, the node representation is passed through a learnable parameter matrix

And (6) carrying out transformation. Second, the transformed node representation is collected from the neighbor node j to node i, and then the RELU function is used. Collecting node characteristics into a matrix

After the semantic graph convolution network is called, applying different weight matrixes to each channel of the node features:

wherein v is^lAnd v^l+1Respectively, the node representations before and after the first convolution, M^dIt is a set of H x H sized matrices that are used as learnable parameter matrices, and the weight vectors represent the local semantic knowledge of the neighboring joints implied in the diagram. | | represents the channel concatenation and represents the d-th row of the transformation matrix, which learns the channel weights as a priori edges in the graph (e.g., how one joint affects other body parts in pose estimation) to enhance the graph representation.

Then, in addition to the non-local concept, we define the feature update operation as:

w is to be_vInitializing to zero, and f is used for calculating a node i and all other nodes j; g is used to compute the representation of node j by computingThe characteristics of the nodes are used for calculating the correspondence between the nodes so as to capture the local and global relationship between the nodes.

4. Connection prediction

After two topological graphs are obtained, a hard alignment mode is not needed, therefore, a connection prediction method is provided, nodes of the two topological graphs are connected pairwise, and scores f (s, r, o) are respectively distributed to possible edges (s, r, o) so as to determine the possibility of the edges belonging to E. To solve this problem, we introduce a graphics auto-encoder model, consisting of a physical encoder and a scoring function (decoder). The encoder will assign each entity v_iE is corresponding to V to form a real-valued vector e_iThe decoder reconstructs the edges of the graph from the vertex representations. In other words, it evaluates the scores of (subject, relationship, object) triplets by a function S. The key feature of our work with others is the reliance on the encoder, most previous methods in training for each v_ie.V is directly using a single real-valued vector e_iAnd we use e through an R-GCN encoderⁱ＝h_i ^(L)To compute the node representation, we use DistMult factorization as an evaluation function in our work, which works well in standard join prediction, in this approach, each relation r has a diagonal matrix associated with it, and a triplet (s, r, o) has a score of:

f(s,r,o)＝e^TR_re_o (Eq.5)

we optimize the cross entropy loss to obtain the following loss function:

t is the set of all triples, l is the logystc _ sigmoid function, y is an indicator that y is 1 for positive case triples and 0 for negative case triples.

5. Cross-graph topological alignment

Based on the above stepsAnd performing cross-graph alignment, and aligning the joint relation information learned by the local semantic network between two people. For samples x from two domains respectively_sAnd x_tWe first get an updated joint representation through Eq3 and Eq4, and then perform a convolution by 1 x 1, we get the final topological representation of the person. Finally, the top scoring edge sets are selected through step four, and their vertex sets are aligned through Eq.7.

Wherein

But only a topological representation of the ith node in both domains. We can help in determining keypoint locations by minimizing the loss function Lsm, thereby learning generalized higher-order structure invariant representations over two domains.

The fine-tune (fine-tune) network is after the Link prediction is completed. And (4) performing Fine-tune by combining the entropy loss of the heat map and the loss function of high-order topological matching, and further finely adjusting the convolutional layer to continue training the model so as to achieve the improvement of the original unsupervised learning. The pre-trained model should be most efficient. The best way to use the model is to retain the model's architecture and the initial weights of the model. We can then retrain the model using the weights in the pre-trained model. Therefore, the target detection is more accurate, and the problem that the original short sleeve cannot be identified by wearing the short sleeve can be solved.

The method comprises the following specific steps:

under the SSDA setting, the tagged target and untagged target data actually have a fairly potential relationship. In one aspect, the dimensions, posture or appearance of a person are different between them. On the other hand, they are subject to uniform distribution and have similar specific keypoint information. Key points and baselines are found, and corresponding first-order heat maps are drawn by calculating heat map vectors of the key points and the baselines. The corresponding us can get its Entropy Loss (control Loss).

The entropy is lost. Entropy minimization (ENT) is a semi-supervised method, assuming that the model has confidence in the prediction of unlabeled data. We adopt it as a regularizer and ensure that it helps the generated heatmap to its fullest extent to achieve better performance in the target domain. We add this term to the optimization of equation 4:

where ent (p) calculates the entropy of the distribution p.

In this process, error correction is supervised using a cross entropy loss function, which is as follows:

cross entropy loss function (cross-entropy)

First, entropy is the amount of shannon information (

It can be understood that y represents the distribution of true markers, h (x) is the distribution of predicted markers of the trained model, and the cross entropy loss function can measure the similarity between y and h (x). I.e. the smaller the cross entropy of the two distributions, the more similar.

On the basis of a training set, a testing set and a development set, a processor corresponding to the testing set in the SZF is established for extracting data of the training set, the testing set and the development set and assigning the data to text, label, on the basis of a pre-training model, relevant parameters are set, and after operation, Fine-train of the process is completed through training, an optimized XZZNet network and a corrected key point image of a human body are obtained.

The processing procedure of the step 5 specifically includes: and generating a key point detection image corresponding to the image in the SZF data set by using the optimized XZZNet network and taking the SZF data set as input, and obtaining the posture information of each key point containing the human body in the image according to the key point detection image.

After a series of improvements, the data set SZF is input into the optimized XZZNet network to generate an image inspection result, and the fact that the key points correspond correctly and part of the key points are missing before is found out under the condition that the human body posture is identified and estimated. The XZZNet is improved through link prediction, so that the whole model achieves a relatively ideal effect.

The human body posture key points generated in the step 3 are supervised networks, and the step 5 is the human body posture key points generated by unsupervised networks. The method is to learn an unsupervised network and perform graph matching on the human posture key points generated by the supervised HRNet and the human posture key points generated by the unsupervised XZZNet network. The unsupervised network parameters are continually fine-tuned to converge the unsupervised network XZZNet.

Through the four steps, a fine-tuned XZZNet network model is obtained, the step is equivalent to a testing step, an SZF data set acquired by Python crawler is used as input, a key point detection result of a human body posture image without sleeves is generated, and posture information of each key point of the human body contained in the image is obtained from the key point detection image.

In summary, a large number of experiments prove that the performance of the target domain can be remarkably improved in the image without the mark or the sparse mark. The invention can accurately distinguish each key point of the human body under the unsupervised network learning framework.

The invention can accurately distinguish each key point of the human body, is beneficial to positioning and identifying the posture of the human body in a complex environment, for example, the distress posture of the human body in a fire scene can be accurately detected, and is beneficial to developing rescue actions in time.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-person posture estimation method based on human body topological structure alignment is characterized by comprising the following steps:

crawling human body images without sleeves by a web crawler to construct an SZF data set;

training the HRNet network by using the MS-COCO and MPLL data sets as input to obtain an XZZNet network; learning image samples of MS-COCO and MPLL data sets by using an HRNet network to obtain a human key point image;

inputting the SZF data set into an XZZNet network, and generating candidate human body posture key points without sleeves by the XZZNet network;

carrying out graph matching on a human body key point diagram generated by the HRNet network and a candidate human body key point diagram generated by the XZZNet network, and finely adjusting the XZZNet network by using a cross entropy loss function according to a graph matching result to obtain an optimized XZZNet network;

and inputting the SZF data set into the optimized XZZNet network, generating a key point detection image corresponding to the image in the SZF data set, and obtaining the posture information of each key point containing the human body in the image according to the key point detection image.

2. The method as claimed in claim 1, wherein said crawling the body image without sleeves by web crawler constructs the SZF dataset comprising:

the method comprises the steps of utilizing a requests program package of python to achieve web crawlers, utilizing the requests program package to call a picture capturing function to obtain a target webpage address, circularly traversing picture websites stored in a dictionary, renaming captured images by using python codes, labeling the images, completing labeling, reading in and storing the images and labeled files one by one according to a sequence to form an SZF image data set.

3. The method of claim 1, wherein the HRNet network is trained using the MS-COCO and MPLL datasets as inputs to obtain an XZZNet network; learning image samples of MS-COCO and MPLL data sets by using an HRNet network to obtain a human key point image, wherein the method comprises the following steps:

turning, cutting and remolding an original image in the MS-COCO and MPLL data sets, and training an HRNet network by taking the changed image as input to obtain a robust XZZNet network; learning image samples of the MS-COCO and MPLL data sets by using an HRNet network to obtain a human body key point diagram, averaging the heat diagrams generated by the changed images, and predicting the position of each human body key point by adjusting the position of the highest heat value and shifting the position by one fourth in the direction from the highest response to the second highest response.

4. The method of claim 3, wherein when training the HRNet network using the MS-COCO and MPLL datasets as inputs, the HRNet network comprises four stages, mainly comprising four parallel subnets, each subnet gradually reduces resolution to half and increases corresponding width by one time, and switching units across the parallel subnets are introduced into the HRNet network, so that each subnet repeatedly accepts information from other parallel subnets by performing multi-scale repeated fusion by repeatedly switching information on the parallel multi-resolution subnets. The keypoints are estimated by a high resolution representation of the network output.

5. The method of claim 1, wherein the inputting of the SZF dataset into an XZZNet network that generates candidate sleeve-free body pose keypoints, comprises:

inputting the previously constructed SZF data set into an XZZNet network, and utilizing L through a CAFA module and a pre-training network in the XZZNet network_fdFunction learning domain invariant and fine-grained cross-domain representation of human beings

Wherein M, N represents the number of sample images of the source domain and the target domain, F_s,i'，F_t,j' denotes F at the i-th position_s' and j-th position F_t'. φ is a mapping operation that projects domain features into a kernel Hilbert space H;

marking and annotating key points of image samples in the SZF data set, and performing attitude prediction by adopting an encoder-decoder frameworkGenerating corresponding image triplet features using a feature extractor

And

from image triplet characteristics

And

obtaining self-adaptive characteristic point F through CAFA module_s,F_tThe feature point F_s,F_tInputting the predicted key point heat maps into a posture estimator, predicting the respective key point heat maps, and generating candidate human body key point images.

6. The method according to claim 1, wherein the map matching is performed on the human body key point map generated by the HRNet network and the candidate human body key point map generated by the XZZNet network, and the XZZNet network is adjusted by using a cross entropy loss function according to a map matching result to obtain the optimized XZZNet network, and the method comprises the following steps:

the method comprises the steps of utilizing human body topology alignment to conduct image construction and image matching operation on a human body key point image labeled by an HRNet network and a candidate labeled human body key point image generated by an XZZNet network, aligning first-order key points in a domain, giving a loss function for aligning the human body key point image labeled by the HRNet network and the candidate labeled human body key point image generated by the XZZNet network according to a cross entropy function, reversely fine-tuning the XZZNet network by minimizing the loss function, and continuously updating parameters of the XZZNet network model by utilizing a back propagation BP algorithm to train the XZZNet network model, so that the XZZNet network is converged to obtain an optimized XZZNet network and a corrected candidate human body key point image.