CN112801138A - Multi-person attitude estimation method based on human body topological structure alignment - Google Patents

Multi-person attitude estimation method based on human body topological structure alignment Download PDF

Info

Publication number
CN112801138A
CN112801138A CN202110009492.0A CN202110009492A CN112801138A CN 112801138 A CN112801138 A CN 112801138A CN 202110009492 A CN202110009492 A CN 202110009492A CN 112801138 A CN112801138 A CN 112801138A
Authority
CN
China
Prior art keywords
network
xzznet
human body
image
key point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110009492.0A
Other languages
Chinese (zh)
Other versions
CN112801138B (en
Inventor
李浥东
郎丛妍
孙鑫雨
冯紫钰
赵治坤
汪敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202110009492.0A priority Critical patent/CN112801138B/en
Publication of CN112801138A publication Critical patent/CN112801138A/en
Application granted granted Critical
Publication of CN112801138B publication Critical patent/CN112801138B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-person posture estimation method based on human body topological structure alignment. The method comprises the following steps: training the HRNet network by using the MS-COCO and MPLL data sets as input to obtain an XZZNet network; learning the image sample of the data set by using an HRNet network to obtain a human body key point image; inputting the SZF data set into an XZZNet network to generate candidate key points of the human posture without sleeves; carrying out graph matching on a human body key point graph generated by the HRNet network and a candidate human body key point graph generated by the XZZNet network, and finely adjusting the XZZNet network by using a cross entropy loss function to obtain an optimized XZZNet network; and inputting the SZF data set into the optimized XZZNet network, generating a key point detection image corresponding to the image in the SZF data set, and obtaining the posture information of each key point containing the human body in the image according to the key point detection image. The invention can obviously improve the performance of the target domain in the image without the mark or the sparse mark, and can accurately distinguish each key point of the human body in the image under the unsupervised network learning framework.

Description

Multi-person attitude estimation method based on human body topological structure alignment
Technical Field
The invention relates to the technical field of human body behavior analysis, in particular to a multi-person posture estimation method based on human body topological structure alignment.
Background
In recent years, with the development of information technology and the popularization of intelligent technology, the global technology change is further promoted, and technologies such as cloud computing, internet of things, big data and artificial intelligence are rapidly developed, wherein the human body posture recognition technology is widely applied to the related field of computer vision.
The hot spot of the current artificial intelligence technology has very important research significance for human body posture recognition under a fixed scene, and has certain promotion effect for realizing modern construction in China, so that the enhancement of technical analysis and study are very important. As early as 70 s in the last century, China has started research on human behavior analysis, the research has a strong pushing effect on the development of artificial intelligence in China, and it has become possible to analyze simpler gestures and actions under specific situations or in more standard scenes.
With the continuous improvement of the social level in China, the quality requirements of people on social life are continuously increased, so that video monitoring becomes an indispensable safety measure for people in the actual life process, and the technical requirements based on video analysis are higher and higher. For example, the human body posture recognition method is widely applied to industries such as intelligent home decoration, the medical field and motion analysis, and the human body posture recognition under the solid-state scene has obvious functions in various fields. In particular, in recent years, the enhancement of security work in our country has strong demands for the dense and mobile population in large cities, the discrimination of criminals and the like.
At present, no multi-person posture estimation method based on human body topological structure alignment effectively exists in the prior art.
Disclosure of Invention
The embodiment of the invention provides a multi-person posture estimation method based on human body topological structure alignment, so that each key point of a human body in an image can be accurately distinguished under an unsupervised network learning framework.
In order to achieve the purpose, the invention adopts the following technical scheme.
(corresponding claims)
According to the technical scheme provided by the embodiment of the invention, the performance of the target domain can be obviously improved in the image without the mark or the sparse mark, and each key point of the human body can be accurately distinguished under the unsupervised network learning framework.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a multi-person pose estimation method applied to an image based on human body topological structure alignment according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
The flow chart of the multi-person posture estimation method applied to the image based on human body topological structure alignment provided by the embodiment of the invention is shown in figure 1, and the method comprises the following steps:
step 1, crawling human body images without sleeves through a web crawler to construct an SZF (Sleeve-zero Figures) data set.
And 2, learning the image samples of the MS-COCO and MPLL data sets by using an HRNet network to obtain a human body key point image, and training the HRNet network by using the MS-COCO and MPLL data sets as input so as to obtain a robust XZZNet network.
The MS-COCO data set is called Common Objects in Context, is a data set provided by Microsoft team and can be used for image recognition, and comprises tasks of detection (detection), segmentation (segmentation), keypoints (keypoints) and the like. Images in the MS-COCO data set comprise natural images and common target images in life, the background is complex, the number of targets is large, and the size of the targets is smaller, so that the task on the MS-COCO data set is difficult to realize. Furthermore, the MS-COCO dataset contains 91 types of images. Its training set has 82,783 images, the verification set has 40,504 images, and the test set has 40,775 images. Each image provides 17 body keypoints.
The MPII human pose dataset consists of images taken from the real world with full body pose annotations. There were approximately 25K images, 40K subjects, of which approximately 7K were used for testing and the remaining 18K for training and validation. Each image provides 16 body keypoints.
And 3, inputting the SZF data set into an XZZNet network, and generating candidate human body posture key points without sleeves by the XZZNet network.
And 4, constructing graph models for the human body key point diagram generated by the HRNet network and the candidate human body key point diagram generated by the XZZNet network, carrying out graph matching, continuously correcting through a supervision loss function to enable the graph models to be more sensitive to key points in human body model identification, learning generalized high-order structure invariant representation on two fields by utilizing a minimized loss function to help judge the positions of the key points, and adjusting the XZZNet network by utilizing a cross entropy loss function to obtain the optimized XZZNet network.
And 5: and generating a key point detection image corresponding to the image in the SZF data set by using the optimized XZZNet network and taking the SZF data set as input, and obtaining the posture information of each key point containing the human body in the image according to the key point detection image.
The processing procedure of the step 1 specifically comprises the step of acquiring a large number of human body images without sleeves from a network in a python crawler mode. The python has a powerful crawler and a mature and efficient crawler-remote dictionary service distributed strategy crawler frame, and is convenient for efficiently downloading the webpage; multithreading and process models are mature and stable, and multithreading or processes can optimize program efficiency and improve the downloading and analysis capability of the whole system. Meanwhile, the python has excellent third-party package which can simulate the behavior of the user agent to construct a proper request, and the website is prevented from killing the crawler.
Implementing web crawlers using the requests package of python: the requests library is a module commonly used for http requests, and by introducing the requests library, the webpage can be more conveniently crawled in a program; knowing the encoding mode of the webpage, performing character string matching on the defined data format by using a regular expression in a source program, and converting the successfully matched website into a character string form to be stored in a dictionary; and requesting the website where the externally input image is located, and opening the file at the specified position by using an open () function.
The subject function of python mainly includes three aspects: firstly, acquiring a target webpage address (an image data set can be acquired by using a fixed King seal hundred-degree view), secondly, calling a function for capturing an image, and thirdly, circularly traversing the image website stored in the dictionary.
And deleting unqualified images from the SZF image data set, renaming original messy codes of the images to easily recognized names by using python codes, and labeling the SZF data set. And after the annotation is finished, reading and storing the image and the annotation file one by one in sequence to form an SZF image data set.
The processing procedure of the step 2 specifically includes:
turning (flip), cutting (crop) and remolding (reshape) an original image in the MS-COCO and MPLL data sets, and training the HRNet network by taking the changed image as input to obtain a robust XZZNet network; learning image samples of the MS-COCO and MPLL data sets by using an HRNet network to obtain a human body key point diagram, averaging the heat diagrams generated by the changed images, and predicting the position of each human body key point by adjusting the position of the highest heat value and shifting the position by one fourth in the direction from the highest response to the second highest response.
The HRNet neural network model was trained using the existing MS-COCO and MPLL datasets. In the whole process, a switching unit crossing parallel subnets is introduced into the HRNet network, and information is repeatedly switched on the parallel multi-resolution subnets to perform multi-scale repeated fusion, so that each subnet repeatedly receives information from other parallel subnets. The keypoints are estimated by a high resolution representation of the network output.
The HRNet network used by the invention comprises four stages, the main body is four parallel subnetworks, the resolution ratio is gradually reduced to half, and the corresponding width (channel number) is doubled. The first stage contains 4 remaining modules, each of which, like ResNet-50, is composed of a bottompiece with 64 channels, followed by a 3 × 3 convolution map with the channel number reduced to (C), and the second, third and fourth stages contain 1, 4 and 3 switching modules, respectively. One switch module contains 4 residual modules, each of which contains 2 convolutional layers of 3x3, and one switch module. In summary, there are 8 switch modules, i.e. 8 times of multi-scale fusion. In training, a first stage generates a top-level feature (high-level feature), and the feature dimension is gradually reduced in the next several stages. Thus, the features of the subsequent stage parallel subnet are composed of the features of the previous stage and the features of the next stage. The heatmap is simply regressed from the higher-order feature representation output by the last exchange module. The loss function is defined as the mean square error and is used to compare the predicted heatmap with the groudtuthheatmpas. The group transmit heatmpas is generated by adopting two-dimensional Gaussian distribution, taking the group position of each key point as the center and the standard deviation as 1 pixel.
Training on the MS-COCO dataset:
the aspect ratio of the human detection bounding box (object detection bounding box) is expanded to a fixed aspect ratio, i.e., height: width: 4:3, and then the bounding box is cut out from the image and adjusted to a fixed size, 256 × 192 or 384 × 288. Data increments included random rotations (-45 °,45 °), random scales ([0.65,1.35]), and then inversions. We use Adam optimizer,. The initial learning rate (learning rate) is set to 1e-3, trained at 170 th and 200epochs, and ended within 210 epochs.
Training on MPLL dataset:
the test procedure is almost the same as in COCO, except that a standard test strategy is used. A standard metric, PCKh (head normalized probability of correct keypoints) score, is used. The union is correct if it belongs to a pixel at the location of α l group, where α is a constant and l is the 60% header bounding box whose header size corresponds to the true diagonal length. PCKh @0.5(α ═ 0.5) score.
And training the HRNet network by an MS-COCO data set and an MPLL data set to obtain the XZZNet network.
The XZZNet network and the HRNet network have the same network structure, but the weight parameters of the networks are different, and the XZZNet network at the moment is equivalent to an HRNet model pre-trained on MS-COCO and MPII data sets in the first step.
The processing procedure of the step 3 specifically includes that the XZZNet network is obtained by HRNet network training, so that the convolutional layer and the network parameters (the convolutional layer is the same, the network parameters are different and are continuously updated) are the same as the HRNet network. After the construction of the SZF data set and the generation of the network XZZNet through supervised training are completed, the previously constructed SZF data set is input into the XZZNet network, and the L-utilized pre-training network and the CAFA (Cross attention alignment) module in the XZZNet networkfdThe function learns the domain-invariant and fine-grained human cross-domain representation:
Figure BDA0002884472240000071
fine-grained features are effective for accurate pose estimation. The goal of CAFA is to adapt to more domain-invariant fine-grained human features across domains. Unlike previous feature adaptation methods, we capture cross-domain correlated fine-grained features through BSAM. It explores the local spatial feature dependencies across domains, rather than simply considering domain features separately. By exploring feature interactions in a bi-directional manner, fine-grained human features can be well encoded for each domain. Specifically, we have designed a source-to-target adaptation (STA) mechanism to enhance the source-to-target human features by adaptively aggregating target features based on their similarities. Similarly, we also use a target-to-source adaptation (TSA) mechanism to update target domain features by aggregating relevant source domain features. Details of the CAFA are shown in the following figures.
Given sample pair xs,xt(one from the source domain and one from the target domain) using a feature extractor (feature encoder) to generate corresponding features FsAnd FtAnd two convolutional layers are applied to generate a and B, respectively. Furthermore Fs,FtIs also fed into another convolution layer to obtain Sc,Tc
To determine FsAnd FtFine-grained characteristic dependency relationship of each corresponding position in the graph, a correlation map (correlation map) phi ═ aTB. Wherein phi(i,j)Is used for measuring FsThe ith position of (1) and FtThe jth position of (a). To make FsAnd FtMutual enhancement, so a bi-directional enhancement mechanism 1) source domain to target domain adaptation mechanism (STA) is employed:
in STA, we define the spatial association graph of source domain to target domain as:
Figure BDA0002884472240000081
wherein,
Figure BDA0002884472240000082
is represented by FsThe ith position pair FtThe j-th position of (a). To exploit the fine-grained features with similar spatial response in the target domain, we update FsIs composed of
Fs'=FssTsψs->t (Eq.2),
Wherein λsAnd (4) utilizing the importance of the target domain related spatial information and the source domain characteristics. Thus, the object similarity characteristicF into which the eigenresponse is encodeds' each position.
2) Target Domain to Source Domain Adaptation mechanism (TSA)
Similarly, we can get the correlation diagram from the target domain to the source domain according to Eq.1
Figure BDA0002884472240000083
It indicates FtJ position of (2) for FsThe ith position of (a). Combining similar fine-grained source domain responses and the original target properties in Eq.2, we updated Ft. Thus, Fs' and Ft' so we can encode more fine-grained features for each domain.
Finally, by applying the maximum mean error lfd(MMD for short) alignment Fs' and Ft'。
Figure BDA0002884472240000091
Where M, N represents the number of sample images of the source domain and the target domain. Fs,i',Ft,j' denotes F at the i-th positions' and j-th position Ft'. φ is a mapping operation that projects domain features into a kernel Hilbert space H (kernel Hilbert space). Arbitrary distribution of features can be represented by core embedding techniques. It allows us to minimize lfdTo learn domain invariant fine-grained human features.
Marking and annotating key points of image samples in the SZF data set, adopting a modified simple baseline as a baseline, performing attitude prediction on the baseline by utilizing an encoder-decoder framework, and generating corresponding image triple features by using a feature extractor
Figure BDA0002884472240000092
And
Figure BDA0002884472240000093
and then based on image triplet characteristics
Figure BDA0002884472240000094
And
Figure BDA0002884472240000095
obtaining self-adaptive characteristic point F through CAFA modules,FtThen the feature point F is sets,FtInputting the predicted key point heat maps into an estimator, predicting the respective key point heat maps, and generating candidate human body key point images.
The processing procedure of the step 4 specifically includes:
human body key point images (heat maps) marked by an HRNet network and candidate marked human body key point images (heat maps) generated by an XZZNet network are subjected to intra-domain structure self-adaptation by human body topology alignment, and then graph construction and graph matching operation are carried out to achieve the effect of inter-domain structure alignment. And then, reversely fine-tuning the XZZNet network according to the loss error given by the cross entropy function, and correcting the candidate human body key point images.
The human body key point images (heat maps) marked by the HRNet network and the candidate marked human body key point images (heat maps) generated by the XZZNet network are subjected to image matching by human body topology alignment, although the alignment of the first-order key points in the domain, namely 'structure adaptation in the domain', can not well overcome the problems of large pose difference and serious geometric deformation, particularly under the condition of cross-domain serious shielding. Therefore, the invention considers the Topology knowledge of people with unchanged fields, and adopts an IHTA (the module adopts the IHTA based on a GCN model to carry out Inter-domain Human body Topology Alignment, and SemGCN is also a mechanism of the GCN) (Inter-domain Human-Topology Alignment) to solve the problem, wherein the IHTA is designed by a Graph Convolutional neural Network (GCN for short), and the mechanism provides a clear high-order Human body framework structure modeling mode, is beneficial to obtaining the space Topology information of joints, and enables the Inter-field Human Topology adaptation to be effective and reliable.
And (3) finely adjusting the XZZNet network by combining the entropy loss of the heat map and a loss function of high-order topological matching (Fine-tune), and continuously updating the parameters of the model by utilizing back propagation so as to train the whole model. We use the parameters of the second pre-training step to initialize the unsupervised network, then fine-tune XZZ the network by minimizing the loss function of alignment of the two topologies generated by the HRNet network and the XZZ network, and continuously update the parameters of the model by back-propagation BP algorithm to optimize the model.
1. Local keypoint feature extraction
First, based on the feature map F and the key-point heat map Yi kpA set of key points V of two domains can be obtained by outer product and global mean pooling operationsKPThe semantic local feature of (1). The specific formula is as follows:
Figure BDA0002884472240000101
2. representation of the figure
Constructing a visual topological graph representation G ═ (V, E), where V is a set of points in the graph, which can be represented as V ═ ViI 1, 2.., H }, E is an edge set in the graph, which can be expressed mathematically as:
E={viand vj,i,j=1,2,...H,viAnd vjConnected in the figure } (Eq.2)
Here we introduce the concept of a contiguous matrix, where the element a in the matrix is given by AijHas a value of 1 if and only if viAnd vjAdjacent in the topological graph, otherwise aijIs 0.
3. Graph convolution network
Obviously, we can naturally consider the human body structure as a natural figure, and there are potential space constraints between the joints. We can consider the human joint as the key point viThe four limbs of the person are the edges e between the key pointsij. Based on the above viewpoints, the topological representation of the human is modeled by adopting SemGCN (Semantic Graph probabilistic Networks), and for a Graph convolution model, the propagation of features through adjacent nodes is helpful for learning robust Graph convolution modelsThe relationship information between the bar local structure and the nodes. At the same time, the non-local layer is adopted to help capture local and global remote dependency relationships between nodes so as to know more human context information. It enables us to obtain robust human topological information, which is a necessary condition for learning structure invariant information across domains. There are two steps to apply node i based on a graph of convolution propagation. First, the node representation is passed through a learnable parameter matrix
Figure BDA0002884472240000111
And (6) carrying out transformation. Second, the transformed node representation is collected from the neighbor node j to node i, and then the RELU function is used. Collecting node characteristics into a matrix
Figure BDA0002884472240000112
After the semantic graph convolution network is called, applying different weight matrixes to each channel of the node features:
Figure BDA0002884472240000113
wherein v islAnd vl+1Respectively, the node representations before and after the first convolution, MdIt is a set of H x H sized matrices that are used as learnable parameter matrices, and the weight vectors represent the local semantic knowledge of the neighboring joints implied in the diagram. | | represents the channel concatenation and represents the d-th row of the transformation matrix, which learns the channel weights as a priori edges in the graph (e.g., how one joint affects other body parts in pose estimation) to enhance the graph representation.
Then, in addition to the non-local concept, we define the feature update operation as:
Figure BDA0002884472240000114
w is to bevInitializing to zero, and f is used for calculating a node i and all other nodes j; g is used to compute the representation of node j by computingThe characteristics of the nodes are used for calculating the correspondence between the nodes so as to capture the local and global relationship between the nodes.
4. Connection prediction
After two topological graphs are obtained, a hard alignment mode is not needed, therefore, a connection prediction method is provided, nodes of the two topological graphs are connected pairwise, and scores f (s, r, o) are respectively distributed to possible edges (s, r, o) so as to determine the possibility of the edges belonging to E. To solve this problem, we introduce a graphics auto-encoder model, consisting of a physical encoder and a scoring function (decoder). The encoder will assign each entity viE is corresponding to V to form a real-valued vector eiThe decoder reconstructs the edges of the graph from the vertex representations. In other words, it evaluates the scores of (subject, relationship, object) triplets by a function S. The key feature of our work with others is the reliance on the encoder, most previous methods in training for each vie.V is directly using a single real-valued vector eiAnd we use e through an R-GCN encoderi=hi (L)To compute the node representation, we use DistMult factorization as an evaluation function in our work, which works well in standard join prediction, in this approach, each relation r has a diagonal matrix associated with it, and a triplet (s, r, o) has a score of:
f(s,r,o)=eTRreo (Eq.5)
we optimize the cross entropy loss to obtain the following loss function:
Figure BDA0002884472240000121
t is the set of all triples, l is the logystc _ sigmoid function, y is an indicator that y is 1 for positive case triples and 0 for negative case triples.
5. Cross-graph topological alignment
Based on the above stepsAnd performing cross-graph alignment, and aligning the joint relation information learned by the local semantic network between two people. For samples x from two domains respectivelysAnd xtWe first get an updated joint representation through Eq3 and Eq4, and then perform a convolution by 1 x 1, we get the final topological representation of the person. Finally, the top scoring edge sets are selected through step four, and their vertex sets are aligned through Eq.7.
Figure BDA0002884472240000131
Wherein
Figure BDA0002884472240000132
But only a topological representation of the ith node in both domains. We can help in determining keypoint locations by minimizing the loss function Lsm, thereby learning generalized higher-order structure invariant representations over two domains.
The fine-tune (fine-tune) network is after the Link prediction is completed. And (4) performing Fine-tune by combining the entropy loss of the heat map and the loss function of high-order topological matching, and further finely adjusting the convolutional layer to continue training the model so as to achieve the improvement of the original unsupervised learning. The pre-trained model should be most efficient. The best way to use the model is to retain the model's architecture and the initial weights of the model. We can then retrain the model using the weights in the pre-trained model. Therefore, the target detection is more accurate, and the problem that the original short sleeve cannot be identified by wearing the short sleeve can be solved.
The method comprises the following specific steps:
under the SSDA setting, the tagged target and untagged target data actually have a fairly potential relationship. In one aspect, the dimensions, posture or appearance of a person are different between them. On the other hand, they are subject to uniform distribution and have similar specific keypoint information. Key points and baselines are found, and corresponding first-order heat maps are drawn by calculating heat map vectors of the key points and the baselines. The corresponding us can get its Entropy Loss (control Loss).
The entropy is lost. Entropy minimization (ENT) is a semi-supervised method, assuming that the model has confidence in the prediction of unlabeled data. We adopt it as a regularizer and ensure that it helps the generated heatmap to its fullest extent to achieve better performance in the target domain. We add this term to the optimization of equation 4:
Figure BDA0002884472240000141
where ent (p) calculates the entropy of the distribution p.
In this process, error correction is supervised using a cross entropy loss function, which is as follows:
cross entropy loss function (cross-entropy)
Figure BDA0002884472240000142
First, entropy is the amount of shannon information (
Figure BDA0002884472240000143
It can be understood that y represents the distribution of true markers, h (x) is the distribution of predicted markers of the trained model, and the cross entropy loss function can measure the similarity between y and h (x). I.e. the smaller the cross entropy of the two distributions, the more similar.
On the basis of a training set, a testing set and a development set, a processor corresponding to the testing set in the SZF is established for extracting data of the training set, the testing set and the development set and assigning the data to text, label, on the basis of a pre-training model, relevant parameters are set, and after operation, Fine-train of the process is completed through training, an optimized XZZNet network and a corrected key point image of a human body are obtained.
The processing procedure of the step 5 specifically includes: and generating a key point detection image corresponding to the image in the SZF data set by using the optimized XZZNet network and taking the SZF data set as input, and obtaining the posture information of each key point containing the human body in the image according to the key point detection image.
After a series of improvements, the data set SZF is input into the optimized XZZNet network to generate an image inspection result, and the fact that the key points correspond correctly and part of the key points are missing before is found out under the condition that the human body posture is identified and estimated. The XZZNet is improved through link prediction, so that the whole model achieves a relatively ideal effect.
The human body posture key points generated in the step 3 are supervised networks, and the step 5 is the human body posture key points generated by unsupervised networks. The method is to learn an unsupervised network and perform graph matching on the human posture key points generated by the supervised HRNet and the human posture key points generated by the unsupervised XZZNet network. The unsupervised network parameters are continually fine-tuned to converge the unsupervised network XZZNet.
Through the four steps, a fine-tuned XZZNet network model is obtained, the step is equivalent to a testing step, an SZF data set acquired by Python crawler is used as input, a key point detection result of a human body posture image without sleeves is generated, and posture information of each key point of the human body contained in the image is obtained from the key point detection image.
In summary, a large number of experiments prove that the performance of the target domain can be remarkably improved in the image without the mark or the sparse mark. The invention can accurately distinguish each key point of the human body under the unsupervised network learning framework.
The invention can accurately distinguish each key point of the human body, is beneficial to positioning and identifying the posture of the human body in a complex environment, for example, the distress posture of the human body in a fire scene can be accurately detected, and is beneficial to developing rescue actions in time.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A multi-person posture estimation method based on human body topological structure alignment is characterized by comprising the following steps:
crawling human body images without sleeves by a web crawler to construct an SZF data set;
training the HRNet network by using the MS-COCO and MPLL data sets as input to obtain an XZZNet network; learning image samples of MS-COCO and MPLL data sets by using an HRNet network to obtain a human key point image;
inputting the SZF data set into an XZZNet network, and generating candidate human body posture key points without sleeves by the XZZNet network;
carrying out graph matching on a human body key point diagram generated by the HRNet network and a candidate human body key point diagram generated by the XZZNet network, and finely adjusting the XZZNet network by using a cross entropy loss function according to a graph matching result to obtain an optimized XZZNet network;
and inputting the SZF data set into the optimized XZZNet network, generating a key point detection image corresponding to the image in the SZF data set, and obtaining the posture information of each key point containing the human body in the image according to the key point detection image.
2. The method as claimed in claim 1, wherein said crawling the body image without sleeves by web crawler constructs the SZF dataset comprising:
the method comprises the steps of utilizing a requests program package of python to achieve web crawlers, utilizing the requests program package to call a picture capturing function to obtain a target webpage address, circularly traversing picture websites stored in a dictionary, renaming captured images by using python codes, labeling the images, completing labeling, reading in and storing the images and labeled files one by one according to a sequence to form an SZF image data set.
3. The method of claim 1, wherein the HRNet network is trained using the MS-COCO and MPLL datasets as inputs to obtain an XZZNet network; learning image samples of MS-COCO and MPLL data sets by using an HRNet network to obtain a human key point image, wherein the method comprises the following steps:
turning, cutting and remolding an original image in the MS-COCO and MPLL data sets, and training an HRNet network by taking the changed image as input to obtain a robust XZZNet network; learning image samples of the MS-COCO and MPLL data sets by using an HRNet network to obtain a human body key point diagram, averaging the heat diagrams generated by the changed images, and predicting the position of each human body key point by adjusting the position of the highest heat value and shifting the position by one fourth in the direction from the highest response to the second highest response.
4. The method of claim 3, wherein when training the HRNet network using the MS-COCO and MPLL datasets as inputs, the HRNet network comprises four stages, mainly comprising four parallel subnets, each subnet gradually reduces resolution to half and increases corresponding width by one time, and switching units across the parallel subnets are introduced into the HRNet network, so that each subnet repeatedly accepts information from other parallel subnets by performing multi-scale repeated fusion by repeatedly switching information on the parallel multi-resolution subnets. The keypoints are estimated by a high resolution representation of the network output.
5. The method of claim 1, wherein the inputting of the SZF dataset into an XZZNet network that generates candidate sleeve-free body pose keypoints, comprises:
inputting the previously constructed SZF data set into an XZZNet network, and utilizing L through a CAFA module and a pre-training network in the XZZNet networkfdFunction learning domain invariant and fine-grained cross-domain representation of human beings
Figure FDA0002884472230000021
Wherein M, N represents the number of sample images of the source domain and the target domain, Fs,i',Ft,j' denotes F at the i-th positions' and j-th position Ft'. φ is a mapping operation that projects domain features into a kernel Hilbert space H;
marking and annotating key points of image samples in the SZF data set, and performing attitude prediction by adopting an encoder-decoder frameworkGenerating corresponding image triplet features using a feature extractor
Figure FDA0002884472230000022
And
Figure FDA0002884472230000023
from image triplet characteristics
Figure FDA0002884472230000024
And
Figure FDA0002884472230000025
obtaining self-adaptive characteristic point F through CAFA modules,FtThe feature point Fs,FtInputting the predicted key point heat maps into a posture estimator, predicting the respective key point heat maps, and generating candidate human body key point images.
6. The method according to claim 1, wherein the map matching is performed on the human body key point map generated by the HRNet network and the candidate human body key point map generated by the XZZNet network, and the XZZNet network is adjusted by using a cross entropy loss function according to a map matching result to obtain the optimized XZZNet network, and the method comprises the following steps:
the method comprises the steps of utilizing human body topology alignment to conduct image construction and image matching operation on a human body key point image labeled by an HRNet network and a candidate labeled human body key point image generated by an XZZNet network, aligning first-order key points in a domain, giving a loss function for aligning the human body key point image labeled by the HRNet network and the candidate labeled human body key point image generated by the XZZNet network according to a cross entropy function, reversely fine-tuning the XZZNet network by minimizing the loss function, and continuously updating parameters of the XZZNet network model by utilizing a back propagation BP algorithm to train the XZZNet network model, so that the XZZNet network is converged to obtain an optimized XZZNet network and a corrected candidate human body key point image.
CN202110009492.0A 2021-01-05 2021-01-05 Multi-person gesture estimation method based on human body topological structure alignment Active CN112801138B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110009492.0A CN112801138B (en) 2021-01-05 2021-01-05 Multi-person gesture estimation method based on human body topological structure alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110009492.0A CN112801138B (en) 2021-01-05 2021-01-05 Multi-person gesture estimation method based on human body topological structure alignment

Publications (2)

Publication Number Publication Date
CN112801138A true CN112801138A (en) 2021-05-14
CN112801138B CN112801138B (en) 2024-04-09

Family

ID=75808399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110009492.0A Active CN112801138B (en) 2021-01-05 2021-01-05 Multi-person gesture estimation method based on human body topological structure alignment

Country Status (1)

Country Link
CN (1) CN112801138B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361334A (en) * 2021-05-18 2021-09-07 山东师范大学 Convolutional pedestrian re-identification method and system based on key point optimization and multi-hop attention intention
CN113610102A (en) * 2021-06-23 2021-11-05 浙江大华技术股份有限公司 Training and target segmentation method for segmentation network and related equipment
CN114742890A (en) * 2022-03-16 2022-07-12 西北大学 6D attitude estimation data set migration method based on image content and style decoupling

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203395A (en) * 2016-07-26 2016-12-07 厦门大学 Face character recognition methods based on the study of the multitask degree of depth
CN108205655A (en) * 2017-11-07 2018-06-26 北京市商汤科技开发有限公司 A kind of key point Forecasting Methodology, device, electronic equipment and storage medium
CN109190467A (en) * 2018-07-26 2019-01-11 北京纵目安驰智能科技有限公司 A kind of more object detecting methods, system, terminal and storage medium returned based on key point
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
WO2020207281A1 (en) * 2019-04-12 2020-10-15 腾讯科技(深圳)有限公司 Method for training posture recognition model, and image recognition method and apparatus
CN111860101A (en) * 2020-04-24 2020-10-30 北京嘀嘀无限科技发展有限公司 Training method and device for face key point detection model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203395A (en) * 2016-07-26 2016-12-07 厦门大学 Face character recognition methods based on the study of the multitask degree of depth
CN108205655A (en) * 2017-11-07 2018-06-26 北京市商汤科技开发有限公司 A kind of key point Forecasting Methodology, device, electronic equipment and storage medium
CN109190467A (en) * 2018-07-26 2019-01-11 北京纵目安驰智能科技有限公司 A kind of more object detecting methods, system, terminal and storage medium returned based on key point
WO2020207281A1 (en) * 2019-04-12 2020-10-15 腾讯科技(深圳)有限公司 Method for training posture recognition model, and image recognition method and apparatus
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
CN111860101A (en) * 2020-04-24 2020-10-30 北京嘀嘀无限科技发展有限公司 Training method and device for face key point detection model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GDTOP818: "《HRNet详解》", Retrieved from the Internet <URL:《https://blog.csdn.net/weixin_37993251/article/details/88043650?ops_request_misc=&request_id=&biz_id=102&utm_term=HRNet&utm_medium=distribute.pc_search_result.none-task-blog-2》> *
冯晓月等: "《二维人体姿态估计研究进展》", 《计算机科学 》 *
张伟等: "《引入全局约束的精简人脸关键点检测网络》", 《 信号处理》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361334A (en) * 2021-05-18 2021-09-07 山东师范大学 Convolutional pedestrian re-identification method and system based on key point optimization and multi-hop attention intention
CN113610102A (en) * 2021-06-23 2021-11-05 浙江大华技术股份有限公司 Training and target segmentation method for segmentation network and related equipment
CN114742890A (en) * 2022-03-16 2022-07-12 西北大学 6D attitude estimation data set migration method based on image content and style decoupling

Also Published As

Publication number Publication date
CN112801138B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
CN111902825B (en) Polygonal object labeling system and method for training object labeling system
CN109033107B (en) Image retrieval method and apparatus, computer device, and storage medium
CN107871014A (en) A kind of big data cross-module state search method and system based on depth integration Hash
CN112801138B (en) Multi-person gesture estimation method based on human body topological structure alignment
CN112183577A (en) Training method of semi-supervised learning model, image processing method and equipment
CN110347932B (en) Cross-network user alignment method based on deep learning
CN111709409A (en) Face living body detection method, device, equipment and medium
CN111079532A (en) Video content description method based on text self-encoder
Wang et al. Storm: Structure-based overlap matching for partial point cloud registration
CN112200266B (en) Network training method and device based on graph structure data and node classification method
CN113011568B (en) Model training method, data processing method and equipment
Reddy et al. AdaCrowd: Unlabeled scene adaptation for crowd counting
CN113642602B (en) Multi-label image classification method based on global and local label relation
CN113065409A (en) Unsupervised pedestrian re-identification method based on camera distribution difference alignment constraint
CN116310318B (en) Interactive image segmentation method, device, computer equipment and storage medium
Liao et al. FERGCN: facial expression recognition based on graph convolution network
Cai et al. Multiperspective light field reconstruction method via transfer reinforcement learning
Dai et al. Enhancing two-view correspondence learning by local-global self-attention
CN114140524B (en) Closed loop detection system and method for multi-scale feature fusion
CN116342978A (en) Target detection network training and target detection method and device, and electronic equipment
Wang et al. An Improved Convolutional Neural Network‐Based Scene Image Recognition Method
CN114841887A (en) Image restoration quality evaluation method based on multi-level difference learning
Zhang [Retracted] An Intelligent and Fast Dance Action Recognition Model Using Two‐Dimensional Convolution Network Method
Lin et al. CapsNet meets ORB: A deformation‐tolerant baseline for recognizing distorted targets
Yang et al. Robust feature mining transformer for occluded person re-identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant