CN116612527A - Human body posture estimation method, system and equipment based on double decoders - Google Patents

Human body posture estimation method, system and equipment based on double decoders Download PDF

Info

Publication number
CN116612527A
CN116612527A CN202310563464.2A CN202310563464A CN116612527A CN 116612527 A CN116612527 A CN 116612527A CN 202310563464 A CN202310563464 A CN 202310563464A CN 116612527 A CN116612527 A CN 116612527A
Authority
CN
China
Prior art keywords
human body
layer
decoder
global
body posture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310563464.2A
Other languages
Chinese (zh)
Inventor
陈蔚岳
杨刚
戴丽珍
杨辉
邓高强
盛婕
陆荣秀
徐芳萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Jiaotong University
Original Assignee
East China Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Jiaotong University filed Critical East China Jiaotong University
Priority to CN202310563464.2A priority Critical patent/CN116612527A/en
Publication of CN116612527A publication Critical patent/CN116612527A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a human body posture estimation method, system and equipment based on double decoders, and relates to the field of human body posture estimation. The invention carries out feature preprocessing on RGB images through a CNN network, encodes and extracts features of the preprocessed feature vector sequences through a transducer encoder, takes the output vector sequences with global dependency characteristics as the input of a target decoder and a key point decoder at the same time, adopts a dual-decoder parallel architecture, extracts local dependency characteristics between global dependency characteristics and key points of individuals at the same time, solves the problem of missed detection of the key points caused by insufficient density of the key points, and shortens training time; the global dependence characteristic and the local dependence characteristic are fused by the fusion device based on the multi-layer perceptron, so that the problem that the human body cannot be detected due to the shielding phenomenon of the human body, and the posture estimation cannot be carried out is solved, and the human body posture estimation precision is greatly improved.

Description

Human body posture estimation method, system and equipment based on double decoders
Technical Field
The invention relates to the technical field of automatic driving, in particular to a human body posture estimation method, system and equipment based on double decoders.
Background
The automatic driving technology has been developed rapidly in recent years, and has become one of the hot technologies in the intelligent transportation field. In practical applications, an autonomous vehicle needs to be able to accurately identify pedestrians, cyclists, motorcyclists, etc. traffic participants on a road and predict their movement trajectories and intentions, so that a path can be planned better and collisions can be avoided. Human body posture estimation is a human body action recognition technology based on a computer vision technology, and the posture state of a human body can be deduced by analyzing information such as the position, angle, speed and the like of the human body in space. In the field of automatic driving, the human body posture estimation technology is utilized to identify the posture states of traffic participants such as pedestrians, cyclists, motorcyclists and the like, and more accurate information of pedestrian detection and motion trail prediction can be provided for an automatic driving vehicle, so that driving safety is improved.
Human body posture estimation is an important research direction of computer vision, and aims to locate human body key points in images, including eyes, nose, arms, legs and the like. Among existing popular human body posture estimation methods, they can be roughly classified into top-down and bottom-up methods. The top-down method needs to detect all human bodies in the image and then estimate the gesture of each human body, and the method can utilize the interaction information among key points of a single human body, but has larger calculation amount and slower processing speed, and depends on the performance of human body detection to a great extent, if the human body has a shielding phenomenon, the human body can not be detected, and thus the gesture estimation can not be carried out. The bottom-up method needs to detect all key point information at the same time and then combine the key point information into human body gesture, the method greatly improves the prediction speed, but the performance is influenced by the image resolution, if the density of the key points is insufficient, the key point omission detection can occur, the human body gesture estimation precision is lower, and the method can not ensure the gesture consistency among different key points.
Disclosure of Invention
Aiming at the problems in the background art, the invention provides a human body posture estimation method, a system and equipment based on double decoders, so as to shorten human body posture estimation time and improve human body posture estimation precision.
In order to achieve the above object, the present invention provides the following solutions:
in one aspect, the present invention provides a human body posture estimation method based on dual decoders, including:
performing feature preprocessing on the acquired RGB image through a CNN network to obtain a preprocessed feature vector sequence;
the preprocessed feature vector sequence is subjected to coding and feature extraction by a transducer coder, and a vector sequence with global dependency relation features is output;
taking the vector sequence with the global dependency characteristic as the input of a target decoder and a key point decoder at the same time, wherein the target decoder extracts the global dependency characteristic of an individual, and the key point decoder extracts the local dependency characteristic among key points;
the global dependence characteristic and the local dependence characteristic are fused by a fusion device based on a multi-layer perceptron, so that key point coordinates required by human body posture estimation are obtained;
and estimating the human body posture according to the coordinates of the key points.
Optionally, the feature preprocessing is performed on the obtained RGB image through a CNN network to obtain a preprocessed feature vector sequence, which specifically includes:
the RGB image is converted into a feature vector sequence by using a CNN backhaul network.
Optionally, the fusion device based on the multi-layer perceptron comprises a 1 st stage S connected in sequence 2 -MLP multi-layer perceptron module, first global averaging pooling layer, stage 2S 2 -an MLP multi-layer perceptron module, a second global averaging pooling layer and a fully connected layer.
Optionally, the S 2 The MLP multi-layer perceptron module comprises a first full-connection layer, a feature shift module, a first batch normalization layer, a feature compression layer, a second full-connection layer and a second batch normalization layer in a residual connection mode.
On the other hand, the invention also provides a human body posture estimation system based on the double decoders, which comprises the following steps:
the characteristic preprocessing unit is used for carrying out characteristic preprocessing on the acquired RGB image through a CNN network to obtain a preprocessed characteristic vector sequence;
the coding unit is used for coding and extracting the characteristics of the preprocessed characteristic vector sequence through a transducer coder and outputting a vector sequence with global dependency characteristics;
the double decoding unit is used for taking the vector sequence with the global dependency characteristic as the input of the target decoder and the key point decoder at the same time, the target decoder extracts the global dependency characteristic of the individual, and the key point decoder extracts the local dependency characteristic among the key points;
the feature fusion unit is used for fusing the global dependence features and the local dependence features through a fusion device based on a multi-layer perceptron, so as to obtain key point coordinates required by human body posture estimation;
and the gesture estimation unit is used for estimating the human body gesture according to the coordinates of the key points.
Optionally, the feature preprocessing unit specifically includes:
and a feature preprocessing subunit for converting the RGB image into a feature vector sequence by using the CNN backhaul network.
In another aspect, the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the dual-decoder-based human body posture estimation method when executing the computer program.
Optionally, the memory is a non-transitory computer readable storage medium.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a human body posture estimation method, a system and equipment based on double decoders, wherein the method carries out characteristic preprocessing on an acquired RGB image through a CNN network, encodes and extracts characteristics of a preprocessed characteristic vector sequence through a transform encoder, takes the output vector sequence with global dependency characteristics as input of a target decoder and a key point decoder at the same time, adopts a double-decoder parallel architecture, can extract local dependency characteristics between global dependency characteristics and key points of individuals at the same time, solves the problem that the key point detection is missed due to insufficient key point density, and shortens model training time; furthermore, the global dependence characteristic and the local dependence characteristic are fused through a fusion device based on a multi-layer perceptron, so that the problem that the human body cannot be detected due to the shielding phenomenon of the human body, and the gesture estimation cannot be performed is solved, and the accuracy of the human body gesture estimation is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a human body posture estimation method based on dual decoders provided by the invention;
fig. 2 is a schematic diagram of a human body posture estimation method based on dual decoders provided by the present invention;
FIG. 3 is a schematic diagram of a fusion cage based on a multi-layer perceptron provided by the invention;
FIG. 4 is a schematic diagram of a feature offset module according to the present invention;
FIG. 5 is a schematic diagram of a feature compression layer provided by the present invention;
fig. 6 is a schematic diagram of a test picture of a human body posture estimation method based on a dual decoder according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a human body posture estimation method, a human body posture estimation system and human body posture estimation equipment based on double decoders, so as to shorten human body posture estimation time and improve human body posture estimation precision.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Fig. 1 is a flowchart of a human body posture estimation method based on a dual decoder provided by the invention, and fig. 2 is a schematic diagram of a human body posture estimation method based on a dual decoder provided by the invention. Referring to fig. 1 and 2, a human body posture estimating method based on a dual decoder includes:
step 1: and carrying out feature preprocessing on the acquired RGB image through a CNN network to obtain a preprocessed feature vector sequence.
For an original three-channel RGB image, feature preprocessing is performed through a CNN network, and the main function of the CNN feature preprocessing is to convert the image into a vector sequence for subsequent processing and classification. In particular, the invention, through the use of CNNbackbone networks, can compress RGB images into smaller feature vector sequences without losing much information, which can then be input to other modules for further feature extraction.
In order to convert the RGB image into a vector sequence, the HRNet model is used for feature preprocessing, and the RGB image is converted into a feature vector sequence with higher dimension from the RGB matrix of the three channels.
Step 2: the preprocessed feature vector sequence is subjected to coding and feature extraction by a transducer coder, and a vector sequence with global dependency relation features is output.
And adding a 2D position embedding matrix into the preprocessed feature vector sequence, and extracting image features through a transducer encoder. The transducer encoder is composed of a plurality of self-attention layers and a feedforward neural network and is used for encoding a characteristic vector sequence and extracting characteristics. By using a transducer encoder, global dependencies in the feature vector sequence can be captured, thereby more accurately understanding the image features.
Step 3: and taking the vector sequence with the global dependency characteristics as the input of a target decoder and a key point decoder at the same time, wherein the target decoder extracts the global dependency characteristics of individuals, and the key point decoder extracts the local dependency characteristics among key points.
The target decoder used in the invention adopts a transducer decoder, and the key point decoder is similar to the target decoder in structure, wherein the input of the target decoder is N target Query matrixes representing targets, and in the key point decoder, the target Query matrixes are replaced by key point Query matrixes representing M key points. N is determined according to the number of targets in the image, and is generally 100; m is usually 16 or 17 depending on the number of key points of the human body.
The target decoder obtains a human body boundary box in the image by extracting global dependency characteristic information of the individual; the key point decoder obtains feature vectors containing a plurality of human key points by extracting local dependency features among the key points.
Step 4: the global dependence characteristic and the local dependence characteristic are fused by a fusion device based on a multi-layer perceptron, so that key point coordinates required by human body posture estimation are obtained.
Fig. 3 is a schematic structural diagram of a fusion device based on a multi-layer perceptron. Referring to fig. 3, the fusion cage body part based on the multi-layer perceptron of the present invention comprises two S 2 -an MLP multi-layer perceptron module, two global average pooling layers and one fully connected layer, denoted as:
y=fc(gap(S 2 MLP(gap(S 2 MLP(x)))))
where fc represents the fully connected layer and gap represents the global average pooling layer; x represents the input feature vector of the fusion device, and y represents the output feature vector; s is S 2 MLP represents S 2 -an MLP multi-layer perceptron module.
At S 2 In the MLP multi-layer perceptron module, the feature offset module and the feature compression layer play a main role in addition to the full-connection layer, the normalization layer and the residual connection mode.
Specifically, the fusion device based on the multi-layer perceptron comprises a 1 st stage S which is connected in sequence 2 -MLP multi-layer perceptron module, first global averaging pooling layer, stage 2S 2 -an MLP multi-layer perceptron module, a second global averaging pooling layer and a fully connected layer. The S is 2 The MLP multi-layer perceptron module comprises a first full-connection layer, a feature shift module, a first batch normalization layer, a feature compression layer, a second full-connection layer and a second batch normalization layer in a residual connection mode.
Fig. 4 is a schematic diagram of a feature offset module according to the present invention. Referring to fig. 4, the feature shift module is responsible for feature translation along the spatial direction, assuming that the input feature matrix is c×w in size and assuming that c= 5,W =5, the input feature matrix is divided into 5 parts, which are shifted by { -2, -1,0,1,2} units in the horizontal direction, respectively, and then the part of the features shifted out of the solid line frame are used as the complement of the missing features in the solid line frame, keeping the feature sizes before and after shifting unchanged. Since the features perform different units of displacement, information from different spatial locations can be combined together so that the information from different spatial locations can be fully interacted and streamed.
Fig. 5 is a schematic diagram of a feature compression layer provided by the present invention. Referring to fig. 5, the feature compression layer is responsible for token compression along the spatial direction, while allowing the image dimension to increase. Given input featuresThe different token will exchange information fully through compression rearrangement. Specifically, pair X along the channel dimension i Compressing all token in (2) to obtain new feature +.>Then new feature X' i Is input into a fully connected layer for information mixing, in this way, the different feature matrices in each region can be fully mixed to output the generated output features. To improve efficiency, the compression operation may be followed by a nonlinear activation function (e.g., reLU or GeLU) and a normalization layer (e.g., BN or LN) to improve training stability.
The invention utilizes the fusion device structure shown in fig. 3 to fuse the global dependence characteristic information of the individual and the local dependence characteristic between the key points, and the information between different dimensions is fused with each other by offsetting and compressing the characteristic vector in the space direction in the fusion process, so that the coordinate information of all the key points of the human body in the image is finally obtained.
The loss function loss adopts a Hungary algorithm, and corresponding confidence scoring is carried out on the predicted value according to the target detection category, the target detection boundary box, the key point category and the key point coordinates.
All keypoint coordinates a for each human object i in the real data tag i =[(x 1 ,y 1 ),(x 2 ,y 2 ),…,(x k ,y k )]To make the loss calculation more reasonable, a binary matrix V is used i =[v 1 ,v 1 ,v 2 ,v 2 ,…,v k ,v k ]Indicating whether the keypoint is visible (a value of 0 indicates invisible, 1 indicates visible), where k indicates the number of categories of keypoints. The matrix of true values is thus denoted y i =(A i ,V i ) The matrix of model predictors is expressed asWherein-> Therefore, the matching function between the keypoint predictor and the true value is defined as:
wherein, calculating an error between the predicted value and the true value of the key point coordinates using the L1 loss, calculating an error of the visibility matrix using the L2 loss, lambda loc And lambda (lambda) vis Representing the coordinate error coefficient and the visibility error coefficient, respectively, σ (i) represents a prediction with an index.
After the matching function is determined, binary matching of the lowest matching cost is performed based on the Hungary algorithm:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a symmetrical group->The N elements of (a) are permuted.
Model final loss function for all optimal allocation pairsThe hungarian loss summation is performed, i.e. a linear combination is made between the negative log likelihood of the target detection class, the target bounding box loss and the keypoint loss defined above:
wherein the method comprises the steps ofRepresenting a Query matrix that considers only detected targets; />Representing the traditional Hungarian matching of the coordinates of the target bounding box; />Representing taking the logarithm of the target class; b i 、c i 、/>Respectively representing the ith target coordinate, the category and the category confidence coefficient; k represents the number of key points output by the key point decoder; n represents the target number output by the target decoder.
Step 5: and estimating the human body posture according to the coordinates of the key points.
The invention adopts a double-decoder parallel architecture of the target decoder and the key point decoder, can simultaneously extract global dependence characteristic information of individuals and local dependence characteristic information between key points, solves the problem that the key point is missed due to insufficient density of the key points, shortens the model training time, and shortens the time for estimating the human body posture. Furthermore, the invention adopts a fusion device structure based on a multi-layer perceptron, and is provided with a characteristic offset module and a characteristic compression layer, so that the global dependence characteristic and the local dependence characteristic are fused, the problem that the human body can not be detected due to the shielding phenomenon of the human body, and thus the posture estimation can not be carried out is solved, and the accuracy of the human body posture estimation is greatly improved.
The technical effects of the method of the present invention are verified by the following specific examples. In the present embodiment of the present invention,
data set selection: the training and testing framework is Pytorch framework, and public academic benchmark data sets COCO2017 and MPII are selected, wherein the COCO2017 data set comprises 20 ten thousand images and 25 ten thousand human body targets which are marked with 17 key point information, and the public training data set and the public verification data set comprise 15 ten thousand people and 170 ten thousand key mark points; the MPII dataset contains about 25k images, including 40k human subjects each labeled with 16 keypoint information, extracted from YouTube video, as shown in FIG. 6. Of which 28k are typically used for training and 11k are used for testing. The data is mainly composed of multiple persons, and is a verification and test set for Shan Zhen single person gestures, single-frame multiple-person gestures and video multiple-person gestures, and most methods are mainly composed of using single-frame multiple-person gesture test sets. Body part shielding, 3D trunk and head direction labels are also recorded in the test set.
Analysis of experimental results:
(1) COCO dataset experimental results: the model employs an ADAM optimizer, where the parameters are set to a=0.0001, β 1 =0.9,β 2 =0.999. Initial learning rate of 10 -3 The training round number is 300 rounds, and the training process is respectively carried out the learning rate decay according to the proportion of 10% at the 200 th round and the 250 th round. And (3) pretraining the CNNbackbone modules used in all models by adopting an ImageNet data set. In order to improve the utilization rate of training data, data enhancement methods including random rotation, random scale transformation, random cropping, and the like are used. In the training stage, the size of the model input picture is fixed to 512×512, the slice size of the feature map is 8×8. In COCO data set experiments, the preset detection target query number L ob =103, preset number of key points query L kp =17. The experimental results are shown in table 1, and compared with the most advanced methods on the COCO test development set, including DPIT, GRMI, CPN, RMPE, simpleBaseline and HRNet, wherein DPIT is a two-channel method, simpleBaseline, GRMI, RMPE is a top-down method, and CPN and HRNet are bottom-up methods. For different methods, quantitative comparisons are performed based on different backbones and input resolutions. In the tokenPose model, to make the comparison model comparable to the model parameters of the present invention, the first three phases of the HRNet are used as backbone networks instead of the entire HRNet network. In this case, the network parameters are only 25% of the original version, as shown by HRNet-W48 in Table 1. The method of the present invention proposes two versions in implementation, the first version being DPET-B comprising a 6-layer decoder (layer 3 target decoder+layer 3 keypoint decoder) and the second version being DPET-L comprising an 8-layer decoder (layer 4 target decoder+layer 4 keypoint decoder). Compared with SimpleBaseline and HRNet, it can be seen that the inventive process (DPET) achieves better performance while being lighter weight. Quantitatively, DPET-L of the invention is improved by 0.7AP and 0.4AR over HRNet-W48, which demonstrates the superiority of the process of the invention. Compared with other methods, the DPET of the invention shows the best performance on the AP and the AR, and the equivalent result is obtained on other metrics.
TABLE 1 results of experiments on COCO test sets
(2) MPII dataset experimental results: the PCKh@0.5 result of the MPII validation set is shown in table 2, with a picture input size unified to 256×256. In MPII data set experiment, the preset detection target query number L ob =104, preset number of key points query L kp =16. Specifically, compared with SimpleBaserine and HRNet, the DPET-L of the invention realizes the optimality on Elb, wri, ank and Mean indexesCan be used. On most other criteria, a second best level is also achieved.
Table 2 experimental results on the MPII test set
It can be seen that the human body posture estimation method based on the double decoders is provided in the invention, firstly, the traditional top-down estimation method and the bottom-up estimation method are abandoned, and the target decoder and the key point decoder are used for simultaneously carrying out the target detection and the key point estimation functions, so that the problem of the missed detection of the key points caused by the target shielding in the traditional method is effectively solved. Secondly, a fusion device based on a multi-layer perceptron and provided with a feature offset module and a feature compression layer is used for fusing the personal global feature information extracted by the target decoder and the local dependency features between the key points extracted by the key point decoder, so that the human body posture estimation accuracy is higher than that of other methods, and the posture continuity between different key points can be ensured. Finally, the scientificity and effectiveness of the method of the invention were demonstrated by COCO2017 and MPII dataset benchmarks. The method provided by the invention has better fusion in calculation precision and speed, has better instantaneity, can accurately estimate the posture of the pedestrian in time, and is favorable for application and popularization.
Based on the method provided by the invention, the invention also provides a human body posture estimation system based on the double decoders, which comprises the following steps:
the characteristic preprocessing unit is used for carrying out characteristic preprocessing on the acquired RGB image through a CNN network to obtain a preprocessed characteristic vector sequence;
the coding unit is used for coding and extracting the characteristics of the preprocessed characteristic vector sequence through a transducer coder and outputting a vector sequence with global dependency characteristics;
the double decoding unit is used for taking the vector sequence with the global dependency characteristic as the input of the target decoder and the key point decoder at the same time, the target decoder extracts the global dependency characteristic of the individual, and the key point decoder extracts the local dependency characteristic among the key points;
the feature fusion unit is used for fusing the global dependence features and the local dependence features through a fusion device based on a multi-layer perceptron, so as to obtain key point coordinates required by human body posture estimation;
and the gesture estimation unit is used for estimating the human body gesture according to the coordinates of the key points.
The feature preprocessing unit specifically comprises:
and a feature preprocessing subunit for converting the RGB image into a feature vector sequence by using the CNN backhaul network.
The fusion device based on the multilayer perceptron comprises a 1 st stage S which is connected in sequence 2 -MLP multi-layer perceptron module, first global averaging pooling layer, stage 2S 2 -an MLP multi-layer perceptron module, a second global averaging pooling layer and a fully connected layer.
The S is 2 The MLP multi-layer perceptron module comprises a first full-connection layer, a feature shift module, a first batch normalization layer, a feature compression layer, a second full-connection layer and a second batch normalization layer in a residual connection mode.
Further, the present invention also provides an electronic device, which may include: a processor, a communication interface, a memory, and a communication bus. The processor, the communication interface and the memory complete communication with each other through a communication bus. The processor may invoke a computer program in memory to perform the dual decoder based human body pose estimation method.
Furthermore, the computer program in the above-described memory may be stored in a non-transitory computer readable storage medium when it is implemented in the form of a software functional unit and sold or used as a separate product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.
The invention adopts a shared encoder and two decoders to extract the characteristics, wherein the two decoders respectively extract the global dependence information of individuals and the local dependence information between key points, and adopts a characteristic fusion network based on a multi-layer perceptron to fuse the information extracted by the two encoders, thereby establishing a human body posture estimation method with global information, greatly shortening the human body posture estimation time and improving the human body posture estimation precision, and having wide application prospect.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (10)

1. A human body posture estimation method based on double decoders, comprising:
performing feature preprocessing on the acquired RGB image through a CNN network to obtain a preprocessed feature vector sequence;
the preprocessed feature vector sequence is subjected to coding and feature extraction by a transducer coder, and a vector sequence with global dependency relation features is output;
taking the vector sequence with the global dependency characteristic as the input of a target decoder and a key point decoder at the same time, wherein the target decoder extracts the global dependency characteristic of an individual, and the key point decoder extracts the local dependency characteristic among key points;
the global dependence characteristic and the local dependence characteristic are fused by a fusion device based on a multi-layer perceptron, so that key point coordinates required by human body posture estimation are obtained;
and estimating the human body posture according to the coordinates of the key points.
2. The human body posture estimation method based on the dual decoder according to claim 1, wherein the feature preprocessing is performed on the obtained RGB image through a CNN network to obtain a preprocessed feature vector sequence, and specifically includes:
the RGB image is converted into a feature vector sequence by using a CNNbackbone network.
3. The human body posture estimating method based on the double decoder according to claim 1, characterized in that said fusion device based on the multi-layer perceptron comprises a 1 st stage S connected in sequence 2 -MLP multi-layer perceptron module, first global averaging pooling layer, stage 2S 2 -an MLP multi-layer perceptron module, a second global averaging pooling layer and a fully connected layer.
4. The human body posture estimating method based on double decoder of claim 3, characterized in that said S 2 The MLP multi-layer perceptron module comprises a first full-connection layer, a characteristic offset module, a first batch normalization layer, a characteristic compression layer, a second full-connection layer and a second batch normalization layer which adopt a residual connection modeAnd a layer.
5. A dual decoder-based human body pose estimation system, comprising:
the characteristic preprocessing unit is used for carrying out characteristic preprocessing on the acquired RGB image through a CNN network to obtain a preprocessed characteristic vector sequence;
the coding unit is used for coding and extracting the characteristics of the preprocessed characteristic vector sequence through a transducer coder and outputting a vector sequence with global dependency characteristics;
the double decoding unit is used for taking the vector sequence with the global dependency characteristic as the input of the target decoder and the key point decoder at the same time, the target decoder extracts the global dependency characteristic of the individual, and the key point decoder extracts the local dependency characteristic among the key points;
the feature fusion unit is used for fusing the global dependence features and the local dependence features through a fusion device based on a multi-layer perceptron, so as to obtain key point coordinates required by human body posture estimation;
and the gesture estimation unit is used for estimating the human body gesture according to the coordinates of the key points.
6. The human body posture estimation system based on double decoders according to claim 5, wherein the feature preprocessing unit specifically comprises:
a feature preprocessing subunit for converting the RGB image into a feature vector sequence by using a CNNbackbone network.
7. The dual decoder-based human body posture estimation system of claim 5, wherein the multi-layer perceptron-based fusion comprises a stage 1S connected in sequence 2 -MLP multi-layer perceptron module, first global averaging pooling layer, stage 2S 2 -an MLP multi-layer perceptron module, a second global averaging pooling layer and a fully connected layer.
8. The dual decoder-based human body posture estimation system of claim 7, wherein said S 2 The MLP multi-layer perceptron module comprises a first full-connection layer, a feature shift module, a first batch normalization layer, a feature compression layer, a second full-connection layer and a second batch normalization layer in a residual connection mode.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the dual decoder based human pose estimation method according to any of claims 1 to 4 when executing the computer program.
10. The electronic device of claim 9, wherein the memory is a non-transitory computer readable storage medium.
CN202310563464.2A 2023-05-18 2023-05-18 Human body posture estimation method, system and equipment based on double decoders Pending CN116612527A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310563464.2A CN116612527A (en) 2023-05-18 2023-05-18 Human body posture estimation method, system and equipment based on double decoders

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310563464.2A CN116612527A (en) 2023-05-18 2023-05-18 Human body posture estimation method, system and equipment based on double decoders

Publications (1)

Publication Number Publication Date
CN116612527A true CN116612527A (en) 2023-08-18

Family

ID=87683002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310563464.2A Pending CN116612527A (en) 2023-05-18 2023-05-18 Human body posture estimation method, system and equipment based on double decoders

Country Status (1)

Country Link
CN (1) CN116612527A (en)

Similar Documents

Publication Publication Date Title
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN111026915B (en) Video classification method, video classification device, storage medium and electronic equipment
CN113888744A (en) Image semantic segmentation method based on Transformer visual upsampling module
CN111144314B (en) Method for detecting tampered face video
Fang et al. Traffic accident detection via self-supervised consistency learning in driving scenarios
CN112954399B (en) Image processing method and device and computer equipment
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
CN116682144B (en) Multi-modal pedestrian re-recognition method based on multi-level cross-modal difference reconciliation
CN113807361A (en) Neural network, target detection method, neural network training method and related products
Wang et al. Thermal images-aware guided early fusion network for cross-illumination RGB-T salient object detection
CN114283352A (en) Video semantic segmentation device, training method and video semantic segmentation method
CN110738099B (en) Low-resolution pedestrian re-identification method based on self-adaptive double-branch network
CN116612527A (en) Human body posture estimation method, system and equipment based on double decoders
CN115953832A (en) Semantic decoupling-based combined action recognition method of self-attention model
CN115424318A (en) Image identification method and device
Hu et al. Lightweight attention‐guided redundancy‐reuse network for real‐time semantic segmentation
CN115471765B (en) Semantic segmentation method, device and equipment for aerial image and storage medium
Pang et al. Self-similarity guided probabilistic embedding matching based on transformer for occluded person re-identification
CN116912488B (en) Three-dimensional panorama segmentation method and device based on multi-view camera
Wang et al. Global context instructive network for extreme crowd counting
Masilang et al. ConNet: Designing a Fast, Efficient, and Robust Crowd Counting Model Through Composite Compression
CN117474956B (en) Light field reconstruction model training method based on motion estimation attention and related equipment
CN113177483B (en) Video object segmentation method, device, equipment and storage medium
Wang et al. Crowd Counting Model with Convolutional Neural Networks and Transformer
CN115641445B (en) Remote sensing image shadow detection method integrating asymmetric inner convolution and Transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination