CN114677707A - Human body posture estimation method based on multi-resolution feature fusion network - Google Patents

Human body posture estimation method based on multi-resolution feature fusion network Download PDF

Info

Publication number
CN114677707A
CN114677707A CN202210262826.XA CN202210262826A CN114677707A CN 114677707 A CN114677707 A CN 114677707A CN 202210262826 A CN202210262826 A CN 202210262826A CN 114677707 A CN114677707 A CN 114677707A
Authority
CN
China
Prior art keywords
layer
human body
convolution
resolution
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210262826.XA
Other languages
Chinese (zh)
Inventor
段维柏
吴小杰
王啸
周奂斌
王超
伍腾飞
黄海徽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Sanjiang Aerospace Wanfeng Technology Development Co Ltd
Original Assignee
Hubei Sanjiang Aerospace Wanfeng Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Sanjiang Aerospace Wanfeng Technology Development Co Ltd filed Critical Hubei Sanjiang Aerospace Wanfeng Technology Development Co Ltd
Priority to CN202210262826.XA priority Critical patent/CN114677707A/en
Publication of CN114677707A publication Critical patent/CN114677707A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a human body posture estimation method based on a multi-resolution characteristic fusion network, which mainly comprises the steps of constructing a multi-resolution characteristic fusion network model with a plurality of parallel network structure characteristic extraction network branches, acquiring image characteristics with different resolutions, obtaining a corresponding single posture estimation result, embedding a multi-receptive field fusion module in the multi-receptive field fusion network model, wherein the multi-receptive field fusion network model comprises a plurality of convolution kernels with different sizes, the convolution kernels are arranged in parallel, the image characteristics are subjected to convolution operation through the convolution kernels of the first layer, the first layer of receptive field characteristics are extracted, the image characteristics and the receptive field characteristics output by the convolution kernels of the previous layer are respectively obtained through the convolution kernels of the next layer, and the receptive field characteristics output by each layer of convolution kernels are fused; the invention adopts a multi-receptive-field fusion module, fully utilizes context information, enhances the discriminability of each resolution characteristic, realizes the detection of 'difficult key points' and improves the detection precision.

Description

Human body posture estimation method based on multi-resolution feature fusion network
Technical Field
The application relates to the technical field of deep learning and computer vision, in particular to a human body posture estimation method based on a multi-resolution feature fusion network.
Background
Human Position Estimation (HPE) has been the focus of researchers as one of the very challenging issues in the field of computer vision. How to let a computer understand human behaviors is a very important research problem, and human posture estimation can be regarded as a basic task of the problem.
The human body posture estimation target is to accurately detect key points of human body parts such as wrists, elbows, shoulders and the like from a given image; and connecting the detected key points according to the structure of the human skeleton to form the posture of the human body. Human posture estimation based on a deep learning method provides two ideas. One is a Top-down (Top-down) based idea, and the human body posture estimation of the idea firstly detects all human body example frames in an input image and then independently estimates the posture of a human body in each frame; the other is based on a Bottom-up (Bottom-up) idea, which is to perform all detection on all key points of all human body examples contained in an input image at one time, and then to divide the detected key points into corresponding human body examples according to a certain class algorithm.
Aiming at the human posture estimation of the Top-down thought, the main task of the human posture estimation is to accurately position the position of a key point. Therefore, on the aspect of characteristics, two key problems of human body posture estimation for realizing high precision are: (1) the requirement of the positioning key point on the spatial sensitivity is high; (2) the human body postures are complex and various, and the key points are difficult to distinguish.
Aiming at the key problem (1), the higher requirement of the positioning key point on the spatial sensitivity is one of the challenges of the human body posture estimation task. With the rise of deep learning, various deep convolutional neural networks are used for feature extraction of images, and most of the networks are based on the structure of a traditional classification network to extract deep semantic information of the images so as to be used for corresponding tasks. More spatial information is lost in the process, and the obtained low-resolution characteristic diagram is not suitable for the human body posture estimation task.
Aiming at the key problem (2), for images (with difficult key points) of difficult scenes such as complex backgrounds, complex human body motions or occlusion, the task of estimating the human body posture of the images is difficult. The existing method is mostly based on a multi-scale feature fusion or a method for adding corresponding additional data set training to improve the detection precision of 'difficult key points'. Multi-scale feature fusion, while enabling richer learned features, may lack contextual information as a guide. Adding additional data sets is expensive, and results in large resource consumption (labeling of data sets is time-consuming and labor-consuming).
Human body posture estimation is widely applied to multiple fields including man-machine interaction, behavior recognition, video monitoring and the like, and the fields are closely related to the life of people. Along with the development of science and technology, the demand of people on the quality of life is continuously increased, so that the research on human posture estimation has a good prospect. In the long-term development process, the performance of the human posture estimation algorithm can not reach the actual application level, but the appearance of the deep learning technology changes the situation. The current methods mainly aim at improving the existing network model, for example, migrating the network in the classification task to the human body posture estimation task. However, the network of classification tasks generally ignores the importance of spatial information, and the requirement of human posture estimation on accurate positioning of each joint point is very high.
Disclosure of Invention
In view of at least one defect or improvement requirement of the prior art, the present invention provides a human body posture estimation method based on a multi-resolution feature fusion network, which aims to improve the detection accuracy of human body key points (including "difficult key points") in an image of a complex scene.
In order to achieve the above object, according to an aspect of the present invention, there is provided a human body posture estimation method based on a multi-resolution feature fusion network, including the following steps:
Acquiring an image to be detected and inputting the image to be detected into a human body detection network model to acquire a suggestion frame of each human body example in the image to be detected;
inputting each suggestion box into a trained multi-resolution feature fusion network model respectively to obtain a corresponding single posture estimation result;
the multi-resolution feature fusion network model is provided with a plurality of feature extraction network branches forming a parallel network structure, the feature extraction network branches are respectively used for acquiring image features with different resolutions in a suggestion frame, and one or more multi-receptive field fusion modules are embedded in each feature extraction network branch; carrying out attitude estimation based on the fused features of the image features with different resolutions to obtain a single attitude estimation result;
each multi-receptive-field fusion module comprises a plurality of convolution kernels with different sizes, the convolution kernels of the first layer carry out convolution operation on image features, and the first layer of receptive field features are extracted; the next layer of convolution kernel respectively acquires the image characteristics and the receptive field characteristics output by the previous layer of convolution kernel and carries out convolution operation; and fusing the receptive field characteristics output by each layer of convolution kernel.
Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the multi-receptive field fusion module includes four layers of convolution kernels, which are convolution kernels with sizes of 1 × 1, 3 × 3, 5 × 5 and 7 × 7 respectively, and are used for extracting the receptive field features with corresponding sizes respectively.
Further, in the method for estimating a human body pose based on a multi-resolution feature fusion network, the multi-field fusion module further includes a preceding 1 × 1 convolution kernel for obtaining the input image features and adjusting the number of channels thereof, and outputting the image features with the adjusted number of channels to each layer of convolution kernel.
Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the multi-receptive field fusion module further includes a post-positioned 1 × 1 convolution kernel for obtaining the receptive field features output by each layer of convolution kernels and restoring to the original channel number.
Furthermore, in the human body posture estimation method based on the multi-resolution feature fusion network, in a plurality of feature extraction network branches of the multi-resolution feature fusion network model,
a first layer of feature extraction network branches to obtain a suggestion frame of each human body example, and image features with preset resolution are extracted and maintained from the suggestion frame;
and the next layer of feature extraction network branch performs downsampling on the image features maintained by the previous layer of feature extraction network branch, restores the generated image features into the image features with the preset resolution through upsampling after the multi-sensing-field fusion, and fuses the image features with the image features maintained by the first layer of feature extraction network branch.
Further, in the above method for estimating a human body pose based on a multi-resolution feature fusion network, the first layer of feature extraction network branches includes a convolution module, and the other layers of feature extraction network branches include a convolution module and a down-sampling module;
the convolution module consists of a convolution layer, a BN layer and an activation function; the down-sampling module consists of a stepping convolution layer, a BN layer and an activation function.
Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the training process of the multi-resolution feature fusion network model is as follows:
acquiring a training image sample;
inputting training image samples into a human body detection network model to obtain a suggestion box of each human body example in an image;
cutting and zooming the suggestion box of each human body example, and inputting the cut and zoomed suggestion box into a multi-resolution feature fusion network model for single posture estimation;
and calculating a target loss function according to the prediction result of the single attitude estimation and the label truth value of the training image sample, and performing iterative training on the multi-resolution feature fusion network model based on the target loss function to obtain the trained multi-resolution feature fusion network model.
Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the target loss function is a minimum mean square error function.
Further, in the human body posture estimation method based on the multi-resolution characteristic fusion network, the human body detection network model adopts a Faster R-CNN network model.
Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the training and testing of the multi-resolution feature fusion network model are completed under a deep learning Pytorch framework.
In general, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects:
(1) the invention provides a human body posture estimation method based on a multi-resolution feature fusion network, which comprises the steps that a first layer of convolution kernels carries out convolution operation on image features and extracts first layer of receptive field features in a mode that a plurality of convolution kernels with different sizes are arranged in parallel; the next layer of convolution kernel respectively acquires image characteristics and the receptive field characteristics output by the previous layer of convolution kernel to carry out convolution operation, and the receptive field characteristics in different ranges are fused to enhance the information content of the learning characteristics; the process of multi-receptive-field fusion fully utilizes context information, enhances the discriminability of each resolution characteristic, realizes the detection of 'difficult key points' in difficult scenes such as complex background, complex human body action or occlusion and the like, and improves the detection precision;
(2) The invention provides a human body posture estimation method based on a multi-resolution feature fusion network, which is characterized in that a plurality of feature extraction network branches of a parallel network structure are arranged, a first layer of feature extraction network branches acquires a suggestion frame of each human body example, image features with preset resolution are extracted and maintained from the suggestion frame, and shallow spatial information is reserved; the next layer of feature extraction network branch performs down-sampling on the image features maintained by the previous layer of feature extraction network branch, and the generated image features are restored to be the image features with the first resolution through up-sampling after multi-sensing-field fusion and are fused with the image features maintained by the first layer of feature extraction network branch, so that the expression capability of semantic information is enhanced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of human body pose estimation for the top-down approach of the present embodiment;
FIG. 2 is a flowchart of a human body posture estimation method based on a multi-resolution feature fusion network according to the present embodiment;
FIG. 3 is a block diagram of the multi-field fusion module according to the present embodiment;
FIG. 4 is a diagram illustrating a multi-resolution feature fusion network according to the present embodiment;
fig. 5 is a flowchart of a multi-resolution feature fusion network model training process in this embodiment.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
In other instances, well-known or widely used techniques, elements, structures and processes may not be described or shown in detail in order to avoid obscuring the understanding of the invention by the skilled artisan. Although the drawings represent exemplary embodiments of the present invention, the drawings are not necessarily to scale and certain features may be exaggerated or omitted in order to better illustrate and explain the present invention.
The application provides a human body posture estimation method based on a top-down thought. Fig. 1 is a schematic flow chart of human body posture estimation from top to bottom according to the top-down idea of the embodiment of the present invention, please refer to fig. 1, in which the top-down human body posture estimation first uses a human body detector to detect all human body examples in an input image, then cuts the detected human body examples into a single human body example, and sends the single human body example into a single human body posture estimation network to obtain posture estimation results of all human body examples. In particular, the human body detector may use an existing advanced object detection network.
Fig. 2 is a diagram illustrating a human body posture estimation method based on a multi-resolution feature fusion network according to an embodiment of the present invention, referring to fig. 2, the method includes the following steps;
(1) acquiring an image to be detected and inputting the image to be detected into a human body detection network model to acquire a suggestion frame of each human body example in the image to be detected;
In one specific example, the human detection network model employs the Faster R-CNN network model.
(2) Inputting each suggestion box into a trained multi-resolution feature fusion network model respectively to obtain a corresponding single posture estimation result;
the multi-resolution feature fusion network model is provided with a plurality of feature extraction network branches forming a parallel network structure, the feature extraction network branches are respectively used for acquiring image features with different resolutions in a suggestion frame, and one or more multi-receptive field fusion modules are embedded in each feature extraction network branch; carrying out attitude estimation based on the fused features of the image features with different resolutions to obtain a single attitude estimation result;
each multi-receptive-field fusion module comprises a plurality of convolution kernels with different sizes, the convolution kernels of the first layer carry out convolution operation on image features, and the first layer of receptive field features are extracted; the next layer of convolution kernel respectively acquires the image characteristics and the receptive field characteristics output by the previous layer of convolution kernel and carries out convolution operation; and fusing the receptive field characteristics output by each layer of convolution kernel.
Fig. 3 is a structural diagram of the multi-receptive field fusion module in this embodiment, please refer to fig. 3, the multi-receptive field fusion module includes four layers of convolution kernels, which respectively adopt convolution kernels with sizes of 1 × 1, 3 × 3, 5 × 5, and 7 × 7 to respectively extract the receptive field features with corresponding sizes, and the module is embodied and described by a mathematical expression as the following expression group:
X1=X2=X3=X4=f(X) (1)
Figure BDA0003551272400000071
Y=concat[Y1;Y2;Y3;Y4] (3)
Figure BDA0003551272400000072
Wherein F (X) and F represent convolution layer functions of a 1X 1 convolution kernel, X1、X2、X3、X4The receptive field characteristic of the output after the 1 × 1 convolution operation, YiY is a concat function representing the channel combination of the features, which is a convolution layer composite function corresponding to convolution kernels of four sizes, namely 1 × 1, 3 × 3, 5 × 5 and 7 × 7 respectively.
Further, the fusion process of multiple receptive fields is as follows: resolution characteristic images in a plurality of characteristic extraction network branches of a parallel network structure are input to a multi-receptive-field fusion module, and in order to reduce the calculated amount,performing 1 × 1 convolution kernel operation on the original input characteristic X to compress the channel number to one fourth of the original channel number, outputting the receptive field characteristic, and dividing the characteristic into four branches X1、X2、X3、X4To each layer of convolution kernel; first layer convolution kernel to check image features X1Performing 1 × 1 convolution operation to output the first layer of receptive field characteristics Y1(ii) a Second layer convolution kernel acquisition image feature X2And the reception field characteristic Y output by the upper layer convolution kernel1Performing 3 × 3 convolution operation to output the receptive field characteristic Y of the second layer2(ii) a Third layer convolution kernel obtaining image characteristic X3And the reception field characteristic Y output by the upper layer convolution kernel25 x 5 convolution operation is carried out, and the receptive field characteristic Y of the third layer is output3(ii) a Obtaining image feature X by using a fourth layer of convolution kernel 4And the receptive field characteristic Y output by the upper convolution kernel3Performing 7 × 7 convolution operation to output the receptive field characteristic Y of the fourth layer4
The output receptive field characteristics Y of each layer1、Y2、Y3、Y4Merging different receptive field characteristics based on channels to obtain receptive field characteristics Y through concat operation, and fusing the receptive field characteristics Y with original input characteristics X to enhance the characteristics discrimination; finally, performing 1 × 1 convolution kernel operation to increase learning parameters of the network and recover the number of channels same as the original input characteristic X; and finally, the number of channels for outputting image features is reduced to be the same as the number of key detection points of the human body. In this embodiment, an element-wise add algorithm is used to fuse the original input feature image X and the receptive field feature Y, and other fusion methods may also be used, which is not specifically limited in this embodiment.
Fig. 4 is a structural diagram of a multiresolution feature fusion network in this embodiment, please refer to fig. 4, the multiresolution feature fusion network has a plurality of feature extraction network branches forming a parallel network structure, the first layer of feature extraction network branches obtains a suggestion box of each human body instance, and extracts and maintains image features with a preset resolution therefrom; and the next layer of feature extraction network branch performs down-sampling on the image features maintained by the previous layer of feature extraction network branch, and the generated image features are restored to the image features with the first resolution through up-sampling after being fused with multiple sensing fields and are fused with the image features maintained by the first layer of feature extraction network branch.
In a particular embodiment, the first-level feature extraction network branches to extract and maintain image features having an original image resolution 1/4; the next-layer feature extraction network branch downsamples the image features of which the original image resolution 1/4 is extracted from the previous-layer feature extraction network branch, and the image features are the image features of the original feature image resolutions 1/8, 1/16 and 1/32 in sequence; however, in the process of reducing the resolution of the feature image, more spatial information is lost, so that the next-layer feature extraction network branch performs downsampling on 1/4 image features with the original image resolution maintained by the previous-layer feature extraction network branch, and the generated image features are restored to the image features with the first resolution through upsampling after being fused by multiple sensing fields; the multi-receptive-field fusion module is embedded into a plurality of characteristic extraction network branches, and after the resolution characteristics pass through the multi-receptive-field fusion module, the context information is fully utilized, and the discriminativity of each resolution characteristic is enhanced. And finally, predicting the position information of the key points in a heat map mode by the multi-resolution characteristic image through a multi-receptive-field fusion module to finally obtain a human body posture result predicted by a network. In this embodiment, the upper branch includes a convolution module, and the lower branch includes a convolution module and a down-sampling module. The convolution module consists of a convolution layer Conv, a BN layer Batch Normalization and an activation function; the down-sampling module consists of a stepping convolution layer, a BN layer and an activation function. In this embodiment, the upsampling uses a linear interpolation method, and may also use methods such as transposed convolution and void convolution, without specific limitation.
Fig. 5 is a flowchart of a training process of the multi-resolution feature fusion network model in this embodiment, please refer to fig. 5, where the training process includes the following steps:
(1) acquiring a training image sample;
further, the training image sample is a sample image carrying the labeling information;
the COCO dataset contains over 200000 pictures, which collectively contain 250000 human instances. For the complete human body example, the label contains information for 17 key points. The model was trained on the COCO training set. The training set included 57000 pictures that collectively contained 150000 personal instances. After model training, the model was evaluated on a COCO validation set containing 5000 pictures and a test set containing 20000 pictures.
The MPII human pose estimation dataset collects pictures from various activities in the real world, approximately 25000 pictures, which contain 40000 human instances in total. For the complete human body example, the full body posture label is contained, and the label contains 16 key point information of the full body. The model was trained on the MPII training set, which contained about 28000 human instances. After model training, the models were evaluated on an MPII test set containing 12000 human instances.
In this embodiment, the training image samples and the test image samples are selected from the COCO and MPII data sets. Of course, other data sets may be used, and the embodiment is not limited in particular.
(2) Inputting the training image samples into a human body detection network model to obtain a suggestion box of each human body example in the image;
in one particular example the human detection network model employs the Faster R-CNN network model.
(3) Cutting and zooming the suggestion box of each human body example, and inputting the cut and zoomed suggestion box into a multi-resolution feature fusion network model for single posture estimation;
(4) and calculating a target loss function according to the prediction result of the single attitude estimation and the label truth value of the training image sample, and performing iterative training on the multi-resolution feature fusion network model based on the target loss function to obtain the trained multi-resolution feature fusion network model.
In this embodiment, the set target loss function is a minimum Mean square Error function Mean Squared Error, MSE, or other functions may be used, which is not limited in this embodiment. And the training and testing of the resolution characteristic fusion network model are completed under a deeply-learned Pythrch framework.
According to the human body posture estimation method based on the multi-resolution feature fusion network, through the mode that a plurality of convolution kernels with different sizes are arranged in parallel, the first layer of convolution kernels carries out convolution operation on image features, and the first layer of receptive field features are extracted; the next layer of convolution kernel respectively acquires image characteristics and the receptive field characteristics output by the previous layer of convolution kernel to carry out convolution operation, and the receptive field characteristics in different ranges are fused to enhance the information content of the learning characteristics; the multi-resolution feature fusion enables the network to have rich space and semantic information, the process of multi-receptive field fusion fully utilizes context information, the discriminativity of each resolution feature is enhanced, the detection of 'difficult key points' is realized under the difficult scenes of complex background, complex human body action or shielding and the like, and the detection precision is improved.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A human body posture estimation method based on a multi-resolution feature fusion network is characterized by comprising the following steps:
acquiring an image to be detected and inputting the image to the human body detection network model to acquire a suggestion frame of each human body example in the image to be detected;
inputting each suggestion box into a trained multi-resolution feature fusion network model respectively to obtain a corresponding single-person attitude estimation result;
the multi-resolution feature fusion network model is provided with a plurality of feature extraction network branches forming a parallel network structure, the feature extraction network branches are respectively used for acquiring image features with different resolutions in a suggestion frame, and one or more multi-receptive field fusion modules are embedded in each feature extraction network branch; performing attitude estimation based on the feature obtained by fusing the image features with different resolutions to obtain a single-person attitude estimation result;
Each multi-receptive-field fusion module comprises a plurality of convolution kernels with different sizes, wherein the convolution kernels are arranged in parallel, the first layer of convolution kernels perform convolution operation on image features, and the first layer of receptive field features are extracted; the next layer of convolution kernel respectively obtains the image characteristics and the receptive field characteristics output by the previous layer of convolution kernel and carries out convolution operation; and fusing the receptive field characteristics output by each layer of convolution kernel.
2. The method as claimed in claim 1, wherein the multi-domain fusion module comprises four layers of convolution kernels, each of which is of 1 × 1, 3 × 3, 5 × 5 and 7 × 7 size, for extracting the domain features of the corresponding size.
3. The method for estimating human body pose based on multi-resolution feature fusion network according to claim 1 or 2, wherein the multi-field fusion module further comprises a preposed 1 x 1 convolution kernel for obtaining the input image features and adjusting the number of channels thereof, and outputting the image features after the adjustment of the number of channels to each layer of convolution kernel.
4. The method according to claim 1 or 2, wherein the multi-field fusion module further comprises a post-positioned 1 x 1 convolution kernel for obtaining the field features output by each layer of convolution kernels and restoring the field features to the original number of channels.
5. The method for estimating human body pose based on multi-resolution feature fusion network according to claim 1, wherein in a plurality of feature extraction network branches of the multi-resolution feature fusion network model,
a first layer of feature extraction network branches to obtain a suggestion frame of each human body example, and image features with preset resolution are extracted and maintained from the suggestion frame;
and the next layer of feature extraction network branch performs down-sampling on the image features maintained by the previous layer of feature extraction network branch, and the generated image features are restored to the image features with the preset resolution through up-sampling after multi-sensing-field fusion and are fused with the image features maintained by the first layer of feature extraction network branch.
6. The multi-resolution feature fusion network-based human body posture estimation method according to claim 5, wherein the first layer of feature extraction network branches comprises a convolution module, and the other layers of feature extraction network branches comprise a convolution module and a down-sampling module;
the convolution module consists of a convolution layer, a BN layer and an activation function; the down-sampling module consists of a stepping convolution layer, a BN layer and an activation function.
7. The method for estimating the human body posture based on the multi-resolution feature fusion network as claimed in claim 1, wherein the training process of the multi-resolution feature fusion network model is as follows:
Acquiring a training image sample;
inputting the training image samples into a human body detection network model to obtain a suggestion box of each human body example in the image;
cutting and zooming the suggestion box of each human body example, and inputting the cut and zoomed suggestion box into a multi-resolution feature fusion network model for single posture estimation;
and calculating a target loss function according to the prediction result of the single attitude estimation and the label truth value of the training image sample, and performing iterative training on the multi-resolution feature fusion network model based on the target loss function to obtain the trained multi-resolution feature fusion network model.
8. The method of claim 7, wherein the objective loss function is a minimum mean square error function.
9. The multi-resolution feature fusion network-based human body posture estimation method of claim 1, wherein the human body detection network model adopts a Faster R-CNN network model.
10. The multi-resolution feature fusion network-based human body pose estimation method of claim 1, wherein the training and testing of the multi-resolution feature fusion network model is performed under a deep learning Pytorch framework.
CN202210262826.XA 2022-03-17 2022-03-17 Human body posture estimation method based on multi-resolution feature fusion network Pending CN114677707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210262826.XA CN114677707A (en) 2022-03-17 2022-03-17 Human body posture estimation method based on multi-resolution feature fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210262826.XA CN114677707A (en) 2022-03-17 2022-03-17 Human body posture estimation method based on multi-resolution feature fusion network

Publications (1)

Publication Number Publication Date
CN114677707A true CN114677707A (en) 2022-06-28

Family

ID=82073424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210262826.XA Pending CN114677707A (en) 2022-03-17 2022-03-17 Human body posture estimation method based on multi-resolution feature fusion network

Country Status (1)

Country Link
CN (1) CN114677707A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524546A (en) * 2023-07-04 2023-08-01 南京邮电大学 Low-resolution human body posture estimation method based on heterogeneous image cooperative enhancement
CN117764988A (en) * 2024-02-22 2024-03-26 山东省计算中心(国家超级计算济南中心) Road crack detection method and system based on heteronuclear convolution multi-receptive field network
CN117853432A (en) * 2023-12-26 2024-04-09 北京长木谷医疗科技股份有限公司 Hybrid model-based osteoarthropathy identification method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524546A (en) * 2023-07-04 2023-08-01 南京邮电大学 Low-resolution human body posture estimation method based on heterogeneous image cooperative enhancement
CN116524546B (en) * 2023-07-04 2023-09-01 南京邮电大学 Low-resolution human body posture estimation method based on heterogeneous image cooperative enhancement
CN117853432A (en) * 2023-12-26 2024-04-09 北京长木谷医疗科技股份有限公司 Hybrid model-based osteoarthropathy identification method and device
CN117764988A (en) * 2024-02-22 2024-03-26 山东省计算中心(国家超级计算济南中心) Road crack detection method and system based on heteronuclear convolution multi-receptive field network
CN117764988B (en) * 2024-02-22 2024-04-30 山东省计算中心(国家超级计算济南中心) Road crack detection method and system based on heteronuclear convolution multi-receptive field network

Similar Documents

Publication Publication Date Title
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN111047551B (en) Remote sensing image change detection method and system based on U-net improved algorithm
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN111931684B (en) Weak and small target detection method based on video satellite data identification features
CN110287960A (en) The detection recognition method of curve text in natural scene image
CN114677707A (en) Human body posture estimation method based on multi-resolution feature fusion network
CN111160350B (en) Portrait segmentation method, model training method, device, medium and electronic equipment
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN109919223B (en) Target detection method and device based on deep neural network
CN113313082B (en) Target detection method and system based on multitask loss function
CN111353544A (en) Improved Mixed Pooling-Yolov 3-based target detection method
CN114117614A (en) Method and system for automatically generating building facade texture
CN110781980A (en) Training method of target detection model, target detection method and device
CN113239753A (en) Improved traffic sign detection and identification method based on YOLOv4
CN112700476A (en) Infrared ship video tracking method based on convolutional neural network
CN113516126A (en) Adaptive threshold scene text detection method based on attention feature fusion
CN112784756A (en) Human body identification tracking method
CN114187454A (en) Novel significance target detection method based on lightweight network
CN117079163A (en) Aerial image small target detection method based on improved YOLOX-S
Kavitha et al. Convolutional Neural Networks Based Video Reconstruction and Computation in Digital Twins.
CN111582057B (en) Face verification method based on local receptive field
CN116805360B (en) Obvious target detection method based on double-flow gating progressive optimization network
CN114049541A (en) Visual scene recognition method based on structural information characteristic decoupling and knowledge migration
CN112418229A (en) Unmanned ship marine scene image real-time segmentation method based on deep learning
CN111767919A (en) Target detection method for multi-layer bidirectional feature extraction and fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination