CN112861699A

CN112861699A - Method for estimating height of human body in any posture based on single depth image and multi-stage neural network

Info

Publication number: CN112861699A
Application number: CN202110150551.6A
Authority: CN
Inventors: 尹富坤; 周世哲
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2021-05-28

Abstract

The invention discloses a method for estimating the height of a human body in any posture based on a single depth image and a multi-stage neural network, which realizes the accurate height estimation of the human body in any posture and position through a 4-stage neural network framework and a development neural network framework. The method mainly comprises the following steps: human body segmentation, namely segmenting a human body trunk image from a single depth image by using a neural network, and extracting high-frequency detail information as input in order to enable the segmentation edge to be finer; constructing intermediate expression, further dividing the trunk image into four parts with small bending degrees, namely a head, an upper body, a thigh and a shank, and respectively predicting the lengths of the four parts, thereby fully utilizing the excellent performance of the convolutional neural network in local perception; and model design, namely, designing a network architecture and a training method for developing a neural network and combining a mixed pooling strategy to estimate the height of a human body, thereby further improving the network performance and reducing the training time. The average human height prediction accuracy of the invention is more than 99.1%.

Description

Method for estimating height of human body in any posture based on single depth image and multi-stage neural network

Technical Field

The invention relates to the field of computer vision and machine learning, and provides a method for estimating the height of a human body in any posture based on a single depth image and a multi-stage neural network. The specific technology comprises the following steps: the method comprises the steps of depth image segmentation, image feature extraction, image semantic information extraction, neural network architecture design and neural network training. Through the technology, a 4-stage neural network framework is constructed, a training method for developing a neural network is provided, and the accurate estimation of the height of a human body from a single depth image is realized.

Background

In the fields of human body three-dimensional reconstruction, virtual reality, medical health, clothing design and the like, height data is essential data information in the development process of the fields. In most cases, the conventional method generally requires that the person to be measured stands upright, and then the person is measured and read by means of a meter or a height meter. The measurement mode not only consumes a great deal of time and labor, but also requires active cooperation of the measured person and limited use scenes. Particularly, in practical application scenarios, if a measuring person lacks a measuring tool such as a meter or a height meter, or the measured person cannot stand upright due to injury or disease, the conventional height measurement cannot be performed.

In recent years, some methods attempt to acquire information from images or videos to achieve the purpose of measuring the height of a human body in a non-contact manner, so that the problems in the conventional methods can be solved to a certain extent, but the methods still have certain limitations: most methods can only measure simple postures such as standing and walking, or require a tested person to stand at a specified position, so that the use scenes of the methods are greatly limited. Some methods require manual calibration of the head and feet, cannot be fully automated, and require a large amount of manual labeling. Still other methods require the acquisition of multiple photographs, or the use of multiple devices, adding to the expenditure of time and cost.

Aiming at the problems, the method can output a reliable result within millisecond time by only shooting one depth image, realize full-automatic estimation of the height of the human body from the image and save a large amount of manpower and time. Meanwhile, the tested person can make any postures of standing, walking, bending, sitting, standing and the like without being required to be in a certain fixed position in the image, can be positioned at any position in the acquisition range of the depth camera in various postures, and has good adaptability and robustness.

Disclosure of Invention

Technical problem to be solved

The invention aims to provide a method for estimating the height of a human body in any posture based on a single depth image and a multi-stage neural network, which can accurately estimate the height of the human body in any position and any posture by a full-automatic non-contact measuring means.

(II) technical scheme

1. The invention provides a method for estimating the height of a human body in any posture based on a single depth image and a multi-stage neural network, which comprises the following steps:

and S1, acquiring data, wherein the step acquires a human height data set with 2136 depth images by using the depth camera. The tested person can be positioned at any position in the acquisition range of the depth camera and can be put in any posture, including non-upright postures such as sitting, bending, walking and the like. The height of each volunteer and the lengths of four approximately rigid parts, namely the head, the upper body, the thigh and the shank, are measured and recorded. And marking the real values of the corresponding trunk images and the real values of the body part images for each depth image.

And S2, extracting the edge image, and extracting edge high-frequency information in the original depth image by using an edge detection operator.

S3, segmenting human body trunk, and designing a convolutional neural network f¹(x) And the human body image is extracted from the original depth image and the edge image.

S4, recognizing body parts and designing a convolutional neural network f²(x) The torso image is further segmented into four approximately rigid body parts: head, upper body, thigh and calf, and obtaining body position image.

S5, predicting the length of the body part, and designing a convolutional neural network f³(x) The lengths of four approximately rigid body parts, i.e., the probe, the upper body, the thigh, and the calf, are predicted from the body part image and the original depth image.

S6, predicting the height of human body, designing a convolutional neural network f⁴(x) The human height is predicted through the original depth image, the body part image and the body part length. At the same time, different characteristics are adopted according to different input dataUsing a hopping connection structure to input raw input data into each convolutional layer.

S7, designing a developing neural network, predicting task characteristics based on height and a convolutional neural network f⁴(x) The framework of (2) and a training method of a network structure changing along with the iteration times are designed, and the fitting state is repeatedly destroyed until the neural network finds a global optimal solution.

2. The method of any pose human height estimation based on a single depth image and a multi-stage neural network of claim 1, wherein: and only using a single depth image to predict the height of the human body in a non-contact mode. The input data adopted in all steps of the invention are original depth images or intermediate expressions obtained by the original depth images. Only a low-price commercial-grade depth camera is used for collecting depth data to serve as an original depth image, so that equipment cost is reduced, and the method is easy to use, practical and popularize. The depth image is obtained by using an infrared technology, and the performance is not influenced even if no external light source exists, so that the method can be applied to the fields of night security monitoring and the like. The body height of a human body is predicted in the image in a non-contact full-automatic mode, only one image is needed to be shot, a measurer does not need to be in direct contact with a detected person, and the method is suitable for measuring the body height data of the human body in epidemic situation prevention and control normalization periods, physical examination, riding, ticket selling and other situations. The depth image does not contain human face features and clothing texture features, and the fact that the neural network learns the identity features of the detected person can be avoided only by using a single depth image, so that interference is generated on height estimation.

3. The method of any pose human height estimation based on a single depth image and a multi-stage neural network of claim 1, wherein: and extracting height prediction related features by using a multi-stage neural network, and converting the height estimation problem into a plurality of local small problems. By convolutional neural networks f¹(x) Obtaining the image of the human body, and obtaining the image of the human body through a convolution neural network f²(x) Obtaining body part images through a convolution neural network f³(x) Obtaining the length of body part, decomposing the body height estimation into four approximately rigid parts of head, upper half body, thigh and shank, respectively predicting, and analyzingThe four results are integrated into the height of the measured human body. The benefits of this are: the height prediction is decomposed into four rigid part predictions, which is an easier problem; the lengths of the four body parts and the topological relationship between them can suggest the posture of the human body, thereby providing a favorable clue for height estimation; the height prediction problem is divided into four small local problems, and the excellent performance of the convolutional neural network on local perception can be fully utilized.

4. The arbitrary pose human height estimation method based on a single depth image and a multi-stage neural network of claim 1, further comprising: convolutional neural network f¹(x) And the method is used for segmenting the human body image from the single depth image. Since in the field of height measurement we define the distance between the apex of the head and the point of the sole of the foot when the body is upright as height, the positioning of the apex of the head and the point of the sole of the foot is of great importance. In the existing human body segmentation method, the segmentation at the body edge is often inaccurate, which influences the selection of the head vertex and the foot bottom point. Our method uses Canny operator to extract edge information from depth image, and enhances the edge of human body segmentation image.

E＝Canny(X)

X represents the original depth image acquired by the camera and E represents the corresponding edge image extracted.

f¹(x) Including five downsampling and five upsampling modules. In the up-sampling module, a module I and a module II both comprise 2 convolutional layers and activation functions, and a module III, a module IV and a module V all comprise 3 convolutional layers and activation functions. The down-sampling module is symmetrical to the up-sampling module, the first module, the second module and the third module all comprise 3 convolution layers and activation functions, and the fourth module and the fifth module all comprise 2 convolution layers and activation functions. Our method takes the original depth image X and the edge image E as input, and passes through a convolutional neural network f¹(x) To obtain a human body trunk segmentation image prediction image T'.

T′＝f¹(X,E)

Convolutional neural network f¹(x) Loss value of Loss1 using a predictive torso image sumMean of the pixel-by-pixel difference sum of squares of the real torso image, Adam is used by the optimizer.

N is the total number of pixels in the image, i is a certain pixel point in the cyclic variable representation image, and T is a real trunk image. The high-frequency information of the human body edge is input into the neural network, so that the accuracy of the edge in the human body trunk segmentation graph can be obviously improved.

5. The arbitrary pose human height estimation method based on a single depth image and a multi-stage neural network of claim 1, further comprising: convolutional neural network f²(x) The method is used for further dividing the human body image into four approximately rigid parts, namely a head, an upper body, thighs and calves, so as to obtain the human body image. Convolutional neural network f²(x) And f¹(x) The network structures of the modules are the same, and each module comprises 5 down-sampling modules and 5 up-sampling modules, and each module comprises a plurality of convolution layers and an activation function. To a convolutional neural network f²(x) The prediction graph L ' of the human body part segmentation image is obtained by inputting the prediction graph T ' of the human body trunk segmentation image in the middle, because the prediction graph L ' of the human body part segmentation image is obtained by the convolutional neural network f¹(x) The obtained trunk segmentation image T' has errors, so the images are simultaneously f²(x) The original depth image X is input to avoid accumulation of errors.

L′＝f²(X,T′)

Convolutional neural network f²(x) Loss value Loss2 is the average of the sum of squared pixel-by-pixel differences between the predicted body region image and the actual body region image.

N is the total number of pixels in the image, i is a certain pixel point in the cyclic variable representation image, and L is a real human body part image. The optimizer employs Adam to minimize the error of predicted and real images until the network can robustly achieve convergence.

6. The arbitrary pose human height estimation method based on a single depth image and a multi-stage neural network of claim 1, further comprising: convolutional neural network f³(x) The method is used for predicting the lengths of four approximately rigid parts, namely the measuring head, the upper half body, the thigh and the lower leg. Convolutional neural network f³(x) The device comprises 13 convolutional layers, 5 full-connection layers and corresponding activation functions, wherein the 13 convolutional layers form 5 up-sampling modules, a first module and a second module respectively comprise 2 convolutional layers and activation functions, and a third module, a fourth module and a fifth module respectively comprise 3 convolutional layers and activation functions. And outputting a 1 x 4 vector representing the length of 4 body parts through 6 fully-connected layers, wherein the number of nodes of the 6 fully-connected layers is respectively as follows: 4096. 4096, 1000, 256, 64, 4. To a convolutional neural network f³(x) Meanwhile, in order to reduce accumulated errors, an original depth image X is input into the network, and length estimated values of 4 body parts including the head, the upper body, the thigh and the lower leg are obtained.

[H^headH^upperbodyH^thighH^calf]^1*4＝f³(X,L)

H^headIs the head length, H^upperbodyTo upper body length, H^thighIs thigh length, H^calfIs the calf length.

The Loss value Loss3 uses the sum of the squares of the difference of the predicted four-part length and the true value.

Loss3＝|H^head-TH^head|²+|H^upperbody-TH^upperbody|²+|H^thigh-TH^thigh|²+|H^calf-TH^calf|²

TH^head，TH^uooerbody，TH^thigh，TH^calfRepresenting the actual length of the head, upper body, thigh and calf, respectively. The optimizer uses Adam to predict the lengths of the 4 approximately rigid sites separately by minimizing the loss values.

7. The single-sheet based depth map of claim 1The method for estimating the height of the human body with any posture, such as the multi-stage neural network, further comprises the following steps: convolutional neural network f⁴(x) The height of the human body is estimated by the body part image, the body part length and the original depth image. Convolutional neural network f⁴(x) The method comprises 13 convolutional layers, 7 full-connection layers and corresponding activation functions, wherein the 13 convolutional layers form 5 upsampling modules, a first module and a second module both comprise 2 convolutional layers and activation functions, a third module, a fourth module and a fifth module all comprise 3 convolutional layers and activation functions, the last convolutional layer result is unfolded into a one-dimensional vector, and an estimated height value is output through the 7 full-connection layers, wherein the number of nodes of the 7 full-connection layers is 4096, 1000, 256, 64, 16 and 1. In order to reduce accumulated errors and improve the accuracy of prediction, an original depth image X and a human body part segmentation image prediction image L' are input into a convolutional neural network f together⁴(x) In (1). Meanwhile, the invention also adopts a jump structure, and adopts different pooling strategies according to the characteristics of different input data: and adopting an average pooling strategy for the depth image and a maximum pooling strategy for the human body part segmentation image, so that input data with different scales are directly input into each convolution layer.

[H^human]＝f⁴(X,L′,H_4-part)

H^humanIs an estimate of body height, H_4-partAre predicted values of the 4 approximate rigid body part lengths in step S5.

Convolutional neural network f⁴(x) Loss value Loss4 uses the square of the difference between the estimated height and the actual height.

Loss4＝|H^human-TH^human|²

TH^humanIs the true value of the height of the human body. The optimizer estimates body height by minimizing the loss value using Adam.

8. The arbitrary pose human height estimation method based on a single depth image and a multi-stage neural network of claim 7, further comprising: a hopping connection structure with a hybrid pooling strategy. Convolutional neural network f⁴(x) Having 13 convolution layers, at f⁴(x) The problem that the gradient disappears under the condition that the network layer number is deep can be solved by adding the jump connection, and meanwhile, the reverse propagation of the gradient is facilitated, and the training process is accelerated. In steps S3, S4, and S5, we use three neural networks to obtain length estimates of the head, upper body, thigh, and lower leg of the human body, and input them as input data to the convolutional neural network f⁴(x) In the method, because the estimated value predicted by each neural network has an error with the true value, and the current error is input into the neural network of the next stage, the error is increased. We use the skip-join structure to input the original depth image X and the body part image L directly into the convolutional neural network f⁴(x) To minimize the accumulated error.

Since the original depth image X and the body part image L have different characteristics: the original depth image X has a plurality of noise points, which is expressed on the depth image, namely a plurality of extreme points, and if a maximum pooling strategy is simply used, the noise points are reserved, the depth information is lost, the network learning is interfered, and a larger error is caused; the body part image L is smoother and if an average pooling strategy is used, the gradient at the body part boundary is reduced, introducing errors. Therefore, an average pooling strategy is adopted for the original depth image X, noise of the depth image is smoothed, and the accurate original depth image under different scales is input into a network. And adopting a maximum pooling strategy for the body part image L, keeping the accuracy of the segmentation boundary, and still keeping gradient information when the image size is reduced.

L_next＝Maxpool(L_now)

X_next＝Avgpool(X_now)

L_nowIs a body part image of the current scale, L_nextIs the body part image of the next scale. X_nowFor depth images of the current scale, X_nextIs the depth image of the next scale.

The original information is input into each convolution module by adopting a jump connection structure of a mixed pooling strategy, so that the original image can be kept undistorted under each scale to the maximum extent, and the accuracy of network prediction is improved.

9. The method for arbitrary pose body height estimation based on a single depth image and a multi-stage neural network of claim 7, further comprising: architecture and training methods for neural networks were developed. The invention proposes to develop a neural network and apply it to a convolutional neural network f⁴(x) So as to improve the network precision, reduce the training time and prevent overfitting.

The main idea of developing neural networks is: in the network training process, when the network tends to converge, the architecture of the convolution layer part of the network is adjusted to jump out of the local minimum value, and a global optimal solution is searched.

The specific method comprises the following steps: in a convolutional neural network f⁴(x) When the number of iterations is less than 4 x 10 in the training process⁴In time, the network is pre-trained, the neural network f⁴(x) 13 convolution layers of the middle 5 modules are all in working states; when the number of iterations equals 4 x 10⁴Then, storing the pre-training model; when the number of iterations is greater than 4 x 10⁴And less than 6 x 10⁴Time, neural network f⁴(x) Each convolution module only reserves the first convolution layer, and the total number of the convolution layers is 5 for training; when the number of iterations is greater than 6 x 10⁴And less than 8 x 10⁴And then, the first module and the second module recover one convolutional layer from the pre-training model respectively, and at the moment, 7 convolutional layers in the network participate in training. When the number of iterations is more than 8 x 10⁴And less than 1 x 10⁵And then, recovering one convolutional layer from the pre-training model by using a module three, a module four and a module five, wherein 10 convolutional layers in the network participate in training. When the number of iterations is greater than 1 x 10⁵In time, module three, module four, and module five each recover one convolutional layer from the pre-trained model, i.e., recover the first 13 convolutional layers.

Developing a neural network can be accomplished by fitting and destroying the fitting conditions iteratively until the network finds a globally optimal solution. Since when the number of iterations is greater than 4 x 10⁴And less than 1 x 10⁵Only part of the convolutional layer is trained, so that the network training time can be reduced. When the network tends to converge, by increasingThe deconvolution layer prevents the network from being over-fitted, so that the network jumps out of the local optimal solution to search the global optimal solution, thereby effectively improving the accuracy of height estimation.

Drawings

FIG. 1 is a schematic diagram of a framework of an arbitrary posture human height estimation method based on a single depth image and a multi-stage neural network according to an embodiment of the present invention. Firstly extracting edge images from the acquired depth images and then inputting the edge images into a neural network f¹(x) Obtaining a trunk image; the torso image and depth image are then input to a neural network f²(x) Obtaining a body part image; inputting the body position image and the depth image into a neural network f³(x) Obtaining a predicted value of the length of the body part; finally, the length of the body part, the body part image and the depth image are input into the neural network f⁴(x) And outputting the estimated value of the height of the human body.

FIG. 2 shows a convolutional neural network f according to an embodiment of the present invention¹(x) The network structure chart of (1) inputting the depth image and the edge image and outputting the trunk image. The system comprises 10 convolution modules and 26 convolution layers. The first 5 convolution modules are used to extract features, and the last 5 convolution modules generate a torso image.

FIG. 3 shows a convolutional neural network f according to an embodiment of the present invention²(x) The network structure chart of (1) inputs the depth image and the body image and outputs the body position image. The system comprises 10 convolution modules and 26 convolution layers. The first 5 convolution modules are used to extract features, and the last 5 convolution modules generate body part images.

FIG. 4 shows a convolutional neural network f according to an embodiment of the present invention³(x) The network structure chart of (1) inputs the depth image and the body part image, and outputs the body part length estimation value. The system comprises 5 convolution modules, 13 convolution layers and 6 full-connection layers. The 5 convolution modules are used for extracting features, and length estimated values of 4 body parts including the head, the upper body, the thigh and the shank are obtained through 6 full-connection layers.

FIG. 5 shows a convolutional neural network f according to an embodiment of the present invention⁴(x) The network structure chart inputs the depth image, the length value of the body part and the body part image and outputs the height of the human body. Contains 5 convolution modules13 convolutional layers and 7 fully connected layers. The 5 convolution modules are used for extracting features, and then the estimated value of the height of the human body is obtained through 7 full-connection layers.

FIG. 6 shows a convolutional neural network f according to an embodiment of the present invention⁴(x) The network structure changes with the iteration number.

FIG. 7 is an example of some experimental results according to an embodiment of the present invention.

Detailed Description

The invention provides a method for estimating the height of a human body in any posture based on a single depth image and a multi-stage neural network, which realizes the accurate height estimation of the human body in any position and any posture through a 4-stage neural network framework and a development neural network framework. Firstly, a large number of depth images are collected for a plurality of people, the height of a tested person and the length of each part of a body are recorded, and a trunk image and a body part image are marked as truth values to construct a data set. Then the network and model are designed, the height estimation is converted into the length prediction of four approximately rigid body parts, and a 4-stage convolution neural network is designed to complete the process. And finally, improving a network architecture and a training method, and providing an architecture named as developing a neural network, so that the training time is reduced, overfitting is prevented, and the accuracy of network prediction is further improved.

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

The invention provides a method for estimating height of a human body in any posture based on a single depth image and a multi-stage neural network, and the method is a frame schematic diagram of the method for estimating the height based on the single depth image. As shown in FIG. 1, the invention mainly realizes the accurate estimation of the human body height through a 4-stage neural network framework, which comprises: neural network f¹(x) Neural network f²(x) Neural network f³(x) And neural network f⁴(x) In that respect Neural network f¹(x) Estimating a detected human body trunk image from the depth image and the edge image; neural network f²(x)From depth images and neural networks f¹(x) The output human body trunk image estimates four approximately rigid body part segmentation images of the head, the upper half body, the thigh and the shank of the tested person; neural network f³(x) From depth images and neural networks f²(x) The output body part images estimate the corresponding lengths of 4 body parts of the tested person; neural network f⁴(x) From depth images, body position images and neural networks f³(x) The output length of body part estimates the height of human body, the invention also designs a mixed pooling strategy and develops a neural network and applies the neural network to f⁴(x) Further improving the accuracy of height estimation.

With reference to fig. 2 to 6, specific technical steps of the above-mentioned framework are described:

and S1, segmenting the human body image through the depth image and the edge image.

The method comprises the following steps of firstly, acquiring an edge image from a depth image by using a Canny operator, and specifically comprises the following steps: and selecting a Gauss filter for smooth filtering of the image, and processing by adopting a non-extreme value suppression technology to obtain edge information of the depth image. The depth image and edge image are then input to a convolutional neural network f¹(x) In the method, features of an input image are extracted by 5 downsampling modules, and the image is represented by a low-dimensional feature vector, wherein the first module and the second module respectively comprise 2 convolutional layers, and the third module, the fourth module and the fifth module respectively comprise 3 convolutional layers. And inputting the feature vectors into 5 upsampling modules, wherein each of the module six, the module seven and the module eight comprises 3 convolutional layers, each of the module 9 and the module 10 comprises 2 convolutional layers, and the 5 upsampling modules output corresponding human body trunk images according to the input feature vectors.

Convolutional neural network f¹(x) The network architecture is shown in fig. 2.

The Loss value Loss1 quantization neural network f is adopted in the step¹(x) The error between the estimated image and the real image is specifically an average value of the pixel-by-pixel difference sum of squares of the predicted torso image and the real torso image.

N is the total number of pixels in the image, i is a cyclic variable representing a certain pixel point in the image, X is a depth image, E is an edge image, and T is a real trunk image. The initial learning rate was set to 0.0001 with a reduction of 0.8 every 5 rounds. And calculating an updating step size by adopting an Adam optimizer, and obtaining an estimated human body trunk image by minimizing the Loss 1.

And S2, obtaining a body position image through the depth image and the body image.

The step is to further divide the trunk image according to the body part to obtain the body part image, and simultaneously input the depth image to eliminate the accumulated error, thereby converting the human height estimation into the estimation of the lengths of four approximate rigid parts, namely the head, the upper body, the thigh and the shank. The specific method comprises the following steps: inputting depth images and torso into a convolutional neural network f²(x) In the method, features of an input image are extracted by 5 downsampling modules, and the image is represented by a low-dimensional feature vector, wherein the first module and the second module respectively comprise 2 convolutional layers, and the third module, the fourth module and the fifth module respectively comprise 3 convolutional layers. And inputting the feature vectors into 5 upsampling modules, wherein each of the module six, the module seven and the module eight comprises 3 convolutional layers, each of the module 9 and the module 10 comprises 2 convolutional layers, and the 5 upsampling modules output corresponding body part images according to the input feature vectors.

Convolutional neural network f²(x) The network architecture is shown in fig. 3.

The Loss value Loss2 quantization neural network f is adopted in the step²(x) The error between the estimated image and the real image is specifically an average value of pixel-by-pixel difference sum of squares of the predicted body part image and the real body part image.

N is the total number of pixels in the image, i is a certain pixel point in the cyclic variable representation image, X is a depth image, T is a human body trunk image, and L is a real body part image. The initial learning rate was set to 0.0001 with a reduction of 0.8 every 5 rounds. An Adam optimizer is used to calculate the update step size and an estimated body part image is obtained by minimizing Loss 2.

And S3, obtaining the length value of the body part through the depth image and the body part image.

This step estimates the lengths of four approximately rigid body parts, the head, the upper body, the thighs, and the calves, from the depth image and the body part image. The specific method comprises the following steps: inputting depth image and body part image to convolutional neural network f³(x) Firstly, 5 downsampling modules are used for extracting image features, wherein a module I and a module II respectively comprise 2 convolutional layers, and a module III, a module IV and a module V respectively comprise 3 convolutional layers. Then, the output characteristics of the last layer are expanded into 25088-dimensional vectors, 6 full-connected layers are sequentially input, and a four-dimensional vector representing the lengths of the 4 body parts of the head, the upper body, the thigh and the calf is output, and the process can be expressed as follows:

[H^headH^upperbodyH^thighH^calf]^1*4＝f³(X,L)

H^headis the head length, H^upperbodyTo upper body length, H^thighIs thigh length, H^calfIs the calf length, X is the depth image, and L is the body part image.

Convolutional neural network f³(x) The network structure diagram is shown in fig. 4, and the number of input and output nodes of each fully-connected layer is marked below the layer.

The Loss value Loss3 quantization neural network f is adopted in the step³(x) And outputting an error between the estimated length and the real length, specifically, a sum of squares of differences between the four parts of the estimated length and the real length.

TH^head，TH^upperbody，TH^thigh，TH^calfRepresenting the actual length of the head, upper body, thigh and calf, respectively. The initial learning rate was set to 0.0001 with a reduction of 0.5 every 50 rounds. The update step size is calculated using an Adam optimizer, and the estimated 4 body part lengths are obtained by minimizing Loss 3.

And S4, obtaining the height of the human body through the depth image, the length of the body part and the body part image.

The step estimates the height of the human body by combining the length of the body part and the image of the body part, and inputs the original depth image to eliminate the accumulated error. The specific method comprises the following steps: inputting the depth image, the body part length and the body part length into a convolutional neural network f⁴(x) Firstly, 5 downsampling modules are used for extracting height estimation related features, wherein a module I and a module II respectively comprise 2 convolutional layers, and a module III, a module IV and a module V respectively comprise 3 convolutional layers. Jump connection is added before each convolution layer, and the depth images and body part segmentation images under different scales are input before each convolution layer, so that the problem that the gradient disappears under the condition that the network layer number is deep is solved, the reverse propagation of the gradient is facilitated, and the training process is accelerated. In the step, different pooling strategies are selected according to different characteristics of an input image, noise is prevented from being introduced or original image gradient is prevented from disappearing, an average pooling strategy is adopted for a depth image, a maximum pooling strategy is adopted for a body part image, and the depth image and the body part image under different scales are input to each convolution layer. And finally, expanding the output characteristics of the last layer into 25088-dimensional vectors, sequentially inputting 7 fully-connected layers, and outputting a one-dimensional vector to represent the height of the human body, wherein the process can be represented as follows:

[H^human]＝f⁴(X,L,H_4-part)

H^humanis an estimate of body height, H_4-partAre predicted values of the 4 approximate rigid body part lengths in step S3.

Convolutional neural network f⁴(x) The network structure diagram is shown in fig. 5, and the number of input and output nodes of each fully-connected layer is marked below the layer.

The Loss value Loss4 quantization neural network f is adopted in the step⁴(x) And outputting the error between the estimated height and the real height, specifically the square sum of the difference between the estimated height and the real height.

Loss4＝|H^human-TH^human|²

The initial learning rate was set to 0.0001 with a reduction of 0.5 every 50 rounds. An Adam optimizer is used to calculate the update step size, and an estimate of height is obtained by minimizing Loss 4.

S5 optimization of convolutional neural network f using a evolving neural network framework⁴(x)

Directly using the convolutional neural network f in the step S4⁴(x) The problem of overfitting is easily generated when the height of a human body is estimated, and the prediction accuracy is influenced. This step applies the developing neural network to the convolutional neural network f⁴(x) So as to improve the network precision, reduce the training time and prevent overfitting. In a convolutional neural network f⁴(x) When the number of iterations is less than 4 x 10 in the training process⁴When the method is used, the network is pre-trained, and 13 convolutional layers in 5 modules in the network work; when the number of iterations equals 4 x 10⁴Then, storing the pre-training model; when the number of iterations is greater than 4 x 10⁴And less than 6 x 10⁴When the training is carried out, each module only reserves the first convolutional layer, and the training is carried out on 5 convolutional layers in total; when the number of iterations is greater than 6 x 10⁴And less than 8 x 10⁴And then, the first module and the second module recover one convolutional layer from the pre-training model respectively, and at the moment, the network has 7 convolutional layers in total. When the number of iterations is more than 8 x 10⁴And less than 1 x 10⁵And then, recovering one convolutional layer from the pre-training model by using a module three, a module four and a module five respectively, wherein the network has 10 convolutional layers. When the number of iterations is greater than 1 x 10⁵In time, module three, module four, and module five each recover one convolutional layer from the pre-trained model, i.e., recover the first 13 convolutional layers. By fitting and breaking the fitting conditions iteratively until the network finds a globally optimal solution.

FIG. 5 is a convolutional neural network f⁴(x) Schematic diagram of convolution layer changing with iteration number after applying development neural network framework.

FIG. 6 is an example of the experimental results of the present invention.

Experiments prove that the technology can accurately estimate the height of the human body from a single depth image.

The techniques of the present invention may be implemented in computer software, for example written using Python, and the development environment may be, for example, the Windows 10 system and Pycharm Version 2018.3.

The hardware support required is:

CPU:

Core^TMi7-7700K processor

GPU: inviada (NVIDIA) GeForce RTX 2080 Ti Foundation Edition

The required deep learning environments are:

Pytorch 1.1.0

NVIDIA CUDA 10.1.120 driver

cuDNN-10.0-windows10-x64 v7.3.1.20

experiments prove that the invention can accurately predict the height of a human body from only a single depth image, and the tested person can be positioned at any position in the image to make any posture.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions, improvements and the like within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for estimating the height of the human body in any posture based on the single depth image and the multi-stage neural network can accurately measure the height of the human body in any position and any posture, and the average prediction accuracy rate reaches 99.1%. Our technique can be summarized as the following steps.

S6, predicting the height of human body, designing a convolutional neural network f⁴(x) The human height is predicted through the original depth image, the body part image and the body part length. Meanwhile, different pooling strategies are adopted according to different input data characteristics, and the original input data is input into each convolution layer by using a jump connection structure.

3. The method of any pose human height estimation based on a single depth image and a multi-stage neural network of claim 1, wherein: and extracting height prediction related features by using a multi-stage neural network, and converting the height estimation problem into a plurality of local small problems. By convolutional neural networks f¹(x) Obtaining the image of the human body, and obtaining the image of the human body through a convolution neural network f²(x) Obtaining body part images through a convolution neural network f³(x) And obtaining the length of the body part, so that the human height estimation is decomposed into four approximately rigid parts of the head, the upper half body, the thigh and the shank which are respectively predicted, and the four parts of results are integrated into the measured human height. The benefits of this are: the height prediction is decomposed into four rigid part predictions, which is an easier problem; the lengths of the four body parts and the topological relationship between them can suggest the posture of the human body, thereby providing a favorable clue for height estimation; the height prediction problem is divided into four small local problems, and the excellent performance of the convolutional neural network on local perception can be fully utilized.

4. The base of claim 1The method for estimating the height of the human body in any posture based on a single depth image and a multi-stage neural network further comprises the following steps: convolutional neural network f¹(x) And the method is used for segmenting the human body image from the single depth image. Since in the field of height measurement we define the distance between the apex of the head and the point of the sole of the foot when the body is upright as height, the positioning of the apex of the head and the point of the sole of the foot is of great importance. In the existing human body segmentation method, the segmentation at the body edge is often inaccurate, which influences the selection of the head vertex and the foot bottom point. Our method uses Canny operator to extract edge information from depth image, and enhances the edge of human body segmentation image.

E＝Canny(X)

T′＝f¹(X,E)

Convolutional neural network f¹(x) Loss value Loss1 uses the average of the sum of squares of pixel-by-pixel differences of the predicted torso image and the true torso image, and the optimizer uses Adam.

L′＝f²(X,T′)

6. The arbitrary pose human height estimation method based on a single depth image and a multi-stage neural network of claim 1, further comprising: convolutional neural network f³(x) The method is used for predicting the lengths of four approximately rigid parts, namely the measuring head, the upper half body, the thigh and the lower leg. Convolutional neural network f³(x) Comprising 13 convolutional layers and 5 fully-connected layers and correspondingAnd the activation functions comprise 13 convolutional layers to form 5 upsampling modules, the first module and the second module respectively comprise 2 convolutional layers and activation functions, and the third module, the fourth module and the fifth module respectively comprise 3 convolutional layers and activation functions. And outputting a 1 x 4 vector representing the length of 4 body parts through 6 fully-connected layers, wherein the number of nodes of the 6 fully-connected layers is respectively as follows: 4096. 4096, 1000, 256, 64, 4. To a convolutional neural network f³(x) Meanwhile, in order to reduce accumulated errors, an original depth image X is input into the network, and length estimated values of 4 body parts including the head, the upper body, the thigh and the lower leg are obtained.

[H^head H^upperbody H^thigh H^calf]^1*4＝f³(X，L)

TH^head，TH^upperbody，TH^thigh，TH^calfRepresenting the actual length of the head, upper body, thigh and calf, respectively. The optimizer uses Adam to predict the lengths of the 4 approximately rigid sites separately by minimizing the loss values.

7. The arbitrary pose human height estimation method based on a single depth image and a multi-stage neural network of claim 1, further comprising: convolutional neural network f⁴(x) The height of the human body is estimated by the body part image, the body part length and the original depth image. Convolutional neural network f⁴(x) Comprises 13 convolutional layers and 7 full-connection layers and corresponding activation functions, wherein the 13 convolutional layers form 5 or moreAnd the sampling module, the first module and the second module respectively comprise 2 convolutional layers and activation functions, the third module, the fourth module and the fifth module respectively comprise 3 convolutional layers and activation functions, the last convolutional layer result is expanded into a one-dimensional vector, and the estimated height value is output through 7 full-connection layers, wherein the number of nodes of the 7 full-connection layers is 4096, 1000, 256, 64, 16 and 1. In order to reduce accumulated errors and improve the accuracy of prediction, an original depth image X and a human body part segmentation image prediction image L' are input into a convolutional neural network f together⁴(x) In (1). Meanwhile, the invention also adopts a jump structure, and adopts different pooling strategies according to the characteristics of different input data: and adopting an average pooling strategy for the depth image and a maximum pooling strategy for the human body part segmentation image, so that input data with different scales are directly input into each convolution layer.

[H^human]＝f⁴(X，L′，H_4-part)

Loss4＝|H^human-TH^human|²

8. The arbitrary pose human height estimation method based on a single depth image and a multi-stage neural network of claim 7, further comprising: a hopping connection structure with a hybrid pooling strategy. Convolutional neural network f⁴(x) Having 13 convolution layers, at f⁴(x) The problem that the gradient disappears under the condition that the network layer number is deep can be solved by adding the jump connection, and meanwhile, the reverse propagation of the gradient is facilitated, and the training process is accelerated. In steps S3, S4 and S5, we use three neural networks to acquire the lengths of the head, upper body, thigh and lower leg of the human bodyEstimate and input as input data to the convolutional neural network f⁴(x) In the method, because the estimated value predicted by each neural network has an error with the true value, and the current error is input into the neural network of the next stage, the error is increased. We use the skip-join structure to input the original depth image X and the body part image L directly into the convolutional neural network f⁴(x) To minimize the accumulated error.

L_next＝Maxpool(L_now)

X_next＝Avgpool(X_now)

9. The method for arbitrary pose body height estimation based on a single depth image and a multi-stage neural network of claim 7, further comprising: developing neural networksArchitecture and a training method. The invention proposes to develop a neural network and apply it to a convolutional neural network f⁴(x) So as to improve the network precision, reduce the training time and prevent overfitting.

Developing a neural network can be accomplished by fitting and destroying the fitting conditions iteratively until the network finds a globally optimal solution. Since when the number of iterations is greater than 4 x 10⁴And less than 1 x 10⁵Only part of the convolutional layer is trained, so that the network training time can be reduced. When the network tends to be convergent, overfitting of the network is prevented by increasing or decreasing the convolution layer, so that the network jumps out of a local optimal solution to search a global optimal solution, and the accuracy of height estimation is effectively improved.