CN109033945B

CN109033945B - Human body contour extraction method based on deep learning

Info

Publication number: CN109033945B
Application number: CN201810582283.3A
Authority: CN
Inventors: 王林; 董楠
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2018-06-07
Filing date: 2018-06-07
Publication date: 2021-04-06
Anticipated expiration: 2038-06-07
Also published as: CN109033945A

Abstract

The invention discloses a human body contour extraction method based on deep learning, which is implemented according to the following steps: step 1, extracting Gabor texture features of an original image; step 2, extracting Canny edge characteristics of the original image; step 3, building a convolutional neural network framework suitable for human body contour extraction; step 4, transmitting the original image, the Gabor texture feature map extracted in the step 1 and the Canny edge feature map extracted in the step 2 into the convolutional neural network constructed in the step 3 together for training to generate a CNN character model; step 5, testing the structure of the trained CNN character model to obtain a human body contour image; and 6, recording the overlapping rate and the time consumption of the human body contour image through the testing process of the step 5, and evaluating the human body contour image. The method of the invention achieves higher accuracy, improves the detection rate and shortens the test time.

Description

Human body contour extraction method based on deep learning

Technical Field

The invention belongs to the technical field of machine vision, and particularly relates to a human body contour extraction method based on deep learning.

Background

Human body contour extraction plays an important role in the field of computer vision, and is a core technology of human body detection and human body behavior identification. The human body contour extraction technology is widely applied to the fields of intelligent monitoring, medical treatment and the like at present. The virtual reconstruction of the human body model is a key technology in a modern medical visualization system, and the accurate human body contour information acquisition can ensure that the reasonable medical analysis can be carried out on the diseases of the patient. On the other hand, along with the enhancement of the modern society on personal and public property safety requirements, the utilization rate of the intelligent monitoring system is gradually increased. The primary objective of the intelligent video monitoring technology is to acquire monitoring data by using various monitoring devices, so as to automatically understand and describe events occurring in a detected scene and predict events which may occur in the future. The human body contour extraction is used as a key supporting technology of the intelligent monitoring system, can provide the position and contour information of a human body in an image, is convenient for automatically tracking the human body and identifying behaviors, and therefore the purpose of intelligent monitoring is achieved.

At home and abroad, scholars propose various methods for realizing accurate human body recognition by extracting different features of images and combining classifier training aiming at the difficulty in human body detection of static images. However, although these conventional feature extraction methods can determine the position of the human body, they cannot accurately extract the contour of the human body. Aiming at the problem of target contour extraction, a plurality of effective schemes such as an active contour model, visual saliency and the like are provided. Although these methods can extract the target contour, they have some limitations in terms of computational complexity, real-time performance, and the like.

In recent years, a deep learning method gradually replaces a traditional feature extraction method, and breakthrough progress is made in the fields of target detection, image segmentation and the like. The purpose of deep learning is to automatically perform characteristic learning by simulating the operation of the human brain neural structure during data processing, and further complete the data processing result. Convolutional Neural Networks (CNN) are a type of model in deep learning methods, and their unique weight sharing structure and sparse connection mode make the Network itself dominant in image analysis.

Disclosure of Invention

The invention aims to provide a human body contour extraction method based on deep learning, and solves the problems that in the prior art, the human body contour extraction effect in a static image is poor and the model training speed is slow.

The invention adopts the technical scheme that a human body contour extraction method based on deep learning is implemented according to the following steps:

step 1, extracting Gabor texture features of an original image;

step 2, extracting Canny edge characteristics of the original image;

step 3, building a convolutional neural network framework suitable for human body contour extraction;

step 4, transmitting the original image, the Gabor texture feature map extracted in the step 1 and the Canny edge feature map extracted in the step 2 into the convolutional neural network constructed in the step 3 together for training to generate a CNN character model;

step 5, testing the structure of the trained CNN character model to obtain a human body contour image;

and 6, recording the overlapping rate and the time consumption of the human body contour image through the testing process of the step 5, and evaluating the human body contour image.

The present invention is also characterized in that,

the step 1 is implemented according to the following steps:

step 1.1, obtaining a two-dimensional Gabor filter according to a formula (1):

in the formula (1), Ψ_u,vFor a two-dimensional Gabor filter, u, v are the orientation and dimensions of the Gabor kernel, respectively, where u is 0, pi/4, pi/2, 3 pi/4, pi, 5 pi/4, 6 pi/4 or 7 pi/4, k_u,vThe device is used for controlling the width of a Gaussian window, wherein z is (x, y) space position coordinates, sigma 2 pi is the ratio of the width of the Gaussian window to the wavelength, and i is an imaginary number unit;

wherein the direction and wavelength of the oscillating part are k_u,vComprises the following steps:

in the formula (2), the sampling frequency of the filter is k_v＝k_max/F^v，k_maxPi/2 is the maximum sampling frequency,

is a spacing factor, is used to limit the frequency neutralization function,

is the directional selectivity of the filter;

step 1.2, performing convolution operation on the original image I (x, y) and the two-dimensional Gabor filter obtained in the step 1.1, and extracting Gabor characteristic G of the original image at the position of (x, y)_u,v(x, y), obtaining the Gabor texture characteristics of 8 directions:

G_u,v(x,y)＝I(x,y)*Ψ_u,v (3)。

the step 2 is implemented according to the following steps:

step 2.1, setting weight parameters for RGB three channels to complete gray processing of the original image, wherein the RGB three channel parameter setting expression is as follows:

Gray＝R*0.299+G*0.587+B*0.114

step 2.2, processing the gray level image processed in the step 2.1 by using a first derivative of a two-dimensional Gaussian function, wherein the expression of the two-dimensional Gaussian function is as follows:

in the formula (4), δ is a smoothing parameter, and the larger δ is, the more remarkable the smoothing effect is;

and smoothing the 3 × 3 area of the original image by using a 3 × 3 Gaussian convolution kernel, wherein the Gaussian convolution kernel is as follows:

step 2.3, searching the position with the strongest gray intensity change in the gray image processed in the step 2.2, and calculating the first derivative Z of the horizontal direction x and the vertical direction y of the gray image by using a Sobel operator_xAnd Z_yObtaining the boundary gradient amplitude | Z | and the direction β:

wherein, Sobel operator in abscissa x and ordinate y direction is:

step 2.4, equally dividing the boundary gradient amplitude | Z | obtained in the step 2.3 into four gradient areas, wherein each gradient area corresponds to a quadrant of a coordinate axis Z, then calculating all points in each area one by one along the gradient direction beta of each point in sequence, comparing the gradient amplitude | Z | of each point with two adjacent points, if the point is larger than the front point and the rear point, reserving the point, and if the point is smaller than the front point and the rear point, setting the point to be zero, thereby carrying out non-maximum value inhibition operation on the gray level image processed in the step 2.3 to thin the edge and eliminating non-edge noise points;

step 2.5, setting the high threshold to 70% of the overall gray level distribution of the gray level image processed in step 2.4, and setting the low threshold to 1/2% of the high threshold, if the gray level of the point processed in step 2.4 is greater than the high threshold, setting the pixel value to 255, if the gray level of the point processed in step 2.4 is less than the low threshold, setting the pixel value to 0, if the gray level of the point processed in step 2.4 is between the high threshold and the low threshold, examining the adjacent 8 pixel values, if no point with a value of 255 exists in the adjacent 8 pixel values, setting the pixel value of the point to 0, if a point with a value of 255 exists in the adjacent gradient area, setting the pixel value of the point to 255, and completing the edge feature extraction until all the points are processed.

Step 3 is specifically implemented according to the following steps:

step 3.1, modifying the VGG16 network structure based on the VGG16 network model, reducing 5 convolutional layers in the VGG16 network model to 4, connecting a pooling layer behind each convolutional layer, selecting the maximum pooling, selecting the size of a pooling window to be 2 x 2, and moving the step length to be 2;

let P be the pixel of an unknown point, Q₁₁,Q₁₂,Q₂₁,Q₂₂For four points of known pixels around the P point, the known function f is then at Q₁₁＝(x₁,y₁)，Q₁₂＝(x₁,y₂)，Q₂₁＝(x₂,y₁) And Q₂₂＝(x₂,y₂) The values of the four points are linearly interpolated in the x direction to obtain two intermediate points R₁，R₂Pixel value f (R) of₁) And f (R)₂) Filling the pooled layers to obtain the size of the original image:

wherein R is₁＝(x,y₁)，R₂＝(x,y₂)；

Reunion of R₁，R₂Linear interpolation is performed in the y direction to obtain a pixel value f (P) of the unknown point P:

the resulting pixel result f (x, y) for the unknown point P is:

further generating a deconvolution structure;

step 3.2, introducing a network in the network on the basis of the step 3.1, namely replacing each original convolutional layer structure with an MLP convolutional layer structure;

and 3.3, adding a dropout layer between the convolution layer processed in the step 3.2 and the deconvolution layer to prevent over-fitting of the network, and forming a symmetrical convolutional neural network.

Step 3.2 is specifically implemented according to the following steps:

after an original convolutional layer, a plurality of convolutional layers with convolution kernel of 1 × 1 are connected, and the feature map of each convolutional layer to the last convolutional layer is calculated by the following formula (13):

in the formula (13), (x, y) is a pixel index of the feature map, and a_xyIs an input block, k, centered at (x, y)_tFor the index of the feature map, t is the number of MPL layers, and f isActivating a function, wherein w is a weight coefficient and b is offset;

and outputting the feature map of the current layer through the ReLU activation.

The convolutional neural network is:

the first layer is an input layer, and the input size of the layer is as follows: 224 × 12;

the second layer is a convolutional layer with input dimensions: 112 × 8;

the third layer is a convolutional layer with input dimensions: 56 x 16;

the fourth layer is a convolutional layer with the input dimensions: 28 x 32;

the fifth layer is a dropout layer with input dimensions of: 28 x 64;

the sixth layer is a deconvolution layer, and the output size of the layer is: 56 x 32;

the seventh layer is a deconvolution layer, and the output size of the layer is: 112 × 16;

the eighth layer is a deconvolution layer, and the output size of the layer is: 224 × 8;

the ninth layer is an output layer with an output size of 224 × 1;

each convolutional layer has three convolutional sublayers, with a 1 × 1 convolutional kernel in between each two 3 × 3 convolutional kernels.

Step 4 is specifically implemented according to the following steps:

step 4.1, combining the Gabor characteristics in 8 directions obtained in the step 1, the Canny edge characteristics obtained in the step 2 and the RGB three-channel characteristics of the original image into 12 characteristic channels;

and 4.2, adjusting the image size of the 12 characteristic channels obtained in the step 4.1 to 224 x 224, transmitting the 12 characteristic channels into the convolutional neural network constructed in the step 3, and training by taking the grountruth label graph of the original image as a teacher signal of the CNN to generate the CNN character model.

Step 5 is specifically implemented according to the following steps:

step 5.1, respectively settling the Gabor texture features extracted in the step 1 and the Canny edge features extracted in the step 2 for the image to be detected;

step 5.2, inputting RGB three-channel characteristics, Gabor texture characteristics and Canny edge characteristics of the image to be detected into the CNN model trained in the step 4 to obtain a human body contour heat map;

step 5.3, performing opening and closing operation on the human body contour heat map obtained in the step 5.2, and performing smoothing treatment on the image through a Gaussian low-pass filter to obtain a human body mask;

and 5.4, performing an AND operation on the original image and the human body mask obtained in the step 5.3, namely performing an AND operation on each pixel point of the original image and a corresponding pixel point of the generated human body mask, setting the corresponding position in the original image as 0 if the pixel at the corresponding position of the mask is 0, and taking the original pixel at the corresponding position in the original image to obtain a human body contour image if the pixel in the mask is 1.

The beneficial effect of the invention is that,

(1) according to the human body contour extraction method based on deep learning, the operation of region selection on an original image is not needed, and a series of complex operations such as region combination on an output object are not needed;

(2) according to the human body contour extraction method based on deep learning, Canny edge characteristics are added on the basis of Gabor-CNN, high accuracy is achieved, the detection rate is improved, and the testing time is shortened.

Drawings

FIG. 1 is a schematic diagram of bilinear interpolation;

FIG. 2 is a graph of a convolutional neural network upsampling DAG used by the extraction method of the present invention;

FIG. 3 is a general structure diagram of a CNN used in the extraction method of the present invention;

FIG. 4 is a structural diagram of the extraction method of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a human body contour extraction method based on deep learning, which is implemented according to the following steps:

step 1, extracting Gabor texture features of an original image;

step 1.1, obtaining a two-dimensional Gabor filter according to a formula (1):

is a spacing factor, is used to limit the frequency neutralization function,

is the directional selectivity of the filter;

step 1.2, performing convolution operation on the original image I (x, y) and the two-dimensional Gabor filter obtained in the step 1.1, and extracting Gabor characteristic G of the original image at the position of (x, y)_u,v(x,y)：

G_u,v(x,y)＝I(x,y)*Ψ_u,v (3)

In order to obtain features of the original image, in particular local saliency features in multiple directions, a set of two-dimensional Gabor filters (Gabor kernel functions) with 8 directions u of 0, pi/4, pi/2, 3 pi/4, pi, 5 pi/4, 6 pi/4 or 7 pi/4, 8 directions is used for representation, wherein sigma is 2 pi, k is_max＝π/2，

Obtaining Gabor texture characteristics in 8 directions;

step 2, extracting Canny edge characteristics of the original image;

Gray＝R*0.299+G*0.587+B*0.114

step 2.3, searching the position with the strongest gray intensity change in the gray image processed in the step 2.2, and calculating a first derivative Z of the horizontal direction (abscissa) x and the vertical direction (ordinate) y of the gray image by using a Sobel operator_xAnd Z_yObtaining the boundary gradient amplitude | Z | and the direction β:

wherein, Sobel operator in abscissa x and ordinate y direction is:

step 2.5, setting a high threshold as 70% of the overall gray level distribution of the gray level image processed in step 2.4, and setting a low threshold as 1/2% of the high threshold, wherein if the gray level of the point processed in step 2.4 is greater than the high threshold, the pixel value is 255, if the gray level of the point processed in step 2.4 is less than the low threshold, the pixel value is 0, if the gray level of the point processed in step 2.4 is between the high threshold and the low threshold, the adjacent 8 pixel value is examined, if no point with the value of 255 exists in the adjacent 8 pixel value, the pixel value of the point is 0, if a point with the value of 255 exists in the adjacent gradient area, the pixel value of the point is 255, and until all the points are processed, the edge feature extraction is completed;

step 3.1, modifying the VGG16 network structure based on the VGG16 network model, reducing 5 convolutional layers in the VGG16 network model to 4, and selecting the largest pooling mode by connecting a pooling layer behind each convolutional layer, wherein the size of a pooling window is selected to be 2 x 2, and the moving step is 2;

let P be the pixel of unknown point, Q, as shown in FIG. 1₁₁,Q₁₂,Q₂₁,Q₂₂For four points of known pixels around the P point, the known function f is then at Q₁₁＝(x₁,y₁)，Q₁₂＝(x₁,y₂)，Q₂₁＝(x₂,y₁) And Q₂₂＝(x₂,y₂) The values of the four points are linearly interpolated in the x direction to obtain two intermediate points R₁，R₂Pixel value f (R) of₁) And f (R)₂) Filling the pooled layers to obtain the size of the original image:

wherein R is₁＝(x,y₁)，R₂＝(x,y₂)；

the resulting pixel result f (x, y) for the unknown point P is:

thereby generating a deconvolution structure as shown in fig. 2;

the deconvolution process can combine the outputs of multiple stages of the neural network to strengthen the result, the result is realized by a bilinear interpolation method, and the pixel value of the middle point is obtained by the pixel values of four surrounding points by using the bilinear interpolation method, so that the pooled layer can be filled to obtain the size of the original image;

step 3.2, introducing a network in the network on the basis of the step 3.1, namely replacing each original convolutional layer structure with an MLP convolutional layer structure, and the specific process is as follows:

adding a convolution kernel of 1 x 1 into every two convolution kernels of 3 x 3, and calculating the characteristic diagram of each convolution layer to the last convolution layer by the following formula (13) according to the principle of the MLP convolution layer, wherein the convolution kernel is a convolution layer with 1 x 1 after an original convolution layer:

in the formula (13), (x, y) is a pixel index of the feature map, i.e., an x coordinate axis and a y coordinate axis, a_xyIs an input block, k, centered at (x, y)_tFor the index of the feature map, t is the number of layers of the MPL, f is an activation function, w is a weight coefficient, and b is a bias;

then, activating and outputting a feature map of the current layer through the ReLU;

step 3.3, adding a dropout layer between the convolution layer processed in the step 3.2 and the deconvolution layer to prevent network overfitting, so as to form a symmetrical convolution neural network, wherein the specific process of preventing network overfitting is as follows: randomly discarding part of parameters in the iteration process of the VGG16 network model in the network, and setting the randomly discarded part of parameters as 0;

the neural network structure is as follows:

the second layer is a convolutional layer with input dimensions: 112 × 8;

the third layer is a convolutional layer with input dimensions: 56 x 16;

the fourth layer is a convolutional layer with the input dimensions: 28 x 32;

the fifth layer is a dropout layer with input dimensions of: 28 x 64;

the ninth layer is an output layer with an output size of 224 × 1;

each convolution layer has three convolution sublayers, and a 1 × 1 convolution kernel is arranged between every two 3 × 3 convolution kernels;

reducing the size of the input image from 224 to 28 by 224 through feature extraction, and then performing size reduction by using an deconvolution layer to generate a feature map of the human body outline of the image;

step 4.2, adjusting the image size of the 12 characteristic channels obtained in step 4.1 to 224 x 224, then transmitting the 12 characteristic channels into the convolutional neural network constructed in step 3, training by taking a groudtruth label diagram (the pixels of the human body area are 1, and the rest pixels are 0) of the original image as a teacher signal of the CNN to generate a CNN character model, wherein the training process comprises forward calculation of the neural network model and backward error transfer calculation of the neural network model, and the forward calculation and the backward error calculation are subjected to iterative processing, and the iteration number is 800;

the forward calculation, the reverse calculation and the iteration process are all represented by pseudo codes, each iteration is a process of realizing one forward calculation and one error back propagation, and the core pseudo codes in the neural network training process are as follows:

Step1:initModel(model)；

v/initialize neural network model

Step2:for iter<-1to N do

Step2.1:forward(model)；

V/neural network model Forward computation

Step2.2:backward(model)；

V/neural network model inverse error transfer calculation

Step2.3:update(model)；

// updating neural network weights

Step3:return trained model.

// Return to trained model

step 5.4, performing an and operation on the original image and the human body mask obtained in the step 5.3, namely performing an and operation on each pixel point of the original image and a corresponding pixel point of the generated human body mask (a human body contour region is 1, and a non-human body region is 0), setting the corresponding position in the original image to be 0 if the pixel of the corresponding position of the mask is 0, and taking the original pixel from the corresponding position in the original image to obtain a human body contour image if the pixel of the mask is 1;

step 6, recording the overlapping rate and the time consumption of the human body contour image through the testing process of the step 5, and evaluating the human body contour image;

wherein, the overlapping rate is:

examples

The data set source is a hundred-degree human body image segmentation database, and data in the database are images which are shot from various angles and contain human bodies. 5387 training images and labeled samples in the database respectively. The invention selects 1000 images as training set, and selects 500 images as test set in the rest part. The network input image size was fixed at 224 x 224 in the experiment. In order to accurately and objectively evaluate the effect of the method, and to facilitate comparison with existing methods, the performance of the human contour extraction model of the improved method is measured by using an overlap ratio, wherein the overlap ratio is defined as follows:

wherein S is the degree of overlap, A_PExtracting network predicted body regions for body contours, A_GTIs the actual body area. The higher the S, the higher the overlapping degree is, and the better the human body contour extraction effect is.

The differences of the five methods in terms of input image size, overlapping rate, time consumption and display card are shown in table 1;

TABLE 1 comparison of five human body segmentation methods

In the training process of the neural network, the speed of the GPU can reach 100 times or even 1000 times that of the CPU, the training on the middle-end graphics card of the GTX750 takes about several days, and the running on the high-end CPU of i7 takes at least more than one month, but the testing process is different from the training process, and in the testing stage, the speed of the CPU is still acceptable, one picture is about 10s, and the GPU can reach the testing speed of millisecond level. As can be seen from Table 1, the method provided by the invention has an overlapping rate of 92.03%, the test speed of a single picture reaches 68.84 milliseconds, and the method of Pixel-by-Pixel has an overlapping rate of 86.83%, but the test time is too long and has no real-time performance. Although the Alex-seg-net method consumes the least time in the test of a single image, the overlapping rate of the method only reaches 80.2 percent, and the effect is not ideal; the method is higher than a Pool5-net method in the aspect of overlapping rate, but the effect in the aspect of test time consumption is not as good as that of the Pool5-net method, while a display card GTX960 used by the Pool5-net method belongs to a middle and high-end display card, and the performance is obviously better than that of a GTX750 display card in the experiment, the method of the invention adds Canny edge characteristics on the basis of Gabor-CNN, thereby not only improving the detection rate, but also shortening the test time; therefore, it can be seen that the method of the present invention can exhibit good effects under the condition of limited hardware. In summary, the method provided by the invention combines the traditional characteristics with the deep learning, which not only achieves higher accuracy, but also has shorter testing time, and can meet the requirements of some practical applications.

Claims

1. A human body contour extraction method based on deep learning is characterized by comprising the following steps:

step 1, extracting Gabor texture features of an original image;

step 2, extracting Canny edge characteristics of the original image;

step 3, building a convolutional neural network framework suitable for human body contour extraction:

wherein R is₁＝(x,y₁)，R₂＝(x,y₂)；

the resulting pixel result f (x, y) for the unknown point P is:

further generating a deconvolution structure;

step 3.2, introducing a network in the network on the basis of the step 3.1, namely replacing each original convolutional layer structure with an MLP convolutional layer structure, specifically:

in the formula (13), (x, y) is a pixel index of the feature map, and a_xyIs an input block, k, centered at (x, y)_tFor the index of the feature map, t is the number of MPL layers, f is the activation function, and w is the weight coefficientAnd b is an offset;

3.3, adding a dropout layer between the convolution layer processed in the step 3.2 and the deconvolution layer to prevent over-fitting of the network, and forming a symmetrical convolution neural network;

the convolutional neural network is:

the second layer is a convolutional layer with input dimensions: 112 × 8;

the third layer is a convolutional layer with input dimensions: 56 x 16;

the fourth layer is a convolutional layer with the input dimensions: 28 x 32;

the fifth layer is a dropout layer with input dimensions of: 28 x 64;

the ninth layer is an output layer with an output size of 224 × 1;

step 4.2, adjusting the image size of the 12 characteristic channels obtained in the step 4.1 to 224 x 224, then transmitting the 12 characteristic channels into the convolutional neural network constructed in the step 3, and training by taking a grountruth label graph of an original image as a teacher signal of the CNN to generate a CNN character model;

2. The method for extracting the human body contour based on the deep learning as claimed in claim 1, wherein the step 1 is implemented according to the following steps:

step 1.1, obtaining a two-dimensional Gabor filter according to a formula (1):

in the formula (1), Ψ_u,vIn a two-dimensional Gabor filter, u, v are the orientation and dimensions of the Gabor nuclei, respectively, where u is 0, π/4, π/2, 3 π/4, π, 5 π/4, 6 π/4 or 7 π/4, k_u,vThe device is used for controlling the width of a Gaussian window, wherein z is (x, y) space position coordinates, sigma 2 pi is the ratio of the width of the Gaussian window to the wavelength, and i is an imaginary number unit;

is a spacing factor, is used to limit the filter sampling frequency,

is the directional selectivity of the filter;

step (ii) of1.2, carrying out convolution operation on the original image I (x, y) and the two-dimensional Gabor filter obtained in the step 1.1, and extracting Gabor characteristic G of the original image at the position of (x, y)_u,v(x, y), obtaining the Gabor texture characteristics of 8 directions:

G_u,v(x,y)＝I(x,y)*Ψ_u,v (3)。

3. the method for extracting the human body contour based on the deep learning as claimed in claim 1, wherein the step 2 is implemented according to the following steps:

Gray＝R*0.299+G*0.587+B*0.114

wherein, Sobel operator in abscissa x and ordinate y direction is:

4. The method for extracting the human body contour based on the deep learning as claimed in claim 1, wherein the step 5 is implemented according to the following steps:

step 5.1, respectively calculating the Gabor texture features extracted in the step 1 and the Canny edge features extracted in the step 2 for the image to be detected;