CN111209822A

CN111209822A - Face detection method of thermal infrared image

Info

Publication number: CN111209822A
Application number: CN201911394420.1A
Authority: CN
Inventors: 张天序; 郭诗嘉; 李正涛; 苏轩; 郭婷
Original assignee: Nanjing Huatu Information Technology Co ltd
Current assignee: Nanjing Huatu Information Technology Co ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-29

Abstract

The invention discloses a face detection method of a thermal infrared image, which comprises the following steps: (1) acquiring a positive sample, a negative sample and a test set of the training set, and respectively framing a human face frame as a calibration frame for each thermal infrared image of the positive sample; (2) acquiring a training label; (3) building a convolutional neural network, inputting a training set and a training label into the convolutional neural network together for training, and optimizing the convolutional neural network by using a loss function so as to obtain a required training model of the convolutional neural network; (4) and inputting the thermal infrared image concentrated in the test, and obtaining a face detection frame through a convolutional neural network. The invention inputs the thermal infrared image into the convolutional neural network for training to obtain the convolutional neural network meeting the requirement, and can realize automatic detection of the thermal infrared image so as to accurately frame out the human face range and reduce the detection error rate.

Description

Face detection method of thermal infrared image

Technical Field

The invention belongs to the technical field of biological feature recognition, and particularly relates to a face detection method.

Background

And detecting the human face to obtain the specific positions of all human faces in the picture, wherein the specific positions are usually represented by a rectangular frame, the object in the rectangular frame is the human face, and the part outside the rectangular frame is the background.

The visible light face detection technology is widely applied to the fields of customs, stations, attendance checking, automatic driving, suspect tracking and the like. However, the visible light face detection technology cannot work without an external light source, and cannot detect a face with a mask on the face. The visible light can not be used for living body detection, and the imaging is not judged to be a real person, so that the human face detection method is easy to be deceived by photos and faces dressed up easily, and the results of the face detection are inaccurate and limited.

The thermal infrared image is thermal radiation imaging, which is based on the difference in infrared radiation from an object, and the infrared thermal imager can convert the naturally emitted infrared radiation distribution from the surface of the object into a visible image. Because different objects or different parts of the same object usually have different heat radiation characteristics, such as temperature difference, emissivity and the like, after thermal infrared imaging is carried out, the objects in the thermal infrared image are distinguished due to the difference of the heat radiation. Therefore, the hot infrared image can easily solve the function of biopsy, the human face is a high-temperature object compared with other objects, the image is white in a gray scale image, different capillary vessels are distributed on different organs of the face, the heat radiation is different, and facial five sense organs can be presented.

Active near-infrared face recognition is started to rise at present, but the technology needs an active light source and limits the distance to 50-100 cm. And the active light source can generate obvious reflection on the glasses, so that the positioning precision of the eyes is reduced, and the active light source can be damaged and attenuated after being used for a long time. At present, no face detection method for thermal infrared images exists in China.

Disclosure of Invention

In view of the above defects or improvement requirements of the prior art, the present invention provides a face detection method for thermal infrared images, which can clearly frame the face position in the thermal infrared images without any light source, so as to meet the detection requirements for the thermal infrared images.

In order to achieve the above object, according to an aspect of the present invention, there is provided a method for detecting a human face using thermal infrared images, comprising the steps of:

(1) the method comprises the following steps of taking N thermal infrared images as positive samples and L thermal infrared images of an undisplayed face as negative samples to form a training set, obtaining M thermal infrared images as a test set, and framing a face frame of each thermal infrared image of the positive samples as a calibration frame; the mark of each thermal infrared image in the positive sample is 1, and the mark of each thermal infrared image in the negative sample is 0;

(2) the coordinate value of the central point of the calibration frame of each thermal infrared image is reduced in proportion to the size values of the width and the height, and the reduced coordinate value of the central point, the reduced size values of the width and the height and the mark of the thermal infrared image are stored in an independent txt file together, so that N txt files are obtained in total;

in addition, the path of each thermal infrared image in the training set and the marks of all the thermal infrared images in the negative sample are stored in another txt file;

in this way, N +1 txt files are obtained as training labels;

(3) building a convolutional neural network, inputting a training set and a training label into the convolutional neural network together for training, and optimizing the convolutional neural network by using a loss function so as to obtain a required training model of the convolutional neural network;

(4) and inputting the thermal infrared image concentrated in the test, and obtaining a face detection frame through a convolutional neural network.

Preferably, in the step (1), a thermal infrared imager is adopted to collect the thermal infrared image, and the collection condition is as follows: the human face and the medium wave thermal infrared imager of each person record videos by adopting a plurality of groups of distances and a plurality of groups of set time, the videos are cut according to set frame numbers, then, the set number of photos are selected, and then N thermal infrared images are selected as a training set.

Preferably, the training labels generated in step (2) are specifically as follows:

(2.1) storing the relative coordinates of the center point of the calibration frame:

wherein (x)₁，y₁)，(x₂，y₂) Two coordinates representing diagonal positions on the calibration frame are represented by (x)₁，y₁)，(x₂，y₂) Determining the calibration frame, x₁And x₂Representing the width coordinate, y, in an x-y image coordinate system₁And y₂Denotes the height coordinate in the x-y image coordinate system, and x₁＞x₂，y₁＞y₂；

centre_xRepresenting the width coordinate, centre, of the centre point of the calibration frame in the x-y image coordinate system_yThe length coordinate of the central point of the calibration frame under an x-y image coordinate system is represented, w represents the length of the thermal infrared image where the calibration frame is located, and h represents the height of the thermal infrared image where the calibration frame is located;

(2.2) store the relative size of the length of the calibration box to the thermal infrared image in which it is located:

wherein the frame is_xRepresenting the relative width, frame, of the calibration frame_yIndicating the relative height of the calibration frame;

mixing the above centre_x、centre_y、frame_x、frame_yStoring the marks of the thermal infrared images in the positive sample in the same txt file, and marking the marks of different thermal infrared images in the positive sample and the center of the calibration frame_x、centre_y、frame_x、frame_yStoring different txt files.

Preferably, the convolutional neural network adopts a Darknet framework and a Yolo network, the Darknet framework is used for performing convolution, maximum pooling and normalization operations on the input thermal infrared image so as to obtain the weight of the convolutional neural network, and the Yolo network is used for processing the weight of the convolutional neural network so as to perform face determination and position regression.

Preferably, the size relationship between the calibration box and the prediction box constructed by the convolutional neural network is as follows:

a_x＝d_x+Δ(m_x)

a_y＝d_y+Δ(m_y)

wherein, a_x，a_yRespectively representing the width and height of the center coordinate of the calibration frame under the u-v image coordinate system, a_wAnd a_hDenotes the width and height, Δ (m), of the calibration frame_x)，Δ(m_y) Respectively indicating the amount of deviation in the width direction and the amount of deviation in the height direction from the center of the calibration frame to the center of the prediction frame, d_x，d_yRespectively representing the width and height, p, of the central coordinate of the prediction box_w，p_hExpressed as the width and height of the prediction box, m, respectively_w，m_hWide and high scaling ratios of the prediction box respectively, and the delta function is a sigmoid function.

Preferably, the prediction box constructed by the convolutional neural network is six and is divided into two scales, the heights of the six prediction boxes are respectively a prediction box I, a prediction box II, a prediction box III, a prediction box IV, a prediction box V and a prediction box VI after being sorted from large to small, wherein the first scale allocates the prediction box I, the prediction box III and the prediction box IV, and the second scale allocates the prediction box II, the prediction box IV and the prediction box VI.

Preferably, in step (3), the loss function is optimized for the convolutional neural network specifically as follows:

where loss represents the loss, S²Represents the number of grids of the convolutional neural network, B represents the number of prediction boxes per cell,

whether the jth anchor box of the ith grid is responsible for the target or not is shown, the value is 0 when the ith grid is not responsible for the target, the value is 1 when the ith grid is responsible for the target,

the j-th prediction frame of the i grids represents an irresponsible target, the value of the target is 1 when the target exists, the value of the target is 0 when the target does not exist, and the lambda is_coord＝5，λ_noobj＝0.5，x_i，y_iRespectively representing the width and height of the center point coordinate of the ith prediction box,

respectively representing the width and height, w, of the coordinates of the center point of the ith calibration frame_i，h_iRespectively representing the width and height of the ith prediction box,

respectively, the width and height of the ith calibration frame, c_iRepresenting the confidence of the ith prediction box, the value of the selected prediction box is 1, the value of the unselected prediction box is 0,

representing the confidence of the ith calibration frame, the value of the selected calibration frame is 1, the value of the unselected calibration frame is 0, p_iRepresenting the classification probability of a face in the ith prediction box,

representing the classification probability of the face in the ith calibration frame, c representing the class with or without the face, and classes representing the set of the classes with and without the face;

and after the loss is obtained, updating by adopting a random gradient descent algorithm, continuously selecting and judging the optimal parameter under the current target by the convolutional neural network, updating the parameter in the convolutional neural network according to the loss result, and stopping updating after the convolutional neural network reaches the required index.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

1) the invention inputs the thermal infrared image into the convolutional neural network for training to obtain the convolutional neural network meeting the requirement, and can realize automatic detection of the thermal infrared image so as to accurately frame out the human face range and reduce the error rate of human face detection.

2) The invention carries out face detection by the thermal infrared technology, can clearly frame the face position in the thermal infrared image without any light source, and meets the detection requirement of the thermal infrared image.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic flow chart of the present invention for obtaining training labels;

FIG. 3 is a flow chart of the convolutional neural network gain loss calculation in the present invention;

FIG. 4 is a thermal infrared image to be detected;

FIG. 5 is a schematic illustration of the thermal infrared image of FIG. 4 after detection;

FIG. 6 is a schematic diagram of three prediction boxes in a first scale;

FIG. 7 is a schematic diagram of three prediction boxes at a second scale;

fig. 8 is a schematic diagram of detection of two faces.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Referring to the attached drawings, the method for detecting the human face of the thermal infrared image comprises the following steps:

(1) and taking N thermal infrared images as positive samples and L thermal infrared images without human faces as negative samples to form a training set, and obtaining M thermal infrared images as a test set.

In order to guarantee a sufficient number of thermal infrared images, it is necessary to guarantee sufficient experimental data. Specifically, a medium wave thermal infrared imager with the model of TAURUS-110kM of IRCAM company in Germany can be adopted, and the test environment of data is as follows: the distance of people's face from camera is 2 meters, 3 meters, the people's face of 5 meters different distances, and through recording the video of settlement time to everyone, every video is selected the photo of settlement quantity after cutting out according to setting for the frame number, can select 200 people to shoot, adopts the video form of 50 frame intercepting, has included different gestures, the influence of different scene backgrounds, has had the scene of external light source, has guaranteed the accuracy of the follow-up use of face detection model through a large amount of experiments. Then, the thermal infrared images intercepted by the video can be screened, the images which do not meet the training requirement are removed, the training data is screened to remove some useless data, so as to prevent a computer from learning the parameters and influencing real parameters in deep learning, for example, when a picture is cut, blurred images which are easy to appear in posture conversion are generally removed, 14 ten thousand thermal infrared images can be obtained as a training set, and M-6 ten thousand thermal infrared images are obtained as a test set, the training set selects N-3.5 million thermal infrared images as positive samples and L-10.5 million thermal infrared images as negative samples, the thermal infrared images in the positive samples show faces and can select face frames, the images in the negative examples do not show the face, e.g. only devices, clothing, walls, etc. are shown.

Then framing a face frame of each thermal infrared image of the alignment sample as a calibration frame; the mark of each thermal infrared image in the positive sample is 1, and the mark of each thermal infrared image in the negative sample is 0;

in this way, a total of N +1 txt files are obtained as training labels, as follows:

The invention only needs to store the relative coordinate of the central point of the calibration frame and the relative size of the calibration frame, thereby saving the acquisition time of a large number of parameters.

the convolutional neural network adopts a Darknet framework, and the Darknet framework is used for performing convolution, maximum pooling and normalization operations on an input thermal infrared image so as to obtain the weight of the convolutional neural network, specifically, the Darknet framework trains a 53-layer network and provides a 106-layer fully-convolutional bottom layer framework. In the forward propagation process, the size of the tensor is transformed by changing the step size of the convolution kernel, such as stride (2, 2), which is equivalent to reducing the side length of the image by half (i.e. reducing the area to 1/4). In the network, 5 times of reduction is needed, 1/2 which reduces the characteristic diagram to the original input size⁵I.e., 1/32. The input is 416x416 and the output is 13x13(416/32 ═ 13). The backpone would narrow the output profile to 1/32 at the input.

The convolutional neural network also adopts a Yolo network for processing the weight of the convolutional neural network to perform face judgment and position regression, six prediction frames are built and divided into two scales by designing Fast Anchor (Fast prediction frame algorithm), the heights of the six prediction frames are respectively a prediction frame I, a prediction frame II, a prediction frame III, a prediction frame IV, a prediction frame V and a prediction frame VI after being sorted from large to small, wherein the first scale allocates the prediction frame I, the prediction frame III and the prediction frame IV, and the second scale allocates the prediction frame II, the prediction frame IV and the prediction frame VI.

The size relationship between the calibration box and the prediction box constructed by the convolutional neural network is as follows:

a_x＝d_x+Δ(m_x)

a_y＝d_y+Δ(m_y)

wherein, a_x，a_yRespectively representing the width and height of the center coordinate of the calibration frame under the u-v image coordinate system, a_wAnd a_hDenotes the width and height, Δ (m), of the calibration frame_x)，Δ(m_y) Respectively indicating the amount of deviation in the width direction and the amount of deviation in the height direction from the center of the calibration frame to the center of the prediction frame, d_x，d_yRespectively representing the width and height, p, of the central coordinate of the prediction box_w，p_hExpressed as the width and height of the prediction box, m, respectively_w，m_hRespectively, the wide scaling ratio and the high scaling ratio of the prediction frame; the delta function is a sigmoid function, and the prediction quantity is scaled to be within 0-1, so that the aim of fast convergence can be achieved. When whether the face exists or not is detected, the length-width ratio is approximate to 1: 1, and a prediction frame with large length-width ratio difference cannot appear.

The loss function is optimized for the convolutional neural network as follows:

in the above formula, the total variance is adopted for the loss functions of w and h, and the binary cross entropy is used for the loss function of the confidence coefficient. The first row of the expression is the total square error and is used as the loss function of the position prediction, the second row of the expression uses the root total variance as the loss function of the height and the width, the third row and the fourth row of the expression uses the binary cross entropy as the loss function of the confidence coefficient, and the fifth row of the expression uses SSE as the loss function of the category probability.

and after the loss is obtained, updating by adopting a random gradient descent algorithm, continuously selecting and judging the optimal parameter under the current target by the convolutional neural network, updating the parameter in the convolutional neural network according to the loss result so as to ensure that the output result of the convolutional neural network is the same as the training label, and stopping updating after the convolutional neural network reaches the required index.

(4) And inputting the thermal infrared image to be detected to obtain a face detection result. The invention can realize the processing of 0.024s of a single graph, and has high precision and accuracy rate of more than 98.6 percent.

In addition, the coordinates mentioned in the invention refer to the coordinates under a u-v image coordinate system, the widths of the thermal infrared image and the frame are the side length sizes in the left and right directions, and the heights are the side length sizes in the vertical direction.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A face detection method of a thermal infrared image is characterized by comprising the following steps:

in this way, N +1 txt files are obtained as training labels;

2. The method for detecting the human face of the thermal infrared image according to claim 1, wherein in the step (1), the thermal infrared image is collected by a thermal infrared imager, and the collection condition is as follows: the human face and the medium wave thermal infrared imager of each person record videos by adopting a plurality of groups of distances and a plurality of groups of set time, the videos are cut according to set frame numbers, then, the set number of photos are selected, and then, a training set and a testing set are obtained.

3. The method for detecting the human face with the thermal infrared image according to claim 1, wherein the training label generated in the step (2) is specifically as follows:

centre_xRepresenting the width coordinate, centre, of the centre point of the calibration frame in the x-y image coordinate system_yGraph with center point of calibration box in x-yLength coordinates under an image coordinate system, w represents the length of the thermal infrared image where the calibration frame is located, and h represents the height of the thermal infrared image where the calibration frame is located;

4. The method according to claim 1, wherein the convolutional neural network employs a Darknet framework and a Yolo network, the Darknet framework is used for performing convolution, max pooling and normalization on the input thermal infrared image to obtain weights of the convolutional neural network, and the Yolo network is used for processing the weights of the convolutional neural network to perform face determination and position regression.

5. The method for detecting the human face of the thermal infrared image according to claim 1, wherein the size relationship between the calibration frame and the prediction frame constructed by the convolutional neural network is as follows:

a_x＝d_x+Δ(m_x)

a_y＝d_y+Δ(m_y)

6. The method according to claim 5, wherein the prediction frame constructed by the convolutional neural network is six and divided into two scales, and the heights of the six prediction frames are respectively a prediction frame I, a prediction frame II, a prediction frame III, a prediction frame IV, a prediction frame V and a prediction frame VI after being sorted from large to small, wherein the first scale allocates the prediction frame I, the prediction frame III and the prediction frame IV, and the second scale allocates the prediction frame II, the prediction frame IV and the prediction frame VI.

7. The method for detecting the human face of the thermal infrared image according to claim 1, wherein in the step (3), the loss function is optimized for the convolutional neural network as follows:

to representWhether the jth anchor box of the ith grid is responsible for the target or not is 0 when not responsible and 1 when responsible,