WO2022198751A1

WO2022198751A1 - Rapid facial detection method based on multi-layer preprocessing

Info

Publication number: WO2022198751A1
Application number: PCT/CN2021/091026
Authority: WO
Inventors: 张晖; 叶子皓; 赵海涛; 孙雁飞; 朱洪波
Original assignee: 南京邮电大学
Priority date: 2021-03-25
Filing date: 2021-04-29
Publication date: 2022-09-29
Also published as: JP7335018B2; CN113204991A; JP2023522501A; CN113204991B

Abstract

Disclosed is a facial detection method, comprising: performing color space conversion on an input original image; extracting a skin color region from the image by using an elliptical skin color model; correcting the skin color region by means of a morphological operation; by means of an effective search position filtering method, generating boxes to be subjected to detection; merging boxes to be subjected to detection that have an excessively high degree of overlapping; performing detection on the boxes to be subjected to detection one by one by using a convolutional neural network; and calculating the coordinates of a final face locating box.

Description

A fast face detection method based on multi-layer preprocessing

Related applications

This application claims the priority of the Chinese patent application filed on March 25, 2021 with the application number 2021103222047 and titled "A Fast Face Detection Method Based on Multi-layer Preprocessing", the entire contents of which are approved by Reference is incorporated in this application.

technical field

The present application relates to the field of target detection, in particular to a method for detecting faces quickly and accurately through multi-layer preprocessing.

Background technique

Face recognition technology is an important technology that is widely used in various monitoring, security, personnel management and image production fields. Face recognition technology includes two parts: face detection and distinction. Face detection refers to finding the positions of all faces in the image, while face discrimination can determine whether two faces are the same person. Face detection is the basis of face recognition technology, because the next step can only be done if the position of all faces is found first.

Face detection, as a subfield of object detection, has many mature algorithms, such as Haar cascade classifiers that combine digital image features and classification algorithms or convolutional neural networks in the field of deep learning. Among them, the convolutional neural network, as one of the most advanced algorithms at present, performs very well in the face detection problem. Various optimally designed and fully trained convolutional neural networks can detect faces with high accuracy in various lighting, angles, and even partial occlusions.

SUMMARY OF THE INVENTION

Exemplary embodiments of the present application provide a fast face detection method based on multi-layer preprocessing, which combines multiple image processing methods and convolutional neural network technologies, and aims to solve the problem that the operation of convolutional neural networks is relatively slow.

In one aspect of the present application, a method for fast face detection based on multi-layer preprocessing is provided, and the specific operation steps are as follows:

S101, converting the image to be detected from the RGB color space to the YCbCr color space;

S102, utilize the ellipse skin color model to judge whether all the pixels in the image obtained in S101 are skin color pixels one by one, to obtain the skin color area, wherein when the blue chromaticity and red chromaticity components of any pixel meet the requirements of the elliptical skin color model, then the described The pixel is judged to be the skin color pixel;

S103, performing morphological processing on the skin color region obtained in S102 to obtain the processed skin color region;

S104, carry out effective search position filtering to the described processed skin color region that is processed in S103, obtain effective search position, utilize contour extraction technology to extract the contour of effective search position, corresponding each contour generates a frame to be measured;

S105, use a convolutional neural network with a face detection function to detect the frames to be tested obtained in S104 one by one, and provide the face location coordinates in the frame to be tested;

S106: Determine the coordinates of the face positioning frame according to the coordinates of the frame to be measured and the face positioning coordinates in the frame to be measured.

In one embodiment, the elliptical skin tone model requires:

Cr(13Cr-10Cb-2900)+Cb(13Cb-1388)+295972≤0

Among them, Cb represents the blue chrominance component of the pixel, and Cr represents the red chrominance component of the pixel.

In one embodiment, the step of performing effective search position filtering on the processed skin tone region includes:

Effective search position filtering is performed on the processed skin color region using a filter matrix, wherein pixel values in the processed skin color region, pixel values in the filter matrix, and pixel values in the effective search positions satisfy the following formula:

Among them, dst(i,j) is the pixel value of the coordinate (i,j) in the effective search position dst, src(i+x,j+y) is the coordinate (i+x,j+y) in the skin color area src Pixel value, f(x,y) is the pixel value of the coordinate (x,y) in the filter matrix f, the size of the filter matrix f is (2a+1)×(2b+1), and the center coordinate is (0,0) , t is the preset effective search rate ESR threshold, area is the number of pixels whose value is 1 in the filter matrix f.

In one embodiment, the coordinates of the upper left corner (left, top) and the coordinates of the lower right corner (right, bottom) of the frame to be tested are:

(left, top) = (left'-b, top'-a)

(right, bottom) = (right'+b, bottom'+a)

Among them, (left', top'), (right', bottom') are the coordinates of the upper left corner and the lower right corner of the circumscribed rectangle of the outline, respectively.

In one embodiment, the effective search rate is defined as the ratio of the area of the skin color region in the frame to be measured to the area of the frame to be measured.

In one embodiment, the step of converting the image to be detected from the RGB color space to the YCbCr color space includes:

The color space conversion is performed on the to-be-detected image using the following formula:

Wherein, Y, Cb, and Cr represent the luminance, blue chrominance components, and red chrominance components of the pixel, respectively, and R, G, and B represent the red, green, and blue components of the pixel, respectively.

In one embodiment, the step of performing morphological processing on the skin color region includes: removing free skin color points and thin line structures through an opening operation.

In one embodiment, the step of performing morphological processing on the skin color region further includes: filling holes and bridging gaps through a closing operation.

In one embodiment, the frame to be tested includes at least frame A to be tested and frame B to be tested, and the S104 further includes:

The frames A and B to be measured are merged, wherein if the area of the frame to be measured C obtained after the frames A and B to be measured are merged is less than or equal to the sum of the areas of the frames to be measured A and B, then The to-be-measured frame A and B are merged, otherwise the to-be-measured frame A and B are not merged.

In one embodiment, the coordinates of the upper left corner (l _C , t _C ) and the coordinates of the lower right corner (r _C , b _C ) of the frame C to be tested are:

(l _C ,t _C )=(min(l _A ,l _B ),min(t _A ,t _B ))

(r _C ,b _C )=(max(r _A ,r _B ),max(b _A ,b _B ))

Among them, (l _A , t _A ) and (r _A , b _A ) are the coordinates of the upper left corner and the lower right corner of the frame A to be tested, respectively, and (l _B , t _B ) and (r _B , b _B ) are the coordinates of the upper left corner and the lower right corner of the frame A to be tested, respectively. The coordinates of the upper left corner and the lower right corner of the frame B.

In one embodiment, the coordinates of the upper left corner and the lower right corner of the face positioning frame in S106 are respectively:

(l, t) = (l _C +l', t _C +t')

(r, b) = (r _C +r', b _C +b')

Among them, (l _C , t _C ), (r _C , b _C ) are the coordinates of the upper left corner and the lower right corner of the frame to be tested c, respectively, (l', t'), (r', b') are the volumes The coordinates of the upper left corner and the lower right corner of a face location in the frame c to be tested output by the product neural network.

Another aspect of the present application provides a computer device including a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method in any of the above embodiments when the processor executes the computer program.

In yet another aspect of the present application, a computer-readable storage medium is provided on which a computer program is stored, wherein when the computer program is executed by a processor, the steps of the method described in any of the foregoing embodiments are implemented.

Beneficial effects: The present application can reduce the size of the search area required by the multi-layer preprocessing technology while retaining the high accuracy of the face detection convolutional neural network, thereby greatly improving its running speed.

Description of drawings

FIG. 1 is a flowchart of a fast face detection method based on multi-layer preprocessing according to an embodiment of the present application.

FIG. 2 is a schematic diagram of efficient search position filtering (ESPF filtering) according to an embodiment of the present application.

FIG. 3 is a schematic diagram of generating a frame to be tested according to an embodiment of the present application.

FIG. 4 is a schematic diagram of merging frames to be tested according to an embodiment of the present application.

Detailed ways

As mentioned earlier, although various optimally designed and fully trained convolutional neural networks can detect faces with high accuracy in various lighting, angles and even partial occlusions. However, convolutional neural networks also have their own shortcomings, that is, fast operations are very dependent on GPUs with powerful floating-point computing capabilities. Limited by cost, volume and power, it is difficult to support the fast operation of convolutional neural networks on small edge terminals.

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

The technical solutions of the present application are further elaborated below in conjunction with the accompanying drawings and specific embodiments:

In the embodiment shown in FIG. 1 , a method for fast face detection based on multi-layer preprocessing specifically includes the following operation steps.

S101 , perform color space conversion on an input image (image to be detected), from a default RGB color space to a YCbCr color space. This is because YCbCr separates the brightness and chromaticity of the color, so it is more suitable for scenes that classify colors under different lighting conditions.

Since most of the encoding of images or videos in the computer field is based on the RGB color space, if you want to use YCbCr, you need to convert the RGB color space to the YCbCr color space first. Because the sensitivity of the human eye to the three colors of red, green and blue is not the same, it is necessary to give different weights to red, green and blue when converting the brightness Y. The specific conversion formula is as follows:

S102, utilize the ellipse skin color model to judge whether all the pixels in the image obtained in S101 are skin color pixels one by one, to obtain the skin color area, wherein when the blue chromaticity and red chromaticity components of any pixel meet the requirements of the elliptical skin color model, then the described The pixel is judged to be the skin color pixel.

After counting a large number of skin colors, it can be found that the skin color in the YCbCr space approximately presents an elliptical columnar distribution, that is, the distribution of the skin color in the CbCr plane is close to an ellipse. According to statistical research, if a plane rectangular coordinate system is established with Cr as the horizontal axis and Cb as the vertical axis, the center position of the skin ellipse is (155,113), the length of the long axis is 30, the length of the short axis is 20, and the inclination angle is 45° ( Anticlockwise rotation). Therefore, the skin color ellipse equation is:

With the skin color ellipse model, for a pixel, if the point formed by its blue chromaticity Cb and red chromaticity Cr components is within the skin color ellipse, it can be judged as a skin color pixel, otherwise it is a non-skin color pixel. Simplify formula 2, and the judgment conditions for obtaining a pixel as a skin color pixel are as follows:

Cr(13Cr-10Cb-2900)+Cb(13Cb-1388)+295972≤0 (Formula 3).

In S101, after the RGB image is converted into the YCbCr space, when the Cb and Cr components of a certain pixel satisfy Formula 3, the pixel can be considered as a skin color pixel. For each pixel in the input image, the skin color region (or skin color mask) can be obtained by judging with formula 3.

S103, performing morphological processing on the skin color region obtained in S102 to obtain a processed skin color region.

Morphological operations are a series of techniques in the field of image processing to process the shape features of binarized images. The basic idea is to modify the pixel values in the image by using a specific shape of structural elements and rules, so as to achieve the effects of eliminating noise, filling pores, trimming burrs and smoothing edges for further image analysis and target recognition. Basic morphological operations include Erosion and Dilation. Erosion is used to remove fine structures such as noise and glitches, while dilation is used to fill holes and gaps. When performing the erosion operation, the structuring element is slid pixel by pixel on the input image, and all the input image pixels whose 1-values are opposite in the structuring element are called corresponding pixels. The pixel facing the anchor point of the structuring element is expressed by the formula as follows:

Among them, dst, src, and E represent the output image, input image and structuring element, respectively. The structuring element takes the anchor point as the coordinate center, (i, j) is the anchor position coordinate of the current structuring element, and (x, y) is the structuring element. The offset of the relative anchor point. Equation 4 shows that during the erosion process, only when the 1-valued region of the structuring element is completely covered by the 1-valued region of the input image, the pixel value of the anchor point position of the output image is 1. This causes the contours of the 1-value areas of the image to shrink, that is, visually the 1-value areas appear to be eroded. The dilation operation is similar to the erosion operation, except that the minimum value becomes the maximum value, and its formula is as follows:

Equation 5 shows that in the dilation process, only when the 1-valued region of the structuring element is completely covered by the 0-valued region of the input image, the pixel value at the anchor point position of the output image is 0. This will cause the contours of the 1-valued areas of the image to expand, that is, visually the 1-valued areas appear to be inflated. Erosion and swelling can cause large changes in the area of skin tone areas.

To remove noise and fill in pores without affecting the size of the skin tone area, you need to use Opening and Closing. The opening operation refers to successively eroding and dilating the image with the same structural element. The ON operation can disconnect small connections and remove noise. The closing operation refers to the expansion first and then the corrosion, which can connect the adjacent areas and fill the pores. Morphological processing is performed on the obtained skin color area, the free skin color points and thin line structures are removed by the opening operation, and the holes in the smaller skin color area are filled and small gaps are bridged by the closing operation. The opening and closing operations have little effect on the area of the skin tone area while removing noise and filling pores. The skin color mask obtained in S102 is respectively opened and closed to obtain the final skin color mask.

S104: Perform effective search position filtering on the processed skin color region obtained in S103 to obtain an effective search position, extract the contour of the effective search position by using a contour extraction technique, and generate a frame to be measured corresponding to each contour.

Effective search position filtering (Effective Search Position Filtering, ESPF) is performed on the finally obtained skin color area to obtain all effective search position pixel areas. ESPF filtering is a special image filtering operation, which uses an ellipse-shaped filtering matrix and an effective search rate (Effective Search Rate, ESR)-based filtering calculation operation. The effective search rate is defined as the ratio of the area A _s of the skin color area in the frame to be measured to the area of the frame to be measured A _r , and its formula is as follows:

The ESPF calculation process can be expressed as:

dst, src and f in the formula are the output image, input image and filter matrix, respectively. The size of the filter matrix is (2a+1)×(2b+1), the center coordinate is (0, 0), t is the preset ESR threshold, and area is the number of 1-valued pixels in the filter matrix. The filter matrix used in EPF filtering is an ellipse matrix, in which the 1 values are arranged as a regular ellipse inscribed in the rectangle, as shown in the filter matrix in Figure 2.

As shown in Figure 2, the output image of ESPF filtering is the effective search position, and then the contour of the effective search position is extracted by contour extraction technology, and a test frame is generated for each contour. The frame to be measured is obtained by extending a rectangle circumscribing a contour to the surroundings by a certain distance. The four sides of the circumscribed rectangle are all circumscribed to the contour and each side is parallel to each side of the image. The extension distance is equal to half the size of the filter matrix. If the coordinates of the upper left corner and the lower right corner of the bounding rectangle of the outline are (left', top'), (right', bottom'), the filter matrix size is (2a+1)×( 2b+1), then the coordinates of the upper left corner and the lower right corner of the expanded frame to be tested are:

The final effect of generating the frame to be tested is shown in Figure 3, and each frame to be tested obtained after EPF filtering has a higher ESR. At this time, non-face skin color parts such as small-area skin-color areas and narrow and long skin-color areas are eliminated by ESPF filtering, and the problem of skin color area connectivity is also solved.

S105 , use a convolutional neural network with a face detection function to detect the to-be-measured frames obtained in S104 one by one, and provide the face location coordinates in the to-be-measured frame.

Check if there are frames to be merged and merge them all to get the final frame to be tested. Merging the test frame is to replace the two test frames A and B that need to be merged with a larger test frame C. The test frame C should completely cover A and B and the area should be as small as possible. Therefore, the test frame C can be obtained. The coordinates of the upper left corner and the lower right corner are:

At the same time, the combined to-be-measured frame should also satisfy the condition that the total area does not increase, that is, S _C ≤ S _A +S _B , where the area S=(rl)(bt). Figure 4 shows the effect of merging the frames to be tested, in which two pairs of frames to be tested that overlap in large areas are merged, which further reduces the area to be searched by the convolutional neural network and improves the search efficiency.

S106, according to the coordinates of the frame to be measured and the face positioning coordinates in the frame to be measured, determine the coordinates of the face positioning frame to obtain a face detection result.

Use the convolutional neural network with face detection function to detect each final frame to be tested one by one and give the face positioning coordinates in it, where the output positioning coordinates are relative to the frame to be tested.

Step 7, the convolutional neural network will output the coordinates of all face positioning frames in the frame to be measured relative to the frame to be measured. If the coordinates of the upper left corner and the lower right corner of the frame to be measured are (l _C , t _C ) and (r _C , b _C ), the convolutional neural network outputs the coordinates of the upper left corner and the lower right corner of a certain face positioning frame as (l', t') and (r', b'), then the actual coordinates of the upper left corner and lower right corner of the face positioning frame They are:

Calculate the actual coordinates of the face positioning frame in the image according to the coordinates of the frame to be measured and the face positioning coordinates in it, and output them to obtain the final face detection result.

It should be understood that although the various steps in the flowchart of FIG. 1 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 1 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The execution of these sub-steps or stages The sequence is also not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a phase.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present application, several modifications and improvements can also be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

A face detection method, comprising:

S101, converting the image to be detected from the RGB color space to the YCbCr color space;

S102, utilize the ellipse skin color model to judge whether all the pixels in the image obtained in S101 are skin color pixels one by one, to obtain the skin color area, wherein when the blue chromaticity and red chromaticity components of any pixel meet the requirements of the elliptical skin color model, then the described The pixel is judged to be the skin color pixel;

S103, performing morphological processing on the skin color region obtained in S102 to obtain the processed skin color region;

S104, carry out effective search position filtering to the described processed skin color region that is processed in S103, obtain effective search position, utilize contour extraction technology to extract the contour of effective search position, corresponding each contour generates a frame to be measured;

S105, use a convolutional neural network with a face detection function to detect the frames to be tested obtained in S104 one by one, and provide the face location coordinates in the frame to be tested;

S106: Determine the coordinates of the face positioning frame according to the coordinates of the frame to be measured and the face positioning coordinates in the frame to be measured.
The method according to claim 1, wherein, the elliptical skin color model is required to be:

Cr(13Cr-10Cb-2900)+Cb(13Cb-1388)+295972≤0

where Cb represents the blue chrominance component of the pixel, and Cr represents the red chrominance component of the pixel.
The method of claim 1, wherein the step of performing effective search position filtering on the processed skin tone region comprises:

Effective search position filtering is performed on the processed skin color region using a filter matrix, wherein pixel values in the processed skin color region, pixel values in the filter matrix, and pixel values in the effective search positions satisfy the following formula:

Among them, dst(i,j) represents the pixel value at the coordinate (i,j) in the effective search position dst, and src(i+x,j+y) represents the coordinate (i+x,j+y) in the skin color area src The pixel value at , f(x, y) represents the pixel value at the coordinate (x, y) in the filter matrix f, the size of the filter matrix f is (2a+1)×(2b+1), and the center coordinate is (0, 0), t represents a preset threshold of the effective search rate ESR, and area represents the number of pixels whose value is 1 in the filtering matrix f.
The method according to claim 3, wherein the coordinates of the upper left corner (left, top) and the coordinates of the lower right corner (right, bottom) of the frame to be tested are:

(left, top) = (left'-b, top'-a)

(right, bottom) = (right'+b, bottom'+a)

Among them, (left', top') and (right', bottom') represent the coordinates of the upper left corner and the lower right corner of the circumscribed rectangle of the outline, respectively.
The method according to claim 3, wherein the effective search rate is defined as the ratio of the area of the skin color region in the frame to be measured to the area of the frame to be measured.
The method of claim 1, wherein the step of converting the image to be detected from the RGB color space to the YCbCr color space comprises:

The color space conversion is performed on the to-be-detected image using the following formula:

Wherein, Y, Cb, and Cr represent the luminance, the blue chrominance component, and the red chrominance component of the pixel, respectively, and R, G, and B represent the red, green, and blue components of the pixel, respectively.
The method of claim 1, wherein the step of performing morphological processing on the skin tone region comprises:

Removes loose skin tone spots and fine line structures with an opening operation.
The method according to claim 7, wherein the step of performing morphological processing on the skin color region further comprises:

Fill holes and bridge gaps by closing.
The method according to claim 1, wherein the frame to be tested includes at least a frame to be tested A and a frame to be tested B, and the S104 further includes:

The frames A and B to be measured are merged, wherein if the area of the frame to be measured C obtained after the frames A and B to be measured are merged is less than or equal to the sum of the areas of the frames to be measured A and B, then The to-be-measured frame A and B are merged, otherwise the to-be-measured frame A and B are not merged.
The method according to claim 9, wherein the coordinates of the upper left corner (l C , t C ) and the coordinates of the lower right corner (r C , b C ) of the frame to be measured C are respectively:

(l C ,t C )=(min(l A ,l B ),min(t A ,t B ))

(r C ,b C )=(max(r A ,r B ),max(b A ,b B ))

Among them, (l A , t A ), (r A , b A ) are the coordinates of the upper left corner and the lower right corner of the frame A to be tested, and (l B , t B ), (r B , b B ) are the coordinates of the upper left corner and the lower right corner of the frame B to be tested, respectively.
The method according to claim 9, wherein the coordinates of the upper left corner (l, t) and the coordinates of the lower right corner (r, b) of the face positioning frame are respectively:

(l, t) = (l C +l', t C +t')

(r, b) = (r C +r', b C +b')

Among them, (l C , t C ), (r C , b C ) are the coordinates of the upper left corner and the lower right corner of the frame C to be tested, and (l', t'), (r', b') are the coordinates of the upper left corner and the coordinates of the lower right corner of any face locating frame in the frame to be tested C, which are output by the convolutional neural network, respectively.
A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 11 when the processor executes the computer program.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 11.