CN111914601A

CN111914601A - Efficient batch face recognition and matting system based on deep learning

Info

Publication number: CN111914601A
Application number: CN201910387472.XA
Authority: CN
Inventors: 陈支泽; 朱振宇
Original assignee: Nanjing Shineng Intelligent Technology Co ltd
Current assignee: Nanjing Shineng Intelligent Technology Co ltd
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2020-11-10

Abstract

The invention discloses an efficient batch face recognition and matting system based on deep learning, which comprises: the multi-thread video decoding module is used for acquiring video streams of multiple paths of cameras and pedestrians, and each path of video stream is decoded by using a single thread to obtain a digital representation form of an image; the face detection module maintains a queue for receiving images obtained by multi-thread decoding, splices a plurality of images, batched the plurality of images, and designs an improved cascade face detection network to carry out face detection to obtain a face area and face key point coordinates; the face recognition module extracts face features by using an improved lightweight neural network and performs comparison recognition; and the face matting module is used for matting out a face region which takes the center of a face as a central point and is high and wide by a user, and if the region contains other faces, the other faces are subjected to fuzzy processing to protect the portrait right and privacy of the user.

Description

Efficient batch face recognition and matting system based on deep learning

Technical Field

The invention relates to the field of face detection and recognition, in particular to an efficient batch face recognition and matting system based on deep learning.

Background

The face detection and recognition in a real scene are always hot topics in the field of computer vision, and the difficulty is represented as the fact that the face in the real scene is accompanied by interference of factors such as complex gestures, light rays, expressions, shielding and the like, so how to design an algorithm with good robustness and high speed to detect and recognize the face becomes a key problem which needs to be solved urgently. The traditional face detection or recognition method extracts an interest region in a face through manually designed local description features such as LBP, HOG and Gabor, classifies the interest region through an integrated classifier, and judges the detection and recognition results, but the manually designed feature extraction method is not strong enough in robustness, difficult to cope with the change of real scene noise, not high in efficiency and difficult to meet the requirement of real-time performance. Face detection and recognition techniques typified by deep learning have been highly distinctive in recent years. Compared with the traditional method, the deep learning-based method does not need to design features manually, uses the neurons of the multilevel organization and the nonlinear activation unit to extract the features, uses a large number of samples to learn in the training stage, and greatly improves the accuracy of detection and identification; on the other hand, due to the development of modern parallel computing units and parallel technologies, the real-time performance can meet the scene requirements, and the deep learning method becomes an important means for face detection and application landing recognition.

MTCNN is an important deep learning-based face detection method that finds face regions and face key points in an image through multi-stage candidate frames. The original MTCNN network structure is relatively inefficient, and when the image pixels in a real scene are relatively large, the predicted time is difficult to meet the real-time requirement; particularly, when a single server is used for accessing a plurality of cameras to detect the human face, the sequential detection speed of the images of each camera is very slow; if the detected human faces are sequentially identified by using a common convolutional neural network, the running speed is further reduced; in some scenes, it is also difficult to efficiently deduct a background area containing a face and push the background area to a user, and protect the portrait right and privacy of other people in the background.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an efficient batch face recognition and matting system based on deep learning, which can rapidly and efficiently process face images acquired by a plurality of cameras in batches, complete detection and recognition tasks, and matte the face to be recognized in an area containing a certain background size, and can perform mosaic processing to protect the portrait right and privacy of a user if other faces are contained. The system comprises a multi-thread video decoding module, a face detection module, a face recognition module and a face matting module.

In a first aspect, a multi-thread video decoding module receives an image stream collected by a camera using a plurality of independent threads, decodes the image to obtain a matrix form that can be represented in a computer, and inserts the matrix form into a shared queue.

In a second aspect, the present invention provides a face detection module for detecting a face contained in an image, including:

(1) copying the decoded image queue once, and performing median filtering processing on the copied image, wherein the median filtering processing aims at reducing salt and pepper noise in the image, and avoiding false detection and improving the speed of detection and subsequent identification;

(2) splicing the filtered image queues according to rows to form a complete batch for a face detection module to detect faces, wherein the batch aims to batch images, and the efficiency of detecting batch face data at one time by using a parallel computing technology on hardware equipment with a multi-core computing unit is greatly higher than that of sequentially detecting single images;

(3) designing a new cascade detection network Alpha-MTCNN, changing the minimum input of a PNet network in the original version MTCNN from 12 x 12 to 24 x 24, meeting the minimum face requirement required by a real scene, and simultaneously reducing the range of an image pyramid so as to improve the rate of multi-scale space detection;

(4) changing the sizes and the numbers of Convolution kernels in the PNet, the RNet and the ONet, reducing the calculated amount of Convolution by using a depth Separable Convolution (Depthwise Separable Convolution) technology, and reducing the calculated amount by using a Stride method to replace an original Max Pooling (Max Pooling) method in the Convolution process; and the BN layer is introduced to solve the problem that the parameter update in the network causes frequent change of data distribution.

(5) An LNet network is separately accessed behind the ONet to extract the key points of the face.

In a third aspect, the present invention provides a face recognition module for recognizing and detecting a detected face, including:

(1) aligning the detected human faces, zooming the aligned human faces into a size of 112 multiplied by 112, and forming a batch, namely a batch, by all the obtained human faces;

(2) the input format of the recognition network is a four-dimensional tensor, namely batch × 3 × 112 × 112, wherein batch represents the number of aligned human faces, 3 represents the number of RGB image channels, and 112 represent the width and height of images, and the efficiency of a parallel computing unit can be maximized by using an improved high-efficiency lightweight neural network MobileFaceNet to extract feature vectors in batch and extracting features in batch on a GPU or a server with a plurality of computing cores compared with the sequential extraction of features of a single image;

(3) and normalizing the extracted face feature vector, comparing the normalized face feature vector with the face vectors of known face labels in the bottom library, calculating cosine distance, and obtaining the identified face label by using a nearest neighbor classifier.

In a fourth aspect, the present invention provides a face matting module for matting out a region of a face of interest in an image, comprising:

(1) taking the mean value of four coordinates of the recognized face rectangle as a central point, and taking the specified width and height as boundaries for expansion to form a rectangular area;

(2) polling all other non-interesting faces, comparing the intersection ratios of these face regions and the above rectangular regions IoU;

(3) if IoU is greater than 0, median filtering is carried out on the face, and at the moment, other faces which are not recognized are blurred, so that the portrait right is protected.

Drawings

FIG. 1 is a basic flow diagram of the present invention;

FIG. 2 is a basic flow chart of the face detection method of the present invention;

FIG. 3 is a diagram of a novel PNet face detection network architecture in accordance with the present invention;

FIG. 4 is a diagram of a novel RNet face detection network architecture in accordance with the present invention;

FIG. 5 is a diagram of a novel ONet face detection network architecture in accordance with the present invention;

FIG. 6 is a diagram of a novel LNet face key point detection network architecture in accordance with the present invention;

fig. 7 is a schematic view of face matting.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the specific embodiments. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

FIG. 1 is a schematic diagram of an efficient batch face recognition and matting system based on deep learning according to the present invention. Referring to fig. 1, the system provided in this embodiment includes a multi-thread video decoding module, a face detection module, a face recognition module, and a face matting module. Firstly, the video decoding module receives image data transmitted by a plurality of cameras, the decoding of each camera is carried out in a single different thread, so that the phenomena of screen splash, disorder and the like caused by delay are prevented, and the decoding protocol can use H.265 or other protocols. And transmitting the image data obtained by decoding each thread to a subsequent face detection module.

In this implementation, the face detection module completes a detection task after image decoding, and obtains 4 coordinate positions of a face rectangular region and 5 coordinate positions of face key points, where 4 coordinates refer to coordinate positions corresponding to four corners of the face rectangular region, and 5 face key points refer to positions of key points at five positions including two eyes, a nose, and two ends of a mouth corner in a face. The face detection module comprises image batch processing, image smoothing processing and processing of four novel multi-stage networks PNet, RNet, ONet and LNet.

The detection steps of the face detection module are shown in fig. 2. Firstly, smoothing an image, then performing Gaussian multi-scale sampling, accessing the processed image to a novel PNet network to obtain a primary candidate frame, and performing Non-Maximum Suppression (NMS) on a face detected by the PNet network so as to filter repeated candidate areas; accessing the filtered candidate frame into RNet to further screen a non-face area, and then performing NMS (network management system) processing; and further processing the candidate frame obtained by the RNet through ONet to obtain a final candidate frame, and then processing the final candidate frame by using NMS (network management system) to obtain a final face region. And accessing the LNet after the ONet to extract the key points of the face.

In specific implementation, the face detection module uses an independent thread to execute a detection task, the thread contains a shared queue for receiving images obtained by multi-thread decoding, if the current queue is idle, the images in the shared queue are spliced in batches according to lines to form a single large image, the batch aims to facilitate parallel computing cores to operate, further the face recognition efficiency is greatly improved, and the utilization rate of hardware resources is improved, and parallel computing modes include but are not limited to multi-core, multi-thread, single instruction multiple data and other modes, and are not detailed herein. And carrying out copy operation on the large graphs formed in batches once, and then carrying out subsequent detection processing on the copied large graphs. The purpose of copying here is to have a certain loss of information to the image preprocessing operations in the detection task, while the integrity of the information needs to be maintained in the subsequent recognition process.

The median filtering method is a nonlinear smoothing technology, the gray value of each pixel point is set as the median of the gray values of all the pixel points in a certain neighborhood window of the point, the filtering effect on impulse noise and salt and pepper noise is obvious, the median filtering processing is carried out in the implementation, the purpose of the median filtering processing is to smooth the image and remove the noise in the image, when the scene facing the camera is complex, such as dark light and more dense points, the number of primary face candidate frames can be reduced to a certain extent by introducing filtering operation, the false detection rate can be reduced, and the detection speed can be accelerated.

Optionally, the image smoothing preprocessing is not limited to median filtering, and gaussian filtering and mean filtering may also be used, and the specific implementation process may be attempted for a specific scene, which is not described in detail here.

In this embodiment, the face detection task uses a multi-stage Convolutional Neural Network (CNN). The convolutional neural network is a feedforward neural network comprising a plurality of layers of convolutional filters and an activation unit, and a great breakthrough is made in the field of computer vision because a characteristic extraction mode does not need to be manually designed, and a prediction task can be accurately executed when the training data volume is huge. MTCNN is a multi-stage convolution neural network used for face detection, the MTCNN of the original version has low operation efficiency and is easy to be interfered by noise to cause false detection, the invention improves and optimizes the MTCNN in image input preprocessing, network structure and the like in many aspects, a new high-efficiency face detection network, namely Alpha-MTCNN is designed, the Alpha-MTCNN respectively passes through a novel PNet network and a novel RNet network for the input image, the novel ONet network outputs region coordinates of all faces in the image, and simultaneously passes through a novel LNet network to output key point coordinates of the faces.

Specifically, for the Alpha-MTCNN method, the minimum input of the original PNet network is changed from 12 × 12 to 24 × 24, so that the minimum face requirement required by a real scene is met, and the range of the image Gaussian pyramid is reduced, so that the multi-scale space detection rate is increased; the sizes of Convolution kernels and the number of the Convolution kernels in the PNet, the RNet and the ONet are changed, the calculated amount of Convolution is reduced by using a depth Separable Convolution (Depthwise Separable Convolution) technology, and the calculated amount is reduced by using a Stride method instead of an original Max Pooling (Max Pooling) method in the Convolution process. In addition, Batch Normalization (BN) is used for each neural network to solve the problem that data distribution changes caused by frequent updating of network parameters during training, so that the training convergence degree is accelerated, and the method comprises the following steps:

(1) calculating the mean value of the batch processing data:

(2) calculating batch data variance:

(3) input data is normalized:

(4) performing scale transformation and offset:

in particular, the details of the new PNet network architecture used above are shown in fig. 3. The new PNet is a full convolution network that accepts arbitrary inputs of sizes greater than 24 x 24. Performing convolution operation with convolution kernel of 3 × 3 size and step of 2 on the input image; using a depth separable convolution technology for the obtained feature map, namely firstly using a convolution kernel of 3 multiplied by 3 to carry out grouping convolution operation, and then using a convolution kernel of 1 multiplied by 1 to carry out common convolution operation; carrying out depth separable convolution operation with consistent convolution kernel size on the new feature graph; on the basis, the operation of depth separable convolution and step length of 2 is carried out again; and finally, performing a depth separable convolution operation with consistent convolution kernel size. After the convolution operations, the probability information of whether the face is contained and the coordinate position of the face rectangular region can be obtained by accessing the softmax classification layer and the common regression layer.

Details of the new RNet network architecture used are shown in figure 4. The new RNet is a full convolution network that accepts the face candidate box after PNet filtering and scales the candidate box by 24 x 24 size as input. Firstly, convolution operation with the size of 2 multiplied by 2 and the step length of 2 is used for an input image; this operation then sequentially switches in four depth separable convolution operations of convolution size 3 x 3, where the third separable convolution layer uses a sliding operation with a step size of 2. After the convolution operations, the probability information of whether the face is contained and the coordinate position of the face rectangular region can be obtained by accessing the softmax classification layer and the common regression layer.

Details of the new ONet network architecture used are shown in figure 5. The new ONet is a full convolution network that accepts face candidate frames after RNet filtering and scales the candidate frames by 48 x 48 size as input. Firstly, performing convolution operation on an input image by using a convolution kernel with the size of 3 multiplied by 3, the step length of 2 and the filling of 1; then, the operation is sequentially accessed into 3 depth separable convolution operations with convolution size of 3 multiplied by 3, step length of 2 and filling of 1; finally, a depth separable convolution operation layer with the convolution size of 3 x 3 is accessed. After the convolution operation, the probability information of whether the face is contained and the coordinate position of the face rectangular region can be obtained by accessing the softmax classification layer and the common regression layer, and at the moment, the final face detection region coordinate information is obtained.

Details of the new LNet network architecture used are shown in fig. 6. The new LNet is a full convolution network that accepts face candidate frames after ONet filtering for the purpose of extracting face keypoint coordinate information. The candidate box is scaled by 48 x 48 size as input at the time of implementation. Firstly, performing convolution operation on an input image by using a convolution kernel with the size of 3 multiplied by 3, the step length of 2 and the filling of 1; then accessing a depth separable convolution operation layer with the convolution size of 2 multiplied by 2, the step length of 2 and the filling of 1; then sequentially accessing 3 depth separable convolution operations with convolution size of 3 multiplied by 3, step size of 2 and padding of 1; finally, a deep separable convolution operation with the convolution size of 3 x 3 is accessed. After the convolution operations, the coordinate positions of the key points of the human face can be obtained by accessing the regression layer.

The network design structure used in this implementation includes, but is not limited to, the above description, and the specific number of convolution kernels per layer in each network can be tried according to specific data and scenarios, which will not be described in detail here.

The face recognition module of the invention completes the tasks of feature extraction and comparison after detection.

Preferably, the face recognition module aligns the detected faces, transforms the faces into a uniform form according to the detected 5 face key point coordinates by using a similarity transformation method, scales the aligned faces into 112 × 112 sizes, and forms a batch of all the detected faces. The purpose of batch is to facilitate the operation of parallel computing cores, thereby greatly improving the face recognition efficiency and improving the utilization rate of hardware resources. The parallel computing modes include, but are not limited to, multi-core, multi-thread, single instruction multiple data (simd), etc., and are not described in detail herein.

In this embodiment, the input format of the face recognition network is a four-dimensional tensor, that is, batch × 3 × 112 × 112, where batch represents the number of aligned faces, 3 represents the number of RGB image channels, and 112 represent the width and height of an image. The implementation uses the high-efficiency improved lightweight neural network MobileFaceNet to extract the feature vectors in batches, and the used MobileFaceNet network is connected with the feature vectors with 512 dimensions in a full output mode so as to enhance the representation capability of the image and have stronger identification capability on the face, thereby improving the face comparison accuracy.

The face recognition module normalizes the extracted 512-dimensional face feature vectors, compares the normalized 512-dimensional face feature vectors with face vectors of known face labels, calculates cosine distances, and obtains recognized face labels by using a nearest neighbor classifier. The cosine distance calculation formula is:

the face matting module of the invention needs to find the area with the specified size containing the face (the interesting face) from the original image for each identified face, and the user can select to push the face area. If other faces (non-interesting faces) are found in the area, a blurring process is needed to protect the portrait right and privacy of the user. The method comprises the following specific steps:

(1) the method comprises the steps of taking the average value of four coordinate positions of an identified face rectangle as a central point, taking specified width and height as boundaries for expansion, wherein the width and height can be set by a user to form an expanded rectangular area;

As shown in fig. 7, the face matting effect implemented by the present invention is demonstrated. Fig. 7 shows the original image on the left and the image after face extraction, in which other faces are blurred. According to the invention, the face of the identified interesting person can be scratched out through the face scratching module, and the scratched and expanded rectangular region not only contains the face but also contains a part of background image, so that the face does not appear to be obtrusive, and the user experience can be improved to a certain extent; and the non-interesting faces are subjected to mosaic processing in the pushed pictures, so that the portrait right and privacy of the user are ensured not to be disclosed.

Claims

1. The utility model provides a high-efficient batch face identification and sectional drawing system based on deep learning which characterized in that includes following module:

the multithreading video decoding module is used for acquiring videos of multiple paths of camera faces, and each path of video stream is decoded by using a single thread to obtain a matrix representation form of an image;

the face detection module maintains a queue to receive images obtained by Multi-thread decoding, splices a plurality of images to facilitate batch processing, and designs an improved cascade face detection Network Alpha-MTCNN (Alpha Multi-task conditional Neural Network) to extract face regions and face key point coordinates;

the face recognition module extracts face features by using an improved lightweight neural network MobileFaceNet and performs comparison recognition;

and the face matting module is used for matting out a face region with the center of the face region as a central point and the height and width specified by the user for the face image of interest, and if the region contains other faces, the face matting module is used for carrying out fuzzy processing on the other faces to protect the portrait right and privacy of the user.

2. The efficient batch face recognition and matting system based on deep learning of claim 1, wherein the input and video decoding module receives pedestrian videos of different regions collected by multiple high-definition cameras and decodes the pedestrian videos by using multiple independent threads on a shared server, and the decoded images are placed in a shared queue.

3. The efficient batch face recognition and matting system based on deep learning according to claim 1, wherein the face detection module performs median filtering on the input face image to reduce noise, so as to reduce redundancy of candidate face frames in the cascade network and improve the speed of face detection.

4. The efficient batch face recognition and matting system based on deep learning of claim 1, wherein the face detection module splices a plurality of images obtained from an image queue in batches according to rows, and performs batch detection on a Graphics Processing Unit (GPU) or a server with a plurality of computational cores.

5. The efficient batch face recognition and matting system based on deep learning as claimed in claim 3, wherein the face detection module uses median filtering to smooth the image before face detection, the purpose of the median filtering is to smooth the image and remove noise in the image, when the scene faced by the camera is complex, such as dark light and more dense points, the number of primary face candidate frames can be reduced to a certain extent by introducing filtering operation, so that not only can the false detection rate be reduced, but also the detection speed can be increased.

6. The deep learning based efficient batch face recognition and matting system according to claim 3, wherein the face detection module innovatively improves an original cascade face detection method MTCNN (Multi-task conditional Neural Network) Network from multiple aspects to obtain Alpha-MTCNN.

7. The efficient batch face recognition and matting system based on deep learning as claimed in claim 6, wherein the Alpha-MTCNN method changes the minimum input of the original PNet network from 12 x 12 to 24 x 24, satisfying the minimum face requirement for real scene, and reducing the range of Gaussian pyramid of multi-scale image, thereby increasing the speed of multi-scale space detection.

8. The efficient batch face recognition and matting system based on deep learning as claimed in claim 6, wherein the Alpha-MTCNN method changes the Convolution kernel size and the number of Convolution kernels in PNet, RNet and ONet, reduces the amount of computation of Convolution by using deep Separable Convolution (Depthwise Separable Convolition) technique, and reduces the amount of computation by using Stride method instead of the original Max Pooling (Max Pooling) method during Convolution; batch Normalization (BN) is used for solving the problem that the data distribution changes due to frequent updating of network parameters during training, and the steps are as follows:

(1) calculating the mean value of the batch processing data:

(2) calculating batch data variance:

(3) input data is normalized:

(4) performing scale transformation and offset:

。

9. the system of claim 6, wherein the Alpha-MTCNN has access to an LNet network separately behind the ONet for face key point extraction.

10. The efficient batch face recognition and matting system based on deep learning of claim 1 wherein the face detection module returns all detected face region coordinates and face key point coordinates to subsequent recognition modules in batch.

11. The system of claim 1, wherein the face recognition module aligns the detected faces using similarity transformation, scales the aligned faces to 112 × 112, and combines all the detected faces into a batch.

12. The efficient batch face recognition and matting system based on deep learning according to claim 10, wherein the input format of the face recognition module recognition network is four-dimensional tensor, i.e. batch x 3 x 112, where batch represents the number of aligned faces, 3 represents the number of RGB image channels, and 112 represent the width and height of the image, and the extraction of feature vectors in batch using the improved efficient lightweight neural network MobileFaceNet can maximize the efficiency of parallel computing units compared with the sequential extraction of features from a single image on a GPU or a server with multiple computing cores.

13. The efficient batch face recognition and matting system based on deep learning of claim 1 wherein the face recognition module normalizes the extracted face feature vectors, compares the normalized face feature vectors with face vectors of known face labels, calculates cosine distances, and obtains recognized face labels using nearest neighbor classifiers.

14. The efficient batch face recognition and matting system according to claim 1, wherein the face matting module needs to find a region of a specified size containing each recognized face from an original drawing for each recognized face, and the specific implementation includes:

taking the mean value of four coordinates of the recognized face rectangle as a central point, and taking the specified width and height as boundaries for expansion to form a rectangular area;

polling all other non-interesting faces, comparing the intersection ratios of these face regions and the above rectangular regions IoU;

if IoU is greater than 0, median filtering is carried out on the face, and other faces which are not interested are blurred at the moment, so that the portrait right and privacy are protected.