CN114463800A

CN114463800A - Multi-scale feature fusion face detection and segmentation method based on generalized intersection-parallel ratio

Info

Publication number: CN114463800A
Application number: CN202011251701.4A
Authority: CN
Inventors: 吕巨建; 林凯瀚; 赵慧民; 陈荣军; 熊建斌; 战荫伟
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2022-05-10

Abstract

The invention discloses a multi-scale feature fusion face detection and segmentation method based on generalized intersection ratio, which comprises the following steps: s1: preprocessing a face image to be detected, inputting the face image into a Mask R-CNN model, and extracting a corresponding characteristic diagram through a pre-trained deep neural network model; s2: generating a candidate region on the feature map through a region suggestion network with a preset size; s3: matching and corresponding the pixels of the input image and the feature map by using the candidate region matching, and acquiring a corresponding fixed-size feature map; s4: and finally, classifying the candidate regions and positioning the bounding box by using the full-connection layer, predicting pixel points by using a full convolution network, and generating a corresponding binary mask to segment the face target from the background image. The invention improves the identification precision, enables the positioning precision of the image pixel points to reach the pixel level after the multi-target face detection, and can acquire accurate face information on a complex monitoring picture.

Description

Multi-scale feature fusion face detection and segmentation method based on generalized intersection-parallel ratio

Technical Field

The invention belongs to the technical field of machine vision, and particularly relates to a multi-scale feature fusion face detection and segmentation method based on generalized intersection ratio.

Background

In recent years, with the rapid development and popularization of intelligent hardware devices such as smart phones, high-performance computers, intelligent robots and the like, artificial intelligence technology has been applied to various aspects of daily life, such as automatic driving, electronic commerce, intelligent robots, network security, intelligent home and the like, and great convenience is brought to work and life of people. Human beings are main service objects of the artificial intelligence technology, so the acquisition of face information is particularly important, and the importance of face detection as a key link of the acquisition of the face information is self-evident.

The face detection generally comprises two processes of face identification and face positioning, and the face is detected and positioned from an image or a video through image processing technology, machine learning and other related technologies, so that face information is obtained. The face detection is the first step of face-related application and is also the most critical link, and the effect of the face detection directly influences the performance of subsequent application. Therefore, the face detection becomes a research hotspot in the field of artificial intelligence, and the application prospect is very wide. The development process of face detection research can divide the existing work into a method based on traditional manual characteristics and a method based on a convolutional neural network. The face detection method based on the traditional manual characteristics is mostly based on a frame of a sliding window or matching according to characteristic points, and has an obvious speed advantage; the face detection method based on deep learning mainly utilizes a convolutional neural network to extract features, has good realization effect in the aspects of accuracy and multi-target detection, and can greatly improve the accuracy by replacing less time consumption compared with the traditional machine learning algorithm, so that the face detection algorithm based on deep learning becomes the mainstream research direction of multi-target face detection.

The prior art is as follows: in the aspect of traditional manual features, with the first real-time and effective human face detection method, Viola-Jones, the human face detection starts to enter a practical stage. The Viola-Jones algorithm utilizes Haar features to perform feature expression, and then performs face detection through Adaboost and a cascade structure, so that the real-time face detection effect in a common scene can be realized. However, the algorithm has the disadvantages of large feature size, low recognition rate in complex scenes and the like. In view of the above disadvantages, researchers have designed more elaborate manual Features, such as Histogram of Oriented Gradient (HOG) Features, Scale-invariant feature transform (SIFT), Speeded Up Robust Features (SURF), Local Binary Pattern Features (LBP), etc., and have implemented face detection in combination with classifiers such as Support Vector Machine (SVM). In addition, a Deformable Part Model (DPM) proposed by Felzenszzwald et al in 2010 is one of important progresses of a traditional manual feature method, the method adopts a multi-component strategy, and improves target detection effects under different angles and deformation conditions by combining improved HOG features with SVM, so that important breakthrough is made on tasks such as face recognition, pedestrian detection and the like. The method based on the traditional manual features well realizes the real-time detection of the human face, but has certain defects, such as more complex manual feature design, poorer detection stability, no pertinence of a sliding window strategy and the like, so the human face detection method has larger improvement space.

In terms of Convolutional neural networks, as early as 1994, valillant et al proposed to use neural networks to detect faces, and this method trained two Convolutional Neural Networks (CNNs) for face detection, where one CNN is used to classify whether each pixel is part of a face, and then the exact face position is output through the other CNN. Subsequently, Rowley et al proposed a hidden-link neural network for face detection that determines whether a face is contained or not through a sliding window. With the use and remarkable achievement of deep convolutional neural networks in ImageNet series competition by Alex et al, deep convolutional neural networks have begun to be applied to the field of face detection. The document proposes a deep neural network for fast multi-scale face detection, which consists of a suggestion sub-network and a detection sub-network. It is proposed that in a sub-network, detection is performed at multiple output layers to match objects of different scales and that these detectors of complementary scales are combined into a multi-scale detector, improving the detection of multi-scale objects. The literature combines the advantages of filtered Channel features and deep CNNs to propose a Convolutional Channel Feature (CCF) method, which has lower computation cost and storage cost than the general end-to-end CNN method.

The face detection method based on the deep convolutional neural network has good realization effect in the aspects of accuracy and multi-target detection, and can replace great accuracy improvement with less time consumption, so that a face detection algorithm based on deep learning becomes the mainstream research direction of face detection.

The existing multi-target face detection algorithm mainly realizes the detection of the face and the positioning of a face target frame, the extracted face target feature dimension is large, the space quantization is rough, the accurate positioning cannot be realized, certain background noise exists, the further image processing is not facilitated, and the application of partial high-efficiency and practical image processing technologies (such as face image super-resolution reconstruction, face image correction and the like) on a monitoring video is difficult to realize. Therefore, a multi-target face detection segmentation method facing multimedia is urgently needed.

However, in the mainstream research work in the prior art, face detection mainly achieves classification of a face and a background and positioning of a face bounding box, that is, a face is detected in an image to be detected and positioned by a bounding box. However, in the detected human face boundary box, the human face information usually only occupies a part of the human face boundary box, redundant information is brought by redundant background images in the boundary box, and therefore the problems that the extracted human face features have large background noise, coarse spatial quantization and large extracted feature dimension exist. Resulting in limited application of some practical face-related technologies (e.g., face recognition, facial expression recognition, face super-resolution reconstruction, face pose correction, etc.). The face segmentation method is mostly independent of the detection method, and directly segments the image, which easily causes segmentation errors and lower efficiency. Therefore, a multi-scale feature fusion face detection and segmentation method based on generalized intersection ratio is provided to solve the background

Problems mentioned in the art.

Disclosure of Invention

The invention aims to provide a multi-scale feature fusion face detection and segmentation method based on generalized intersection ratio, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a multi-scale feature fusion face detection and segmentation method based on generalized intersection ratio comprises the following steps:

s1: preprocessing a face image to be detected, inputting the face image into a Mask R-CNN model, and extracting a corresponding characteristic diagram through a pre-trained deep neural network model;

s2: generating a candidate region on the feature map through a region suggestion network with a preset size;

s3: matching and corresponding the pixels of the input image and the feature map by using the candidate region matching, and acquiring a corresponding fixed-size feature map;

s4: and finally, classifying the candidate regions and positioning the bounding box by using the full-connection layer, predicting pixel points by using a full convolution network, and generating a corresponding binary mask to segment the face target from the background image.

A Mask R-CNN loss function in a Mask R-CNN model adopts a generalized intersection-proportion function to replace a traditional smooth L1 function in the regression loss of a boundary box, so that the detection precision of the multi-target face is improved.

A multi-scale feature fusion strategy is adopted in the FPN, a reverse side edge connecting path from bottom to top is added for multi-scale feature fusion, and the small-scale face detection performance is improved.

The Mask R-CNN model completes three tasks in the same network architecture, namely detection and positioning of target position information, classification of a target and a background and segmentation of the target and the background, so that a loss function of the same network architecture comprises three parts, namely positioning loss, classification loss and segmentation loss.

The network global loss function is defined as follows: l ═ L_cls+L_box+L_mask#; wherein L is_clsTo classify the loss, L_boxFor loss of alignment, L_maskIs a segmentation loss.

Compared with the prior art, the invention has the beneficial effects that: according to the method for detecting and segmenting the face based on the generalized intersection-parallel ratio and the multi-scale feature fusion, the identification precision is improved through a ROIAlign (candidate region matching) algorithm, the positioning precision of the pixel points of the image after the multi-target face detection reaches the pixel level, and therefore the requirement of an example segmentation technology on the precision of the pixel points is met.

The invention can perform example segmentation on the multi-target face image of the monitoring video through an FCN (full Convolutional network) algorithm, draws a face binary mask, and segments the face image from a background image, thereby reducing the interference of background noise and acquiring accurate face information on a complex monitoring picture.

According to the invention, the screening of the prediction result is carried out through an MOB (Mask of bounding box) algorithm, and the identification accuracy is improved.

Drawings

FIG. 1 is a schematic flow chart of the detection algorithm of the present invention;

FIG. 2 is a schematic diagram of a GIoU of the present invention;

FIG. 3 is a schematic structural diagram of a ResNet101+ FPN framework according to the present invention;

fig. 4 is a schematic structural diagram of the multi-scale feature fusion network of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1: the invention provides a method for detecting and segmenting a multi-scale feature fusion face based on a generalized intersection ratio as shown in figures 1-4, which comprises the following steps:

The invention relates to a multi-scale feature fusion face detection and segmentation method based on generalized intersection ratio, and the specific embodiment of the whole detection method is as follows:

the network framework is expanded based on a Mask R-CNN model, the whole framework is shown in figure 1, firstly, a human face image to be detected is preprocessed and input into the model, and a corresponding characteristic diagram is extracted through a pre-trained deep neural network model; secondly, generating a candidate Region (Region of Interest, RoI) on the feature map through a Region suggestion Network (RPN) with a preset size; then, matching and corresponding the input image with the pixels of the feature map and acquiring a corresponding fixed-size feature map by using candidate Region matching (Region of Interest Align, RoIAlign); and finally, classifying the candidate area and positioning the boundary box by using a full connection layer, predicting pixel points by using a Full Convolution Network (FCN), and generating a corresponding binary mask to segment the face target from the background image.

Specifically, the present document improves on the deficiencies existing in the face of multi-target face detection and small-scale face detection tasks. A Generalized Intersection over Union (GIoU) is used as a boundary frame loss function to improve the detection precision of the multi-target human face; and a multi-scale feature fusion strategy is adopted in the FPN network, and the aim of improving the small-scale face detection performance is fulfilled.

Mask R-CNN is one of the excellent target detection and segmentation models at present, and three tasks, namely, detection and positioning of target position information, classification of a target and a background and segmentation of the target and the background, are completed in the same network architecture. Therefore, the overall network loss function includes three parts, namely positioning loss, classification loss and segmentation loss, and is defined as follows:

L＝L_cls+L_box+L_mask#

wherein L is_clsTo classify the loss, L_boxFor loss of alignment, L_maskIs a segmentation loss. Specifically, in the classification task, L_clsComprises the following steps:

wherein i corresponds to the ith anchor point, N_clsTo classify the number of samples, p_iPredicting a probability value, p, for the anchor point as a target_i ^*Labeled tag value, p_i ^*1 is positive example, p_i ^*Negative example is 0. L is_clsCross entropy loss for two classes:

if the candidate box is detected to be a certain class, the cross entropy of the class is used as an error value to calculate, and the loss values of other classes are not counted, so that the competition among the classes is avoided, and the formula is as follows:

wherein y is_ijIs the label value of the m x m region coordinate point (i, j),

is the predicted value of the kth class at the point.

In the localization loss, unlike the original Mask R-CNN model, in order to better reflect the Intersection between the predicted and actual bounding boxes and the Intersection when two bounding boxes do not exist, a Generalized Intersection over Union (GIoU) function is used herein as the loss function.

Specifically, as shown in fig. 2, the GIoU is defined as follows: assuming that the black rectangular frame and the blue rectangular frame are the prediction frame and the real frame respectively, the coordinates thereof are:

and

wherein x is₂＞x₁,y₂＞y₁Then the areas are respectively:

rectangular frame at their intersection

Comprises the following steps:

the area is as follows:

similarly, the minimum bounding box of the prediction box and the real box is indicated by the dotted line

Comprises the following steps:

the area is as follows:

from the definition of the IoU function, the IoU values of the predicted frame and the real frame can be obtained as

By considering the intersection condition between two bounding boxes and the condition when the two bounding boxes are not intersected through the smallest bounding box, the GIoU is defined as the formula

Similarly, when GIoU is used herein as the loss function, the loss function is set as:

L_box＝L_GIoU＝1-GIoU#

in the Mask R-CNN model, ResNet101+ FPN frame is shown in FIG. 3, and assuming that the input image size is 224 × 224, feature maps C1, C2, C3, C4 and C5 can be obtained through Conv1, Conv2_ x, Conv3_ x, Conv4_ x and Conv5_ x of ResNet 101.

Due to the large size of C1, a large amount of computational power is involved. Therefore, the FPN adopts the characteristic diagrams of C2, C3, C4 and C5. First, C5 is reduced in dimensionality by 1 × 1 convolution to obtain P5, then P5 is upsampled to the same size as C4 and added to the results of 1 × 1 convolution with C4 to obtain P4.

Similarly, P4, P3, and P2 are available from top to bottom, and finally a 3 × 3 convolution operation is performed after P2, P3, P4, and P5 to reduce aliasing effects of upsampling.

Furthermore, the maximum pooling with step size 2 at P5 resulted in P6. Through the above operations, 5 feature maps P2, P3, P4, P5 and P6 fused with feature information of different levels are finally obtained, and are input into the RPN network for the next operation.

It can be seen that the FPN network has a transversely-connected top-down structure, so that shallow features also have deep feature information, and the feature information richness and the multi-target detection accuracy of the network are improved.

However, FPN networks still have certain disadvantages: firstly, because the horizontal connection of the FPN network only has a top-down path, the output characteristic diagram mainly comprises characteristic information of the current layer and a deeper layer, and the characteristic information of different scales is not fully utilized.

Secondly, shallow features have accurate global position information, while the number of network layers in the ResNet-101 is 101, the transmission distance between the shallow feature information and the deep feature information is long, and valuable information may be lost in the process. Since the large-scale target is mainly located in the shallow feature map, the above-mentioned disadvantages have less influence on the large-scale target. However, for small-scale face targets, ignoring feature information of different scales directly affects the detection accuracy of the small-scale face targets.

Aiming at the defects of the FPN, the invention adopts a Multi-scale feature fusion strategy (Multi-scale feature fusion) to improve the detection effect of the small-scale face target. Specifically, a feature fusion path from bottom to top is added in the FPN network, so that the transmission distance between deep features and shallow features is shortened, the fusion between the deep features and the shallow features is realized, and the detection precision of the small-scale face target is improved.

The multi-scale feature fusion network is shown in FIG. 4, in which C2-C5 and P2-P5 represented by blue rectangles define the same FPN network, and the bottom-up feature fusion path is shown as N2-N5. First, N2 was directly copied from P2 and passed through a 3 × 3 convolution with step size 2 to get a feature map of the same size as P3, then this feature map was added to P3, and the new feature map was passed through a 3 × 3 convolution with step size 1 to get N3. Similarly, N4, N5 are available from bottom to top, where all profiles use 256 channels. Through the operation, the shallow layer and deep layer feature information can be fused in the newly generated feature maps N2-N5, and the small-scale face detection task can be better handled.

The disadvantages of the conventional manual characterization method are:

1. the detection precision is relatively low, and the phenomena of false detection and inaccurate position of a target frame are easy to occur;

2. the conditions of missed detection and false detection are easy to occur to small targets;

3. only target frame positioning and classification are carried out on the multi-target face, and the face image and the background image are not divided;

4. the fine division effect is not achieved, with background noise interference.

The above disadvantages arise for the following reasons:

1. the adopted model is simpler, so that the overfitting phenomenon is easy to occur;

2. the deep extraction of the face features is not carried out; no screening is performed on the prediction results;

3. the face and the background image are segmented without applying an example segmentation technology, and only single detection work is performed.

Disadvantages of the existing deep learning (RCNN, Fast RCNN) methods:

1. the recognition speed is slow;

2. for the face image with the special posture, the detection precision is not high;

The above disadvantages arise for the following reasons:

1. by using a deep learning deep network architecture, feature extraction is carried out on each generated candidate frame, so that the calculation time is increased;

2. the ROI Pooling (Region of Interest, Pooling of candidate regions) performs twice rounding operations, which generates quantization errors, resulting in low accuracy of positioning pixels of images;

3. the prediction results are not screened, so that the false detection rate is increased;

4. the human face and the background image are not segmented by using an image segmentation algorithm, and only single detection work is performed.

The invention introduces instance segmentation on the basis of the traditional surveillance video face detection, utilizes a full convolution network to segment a face image from a background image, and the application of the instance segmentation on the multi-target face detection of the surveillance video is within the protection scope of the invention.

The generalized intersection and comparison function is adopted to replace the traditional smooth L1 function in the regression loss of the boundary box, the detection precision of the multi-target face is improved, the application of the generalized intersection and comparison function in the multi-target face detection and segmentation with better detection and segmentation effects is achieved, and the method is within the protection range of the method.

The invention adopts a multi-scale feature fusion strategy in the FPN network, increases the reverse side edge connecting path from bottom to top to carry out multi-scale feature fusion, and improves the small-scale face detection performance. The application of the MOB algorithm in the multi-target face detection of the surveillance video is within the protection scope of the invention.

To sum up, compared with the prior art:

according to the method, the identification precision is improved through a ROIAlign (candidate region matching) algorithm, and the positioning precision of the image pixel points after the multi-target face detection reaches the pixel level, so that the requirement of an example segmentation technology on the precision of the pixel points is met.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. A multiscale feature fusion face detection and segmentation method based on generalized intersection ratio is characterized in that: the method comprises the following steps:

2. The method for detecting and segmenting the human face by fusing the multi-scale features based on the generalized intersection ratio as claimed in claim 1, wherein: a Mask R-CNN loss function in a Mask R-CNN model adopts a generalized intersection-proportion function to replace a traditional smooth L1 function in the regression loss of a boundary box, so that the detection precision of the multi-target face is improved.

3. The method for detecting and segmenting the human face by fusing the multi-scale features based on the generalized intersection ratio as claimed in claim 1, wherein: a multi-scale feature fusion strategy is adopted in the FPN, a reverse side edge connecting path from bottom to top is added for multi-scale feature fusion, and the small-scale face detection performance is improved.

4. The method for detecting and segmenting the human face by fusing the multi-scale features based on the generalized intersection ratio as claimed in claim 1, wherein: the Mask R-CNN model completes three tasks in the same network architecture, namely detection and positioning of target position information, classification of a target and a background and segmentation of the target and the background, so that a loss function of the same network architecture comprises three parts, namely positioning loss, classification loss and segmentation loss.

5. The method for detecting and segmenting the human face by fusing the multi-scale features based on the generalized intersection ratio as claimed in claim 4, wherein: the network global loss function is defined as follows: l ═ L_cls+L_box+L_mask#; wherein L is_clsTo classify the loss, L_boxFor loss of alignment, L_maskIs a segmentation loss.