WO2024011859A1 - 一种基于神经网络的人脸检测方法和装置 - Google Patents

一种基于神经网络的人脸检测方法和装置 Download PDF

Info

Publication number
WO2024011859A1
WO2024011859A1 PCT/CN2022/141600 CN2022141600W WO2024011859A1 WO 2024011859 A1 WO2024011859 A1 WO 2024011859A1 CN 2022141600 W CN2022141600 W CN 2022141600W WO 2024011859 A1 WO2024011859 A1 WO 2024011859A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution
image
network
recognized
layer
Prior art date
Application number
PCT/CN2022/141600
Other languages
English (en)
French (fr)
Inventor
刘辛
刘辉
张瑞
刘振亚
韩飞
于光远
Original Assignee
天翼云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天翼云科技有限公司 filed Critical 天翼云科技有限公司
Publication of WO2024011859A1 publication Critical patent/WO2024011859A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • the present application relates to the field of machine learning technology, and in particular to a face detection method and device based on neural networks.
  • Facial recognition technology is widely used in finance, security, smart transportation, livelihood services and other fields. It can not only make identity authentication more accurate and convenient, but also enable the identification and tracking of key personnel.
  • face detection models based on neural networks are mostly applied on the server side.
  • the mobile side has become an increasingly important algorithm deployment platform.
  • the deployment of face detection models based on neural networks on mobile terminals has two major pain points: large storage space occupied by the model and long inference delay time, which greatly restricts the deployment and application of face detection models on mobile terminals.
  • This application provides a face detection method and device based on neural networks to reduce the storage space and inference delay time of the face detection model on the mobile terminal while ensuring the accuracy of face detection.
  • embodiments of the present application provide a face detection method based on a neural network.
  • the method includes: passing an image to be recognized through multiple processing units of a first cascade network, and predicting N in the image to be recognized.
  • first candidate windows each first candidate window is a preliminary prediction area with a face image
  • passing the N first candidate windows through multiple processing units of the second cascade network, predict the image to be recognized M second candidate windows in
  • each second candidate window is a modified prediction area with a face image
  • pass the M second candidate windows through multiple processing units of the third cascade network to determine the to-be- A target window of an image having a face image and face feature points in the target window are identified; wherein the convolution operation of at least one processing unit is a depth-separable convolution operation.
  • the network's running delay has obvious advantages.
  • the use of depth-separable convolution can significantly reduce the parameters of the face detection model while maintaining the accuracy of the face detection model, which in turn can make the face detection model more efficient during mobile inference. The amount of calculation is significantly reduced.
  • the at least one processing unit includes a fusion layer; the fusion layer is obtained by fusing a trained convolution layer and a trained batch normalization layer.
  • the batch normalization layer can speed up the convergence speed of the face detection model during training, enhance the generalization ability of the model, and also reduce the degree of overfitting of the network.
  • the batch normalization layer is integrated into the convolution layer without increasing the calculation amount of the face detection model during mobile terminal inference.
  • the convolution kernels in the trained convolution layer are obtained by cropping according to the weight of each convolution kernel.
  • the feature extraction and expression capabilities of the network can be improved.
  • the redundant convolution kernels are pruned according to the weight of the convolution kernel, which can reduce the calculation amount and memory usage of the face detection model during inference on the mobile terminal.
  • the convolution kernels in the trained convolution layer are obtained by trimming according to the weight of each convolution kernel, including: sequentially applying the first cascade network and the second cascade network.
  • Convolution kernel cropping is performed with the third cascade network, where any cascade network is processed in the following manner: according to the weight of each convolution kernel, the M convolution kernels with the smallest weights are cropped; After each cascade network is trained, convolution kernel cropping is continued until the set requirements are met; after the set requirements are met, the channels corresponding to the cropped convolution kernels are deleted.
  • the greater the weight value of the convolution kernel the greater the impact of the convolution kernel on the overall accuracy of the face detection model. Therefore, cropping out the convolution kernel with a smaller weight can ensure that the face accuracy is not reduced. While testing the accuracy of the model, it also reduces the amount of calculation required when inferring the model on the mobile terminal.
  • an iterative strategy of segmented cropping and retraining is used. After only cropping a part of the convolution kernel in each round, the network is retrained, which can make the parameter distribution of the network adapt to the cropped face. Detect the model and simultaneously restore the model’s detection accuracy.
  • a quantization node is included before each processing layer of the processing unit, and the quantization node is used to convert FP32 type input data into INT8/UINT8 type data;
  • An inverse quantization node is included after the layer, which is used to restore INT8/UINT8 type output data to FP32 type data.
  • the continuous inverse quantization nodes and the quantization nodes are merged into one re-quantization node.
  • spatially continuous inverse quantization nodes and quantization nodes can be merged into one re-quantization node, thereby reducing the additional calculation amount caused by continuous quantization and inverse quantization.
  • At least one parameter in the at least one processing unit is subjected to model quantization processing.
  • FP32 type parameters are converted into INT8/UINT8 type parameters to achieve storage compression of the face detection model.
  • embodiments of the present application provide a face detection device based on a neural network, including:
  • a prediction module configured to pass the image to be recognized through multiple processing units of the first cascade network and predict N first candidate windows in the image to be recognized; each first candidate window is a preliminary prediction with a face image. Area;
  • the prediction module is also used to predict M second candidate windows in the image to be recognized by passing the N first candidate windows through multiple processing units of the second cascade network; each second candidate The window is the area where the modified prediction has the face image;
  • a determination module configured to pass the M second candidate windows through multiple processing units of the third cascade network to determine a target window with a face image in the image to be recognized and the face features in the target window. point; wherein the convolution operation of at least one processing unit is a depth-separable convolution operation.
  • the at least one processing unit includes a fusion layer; the fusion layer is obtained by fusing a trained convolution layer and a trained batch normalization layer.
  • the convolution kernels in the trained convolution layer are obtained by cropping according to the weight of each convolution kernel.
  • the convolution kernels in the trained convolution layer are obtained by trimming according to the weight of each convolution kernel, including: sequentially applying the first cascade network and the second cascade network.
  • Convolution kernel cropping is performed with the third cascade network, where any cascade network is processed in the following manner: according to the weight of each convolution kernel, the M convolution kernels with the smallest weights are cropped; After each cascade network is trained, convolution kernel cropping is continued until the set requirements are met; after the set requirements are met, the channels corresponding to the cropped convolution kernels are deleted.
  • a quantization node is included before each processing layer of the processing unit, and the quantization node is used to convert FP32 type input data into INT8/UINT8 type data;
  • An inverse quantization node is included after the layer, which is used to restore INT8/UINT8 type output data to FP32 type data.
  • the continuous inverse quantization nodes and the quantization nodes are merged into one re-quantization node.
  • At least one parameter in the at least one processing unit is subjected to model quantization processing.
  • embodiments of the present application also provide a computing device, including:
  • Memory used to store program instructions
  • a processor configured to call program instructions stored in the memory, and execute the method described in any possible design of the first aspect according to the obtained program instructions.
  • embodiments of the present application also provide a computer-readable storage medium in which computer-readable instructions are stored.
  • the computer reads and executes the computer-readable instructions, any of the above-mentioned aspects of the first aspect is possible.
  • the method described in the design is implemented.
  • Figure 1 is a schematic diagram of a face detection model provided by an embodiment of the present application.
  • Figure 2 is a schematic flowchart of a neural network-based face detection method provided by an embodiment of the present application
  • Figure 3 is a schematic flowchart of a method for cutting redundant convolution kernels provided by an embodiment of the present application
  • Figure 4 is a schematic diagram of nodes related to quantification provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a specific face detection model provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a face detection device based on a neural network provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • multiple refers to two or more than two. Words such as “first” and “second” are only used for the purpose of differentiating descriptions and cannot be understood as indicating or implying relative importance or order.
  • Figure 1 exemplarily shows a face detection model applicable to the embodiment of the present application.
  • the face detection model includes a first cascade network 110, a second cascade network 120 and a third cascade network 130. Cascade network.
  • Each cascade network includes multiple processing units for processing the image to be recognized.
  • the processing unit may include at least one processing layer of a convolution layer, a fusion layer, an activation function layer and a fully connected layer.
  • the arrows in the figure represent a processing process of the image to be recognized by the processing unit.
  • the image to be detected undergoes pyramid transformation to generate a multi-scale image, which is input into the first cascade network.
  • the face area in the image to be detected is extracted.
  • the candidate frame output by the first cascade network is input into the second cascade network, and the face area in the image to be detected is refined through multiple processing units of the second cascade network.
  • the refined candidate frames output by the second cascade network are input into the third cascade network, and the final face rectangular frame and face key points of the image to be detected are determined through multiple processing units of the second cascade network. Location.
  • Each cascade network outputs three parts, face classification (face classification), prediction box regression (bounding box) and key point regression (facial landmark localization).
  • face classification is the probability that the image area corresponding to the rectangular frame is a face image
  • prediction frame regression is the position information of the detected face rectangular frame
  • key point regression is the position information of the five key points of the face in the rectangular frame.
  • the five key points are the positions of the eyes, the tip of the nose and the two corners of the human face.
  • Figure 2 schematically shows a face detection method based on a neural network provided by an embodiment of the present application. This method can be specifically executed by the above face detection model. As shown in Figure 2, the method includes the following steps:
  • Step 201 Pass the image to be recognized through multiple processing units of the first cascade network to predict N first candidate windows in the image to be recognized.
  • each first candidate window is a preliminary prediction area with a face image.
  • Step 202 Pass the N first candidate windows through multiple processing units of the second cascade network to predict M second candidate windows in the image to be recognized.
  • each second candidate window is modified to predict a region with a face image.
  • Step 203 Pass the M second candidate windows through multiple processing units of the third cascade network to determine the target window with the face image of the image to be recognized and the face feature points in the target window.
  • the convolution operation of at least one processing unit is a depthwise separable convolution (Depthwise Separable Convolution) operation.
  • Depthwise separable convolution includes depthwise convolution (Depthwise Convolution) and pointwise convolution (Pointwise Convolution).
  • the network's running delay has obvious advantages.
  • the use of depth-separable convolution can significantly reduce the parameters of the face detection model while maintaining the accuracy of the face detection model, which in turn can make the face detection model more efficient during mobile inference. The amount of calculation is significantly reduced.
  • At least one processing unit in the above three cascade networks includes a fusion layer, which is obtained by fusing the trained convolutional layer and the trained batch normalization (Batch Normalization) layer. .
  • adding a Batchnorm layer can speed up the convergence of the network during training and enhance the generalization ability of the model.
  • the training data is too small or the network capacity is too small, it can also reduce the degree of overfitting of the network.
  • the Batchnorm layer needs to store parameters and also takes up the computing resources of the mobile terminal. Therefore, this application fuses the Batchnorm layer and the adjacent convolution layer to reduce the problem of the face detection model. The amount of calculation required during mobile inference.
  • the method for fusing the Batchnorm layer and the convolutional layer is as follows:
  • Step 1 Calculate the mean of the input:
  • Step 2 Calculate the variance of the input:
  • Step 3 Generate new convolution layer weights:
  • Step 4 Generate a new convolutional layer:
  • Step 5 Re-calculate the convolution:
  • the number of convolution kernels in the network when training the face detection model, you can increase the number of convolution kernels in the network and then increase the number of channels in the middle layer of the network to improve the feature extraction and expression capabilities of the network.
  • the convolution operation is depth-separable convolution
  • the number of intermediate layer channels of the network can be increased by increasing the number of convolution kernels in point-wise convolution.
  • the convolution kernels in the convolution layer are cropped according to the weight of each convolution kernel, and then the cropped face detection model is deployed to the mobile terminal.
  • the convolution kernels of the first cascade network, the second cascade network, and the third cascade network can be trimmed sequentially using iterative clipping.
  • any cascade network is processed as follows: according to each convolution According to the weight of the kernel, the M convolution kernels with the smallest weights are trimmed; after training the trimmed cascade networks, the convolution kernel trimming is continued until the set requirements are met. After meeting the set requirements, the channel corresponding to the cropped convolution kernel is deleted.
  • Figure 3 schematically illustrates a specific method for cropping redundant convolution kernels provided by an embodiment of the present application. As shown in Figure 3, the method includes the following steps:
  • Step 301 Set the cropping ratio and the number of iterations.
  • Step 302 Calculate the weight of each convolution kernel.
  • the accumulated value of the absolute value of the parameters in each convolution kernel can be used as the weight of each convolution kernel.
  • the weight of the convolution kernel is calculated as shown in Formula 1.
  • W n and s n respectively represent the nth convolution kernel matrix and its weight in the network;
  • w i, j, k represent the elements in the convolution kernel matrix;
  • i, j, k respectively represent the three elements of the convolution kernel. channel.
  • Step 303 Arrange the weights of all convolution kernels in the network.
  • Step 304 Mark the m convolution kernels with the smallest weights and the channels related to the m convolution kernels with the smallest weights.
  • Step 305 Retrain the trimmed cascade network.
  • Step 306 Determine whether the proportion of marked convolution kernels reaches a preset proportion.
  • step 302 is executed to perform the next round of cropping; if the ratio of the marked convolution kernels reaches the preset ratio, step 307 is executed.
  • Step 307 Delete all marked convolution kernels and channels.
  • each cascaded network can be retrained using the training set to adapt the network's parameter distribution to the cropped structure while restoring the accuracy of the face detection model. After pre-setting the cropping ratio of the network, it does not crop to the specified ratio in one step. Instead, it adopts an iterative strategy of cutting and retraining in stages. Each iteration only cuts part of the network structure and retrains it, which can reduce face detection. The difficulty of restoring model accuracy when the model is retrained.
  • each processing layer of the processing unit includes a quantization node before it, which is used to convert input data of 32-bit floating point (FP32) type into 8-bit fixed-point (INT8/UINT8) type.
  • FP32 floating point
  • INT8/UINT8 8-bit fixed-point
  • asymmetric sub-channel quantization and retraining can be used to convert FP32 type data to INT8/UINT8 type data.
  • An inverse quantization node is included after each processing layer of the processing unit.
  • the inverse quantization node is used to restore INT8/UINT8 type output data to FP32 type data.
  • the continuous inverse quantization nodes and quantization nodes can be merged into one requantization node, thereby reducing the cost caused by continuous quantization and inverse quantization. Additional calculations.
  • FIG. 4 exemplarily shows the above three types of quantization-related nodes inserted in each cascade network.
  • a quantization node 401 is provided before each processing layer, which is used to convert the input P32 type data into INT8/UINT8 type data, and then input it into each processing layer for processing.
  • An inverse quantization node 402 is provided after each processing layer for converting the output INT8/UINT8 type data back to P32 type data.
  • the spatially continuous inverse quantization node 402 and quantization node 401 are merged into one requantization node 403.
  • FP32 type data is converted into INT8/UINT8 type data, which can further compress the face detection model and achieve storage compression and inference acceleration of the model after the face detection model is deployed to the mobile terminal.
  • model quantization processing includes weight value quantization and activation value quantification.
  • Weight value quantification is to quantify the parameters stored in the model, including weights and bias values. This operation only needs to be performed once when the face detection model is deployed to the mobile terminal for serialization storage, and does not need to be performed during model inference.
  • Activation value quantification is to dynamically quantify the activation value output by each layer in the network.
  • the main method of activation value quantification is to quantize the face detection model in After running an epoch on the pre-selected verification set, the quantization parameters are determined based on the floating point number range of the output activation value of each layer of the model. The determined quantization parameters will no longer change when the face detection model performs inference.
  • the face key point detection branch in the network when training the face detection model, can be retained to improve the face detection performance of the network.
  • the face key point detection branch in the network is cut out to further reduce the computational load of the network.
  • the loss function of the face detection model can be as shown in Formula 2.
  • L landmark represents the regression loss of the key points of the face.
  • the face human loss function uses the focal loss function (Focal Loss), and uses all sample data to calculate the loss to achieve sample balance during training.
  • the loss function for face classification is shown in Formula 3.
  • the regression loss function of the prediction box uses the Euclidean loss function as shown in Formula 4.
  • y box represents the ground-truth coordinate of the sample; Represents the coordinate offset predicted by the network for the sample; and the y box and They are all represented by four-tuple (x, y, h, w), where x, y, h, w respectively represent the horizontal and vertical coordinates of the upper left corner of the detection frame, as well as the height and width of the detection frame.
  • the key points of the face include the coordinates of the left and right eyes, nose, and left and right corners of the mouth, so
  • This application uses the Tensorflow-1.10 deep learning framework when training the face detection model.
  • data augmentation is performed using flipping, cropping, transforming brightness and contrast.
  • Figure 5 exemplarily shows a specific face detection model provided by the embodiment of the present application.
  • the face detection model includes three cascaded networks.
  • the image to be detected is converted into an image with a size of 12 ⁇ 12 ⁇ 3.
  • the image to be detected with a size of 12 ⁇ 12 ⁇ 3 is used as the input of the first cascade network.
  • 16 images with a size of 5 ⁇ 5 are generated.
  • 24 images of size 3 ⁇ 3 are generated.
  • 16 images of size 1 ⁇ 1 are generated.
  • two 1 ⁇ 1 ⁇ 16 convolution kernels two 1 ⁇ 1 feature maps are generated for classification; through four 1 ⁇ 1 ⁇ 16 convolution kernels, four 1 ⁇ 1 feature maps are generated. Used for prediction box regression; 10 1 ⁇ 1 ⁇ 16 convolution kernels generate 10 1 ⁇ 1 feature maps for key point regression.
  • the prediction frame output by the first cascade network is converted into an image to be detected with a size of 24 ⁇ 24 ⁇ 3 after pyramid transformation.
  • the image to be detected with a size of 24 ⁇ 24 ⁇ 3 is used as the input of the second cascade network.
  • 48 images with a size of 11 ⁇ 11 are generated.
  • 64 images of size 4 ⁇ 4 are generated.
  • 64 images of size 3 ⁇ 3 are generated.
  • two 1 ⁇ 1 ⁇ 16 convolution kernels two 1 ⁇ 1 feature maps are generated for classification; through four 1 ⁇ 1 ⁇ 16 convolution kernels, four 1 ⁇ 1 feature maps are generated. Used for prediction box regression; 10 1 ⁇ 1 ⁇ 16 convolution kernels generate 10 1 ⁇ 1 feature maps for key point regression.
  • the prediction frame output by the second cascade network is converted into an image to be detected with a size of 48 ⁇ 48 ⁇ 3 after pyramid transformation.
  • the image to be detected with a size of 48 ⁇ 48 ⁇ 3 is used as the input of the third cascade network.
  • 48 images with a size of 23 ⁇ 23 are generated.
  • 96 images of size 10 ⁇ 10 are generated.
  • 96 images of size 4 ⁇ 4 are generated.
  • 128 images of size 3 ⁇ 3 are generated.
  • This application provides a face detection method based on neural networks.
  • the batch normalization layer is integrated into the convolution layer, and redundant convolution kernels are cropped and inserted into the network.
  • Quantification nodes and other means can significantly reduce the parameters of the face detection model while maintaining the accuracy of the face detection model, thereby reducing the storage space and reasoning of the face detection model on the mobile terminal while ensuring the accuracy of face detection. delay.
  • Figure 6 exemplarily shows a neural network-based face detection device provided by an embodiment of the present application, which can be used on mobile devices.
  • the face detection device 600 includes:
  • the prediction module 601 is used to pass the image to be recognized through multiple processing units of the first cascade network and predict N first candidate windows in the image to be recognized; each first candidate window is a preliminary prediction with a face. image area;
  • the prediction module 601 is also configured to pass the N first candidate windows through multiple processing units of the second cascade network to predict M second candidate windows in the image to be recognized; each second The candidate window is the area where the modified prediction has the face image;
  • Determining module 602 configured to pass the M second candidate windows through multiple processing units of the third cascade network to determine a target window with a face image in the image to be recognized and the face in the target window. Feature points; wherein the convolution operation of at least one processing unit is a depth-separable convolution operation.
  • the at least one processing unit includes a fusion layer; the fusion layer is obtained by fusing a trained convolution layer and a trained batch normalization layer.
  • the convolution kernels in the trained convolution layer are obtained by cropping according to the weight of each convolution kernel.
  • the convolution kernels in the trained convolution layer are obtained by trimming according to the weight of each convolution kernel, including: sequentially applying the first cascade network and the second cascade network.
  • Convolution kernel cropping is performed with the third cascade network, where any cascade network is processed in the following manner: according to the weight of each convolution kernel, the M convolution kernels with the smallest weights are cropped; After each cascade network is trained, convolution kernel cropping is continued until the set requirements are met; after the set requirements are met, the channels corresponding to the cropped convolution kernels are deleted.
  • a quantization node is included before each processing layer of the processing unit, and the quantization node is used to convert FP32 type input data into INT8/UINT8 type data;
  • An inverse quantization node is included after the layer, which is used to restore INT8/UINT8 type output data to FP32 type data.
  • the continuous inverse quantization nodes and the quantization nodes are merged into one re-quantization node.
  • the embodiment of the present application provides a computing device, as shown in Figure 7, including at least one processor 701, and a memory 702 connected to the at least one processor.
  • the processor is not limited in the embodiment of the present application.
  • the connection between processor 701 and memory 702 in Figure 7 is taken as an example through a bus.
  • the bus can be divided into address bus, data bus, control bus, etc.
  • the memory 702 stores instructions that can be executed by at least one processor 701. By executing the instructions stored in the memory 702, at least one processor 701 can execute the above neural network-based face detection method.
  • the processor 701 is the control center of the computing device. It can use various interfaces and lines to connect various parts of the computer device, and perform resource management by running or executing instructions stored in the memory 702 and calling data stored in the memory 702. set up.
  • the processor 701 may include one or more processing units.
  • the processor 701 may integrate an application processor and a modem processor.
  • the application processor mainly processes the operating system, user interface, application programs, etc., and the modem processor
  • the debug processor mainly handles wireless communications. It can be understood that the above-mentioned modem processor may not be integrated into the processor 701.
  • the processor 701 and the memory 702 can be implemented on the same chip, and in some embodiments, they can also be implemented on separate chips.
  • the processor 701 can be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors.
  • Logic devices and discrete hardware components can implement or execute the methods, steps and logical block diagrams disclosed in the embodiments of this application.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor, or executed by a combination of hardware and software modules in the processor.
  • the memory 702 can be used to store non-volatile software programs, non-volatile computer executable programs and modules.
  • the memory 702 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Magnetic Memory, Disk , CD, etc.
  • Memory 702 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the memory 702 in the embodiment of the present application can also be a circuit or any other device capable of realizing a storage function, used to store program instructions and/or data.
  • embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer executable program.
  • the computer executable program is used to cause the computer to execute any of the neural-based methods listed above. Face detection method for the Internet.
  • embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Abstract

一种基于神经网络的人脸检测方法和装置。该方法包括:将待识别图像通过第一级联网络的多个处理单元,预测得到所述待识别图像中的N个第一候选窗口;每个第一候选窗口为初步预测具有人脸图像的区域;将所述N个第一候选窗口通过第二级联网络的多个处理单元,预测得到所述待识别图像中的M个第二候选窗口;每个第二候选窗口为修正预测具有人脸图像的区域;将所述M个第二候选窗口通过第三级联网络的多个处理单元,确定所述待识别图像的具有人脸图像的目标窗口和所述目标窗口中的人脸特征点;其中,至少一个处理单元的卷积操作为深度可分离卷积操作。

Description

一种基于神经网络的人脸检测方法和装置 技术领域
本申请涉及机器学习技术领域,尤其涉及一种基于神经网络的人脸检测方法和装置。
背景技术
人脸识别技术广泛地应用在金融、安防、智慧交通、民生服务等领域,不仅可以使身份认证变得更加准确便捷,还可以实现对重点人员的识别与跟踪。
技术问题
目前,基于神经网络的人脸检测模型大都应用在服务器端,但随着近年来移动互联网的兴起和硬件性能的提升,移动端成为日益重要的算法部署平台。然而,基于神经网络的人脸检测模型在移动端的部署存在模型占存储空间大、推理延迟时间长两大痛点,极大地制约了人脸检测模型在移动端的部署和应用。
因此,目前亟需一种方案,用以在保证人脸检测精度的情况下,减少人脸检测模型在移动端的存储空间和推理延迟时间。
技术解决方案
本申请提供一种基于神经网络的人脸检测方法和装置,用以在保证人脸检测精度的情况下,减少人脸检测模型在移动端的存储空间和推理延迟时间。
第一方面,本申请实施例提供一种基于神经网络的人脸检测方法,该方法包括:将待识别图像通过第一级联网络的多个处理单元,预测得到所述待识别图像中的N个第一候选窗口;每个第一候选窗口为初步预测具有人脸图像的区域;将所述N个第一候选窗口通过第二级联网络的多个处理单元,预测得到所述待识别图像中的M个第二候选窗口;每个第二候选窗口为修正预测具有人脸图像的区域;将所述M个第二候选窗口通过第三级联网络的多个处理单元,确定所述待识别图像的具有人脸图像的目标窗口和所述目标窗口中的人脸特征点;其中,至少一个处理单元的卷积操作为深度可分离卷积操作。
上述技术方案中,在级联型网络中,当输入图像中目标数较少时,网络的运行时延有着明显的优势。并且,使用深度可分离卷积与传统卷积相比,可以在保持人脸检测模型精度的基础上,使人脸检测模型的参数明显减少,进而可以使人脸检测模型在移动端推理时的计算量明显降低。
可选的,所述至少一个处理单元包括融合层;所述融合层是对训练后的卷积层和训练后的批标准化层进行融合后得到的。
上述技术方案中,批标准化层可加快人脸检测模型在训练时的收敛速度,增强模型的泛 化能力,还可以减轻网络的过拟合程度。在人脸检测模型部署到移动端时,将批标准化层融合入卷积层,可以不增加人脸检测模型在移动端推理时的计算量。
可选的,所述训练后的卷积层中的卷积核是按各卷积核的权重进行裁剪后得到的。
上述技术方案中,在训练人脸检测模型时,通过在网络中增加卷积核的个数以增加网络的中间层通道数,可以提高网络的特征提取和表达能力。在人脸检测模型部署到移动端部时,冗余的卷积核根据卷积核的权重被剪裁掉,进而可以降低人脸检测模型在移动端推理时的计算量和内存占用。
可选的,所述训练后的卷积层中的卷积核是按各卷积核的权重进行裁剪后得到的,包括:依次对所述第一级联网络、所述第二级联网络和所述第三级联网络进行卷积核裁剪,其中,任一级联网络按照如下方式处理:按各卷积核的权重,对权重最小的M个卷积核进行裁剪;对裁剪后的各级联网络进行训练之后继续进行卷积核裁剪,直至满足设定要求;在满足所述设定要求后,对裁剪的卷积核对应的通道进行删除。
上述技术方案中,卷积核的权重值越大,代表该卷积核对人脸检测模型整体精度的影响越大,因此,将权重较小的卷积核裁剪掉,可以保证在不降低人脸检测模型精度的同时,减少模型在移动端推理时的计算量。并且,在对卷积核进行裁剪时,使用分次裁剪再训练的迭代策略,每一轮只裁剪一部分卷积核后,重新对网络进行训练,可以使网络的参数分布适应裁剪后的人脸检测模型,并同时恢复模型的检测精度。
可选的,在所述处理单元的每个处理层之前包括一个量化节点,所述量化节点用于将FP32类型的输入数据转换为INT8/UINT8类型的数据;在所述处理单元的每个处理层之后包括一个反量化节点,所述反量化节点用于将INT8/UINT8类型的输出数据还原成FP32类型的数据。
上述技术方案中,通过在网络中插入量化节点的方式,将FP32类型的数据转换为INT8/UINT8类型的数据,可以实现人脸检测模型部署到移动端后模型的存储压缩和推理加速。
可选的,若两个所述处理层之间存在连续的所述反量化节点和所述量化节点,则所述连续的所述反量化节点和所述量化节点被合并为一个再量化节点。
上述技术方案中,空间上连续的反量化节点和量化节点,可以合并成一个再量化节点,进而可以缩减因连续量化和反量化带来的额外计算量。
可选的,所述至少一个处理单元中的至少一个参数经模型量化处理。
上述技术方案中,将FP32类型的参数转换为INT8/UINT8类型的参数,以实现人脸检 测模型的存储压缩。
第二方面,本申请实施例提供一种基于神经网络的人脸检测装置,包括:
预测模块,用于将待识别图像通过第一级联网络的多个处理单元,预测得到所述待识别图像中的N个第一候选窗口;每个第一候选窗口为初步预测具有人脸图像的区域;
所述预测模块,还用于将所述N个第一候选窗口通过第二级联网络的多个处理单元,预测得到所述待识别图像中的M个第二候选窗口;每个第二候选窗口为修正预测具有人脸图像的区域;
确定模块,用于将所述M个第二候选窗口通过第三级联网络的多个处理单元,确定所述待识别图像的具有人脸图像的目标窗口和所述目标窗口中的人脸特征点;其中,至少一个处理单元的卷积操作为深度可分离卷积操作。
可选的,所述至少一个处理单元包括融合层;所述融合层是对训练后的卷积层和训练后的批标准化层进行融合后得到的。
可选的,所述训练后的卷积层中的卷积核是按各卷积核的权重进行裁剪后得到的。
可选的,所述训练后的卷积层中的卷积核是按各卷积核的权重进行裁剪后得到的,包括:依次对所述第一级联网络、所述第二级联网络和所述第三级联网络进行卷积核裁剪,其中,任一级联网络按照如下方式处理:按各卷积核的权重,对权重最小的M个卷积核进行裁剪;对裁剪后的各级联网络进行训练之后继续进行卷积核裁剪,直至满足设定要求;在满足所述设定要求后,对裁剪的卷积核对应的通道进行删除。
可选的,在所述处理单元的每个处理层之前包括一个量化节点,所述量化节点用于将FP32类型的输入数据转换为INT8/UINT8类型的数据;在所述处理单元的每个处理层之后包括一个反量化节点,所述反量化节点用于将INT8/UINT8类型的输出数据还原成FP32类型的数据。
可选的,若两个所述处理层之间存在连续的所述反量化节点和所述量化节点,则所述连续的所述反量化节点和所述量化节点被合并为一个再量化节点。
可选的,所述至少一个处理单元中的至少一个参数经模型量化处理。
第三方面,本申请实施例还提供一种计算设备,包括:
存储器,用于存储程序指令;
处理器,用于调用所述存储器中存储的程序指令,按照获得的程序指令执行如第一方面的任一种可能的设计中所述的方法。
第四方面,本申请实施例还提供一种计算机可读存储介质,其中存储有计算机可读指 令,当计算机读取并执行所述计算机可读指令时,使得上述第一方面的任一种可能的设计中所述的方法实现。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种人脸检测模型的示意图;
图2为本申请实施例提供的基于神经网络的人脸检测方法的流程示意图;
图3为本申请实施例提供的一种裁剪冗余卷积核的方法的流程示意图;
图4为本申请实施例提供的一种与量化相关的节点的示意图;
图5为本申请实施例提供的一种具体的人脸检测模型的示意图;
图6为本申请实施例提供的一种基于神经网络的人脸检测装置的结构示意图;
图7为本申请实施例提供的一种计算设备的结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在本申请的实施例中,多个是指两个或两个以上。“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。
图1示例性的示出了本申请实施例所适用的一种人脸检测模型,该人脸检测模型包括第一级联网络110、第二级联网络120和第三级联网络130三个级联网络。
每个级联网络中包括多个处理单元,用于对待识别图像进行处理。其中,处理单元可以包括卷积层、融合层、激活函数层和全连接层的至少一种处理层。图中的箭头即表示处理单元对待识别图像的一次处理过程。
待检测图像经过金字塔变换后生成多尺度的图像,输入到第一级联网络中,通过第一级联网络的多个处理单元,对待检测图像中的人脸区域进行提取。将第一级联网络输出的候选框输入到第二级联网络中,通过第二级联网络的多个处理单元,对待检测图像中的人脸区域进行提炼。将第二级联网络输出的提炼后的候选框输入到第三级联网络中,通过第二级联网 络的多个处理单元,确定待检测图像最终的人脸矩形框和人脸关键点的位置。
每个级联网络均输出三个部分,人脸分类(face classification)、预测框回归(bounding box)和关键点回归(facial landmark localization)。
其中,人脸分类为矩形框对应图像区域为人脸图像的概率,预测框回归为检测出的人脸矩形框的位置信息,关键点回归为矩形框中人脸的5个关键点位置信息。5个关键点分别为人脸的双眼、鼻尖及两个嘴角的位置。
需要说明的是,上述图1所示的人脸检测模型的结构仅是一种示例,本申请实施例对此不做具体限定。
图2示例性地示出了本申请实施例提供的一种基于神经网络的人脸检测方法,该方法具体可以由上述人脸检测模型执行。如图2所示,该方法包括以下步骤:
步骤201、将待识别图像通过第一级联网络的多个处理单元,预测得到待识别图像中的N个第一候选窗口。
上述步骤中,每个第一候选窗口为初步预测具有人脸图像的区域。
步骤202、将N个第一候选窗口通过第二级联网络的多个处理单元,预测得到待识别图像中的M个第二候选窗口。
上述步骤中,每个第二候选窗口为修正预测具有人脸图像的区域。
步骤203、将M个第二候选窗口通过第三级联网络的多个处理单元,确定待识别图像的具有人脸图像的目标窗口和目标窗口中的人脸特征点。
上述三个级联网络中,至少一个处理单元的卷积操作为深度可分离卷积(Depthwise Separable Convolution)操作。深度可分离卷积包括深度卷积(Depthwise Convolution)和逐点卷积(Pointwise Convolution)。
在级联型网络中,当输入图像中目标数较少时,网络的运行时延有着明显的优势。并且,使用深度可分离卷积与传统卷积相比,可以在保持人脸检测模型精度的基础上,使人脸检测模型的参数明显减少,进而可以使人脸检测模型在移动端推理时的计算量明显降低。
在一种可能的设计中,上述三个级联网络中至少一个处理单元包括融合层,该融合层是对训练后的卷积层和训练后的批标准化(Batch Normalization)层进行融合后得到的。
在训练人脸检测模型时,增加Batchnorm层可加快网络在训练时的收敛速度,并增强模型的泛化能力,当训练数据过少或网络容量过小时,还可以减轻网络的过拟合程度。但是在人脸检测模型部署到移动端时,Batchnorm层需要存储参数并且还占用移动端的计算资源,因此,本申请将Batchnorm层和相邻的卷积层进行融合操作,以降低人脸检测模型在移动端 推理时的计算量。
示例性地,Batchnorm层与卷积层进行融合的方法如下:
步骤一、计算输入的均值:
Figure PCTCN2022141600-appb-000001
步骤二、计算输入的方差:
Figure PCTCN2022141600-appb-000002
步骤三、生成新的卷积层权重:
Figure PCTCN2022141600-appb-000003
步骤四、生成新的卷积层偏执:
Figure PCTCN2022141600-appb-000004
步骤五、重新进行卷积计算:
y←ω′·x+b′
其中,输入x={x_1,x_2,…,x_n}表示原卷积层的输入;ω表示原卷积层的权重;b表示原卷积层的偏置;γ表示原Batchnorm层的缩放参数;β表示原Batchnorm层的平移参数。
在一种可能的设计中,在训练人脸检测模型时,可以通过在网络中通过增加卷积核的个数,进而增加网络中的中间层的通道数,以提高网络的特征提取和表达能力。进一步地,当卷积操作为深度可分离卷积时,可以通过增加逐点卷积中卷积核的个数来增加网络的中间层通道数。
在一种可能的设计中,在人脸检测模型训练完成后,将卷积层中的卷积核按各卷积核的权重进行裁剪,再将裁剪后的人脸检测模型部署到移动端。具体的,可以使用迭代裁剪的方式依次对第一级联网络、第二级联网络和第三级联网络进行卷积核裁剪,其中,任一级联网络按照如下方式处理:按各卷积核的权重,对权重最小的M个卷积核进行裁剪;对裁剪后 的各级联网络进行训练之后继续进行卷积核裁剪,直至满足设定要求。在满足所述设定要求后,对裁剪的卷积核对应的通道进行删除。
图3示例性地示出了本申请实施例提供的一种裁剪冗余卷积核的具体方法,如图3所示,该方法包括以下步骤:
步骤301、设定裁剪比例和迭代次数。
步骤302、计算每个卷积核的权重。
其中,可以将每个卷积核中参数的绝对值的累加值作为每个卷积核的权重。具体的,卷积核的权重的计算方式如公式一所示。
Figure PCTCN2022141600-appb-000005
其中,W n与s n分别表示当网络中第n个卷积核矩阵及其权重;w i,j,k表示卷积核矩阵中的元素;i,j,k分别表示卷积核的三个通道。
步骤303、对网络中所有卷积核的权重进行排列。
步骤304、标记权重最小的m个卷积核和与权重最小的m个卷积核相关的通道。
步骤305、重新训练裁剪后的级联网络。
步骤306、判断被标记的卷积核的比例是否达到预设比例。
若被标记的卷积核的比例没有达到预设比例,则执行步骤302,进行下一轮裁剪;若所述被标记的卷积核的比例达到预设比例,则执行步骤307。
步骤307、删除所有被标记的卷积核和通道。
卷积核的权重值越大,表示该卷积核对网络整体精度的影响越大。对此,裁剪掉一部分对网络整体精度影响较小的卷积核,可以在保证人脸检测模型的检测精度的情况下,降低人脸检测模型在移动端推理过程中的计算量和内存占用。通常人脸检测模型经过一轮减枝后,精度会出现一定程度的下降。可以使用训练集对每个级联网络重新训练,使网络的参数分布适应裁剪后的结构,同时恢复人脸检测模型的精度。在预先设定网络的裁剪比例后,并不一步到位地裁剪到指定比例,而是采取分次裁剪再训练的迭代策略,每次迭代仅裁剪部分网络结构并进行重新训练,可以降低人脸检测模型重新训练时模型精度恢复的难度。
在一种可能的设计中,处理单元的每个处理层之前包括一个量化节点,该量化节点用于将32位浮点(FP32)类型的输入数据转换为8位定点(INT8/UINT8)类型的数据,具体的,可以采用非对称分通道量化再训练的方式,实现FP32类型的数据向INT8/UINT8类型的数据的转换。
在处理单元的每个处理层之后包括一个反量化节点,反量化节点用于将INT8/UINT8类型的输出数据还原成FP32类型的数据。
进一步地,若两个处理层之间存在连续的反量化节点和量化节点,则连续的反量化节点和量化节点可以被合并为一个再量化节点,进而可以缩减因连续量化和反量化带来的额外计算量。
图4示例性地示出了在每个级联网络中插入的上述三种与量化相关的节点。从图中可以看出,在每个处理层之前设置有一个量化节点401,用于将输入的P32类型的数据先转换为INT8/UINT8类型的数据后,再输入到各处理层中进行处理。在每个处理层之后设置有一个反量化节点402,用于将输出的INT8/UINT8类型的数据再转换回P32类型的数据。对于两个处理层之间,空间上连续的反量化节点402和量化节点401被合并为一个再量化节点403。
上述技术方案中,将FP32类型的数据转换为INT8/UINT8类型的数据,可以进一步压缩人脸检测模型,实现人脸检测模型部署到移动端后模型的存储压缩和推理加速。
在一种可能的设计中,至少一个处理单元中的至少一个参数经模型量化处理。其中,模型量化处理包括权重值量化和激活值量化。权重值量化即对模型存储的包括权重与偏置值在内的参数进行量化,该操作仅需在人脸检测模型部署到移动端进行序列化存储时进行一次,而无需在模型推理时进行。激活值量化即对网络中每一层输出的激活值进行动态量化,由于神经网络中每一层输出激活值的数据分布范围差异巨大,因此,激活值量化的主要方式是将人脸检测模型在预先选定的验证集上运行一个epoch后,根据模型每一层输出激活值的浮点数范围确定量化参数,被确定的量化参数在人脸检测模型进行推理时将不再变动。
在一种可能的设计中,可以在训练人脸检测模型时,保留网络中的人脸关键点检测分支,以提高网络的人脸检测性能。在人脸检测模型部署到移动端时,裁剪掉网络中的人脸关键点检测分支,以进一步降低网络的计算量。
在一种可能的设计中,人脸检测模型的损失函数可以如公式二所示。
L=α*FL classification+β*L box+L landmark      公式二
其中,FL classification表示人脸分类损失,其权重α=2;L box表示预测框的回归损失,其权重β=1.5;L landmark表示人脸关键点的回归损失。
(a)人脸分类损失
人脸人类损失函数采用焦点损失函数(Focal Loss),并且采用全部样本数据计算损失,实现训练时样本的平衡。人脸分类的损失函数如公式三所示。
Figure PCTCN2022141600-appb-000006
其中,y∈{0,1},表示样本ground-truth标签;
Figure PCTCN2022141600-appb-000007
表示网络对样本的预测值;α=0.25,表示正负样本的平衡因子;γ=2,表示简单样本和困难样本的平衡因子。
(b)预测框的回归损失
预测框的回归损失函数使用欧氏损失函数如公式四所示。
Figure PCTCN2022141600-appb-000008
其中,y box表示样本的ground-truth坐标;
Figure PCTCN2022141600-appb-000009
表示网络对样本预测的坐标偏移量;且y box
Figure PCTCN2022141600-appb-000010
均以四元组(x,y,h,w)表示,其中x,y,h,w分别表示检测框的左上角的横纵坐标,以及检测框的高度和宽度。
(c)人脸关键点的回归损失
本文中关键点的预测使用欧式损失函数如公式五所示。
Figure PCTCN2022141600-appb-000011
其中,y landmark
Figure PCTCN2022141600-appb-000012
分别表示ground-truth坐标与网络的预测坐标;人脸关键点包括左右眼、鼻子、左右嘴角的坐标,故
Figure PCTCN2022141600-appb-000013
本申请在进行人脸检测模型的训练时,采用Tensorflow-1.10深度学习框架。采用WIDER FACE、CelebA、LFW、FDDB训练集,优化器SGD:Momentum=0.9;Weight_decay=0.0005;Learning rate=0.01。此外,还使用翻转、裁剪、变换亮度和对比度进行数据增强。
为了更好地理解本申请实施例,下面以一个具体的示例介绍本申请提供的技术方案。图5示例性地示出了本申请实施例提供的一种具体的人脸检测模型,该人脸检测模型包括三个级联网络。
待检测图像经过金字塔变换后,被转换为尺寸为12×12×3的图像。将尺寸为12×12×3的待检测图像作为第一级联网络的输入,经过第一次深度可分离卷积操作后,生成16个尺寸为5×5的图像。经过第二次深度可分离卷积操作后,生成24个尺寸为3×3的图像。过第三次深度可分离卷积操作后,生成16个尺寸为1×1的图像。最后,通过2个1×1×16的卷积核,生成2个1×1的特征图用于分类;通过4个1×1×16的卷积核,生成4个1×1的特征图用于预测框回归;10个1×1×16的卷积核,生成10个1×1的特征图用于关键点回归。
第一级联网络输出的预测框经过金字塔变换后,被转换为尺寸为24×24×3的待检测图 像。将尺寸为24×24×3的待检测图像作为第二级联网络的输入,经过第一次深度可分离卷积操作后,生成48个尺寸为11×11的图像。经过第二次深度可分离卷积操作后,生成64个尺寸为4×4的图像。过第三次深度可分离卷积操作后,生成64个尺寸为3×3的图像。最后,通过2个1×1×16的卷积核,生成2个1×1的特征图用于分类;通过4个1×1×16的卷积核,生成4个1×1的特征图用于预测框回归;10个1×1×16的卷积核,生成10个1×1的特征图用于关键点回归。
第二级联网络输出的预测框经过金字塔变换后,被转换为尺寸为48×48×3的待检测图像。将尺寸为48×48×3的待检测图像作为第三级联网络的输入,经过第一次深度可分离卷积操作后,生成48个尺寸为23×23的图像。经过第二次深度可分离卷积操作后,生成96个尺寸为10×10的图像。过第三次深度可分离卷积操作后,生成96个尺寸为4×4的图像。经过第四次深度可分离卷积操作后,生成128个尺寸为3×3的图像。最后,通过2个1×1×16的卷积核,生成2个1×1的特征图用于分类;通过4个1×1×16的卷积核,生成4个1×1的特征图用于预测框回归;10个1×1×16的卷积核,生成10个1×1的特征图用于关键点回归。输出最终的人脸分类、预测框回归和关键点回归。
本申请提供一种基于神经网络的人脸检测方法,通过使用深度可分离卷积代替传统卷积,将批标准化层融合入卷积层,对冗余的卷积核进行裁剪以及在网络中插入量化节点等手段,可以在保持人脸检测模型精度的基础上,使人脸检测模型的参数明显减少,进而实现在保证人脸检测精度的情况下,减少人脸检测模型在移动端的存储空间和推理延迟时间。
基于相同的技术构思,图6示例性地示出了本申请实施例提供的一种基于神经网络的人脸检测装置,该装置可以使移动端设备。如图6所示,该人脸检测装置600包括:
预测模块601,用于将待识别图像通过第一级联网络的多个处理单元,预测得到所述待识别图像中的N个第一候选窗口;每个第一候选窗口为初步预测具有人脸图像的区域;
所述预测模块601,还用于将所述N个第一候选窗口通过第二级联网络的多个处理单元,预测得到所述待识别图像中的M个第二候选窗口;每个第二候选窗口为修正预测具有人脸图像的区域;
确定模块602,用于将所述M个第二候选窗口通过第三级联网络的多个处理单元,确定所述待识别图像的具有人脸图像的目标窗口和所述目标窗口中的人脸特征点;其中,至少一个处理单元的卷积操作为深度可分离卷积操作。
可选的,所述至少一个处理单元包括融合层;所述融合层是对训练后的卷积层和训练后的批标准化层进行融合后得到的。
可选的,所述训练后的卷积层中的卷积核是按各卷积核的权重进行裁剪后得到的。
可选的,所述训练后的卷积层中的卷积核是按各卷积核的权重进行裁剪后得到的,包括:依次对所述第一级联网络、所述第二级联网络和所述第三级联网络进行卷积核裁剪,其中,任一级联网络按照如下方式处理:按各卷积核的权重,对权重最小的M个卷积核进行裁剪;对裁剪后的各级联网络进行训练之后继续进行卷积核裁剪,直至满足设定要求;在满足所述设定要求后,对裁剪的卷积核对应的通道进行删除。
可选的,在所述处理单元的每个处理层之前包括一个量化节点,所述量化节点用于将FP32类型的输入数据转换为INT8/UINT8类型的数据;在所述处理单元的每个处理层之后包括一个反量化节点,所述反量化节点用于将INT8/UINT8类型的输出数据还原成FP32类型的数据。
可选的,若两个所述处理层之间存在连续的所述反量化节点和所述量化节点,则所述连续的所述反量化节点和所述量化节点被合并为一个再量化节点。
可选的,所述至少一个处理单元中的至少一个参数经模型量化处理。基于相同的技术构思,本申请实施例提供了一种计算设备,如图7所示,包括至少一个处理器701,以及与至少一个处理器连接的存储器702,本申请实施例中不限定处理器701与存储器702之间的具体连接介质,图7中处理器701和存储器702之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。
在本申请实施例中,存储器702存储有可被至少一个处理器701执行的指令,至少一个处理器701通过执行存储器702存储的指令,可以执行上述基于神经网络的人脸检测方法。
其中,处理器701是计算设备的控制中心,可以利用各种接口和线路连接计算机设备的各个部分,通过运行或执行存储在存储器702内的指令以及调用存储在存储器702内的数据,从而进行资源设置。可选地,处理器701可包括一个或多个处理单元,处理器701可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器701中。在一些实施例中,处理器701和存储器702可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。
处理器701可以是通用处理器,例如中央处理器(CPU)、数字信号处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合 本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
存储器702作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器702可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(Random Access Memory,RAM)、静态随机访问存储器(Static Random Access Memory,SRAM)、可编程只读存储器(Programmable Read Only Memory,PROM)、只读存储器(Read Only Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器702是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器702还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。
基于相同的技术构思,本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机可执行程序,计算机可执行程序用于使计算机执行上述任一方式所列的基于神经网络的人脸检测方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (10)

  1. 一种基于神经网络的人脸检测方法,其特征在于,所述方法包括:
    将待识别图像通过第一级联网络的多个处理单元,预测得到所述待识别图像中的N个第一候选窗口;每个第一候选窗口为初步预测具有人脸图像的区域;
    将所述N个第一候选窗口通过第二级联网络的多个处理单元,预测得到所述待识别图像中的M个第二候选窗口;每个第二候选窗口为修正预测具有人脸图像的区域;
    将所述M个第二候选窗口通过第三级联网络的多个处理单元,确定所述待识别图像的具有人脸图像的目标窗口和所述目标窗口中的人脸特征点;
    其中,至少一个处理单元的卷积操作为深度可分离卷积操作。
  2. 根据权利要求1所述的方法,其特征在于,所述至少一个处理单元包括融合层;所述融合层是对训练后的卷积层和训练后的批标准化层进行融合后得到的。
  3. 根据权利要求2所述的方法,其特征在于,所述训练后的卷积层中的卷积核是按各卷积核的权重进行裁剪后得到的。
  4. 根据权利要求3所述的方法,其特征在于,所述训练后的卷积层中的卷积核是按各卷积核的权重进行裁剪后得到的,包括:
    依次对所述第一级联网络、所述第二级联网络和所述第三级联网络进行卷积核裁剪,其中,任一级联网络按照如下方式处理:
    按各卷积核的权重,对权重最小的M个卷积核进行裁剪;
    对裁剪后的各级联网络进行训练之后继续进行卷积核裁剪,直至满足设定要求;
    在满足所述设定要求后,对裁剪的卷积核对应的通道进行删除。
  5. 根据权利要求1所述的方法,其特征在于,在所述处理单元的每个处理层之前包括一个量化节点,所述量化节点用于将FP32类型的输入数据转换为INT8/UINT8类型的数据;
    在所述处理单元的每个处理层之后包括一个反量化节点,所述反量化节点用于将INT8/UINT8类型的输出数据还原成FP32类型的数据。
  6. 根据权利要求5所述的方法,其特征在于,若两个所述处理层之间存在连续的所述反量化节点和所述量化节点,则所述连续的所述反量化节点和所述量化节点被合并为一个再量化节点。
  7. 根据权利要求1至6任意一项所述的方法,其特征在于,所述至少一个处理单元中的至少一个参数经模型量化处理。
  8. 一种基于神经网络的人脸检测装置,其特征在于,包括:
    预测模块,用于将待识别图像通过第一级联网络的多个处理单元,预测得到所述待识别图像中的N个第一候选窗口;每个第一候选窗口为初步预测具有人脸图像的区域;
    所述预测模块,还用于将所述N个第一候选窗口通过第二级联网络的多个处理单元,预测得到所述待识别图像中的M个第二候选窗口;每个第二候选窗口为 修正预测具有人脸图像的区域;
    所述预测模块,还用于将所述M个第二候选窗口通过第三级联网络的多个处理单元,确定所述待识别图像的具有人脸图像的目标窗口和所述目标窗口中的人脸特征点;
    其中,至少一个处理单元的卷积操作为深度可分离卷积操作。
  9. 一种计算设备,其特征在于,包括:
    存储器,用于存储程序指令;
    处理器,用于调用所述存储器中存储的程序指令,按照获得的程序指令执行如权利要求1至7中任一项所述的方法。
  10. 一种计算机可读存储介质,其特征在于,包括计算机可读指令,当计算机读取并执行所述计算机可读指令时,使得如权利要求1至7中任一项所述的方法实现。
PCT/CN2022/141600 2022-07-13 2022-12-23 一种基于神经网络的人脸检测方法和装置 WO2024011859A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210828507.0A CN115273183A (zh) 2022-07-13 2022-07-13 一种基于神经网络的人脸检测方法和装置
CN202210828507.0 2022-07-13

Publications (1)

Publication Number Publication Date
WO2024011859A1 true WO2024011859A1 (zh) 2024-01-18

Family

ID=83766322

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141600 WO2024011859A1 (zh) 2022-07-13 2022-12-23 一种基于神经网络的人脸检测方法和装置

Country Status (2)

Country Link
CN (1) CN115273183A (zh)
WO (1) WO2024011859A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273183A (zh) * 2022-07-13 2022-11-01 天翼云科技有限公司 一种基于神经网络的人脸检测方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018003212A1 (ja) * 2016-06-30 2018-01-04 クラリオン株式会社 物体検出装置及び物体検出方法
CN107871134A (zh) * 2016-09-23 2018-04-03 北京眼神科技有限公司 一种人脸检测方法及装置
CN110263774A (zh) * 2019-08-19 2019-09-20 珠海亿智电子科技有限公司 一种人脸检测方法
CN110717481A (zh) * 2019-12-12 2020-01-21 浙江鹏信信息科技股份有限公司 一种利用级联卷积神经网络实现人脸检测的方法
CN115273183A (zh) * 2022-07-13 2022-11-01 天翼云科技有限公司 一种基于神经网络的人脸检测方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018003212A1 (ja) * 2016-06-30 2018-01-04 クラリオン株式会社 物体検出装置及び物体検出方法
CN107871134A (zh) * 2016-09-23 2018-04-03 北京眼神科技有限公司 一种人脸检测方法及装置
CN110263774A (zh) * 2019-08-19 2019-09-20 珠海亿智电子科技有限公司 一种人脸检测方法
CN110717481A (zh) * 2019-12-12 2020-01-21 浙江鹏信信息科技股份有限公司 一种利用级联卷积神经网络实现人脸检测的方法
CN115273183A (zh) * 2022-07-13 2022-11-01 天翼云科技有限公司 一种基于神经网络的人脸检测方法和装置

Also Published As

Publication number Publication date
CN115273183A (zh) 2022-11-01

Similar Documents

Publication Publication Date Title
US20210012198A1 (en) Method for training deep neural network and apparatus
WO2019228317A1 (zh) 人脸识别方法、装置及计算机可读介质
WO2021077984A1 (zh) 对象识别方法、装置、电子设备及可读存储介质
WO2021068323A1 (zh) 多任务面部动作识别模型训练方法、多任务面部动作识别方法、装置、计算机设备和存储介质
EP4099220A1 (en) Processing apparatus, method and storage medium
CN106897746B (zh) 数据分类模型训练方法和装置
US20220172518A1 (en) Image recognition method and apparatus, computer-readable storage medium, and electronic device
CN109829448B (zh) 人脸识别方法、装置及存储介质
US20080187213A1 (en) Fast Landmark Detection Using Regression Methods
CN111488985B (zh) 深度神经网络模型压缩训练方法、装置、设备、介质
WO2021184902A1 (zh) 图像分类方法、装置、及其训练方法、装置、设备、介质
CN109492674B (zh) 用于目标检测的ssd框架的生成方法及装置
US20240143977A1 (en) Model training method and apparatus
CN109961107B (zh) 目标检测模型的训练方法、装置、电子设备及存储介质
CN111368672A (zh) 一种用于遗传病面部识别模型的构建方法及装置
WO2023284182A1 (en) Training method for recognizing moving target, method and device for recognizing moving target
Gao et al. Face detection algorithm based on improved TinyYOLOv3 and attention mechanism
WO2022152104A1 (zh) 动作识别模型的训练方法及装置、动作识别方法及装置
KR20210100592A (ko) 휴리스틱 가우스 클라우드 변환에 기반하는 얼굴인식 기술
WO2024011859A1 (zh) 一种基于神经网络的人脸检测方法和装置
WO2023109361A1 (zh) 用于视频处理的方法、系统、设备、介质和产品
CN112561926A (zh) 三维图像分割方法、系统、存储介质及电子设备
CN113011568A (zh) 一种模型的训练方法、数据处理方法及设备
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
CN113487610B (zh) 疱疹图像识别方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22950966

Country of ref document: EP

Kind code of ref document: A1