CN112115783B - Depth knowledge migration-based face feature point detection method, device and equipment - Google Patents

Depth knowledge migration-based face feature point detection method, device and equipment Download PDF

Info

Publication number
CN112115783B
CN112115783B CN202010809064.1A CN202010809064A CN112115783B CN 112115783 B CN112115783 B CN 112115783B CN 202010809064 A CN202010809064 A CN 202010809064A CN 112115783 B CN112115783 B CN 112115783B
Authority
CN
China
Prior art keywords
face
network
feature
student
teacher
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010809064.1A
Other languages
Chinese (zh)
Other versions
CN112115783A (en
Inventor
吕科
高鹏程
薛健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Chinese Academy of Sciences
Original Assignee
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Chinese Academy of Sciences filed Critical University of Chinese Academy of Sciences
Priority to CN202010809064.1A priority Critical patent/CN112115783B/en
Publication of CN112115783A publication Critical patent/CN112115783A/en
Application granted granted Critical
Publication of CN112115783B publication Critical patent/CN112115783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The embodiment of the invention discloses a face feature point detection method, a device and equipment based on depth knowledge migration, wherein the method comprises the following steps: providing a face data set, and cutting a face image according to a face detection frame or a bounding box of face feature points provided by the face data set to obtain a training set, a verification set and a test set; inputting a test sample and a training sample into an initial face alignment network frame; training a teacher network and a student network in an initial face alignment network framework by using Pytorch until a loss function and the maximum iteration number meet a preset condition to generate a training model; freezing model parameters of a teacher network, extracting deep dark knowledge learned by the teacher network, and transmitting the deep dark knowledge to a student network to generate a final face alignment network model; and inputting the RGB face image in the natural scene into a final face alignment network model, and outputting a face feature point detection result. The face feature point detection precision manuscript has low model parameter and calculation complexity.

Description

Depth knowledge migration-based face feature point detection method, device and equipment
Technical Field
The embodiment of the invention relates to the field of computer vision and digital image processing, in particular to a face feature point detection method, device and equipment based on depth knowledge migration.
Background
The existing method for detecting the face feature points cannot effectively solve the problem that the face feature point positioning in a natural scene is not effective, the complex method is huge in model parameters, high in calculation complexity and incapable of meeting the requirement of operation speed. The simple method can not cope with the interference of factors such as extreme gesture, changeable illumination, serious shielding and the like in a natural scene, and the precision can not meet the application requirement.
Disclosure of Invention
The embodiment of the invention aims to provide a face feature point detection method, device and equipment based on depth knowledge migration, which are used for solving the problems of high computational complexity, low running speed and low precision of the existing face feature point detection.
In order to achieve the above purpose, the embodiment of the present invention mainly provides the following technical solutions:
in a first aspect, an embodiment of the present invention provides a face feature point detection method based on depth knowledge migration, including:
s1: providing a face data set containing face feature point labels, and cutting a face image according to a face detection frame or a bounding box of the face feature points provided by the face data set to obtain a training set, a verification set and a test set;
s2: obtaining training samples from the training set, obtaining test samples from the test set, and inputting the test samples and the training samples into an initial face alignment network frame;
s3: setting parameters of a convolutional neural network, and training a teacher network and a student network in the initial face alignment network framework by using Pytorch until a loss function and the maximum iteration number meet preset conditions to generate a training model;
s4: freezing model parameters of a teacher network, extracting deep dark knowledge learned by the teacher network, transmitting the deep dark knowledge to the student network, and supervising the training process of the student network to generate a final face alignment network model;
s5: and inputting the RGB face image in the natural scene into the final face alignment network model, and outputting a face feature point detection result.
In one embodiment of the present invention, step S1 includes:
s1-1: providing a WFLW data set, wherein the WFLW data set comprises N training pictures and M test pictures, each picture is provided with a picture tag, the picture information comprises face frame information, face characteristic point position information and a plurality of attribute information, and N and M are positive integers larger than zero;
s1-2: cutting a face image according to a face detection frame provided by the face data set, disturbing the face detection frame, and applying random rotation, size scaling and overturning to the face image so as to enhance data and obtain the training set, the verification set and the test set.
In one embodiment of the invention, the initial face alignment network framework is generated by:
generating the teacher grid by adopting a network structure of an encoder-decoder, wherein the teacher grid encoder comprises three up-sampling layers and a convolution layer, and is used for carrying out feature extraction and encoding on an input image, retaining feature extraction information of an original network, and removing a final average pooling layer, a full-connection layer used for classification and a final dimension-lifting 1 multiplied by 1 convolution layer;
adding the decoder after the encoder, performing spatial up-sampling on the image features extracted by the encoder to obtain feature images, converting the channel dimension of the feature images into the number of face feature points, and calculating the expected corresponding face feature point coordinates on each feature image after transformation by using spatial softargmax operation;
providing a student network of an EfficientFAN structure, wherein the student network encoder comprises three upsampling layers and a convolution layer, the student network is used for final face feature point detection, efficientNet-B0 is used as a trunk part of the student network encoder, and a final average pooling layer, a full connection layer used for classification and a final 1X 1 convolution layer of up-dimension of the EfficientNet-B0 are removed;
and adding a 1 multiplied by 1 convolution layer after the student grid encoder, converting the channel number of the feature map obtained by up-sampling of the student grid encoder into the number of the face feature points, and calculating the coordinates of the face feature points on the converted feature map by using space softargmax operation.
In one embodiment of the present invention, step S3 includes:
training the teacher network and the student network separately using a feature point loss function L P Optimizing network parameters and characteristic point loss function L P Calculated by the windloss loss function, which is expressed as follows:
wherein P is E R 1×2N Is the predicted face feature point coordinate vector, G E R 1×2N Is a real face feature point coordinate vector, N is the number of face feature points, ω and e are preset parameters of f (x).
In one embodiment of the present invention, in step S4, extracting deep dark knowledge learned by the teacher network includes:
extracting pixel distribution information on a feature map based on a feature alignment knowledge distillation method, aligning pixel distribution of the feature map of the teacher network and the student network, wherein a feature alignment knowledge distillation loss function is as follows:
wherein A and B are feature graphs of the teacher network and the student network at the same stage respectively,is a 1 x 1 convolution layer used to align the channel dimensions of the two feature maps a and B.
In one embodiment of the present invention, in step S4, transferring the deep secret knowledge to the student network includes:
and extracting face structure information under different scales by a knowledge distillation method based on block similarity, and transmitting the structured information of the face image to the student network by the teacher network.
In a second aspect, an embodiment of the present invention further provides a face feature point detection device based on depth knowledge migration, including:
the face image processing device comprises a providing module, a face image processing module and a face image processing module, wherein the providing module is used for providing a face data set containing face feature point labels, and cutting a face image according to a face detection frame or a bounding box of the face feature points provided by the face data set to obtain a training set, a verification set and a test set;
an output module;
the control processing module is used for acquiring training samples from the training set, acquiring test samples from the test set and inputting the test samples and the training samples into an initial face alignment network frame; the control processing module is also used for setting parameters of a convolutional neural network, training a teacher network and a student network in the initial face alignment network framework by using Pytorch, and generating a training model until a loss function and the maximum iteration number meet preset conditions; the control processing module is also used for freezing model parameters of a teacher network, extracting deep dark knowledge learned by the teacher network, transmitting the deep dark knowledge to the student network, and supervising the training process of the student network to generate a final face alignment network model; the control processing module is also used for inputting RGB face images in the natural scene into the final face alignment network model, and outputting a face feature point detection result through the output module.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: at least one processor and at least one memory; the memory is used for storing one or more program instructions; the processor is configured to execute one or more program instructions to perform the face feature point detection method based on depth knowledge migration according to the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium containing one or more program instructions for being executed with the depth knowledge migration-based face feature point detection method according to the first aspect.
The technical scheme provided by the embodiment of the invention has at least the following advantages:
according to the face feature point detection method, device and equipment based on depth knowledge migration, provided by the embodiment of the invention, the EfficientFAN is adopted as a simple and effective lightweight model, the up-sampling recovery process of the feature map is rapidly realized based on the decoder structure of up-sampling and depth separable convolution, and the spatial information of the feature map is effectively saved.
Compared with the current advanced large-scale complex model, the invention can reach comparable face feature point detection precision, but the model parameter and the calculation complexity are obviously reduced.
The invention uses a knowledge distillation method and a knowledge migration module to improve the accuracy of positioning feature points of the face of the student network EfficientFAN, and provides a block similarity knowledge distillation method for learning multi-scale structural information of the face, and the training process of EfficientFAN is jointly supervised and guided by combining the pixel distribution information on a feature alignment knowledge distillation learning feature diagram. On the premise of not changing the network structure and not increasing the model parameters, the EfficientFAN obtains more accurate face feature point detection results through a knowledge migration method. Experimental results on the public data set show that the EfficientFAN is a simple and effective face feature point detection network, and the knowledge distillation method effectively improves the accuracy of face feature point detection. In combination, the EfficientFAN has quite excellent performance, and has both precision and speed.
Drawings
Fig. 1 is a flowchart of a face feature point detection method based on depth knowledge migration.
Fig. 2 is a block diagram of a face feature point detection device based on depth knowledge migration according to the present invention.
Detailed Description
Further advantages and effects of the present invention will become apparent to those skilled in the art from the disclosure of the present invention, which is described by the following specific examples.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In the description of the present invention, it is to be understood that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "connected" and "connected" are to be construed broadly, and may be connected directly or indirectly through intermediaries, for example. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Fig. 1 is a flowchart of a face feature point detection method based on depth knowledge migration. As shown in fig. 1, the face feature point detection method based on depth knowledge migration of the present invention includes:
s1: providing a face data set containing face feature point labels, and cutting a face image according to a face detection frame or a bounding box of the face feature points provided by the face data set to obtain a training set, a verification set and a test set.
Specifically, step S1 includes:
s1-1: a WFLW dataset is provided. The dataset was derived from IEEE Conference on Computer Vision and Pattern Recognition 2018 and contained 10000 pictures (7500 training pictures and 2500 test pictures). Each picture tag provides face frame information, 98 face feature point location information, and 6 kinds of attribute information (pose, expression, illumination, makeup, occlusion, blurring), and the entire dataset is divided into 6 kinds of subsets according to the image attribute information.
S1-2: cutting a face image according to a face detection frame provided by a face data set, disturbing the face detection frame, and applying random rotation, size scaling and overturning to the face image so as to enhance data and obtain a training set, a verification set and a test set.
S2: training samples are obtained from the training set, test samples are obtained from the testing set, and the test samples and the training samples are input into the initial face alignment network framework.
Specifically, the teacher network adopts an encoder-decoder network architecture, using Efficient Net-B7 as the backbone portion of its encoder. The encoder is used for feature extraction and encoding of the input image, only preserving the feature extraction part of the original network, removing the last average pooling layer and the fully connected layers for classification, also removing the last updimensional 1 x 1 convolution layer and extracting features from the last inverse residual module. Compared with the feature map after the 1X 1 convolution layer, the channel number of the feature map extracted by the teacher network has smaller channel number (640 vs. 2048), so that more original feature information is maintained, information cannot be lost due to dimension increase, and the low-dimension feature map is more suitable for decoder analysis.
And adding a decoder after the last reverse residual error module of the Efficient Net-B7, performing spatial upsampling on the image features extracted by the encoder, and improving the spatial dimension of the feature map by using a more natural upsampling method, namely replacing deconvolution by using the combination of an upsampling layer and a convolution layer, performing spatial upsampling on the feature map by using a general upsampling method, and then performing convolution operation on the basis of the upsampled feature map to enrich the transformation of the feature map.
The invention uses a combination of three upsampling layers and convolutional layers as a decoder for the face alignment network, added after the encoder. The depth separable convolution is used in the network model to replace the traditional convolution operation, so that the calculated amount in the up-sampling process is reduced.
Specifically, the scale factor of the upsampling layer is set to 2, that is, the length and width of the feature map obtained by each upsampling are doubled as compared with those of the input feature map, and the upsampling of the feature map is realized based on the nearest neighbor interpolation algorithm. A 1 x 1 convolutional layer is used after the decoder to generate a space thermodynamic diagram and to convert the channel dimension of the feature map to the number of face feature points. And calculating the expected corresponding face feature point coordinates on each feature map after transformation by using a space softargmax operation.
The spatial softargmax operation can be divided into two steps, the first step being normalized on the output signature using the softmax operation, which can be expressed as:
where x, y are pixel indexes, exp represents an exponential function, and M is a normalized feature map. In the second step, the coordinates P of the feature point l can be finally expressed as:
a small and lightweight student network, called Efficient Face Alignment Network (EfficientFAN), has a network structure similar to that of a teacher network, and will be used for final face feature point detection. EfficientNet-B0 is used as the backbone part of the student network EfficientFAN encoder. Like the teacher network, the encoder of the student network also eliminates the last average pooling layer and full-connection layer for classification in Efficient Net-B0, and the last 1X 1 convolution layer in the upgrad.
Likewise, a combination of three upsampling layers and convolutional layers is used as a decoder for the student network, added after the encoder. The scale factor for each up-sampling layer is 2 and the number of output channels for each convolutional layer is 128. A 1 x 1 convolutional layer is added after the decoder of the student network, and the number of channels of the feature map obtained by up-sampling the decoder is converted from 128 to the number of face feature points.
And finally, calculating coordinates of the face feature points on the converted feature map by using space softargmax operation.
Table 1 student network
The specific structure of the student network is shown in table 1, wherein MBConv represents a handset-side reverse residual module (Mobile Inverted Bottleneck) used by efficiency, DSConv represents a depth separable convolution, and k represents the size of a convolution kernel.
The teacher network located above and the student network located below are organically linked together through a knowledge migration (Knowledge Transfer) module.
The high-efficiency face alignment network based on depth knowledge migration uses two knowledge distillation methods, so that different types of dark knowledge are migrated from a teacher network to a student network EfficientFAN.
The knowledge distillation method for feature alignment extracts pixel distribution information on the feature map, and aligns pixel distribution of the teacher network and the student network feature map, so that the feature map distribution of the student network is close to the distribution of the teacher network.
Correspondingly, the knowledge distillation method of the block similarity extracts the face structure information under different scales, and the structured information of the face image is transmitted to the student network from the teacher network, so that the simple student network can learn the face structure information of the current image.
Feature alignment distillation aligns channel dimensions of feature graphs at the same stage of a teacher network and a student network, and directly compares differences between the teacher network feature graph and the aligned student network feature graph as supervision information in the student network training process.
S3: setting parameters of a convolutional neural network, and training a teacher network and a student network in an initial face alignment network framework by using Pytorch until a loss function and the maximum iteration number meet preset conditions to generate a training model.
In particular, a sheetTraining teacher and student networks exclusively using feature point loss function L P And optimizing network parameters. Characteristic point loss function L P By calculation of the Wing loss function, the Wing loss function can be expressed as follows:
wherein P is E R 1×2N Is the predicted face feature point coordinate vector, G E R 1×2N Is the real face feature point coordinate vector, and N is the number of face feature points. f (x) is a specially designed loss function that appears as a logarithmic loss function with offset for small errors; for larger errors, which appear as L1 loss functions, ω, ε are preset parameters of f (x),is a constant.
S4: the model parameters of the teacher network are frozen, deep dark knowledge learned by the teacher network is extracted, the deep dark knowledge is transmitted to the student network, and the training process of the student network is supervised to generate a final face alignment network model.
Specifically, the knowledge distillation method with characteristic alignment extracts pixel distribution information on the characteristic map, and aligns pixel distribution of the characteristic map of the teacher network and the characteristic map of the student network, so that the characteristic map distribution of the student network approaches to the distribution of the teacher network. The knowledge of feature alignment distillation loss function may be defined as follows:
wherein A and B are characteristic diagrams of the teacher network and the student network at the same stage respectively,is a 1 x 1 convolution layer used to align the channel dimensions of the two feature maps a and B.
The knowledge distillation method of the block similarity extracts the face structure information under different scales, and transmits the structured information of the face image to the student network from the teacher network, so that the simple student network can learn the face structure information of the current image.
And constructing relationship diagrams of different scales for the input feature diagrams, and calculating a similarity matrix based on the constructed relationship diagrams. For a feature map of size h×w, the feature map area may be divided by local blocks of different sizes. The size of the feature map generally satisfies h=w=2 n The whole feature graph is taken as a connected domain, the relation graph is constructed based on local blocks with different sizes as nodes, and the nodes in the relation graph can be set to be 2 k ×2 k K=0, 1, …, k-1 size local block. One width 2 n ×2 n Node size of 2 for feature map construction k ×2 k The relationship diagram of (2) includes 2 n-k ×2 n-k Local blocks or relationship nodes. For simplicity, 2 will be used with the average pooling operation k ×2 k Is aggregated into 1 x 1 relationship graph nodes. For a feature map with a channel number, the vectorization of the first node in the constructed relation map can be expressed as f i ∈R C . Calculating the similarity relation between nodes in the relation graph by using the cosine similarity of the vectors, and calculating the ith node vector f i And the j-th node vector f j Similarity a between ij The calculations are shown below.
In particular, the intermediate feature maps of the teacher network and the student network at the same stage have the same resolution and different channel numbers. Assume that the characteristic diagram of the teacher network is A epsilon R C×H×W The characteristic diagram of the student network is B epsilon R C′×H×W On the characteristic diagram with 2 k ×2 k In the affinity graph constructed by using local blocks with the size as nodes, the number of the nodes is 4 n-k The similarity relation between two nodes can be calculated to obtain 4 n-k ×4 n-k A similarity matrix of size. Order theRepresentation of teacher network feature graph with 2 k ×2 k The local blocks with the sizes are cosine similarity obtained by the ith node and the jth node in the relation graph constructed by the nodes,the characteristic diagram corresponding to the student network is also represented by 2 k ×2 k Cosine similarity obtained by the ith node and the jth node in a relation diagram constructed by partial blocks with the size, the loss function of the block similarity knowledge distillation method can be generalized as follows, wherein the size of the feature diagram satisfies h=w=2 n
Combining a feature alignment knowledge distillation method and a block similarity knowledge distillation method, and introducing a knowledge migration loss function L KT As part of the network training loss function, the training process of the student network is supervised. The student network not only learns the true label information provided by the labeled face feature point coordinates, but also learns finer face structured knowledge and data distribution knowledge extracted from the teacher network. Optimizing the performance of student network EfficientFAN by using a knowledge migration module and a knowledge distillation method, keeping parameters of a teacher network after pre-training in a frozen state, and transferring a knowledge transfer loss function L KT And the training loss function is added, and the dark knowledge learned by the distillation teacher network in the training process of the EfficientFAN is transmitted to the student network, so that the accuracy of positioning the face feature points of the student network is improved. The loss function finally used for optimizing the student network EfficientFAN is shown as follows, and is represented by a characteristic point loss function L P And L KT In combination, where lambda is an adjustable weight parameter for balancing the effects of two loss functions,and->The block similarity knowledge distillation loss function and the feature alignment knowledge distillation loss function of the decoder stage d, respectively.
S5: and inputting the RGB face image in the natural scene into a final face alignment network model, and outputting a face feature point detection result.
According to the face feature point detection method based on depth knowledge migration, provided by the embodiment of the invention, the EfficientFAN is adopted as a simple and effective lightweight model, the up-sampling recovery process of the feature map is rapidly realized based on the decoder structure of up-sampling and depth separable convolution, and the spatial information of the feature map is effectively saved.
Compared with the current advanced large-scale complex model, the invention can reach comparable face feature point detection precision, but the model parameter and the calculation complexity are obviously reduced.
The invention uses a knowledge distillation method and a knowledge migration module to improve the accuracy of positioning feature points of the face of the student network EfficientFAN, and provides a block similarity knowledge distillation method for learning multi-scale structural information of the face, and the training process of EfficientFAN is jointly supervised and guided by combining the pixel distribution information on a feature alignment knowledge distillation learning feature diagram. On the premise of not changing the network structure and not increasing the model parameters, the EfficientFAN obtains more accurate face feature point detection results through a knowledge migration method. Experimental results on the public data set show that the EfficientFAN is a simple and effective face feature point detection network, and the knowledge distillation method effectively improves the accuracy of face feature point detection. In combination, the EfficientFAN has quite excellent performance, and has both precision and speed.
Fig. 2 is a block diagram of a face feature point detection device based on depth knowledge migration according to the present invention. As shown in fig. 2, the face feature point detection device based on depth knowledge migration of the present invention includes: a module 100, an output module 200 and a control processing module 300 are provided.
The providing module 100 is configured to provide a face data set including face feature point labels, and cut a face image according to a face detection frame or a bounding box of the face feature points provided by the face data set to obtain a training set, a verification set and a test set. The control processing module 300 is configured to obtain training samples from the training set, obtain test samples from the test set, and input the test samples and the training samples into the initial face alignment network frame. The control processing module 300 is further configured to set parameters of the convolutional neural network, train the teacher network and the student network in the initial face alignment network framework by using Pytorch, and generate a training model until the loss function and the maximum iteration number meet predetermined conditions. The control processing module 300 is further configured to freeze model parameters of the teacher network, extract deep dark knowledge learned by the teacher network, transmit the deep dark knowledge to the student network, and monitor a training process of the student network to generate a final face alignment network model. The control processing module 300 is further configured to input the RGB face image in the natural scene into a final face alignment network model, and output a face feature point detection result through the output module.
It should be noted that, the specific implementation manner of the face feature point detection device based on depth knowledge migration in the embodiment of the present invention is similar to the specific implementation manner of the face feature point detection method based on depth knowledge migration in the embodiment of the present invention, and specific reference is made to the description of the face feature point detection method based on depth knowledge migration, so that redundancy is reduced and redundant description is omitted.
In addition, other structures and functions of the facial feature point detection device based on depth knowledge migration according to the embodiments of the present invention are known to those skilled in the art, and in order to reduce redundancy, description is omitted.
The embodiment of the invention also provides electronic equipment, which comprises: at least one processor and at least one memory; the memory is used for storing one or more program instructions; the processor is configured to execute one or more program instructions to perform the face feature point detection method based on depth knowledge migration according to the first aspect.
The disclosed embodiments provide a computer readable storage medium having stored therein computer program instructions that, when executed on a computer, cause the computer to perform the above-described depth knowledge migration-based face feature point detection method.
In the embodiment of the invention, the processor may be an integrated circuit chip with signal processing capability. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP for short), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), a field programmable gate array (Field Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The processor reads the information in the storage medium and, in combination with its hardware, performs the steps of the above method.
The storage medium may be memory, for example, may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable ROM (Electrically EPROM, EEPROM), or a flash Memory.
The volatile memory may be a random access memory (Random Access Memory, RAM for short) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (Direct Rambus RAM, DRRAM).
The storage media described in embodiments of the present invention are intended to comprise, without being limited to, these and any other suitable types of memory.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in a combination of hardware and software. When the software is applied, the corresponding functions may be stored in a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims (5)

1. The face feature point detection method based on depth knowledge migration is characterized by comprising the following steps of:
s1: providing a face data set containing face feature point labels, and cutting a face image according to a face detection frame or a bounding box of the face feature points provided by the face data set to obtain a training set, a verification set and a test set;
s2: obtaining training samples from the training set, obtaining test samples from the test set, and inputting the test samples and the training samples into an initial face alignment network frame;
s3: setting parameters of a convolutional neural network, and training a teacher network and a student network in the initial face alignment network framework by using Pytorch until a loss function and the maximum iteration number meet preset conditions to generate a training model;
s4: freezing model parameters of a teacher network, extracting deep dark knowledge learned by the teacher network, transmitting the deep dark knowledge to the student network, and supervising the training process of the student network to generate a final face alignment network model;
s5: inputting RGB face images in a natural scene into the final face alignment network model, and outputting a face feature point detection result;
the initial face alignment network framework is generated by:
generating a teacher grid by adopting a network structure of an encoder-decoder, wherein the teacher grid encoder comprises three up-sampling layers and a convolution layer, and is used for carrying out feature extraction and encoding on an input image, retaining feature extraction information of an original network, and removing a final average pooling layer, a full-connection layer used for classification and a final dimension-rising 1X 1 convolution layer;
adding the decoder after the encoder, performing spatial up-sampling on the image features extracted by the encoder to obtain feature images, converting the channel dimension of the feature images into the number of face feature points, and calculating the expected corresponding face feature point coordinates on each feature image after transformation by using spatial softargmax operation;
providing a student network of an EfficientFAN structure, wherein the student network encoder comprises three upsampling layers and a convolution layer, the student network is used for final face feature point detection, efficientNet-B0 is used as a trunk part of the student network encoder, and a final average pooling layer, a full connection layer used for classification and a final 1X 1 convolution layer of up-dimension of the EfficientNet-B0 are removed;
a convolution layer of 1 multiplied by 1 is added behind the student grid encoder, the channel number of the feature image obtained by up-sampling of the student grid encoder is converted into the number of the face feature points, and the coordinates of the face feature points are calculated on the converted feature image by using space softargmax operation;
the step S3 comprises the following steps:
training the teacher network and the student network separately using a feature point loss function L P Optimizing network parameters and characteristic point loss function L P Calculated by the windloss loss function, which is expressed as follows:
wherein P is E R 1×2N Is the predicted face feature point coordinate vector, G E R 1×2N Is a real face feature point coordinate vector, N is the number of face feature points, ω and e are preset parameters of f (x);
in step S4, extracting deep dark knowledge learned by the teacher network, including:
extracting pixel distribution information on a feature map based on a feature alignment knowledge distillation method, aligning pixel distribution of the feature map of the teacher network and the student network, wherein a feature alignment knowledge distillation loss function is as follows:
wherein A and B are feature graphs of the teacher network and the student network at the same stage respectively,is a 1 x 1 convolution layer for aligning the channel dimensions of the two feature maps a and B;
in step S4, the delivering the deep secret knowledge to the student network includes:
and extracting face structure information under different scales by a knowledge distillation method based on block similarity, and transmitting the structured information of the face image to the student network by the teacher network.
2. The face feature point detection method based on depth knowledge migration of claim 1, wherein step S1 comprises:
s1-1: providing a WFLW data set, wherein the WFLW data set comprises N training pictures and M test pictures, each picture is provided with a picture tag, the picture information comprises face frame information, face characteristic point position information and a plurality of attribute information, and N and M are positive integers larger than zero;
s1-2: cutting a face image according to a face detection frame provided by the face data set, disturbing the face detection frame, and applying random rotation, size scaling and overturning to the face image so as to enhance data and obtain the training set, the verification set and the test set.
3. The device for detecting the facial feature points based on depth knowledge migration is characterized by comprising the following components:
the face image processing device comprises a providing module, a face image processing module and a face image processing module, wherein the providing module is used for providing a face data set containing face feature point labels, and cutting a face image according to a face detection frame or a bounding box of the face feature points provided by the face data set to obtain a training set, a verification set and a test set;
an output module;
the control processing module is used for acquiring training samples from the training set, acquiring test samples from the test set and inputting the test samples and the training samples into an initial face alignment network frame; the control processing module is also used for setting parameters of a convolutional neural network, training a teacher network and a student network in the initial face alignment network framework by using Pytorch, and generating a training model until a loss function and the maximum iteration number meet preset conditions; the control processing module is also used for freezing model parameters of a teacher network, extracting deep dark knowledge learned by the teacher network, transmitting the deep dark knowledge to the student network, and supervising the training process of the student network to generate a final face alignment network model; the control processing module is also used for inputting RGB face images in a natural scene into the final face alignment network model, and outputting a face feature point detection result through the output module;
the initial face alignment network framework is generated by:
generating a teacher grid by adopting a network structure of an encoder-decoder, wherein the teacher grid encoder comprises three up-sampling layers and a convolution layer, and is used for carrying out feature extraction and encoding on an input image, retaining feature extraction information of an original network, and removing a final average pooling layer, a full-connection layer used for classification and a final dimension-rising 1X 1 convolution layer;
adding the decoder after the encoder, performing spatial up-sampling on the image features extracted by the encoder to obtain feature images, converting the channel dimension of the feature images into the number of face feature points, and calculating the expected corresponding face feature point coordinates on each feature image after transformation by using spatial softargmax operation;
providing a student network of an EfficientFAN structure, wherein the student network encoder comprises three upsampling layers and a convolution layer, the student network is used for final face feature point detection, efficientNet-B0 is used as a trunk part of the student network encoder, and a final average pooling layer, a full connection layer used for classification and a final 1X 1 convolution layer of up-dimension of the EfficientNet-B0 are removed;
a convolution layer of 1 multiplied by 1 is added behind the student grid encoder, the channel number of the feature image obtained by up-sampling of the student grid encoder is converted into the number of the face feature points, and the coordinates of the face feature points are calculated on the converted feature image by using space softargmax operation;
training the teacher network and the student network separately using a feature point loss function L P Optimizing network parameters and characteristic point loss function L P Calculated by the windloss loss function, which is expressed as follows:
wherein P is E R 1×2N Is the predicted face feature point coordinate vector, G E R 1×2N Is a real face feature point coordinate vector, N is the number of face feature points, ω and e are preset parameters of f (x);
extracting deep dark knowledge learned by the teacher through the network, wherein the deep dark knowledge comprises the following steps:
extracting pixel distribution information on a feature map based on a feature alignment knowledge distillation method, aligning pixel distribution of the feature map of the teacher network and the student network, wherein a feature alignment knowledge distillation loss function is as follows:
wherein A and B are feature graphs of the teacher network and the student network at the same stage respectively,is a 1 x 1 convolution layer for aligning the channel dimensions of the two feature maps a and B;
in step S4, the delivering the deep secret knowledge to the student network includes:
and extracting face structure information under different scales by a knowledge distillation method based on block similarity, and transmitting the structured information of the face image to the student network by the teacher network.
4. An electronic device, the electronic device comprising: at least one processor and at least one memory;
the memory is used for storing one or more program instructions;
the processor is configured to execute one or more program instructions to perform the depth knowledge migration-based face feature point detection method according to any one of claims 1-2.
5. A computer readable storage medium, wherein the computer readable storage medium contains one or more program instructions for performing the depth knowledge migration-based face feature point detection method according to any one of claims 1-2.
CN202010809064.1A 2020-08-12 2020-08-12 Depth knowledge migration-based face feature point detection method, device and equipment Active CN112115783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010809064.1A CN112115783B (en) 2020-08-12 2020-08-12 Depth knowledge migration-based face feature point detection method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010809064.1A CN112115783B (en) 2020-08-12 2020-08-12 Depth knowledge migration-based face feature point detection method, device and equipment

Publications (2)

Publication Number Publication Date
CN112115783A CN112115783A (en) 2020-12-22
CN112115783B true CN112115783B (en) 2023-11-14

Family

ID=73805270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010809064.1A Active CN112115783B (en) 2020-08-12 2020-08-12 Depth knowledge migration-based face feature point detection method, device and equipment

Country Status (1)

Country Link
CN (1) CN112115783B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634441B (en) * 2020-12-28 2023-08-22 深圳市人工智能与机器人研究院 3D human body model generation method, system and related equipment
CN112767320A (en) * 2020-12-31 2021-05-07 平安科技(深圳)有限公司 Image detection method, image detection device, electronic equipment and storage medium
CN112633406A (en) * 2020-12-31 2021-04-09 天津大学 Knowledge distillation-based few-sample target detection method
CN112734632B (en) * 2021-01-05 2024-02-27 百果园技术(新加坡)有限公司 Image processing method, device, electronic equipment and readable storage medium
CN112418195B (en) * 2021-01-22 2021-04-09 电子科技大学中山学院 Face key point detection method and device, electronic equipment and storage medium
CN112819050B (en) * 2021-01-22 2023-10-27 北京市商汤科技开发有限公司 Knowledge distillation and image processing method, apparatus, electronic device and storage medium
CN113052144B (en) * 2021-04-30 2023-02-28 平安科技(深圳)有限公司 Training method, device and equipment of living human face detection model and storage medium
CN113343979B (en) * 2021-05-31 2022-11-08 北京百度网讯科技有限公司 Method, apparatus, device, medium and program product for training a model
CN113343898B (en) * 2021-06-25 2022-02-11 江苏大学 Mask shielding face recognition method, device and equipment based on knowledge distillation network
CN113470099B (en) * 2021-07-09 2022-03-25 北京的卢深视科技有限公司 Depth imaging method, electronic device and storage medium
CN113628635B (en) * 2021-07-19 2023-09-15 武汉理工大学 Voice-driven speaker face video generation method based on teacher student network
CN113487614B (en) * 2021-09-08 2021-11-30 四川大学 Training method and device for fetus ultrasonic standard section image recognition network model
CN113947801B (en) * 2021-12-21 2022-07-26 中科视语(北京)科技有限公司 Face recognition method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363962A (en) * 2018-01-25 2018-08-03 南京邮电大学 A kind of method for detecting human face and system based on multi-level features deep learning
WO2019128646A1 (en) * 2017-12-28 2019-07-04 深圳励飞科技有限公司 Face detection method, method and device for training parameters of convolutional neural network, and medium
CN110414400A (en) * 2019-07-22 2019-11-05 中国电建集团成都勘测设计研究院有限公司 A kind of construction site safety cap wearing automatic testing method and system
CN110674714A (en) * 2019-09-13 2020-01-10 东南大学 Human face and human face key point joint detection method based on transfer learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019128646A1 (en) * 2017-12-28 2019-07-04 深圳励飞科技有限公司 Face detection method, method and device for training parameters of convolutional neural network, and medium
CN108363962A (en) * 2018-01-25 2018-08-03 南京邮电大学 A kind of method for detecting human face and system based on multi-level features deep learning
CN110414400A (en) * 2019-07-22 2019-11-05 中国电建集团成都勘测设计研究院有限公司 A kind of construction site safety cap wearing automatic testing method and system
CN110674714A (en) * 2019-09-13 2020-01-10 东南大学 Human face and human face key point joint detection method based on transfer learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于深度卷积神经网络与中心损失的人脸识别;张延安;王宏玉;徐方;;科学技术与工程(第35期);全文 *
基于迁移卷积神经网络的人脸表情识别;刘伦豪杰;王晨辉;卢慧;王家豪;;电脑知识与技术(第07期);全文 *

Also Published As

Publication number Publication date
CN112115783A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
CN112115783B (en) Depth knowledge migration-based face feature point detection method, device and equipment
CN112287978A (en) Hyperspectral remote sensing image classification method based on self-attention context network
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN110569738A (en) natural scene text detection method, equipment and medium based on dense connection network
CN111582044A (en) Face recognition method based on convolutional neural network and attention model
CN115496928B (en) Multi-modal image feature matching method based on multi-feature matching
CN112001931A (en) Image segmentation method, device, equipment and storage medium
CN111353544A (en) Improved Mixed Pooling-Yolov 3-based target detection method
CN115424059B (en) Remote sensing land utilization classification method based on pixel level contrast learning
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN113807214B (en) Small target face recognition method based on deit affiliated network knowledge distillation
Wu et al. STR transformer: a cross-domain transformer for scene text recognition
Yang et al. Robust visual tracking using adaptive local appearance model for smart transportation
CN117115880A (en) Lightweight face key point detection method based on heavy parameterization
CN114612961B (en) Multi-source cross-domain expression recognition method and device and storage medium
CN115953625A (en) Vehicle detection method based on characteristic diagram double-axis Transformer module
CN113807218B (en) Layout analysis method, device, computer equipment and storage medium
CN114663751A (en) Power transmission line defect identification method and system based on incremental learning technology
CN109871835B (en) Face recognition method based on mutual exclusion regularization technology
CN113706450A (en) Image registration method, device, equipment and readable storage medium
CN111274893A (en) Aircraft image fine-grained identification method based on component segmentation and feature fusion
Yian et al. Improved deeplabv3+ network segmentation method for urban road scenes
CN117058437B (en) Flower classification method, system, equipment and medium based on knowledge distillation
Wang et al. Image Semantic Segmentation Algorithm Based on Self-learning Super-Pixel Feature Extraction
CN115359304B (en) Single image feature grouping-oriented causal invariance learning method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant