CN111860214A - Face detection method, training method and device of model thereof and electronic equipment - Google Patents

Face detection method, training method and device of model thereof and electronic equipment Download PDF

Info

Publication number
CN111860214A
CN111860214A CN202010611503.8A CN202010611503A CN111860214A CN 111860214 A CN111860214 A CN 111860214A CN 202010611503 A CN202010611503 A CN 202010611503A CN 111860214 A CN111860214 A CN 111860214A
Authority
CN
China
Prior art keywords
feature map
face
network
inputting
outputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010611503.8A
Other languages
Chinese (zh)
Inventor
路洪运
许健
田波
肖鑫
窦宏辰
韩博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202010611503.8A priority Critical patent/CN111860214A/en
Publication of CN111860214A publication Critical patent/CN111860214A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure provides a face detection method and a training method, a device and equipment of a model thereof. The method comprises the following steps: acquiring a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model comprises a basic network, an enhanced network and a prediction network; inputting the face image into a basic network of an initial model, and outputting a first feature map; inputting the first feature map into an enhancement network, and outputting an enhancement feature map; inputting the enhanced feature map into a prediction network to obtain the position information of a predicted face region of the face image; and substituting the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function for calculation, and updating an initial model according to the calculated loss value to obtain a face detection model. According to the embodiment of the disclosure, the accuracy of face detection can be improved.

Description

Face detection method, training method and device of model thereof and electronic equipment
Technical Field
The present disclosure relates to the field of face detection technologies, and more particularly, to a training method for a face detection model, a face detection method, a training apparatus for a face detection model, a face detection apparatus, an electronic device, and a computer-readable storage medium.
Background
With the improvement of computer computing power, the application of the face recognition technology is more and more extensive, especially the rapid development of the internet and 5G technology, and the face recognition technology becomes an important technical means for face payment, image auditing and personnel retrieval. The face detection can provide a recognition area for face recognition and separate a background from a foreground, so that the positioning accuracy of the face detection greatly influences the accuracy of the face recognition.
Therefore, it is necessary to provide an accurate face detection method to improve the accuracy of face recognition.
Disclosure of Invention
An object of the embodiments of the present disclosure is to provide a new technical solution for face detection.
According to a first aspect of the present disclosure, there is provided a training method of a face detection model, the method including:
acquiring a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model comprises a basic network, an enhanced network and a prediction network;
inputting the face image into a basic network of the initial model, and outputting a first feature map;
inputting the first feature map into the enhancement network, and outputting an enhancement feature map;
Inputting the enhanced feature map into the prediction network to obtain the position information of the predicted face area of the face image;
and substituting the real face region position information of the face image and the predicted face region position information of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
Optionally, wherein the enhanced network comprises: a first convolution layer, a variable convolution layer and an up-sampling convolution layer which are connected in sequence;
the inputting the first feature map into the enhancement network and outputting an enhancement feature map comprises:
inputting the first characteristic diagram into the first convolution layer to obtain a first intermediate characteristic diagram;
inputting the first intermediate characteristic diagram into the variable convolution layer to obtain a second intermediate characteristic diagram;
and inputting the second intermediate feature map into the up-sampling convolution layer, and outputting the enhanced feature map.
Optionally, wherein the base network comprises: shallow networks and deep networks;
the inputting the face image into the basic network of the initial model and outputting a first feature map comprises:
Inputting the face image into the shallow network, and outputting a shallow feature map;
inputting the shallow feature map into the deep network, and outputting the first feature map.
Optionally, wherein the shallow network comprises a second convolutional layer and a third convolutional layer; the deep network includes a fourth convolutional layer, a fifth convolutional layer and a sixth convolutional layer;
inputting the face image into the shallow network and outputting a shallow feature map; inputting the shallow feature map into the deep network, and outputting the first feature map, including:
inputting the face image into the second convolution layer and outputting a second convolution layer feature map;
inputting the second convolutional layer feature map into the third convolutional layer, and outputting the shallow layer feature map;
inputting the shallow feature map into the fourth convolutional layer, and outputting a fourth convolutional layer feature map;
inputting the fourth convolutional layer characteristic diagram into the fifth convolutional layer and outputting a fifth convolutional layer characteristic diagram;
inputting the fifth convolutional layer feature map into the sixth convolutional layer, and outputting the first feature map.
Optionally, wherein the enhanced network comprises: a first enhancement subnetwork, a second enhancement subnetwork, and a third enhancement subnetwork;
The inputting the first feature map into the enhancement network and outputting an enhancement feature map comprises:
inputting the first feature map into the first enhancement sub-network, and outputting the enhanced first feature map;
fusing the fifth convolutional layer feature map and the enhanced first feature map to obtain a first fused feature map;
inputting the first fused feature map into the second enhancement sub-network, and outputting an enhanced first fused feature map;
fusing the fourth convolution layer feature map and the enhanced first fusion feature map to obtain a second fusion feature map;
and inputting the second fused feature map into the third enhancement sub-network, and outputting the enhanced feature map.
Optionally, the real face region position information of the face image includes real face region information of the face image and real face position information of the face image;
the predicted face region position information of the face image comprises the predicted face region information of the face image and the predicted face position information of the face image;
the predictive network includes: thermodynamic map prediction sub-networks and regional location prediction sub-networks;
The inputting the enhanced feature map into the prediction network to obtain the position information of the predicted face region of the face image includes:
inputting the enhanced feature map into the thermodynamic diagram prediction sub-network, and outputting predicted face position information of the face image;
and inputting the enhanced feature map into the regional position prediction sub-network, and outputting the predicted face region information of the face image.
Optionally, the substituting the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function to calculate, and updating the initial model according to a loss value obtained by calculation to obtain the face detection model includes:
substituting the predicted face region information of the face image and the corresponding original region information in the real face region position information of the face image into a first loss function, substituting the predicted face region information of the face image and the corresponding original position information in the real face region position information of the face image into a second loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
According to a second aspect of the present disclosure, there is provided a face detection method, including:
acquiring an image to be detected, wherein the image to be detected comprises at least one face;
inputting the image to be detected into a face detection model, and outputting the region position information of the detected face;
the face detection model comprises a basic network, an enhanced network and a prediction network;
the basic network is used for extracting the characteristics of the image to be detected and outputting a first characteristic diagram;
the enhancement network is used for outputting an enhancement feature map according to the first feature map;
and the prediction network is used for obtaining the region position information of the face in the image to be detected according to the enhanced feature map.
According to a third aspect of the present disclosure, there is provided a training apparatus for a face detection model, the method including:
the acquisition module is used for acquiring a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model comprises a basic network, an enhanced network and a prediction network;
the feature extraction module is used for inputting the face image into a basic network of the initial model and outputting a first feature map;
The enhancement module is used for inputting the first feature map into the enhancement network and outputting an enhanced feature map;
the prediction module is used for inputting the enhanced feature map into the prediction network to obtain the position information of the predicted face area of the face image;
and the updating module is used for substituting the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
According to a fifth aspect of the present disclosure, there is provided a face detection apparatus comprising:
the system comprises an acquisition module, a detection module and a processing module, wherein the acquisition module is used for acquiring an image to be detected, and the image to be detected comprises at least one face;
the output module is used for inputting the image to be detected into a face detection model and outputting the region position information of the detected face;
the face detection model comprises a basic network, an enhanced network and a prediction network;
the basic network is used for extracting the characteristics of the image to be detected and outputting a first characteristic diagram;
the enhancement network is used for outputting an enhancement feature map according to the first feature map;
And the prediction network is used for obtaining the region position information of the face in the image to be detected according to the enhanced feature map.
According to a sixth aspect of the present disclosure, there is provided an electronic device comprising a processor and a memory; the memory stores machine executable instructions executable by the processor; the processor executes the machine executable instructions to implement the method of training a face detection model according to any one of the first aspect of the present disclosure.
According to a sixth aspect of the present disclosure, there is provided an electronic device comprising a processor and a memory; the memory stores machine executable instructions executable by the processor; the processor executes the machine executable instructions to implement the face detection method of the second aspect of the present disclosure.
According to one embodiment of the disclosure, a training sample and a preset initial model are obtained; the training sample comprises a face image and real face region position information of the face image; the initial model comprises a basic network, an enhanced network and a prediction network; inputting the face image into a basic network of an initial model, and outputting a first feature map; inputting the first feature map into an enhancement network, and outputting an enhancement feature map; inputting the enhanced feature map into a prediction network to obtain the position information of a predicted face region of the face image; and substituting the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model. According to the embodiment of the disclosure, anchor points can be set without manpower, so that the influence on the detection precision caused by artificial factors is avoided, and the accuracy of face detection is greatly improved through the fusion enhancement of features.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a schematic diagram of a composition structure of an electronic device to which a training method of a face detection model according to an embodiment of the present disclosure may be applied;
FIG. 2 is a flow chart of a training method of a face detection model according to an embodiment of the present disclosure;
FIG. 3 shows a schematic structural diagram of a face detection model according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a training apparatus for a face detection model according to an embodiment of the present disclosure;
fig. 5 shows a schematic structural diagram of an electronic device of a first embodiment of the present disclosure;
fig. 6 shows a schematic flow chart of a face detection method according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a face detection apparatus according to an embodiment of the present disclosure;
fig. 8 shows a schematic structural diagram of an electronic device according to a second embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
< hardware configuration >
Fig. 1 is a schematic diagram of a composition structure of an electronic device to which a training method of a face detection model according to an embodiment of the present disclosure may be applied.
As shown in fig. 1, the electronic apparatus 1000 of the present embodiment may include a processor 1010, a memory 1020, an interface device 1030, a communication device 1040, a display device 1050, an input device 1060, a speaker 1070, a microphone 1080, and the like.
The processor 1010 may be a central processing unit CPU, a microprocessor MCU, or the like. The memory 1020 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1030 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1040 can perform wired or wireless communication, for example. The display device 1050 is, for example, a liquid crystal display panel, a touch panel, or the like. The input device 1060 may include, for example, a touch screen, a keyboard, and the like.
The electronic device 1000 may output audio information through the speaker 1070. The electronic device 1000 can pick up voice information input by a user through the microphone 1080.
The electronic device 1000 may be a smart phone, a laptop, a desktop computer, a tablet computer, or the like.
In this embodiment, the electronic device 1000 may obtain a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model comprises a basic network, an enhanced network and a prediction network; inputting the face image into a basic network of an initial model, and outputting a first feature map; inputting the first feature map into an enhancement network, and outputting an enhancement feature map; inputting the enhanced feature map into a prediction network to obtain the position information of a predicted face region of the face image; and substituting the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
In this embodiment, the memory 1020 of the electronic device 1000 is configured to store instructions for controlling the processor 1010 to operate so as to support a training method for implementing a face detection model according to any embodiment of the present disclosure.
It should be understood by those skilled in the art that although a plurality of devices of the electronic apparatus 1000 are illustrated in fig. 1, the electronic apparatus 1000 of the embodiments of the present disclosure may refer to only some of the devices, for example, the processor 1010, the memory 1020, the display device 1050, the input device 1060, and the like.
The skilled person can design the instructions according to the disclosed solution of the present disclosure. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.
< first embodiment >
< method >
The present embodiment provides a training method for a face detection model, which may be executed by the electronic device 1000 shown in fig. 1, for example.
As shown in fig. 2, the method includes the following steps 2100 to 2500:
step 2100, acquiring a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model includes a base network, an enhanced network, and a predicted network.
The real face region position information of the face image may specifically include real face position information of the face image and real face region information of the face image. The real face position information of the face image is, for example, a center point coordinate of a center point of the face, and the real face region information of the face image is, for example, a circular gaussian distribution region centered on the center point coordinate.
Step 2200, inputting the face image to the basic network of the initial model, and outputting a first feature map.
Wherein the base network comprises: shallow networks and deep networks. In this step, the electronic device 1000 may specifically input the face image to the shallow network, and output the shallow feature map; and inputting the shallow feature map into the deep network and outputting the first feature map.
The basic network is used for identifying the target features in the face images of the training samples. In practical applications, the basic network may be a resnet18 network, a resnet34 network, a resnet50 network, a resnet101 network, a mobilenet network, an initiation network, or the like.
Step 2300, inputting the first feature map into the enhancement network, and outputting an enhancement feature map.
The enhancement network is used for enhancing the features in the first feature map so that the features can better represent the image or the target in the image.
The enhancement network may include a first convolutional layer, a variable convolutional layer, and an upsampling convolutional layer, which are connected in sequence. The process of the electronic device 1000 inputting the first feature map into the enhanced network includes: inputting the first characteristic diagram into the first convolution layer to obtain a first intermediate characteristic diagram; inputting the first intermediate characteristic diagram into the variable convolution layer to obtain a second intermediate characteristic diagram; inputting the second intermediate feature map into the upsampled convolutional layer, and outputting the enhanced feature map.
Illustratively, the first convolutional layer may be a 1 × 1 convolutional layer. Wherein the 1 × 1 convolutional layer is used for enhancing the expression of nonlinear features. The variable convolution layer is used for improving the receptive field, and the situation that the receptive field is reduced and the context content information of a large target cannot be acquired due to the fact that the characteristic output scale is large is avoided. The upsampling convolutional layer is used to enlarge the deep layer feature size to be the same as the shallow layer feature size so as to blend and add features of two different dimensions of the deep layer and the shallow layer.
And 2400, inputting the enhanced feature map into the prediction network to obtain the position information of the predicted face region of the face image.
The prediction network is used for predicting the predicted face region position information of the face image of the training sample, and the predicted face region position information of the face image comprises the predicted face region information of the face image and the predicted face position information of the face image. Wherein, the prediction network may include: thermodynamic diagrams predict subnetworks and regional location predict subnetworks. In this step, the electronic device 1000 may specifically input the enhanced feature map into the thermodynamic diagram prediction sub-network, and output predicted face position information of the face image; and inputting the enhanced feature map into the regional position prediction sub-network, and outputting the predicted face region information of the face image.
And 2500, substituting the position information of the predicted face area of the face image and the position information of the real face area of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
In this step, the electronic device 1000 may substitute the predicted face region information of the face image and the real face region position information of the face image into a first loss function, substitute the predicted face region information of the face image and the real face region position information of the face image into a second loss function, and update the initial model according to the loss value obtained by calculation, so as to obtain the face detection model. Illustratively, the first loss function may be, for example, a smooth L1 loss function, and the second loss function may be, for example, a focal loss function.
For example, the derivative of the parameter to be converged in the initial model may be calculated based on the loss value and the back propagation algorithm, and then the parameter to be converged in the initial model may be updated based on the derivative and the gradient descent algorithm. In an actual training process, the electronic device 1000 updates the parameters to be converged in the initial model for multiple times based on multiple training samples until convergence, so as to obtain a converged model, i.e., a face detection model. It is understood that, in practical applications, a suitable algorithm may be selected as needed to update the initial model, and this embodiment is not particularly limited.
In one example, as shown in fig. 3, the underlying network may be a resnet18 network, allowing for relative balancing of speed and accuracy. Wherein the shallow network comprises a second convolutional layer and a third convolutional layer; the deep network includes a fourth convolutional layer, a fifth convolutional layer and a sixth convolutional layer. The enhanced network may include: a first enhancement subnetwork, a second enhancement subnetwork, and a third enhancement subnetwork.
The electronic device 1000 inputs the face image into the second convolution layer and outputs a second convolution layer feature map; inputting the second convolutional layer feature map into the third convolutional layer, and outputting the shallow layer feature map; inputting the shallow layer feature map into the fourth convolutional layer, and outputting a fourth convolutional layer feature map; inputting the fourth convolutional layer characteristic diagram into the fifth convolutional layer, and outputting a fifth convolutional layer characteristic diagram; inputting the fifth convolutional layer feature map into the sixth convolutional layer, and outputting the first feature map.
As can be seen from fig. 3, the electronic device 1000 inputs the first feature map output from the sixth convolutional layer into the first enhancement sub-network, so as to obtain an enhanced first feature map; then, fusing the fifth convolution layer characteristic diagram and the enhanced first characteristic diagram to obtain a first fused characteristic diagram; inputting the first fused feature map into the second enhancement subnetwork, and outputting the enhanced first fused feature map; fusing the fourth convolution layer feature map and the enhanced first fused feature map to obtain a second fused feature map; inputting the second fused feature map into the third enhancement subnetwork and outputting the enhanced feature map.
Wherein each of the enhancement subnetworks comprises: a first convolutional layer, a variable convolutional layer and an upsampling convolutional layer connected in sequence. The process of the electronic device 1000 inputting the first feature map, the first fused feature map or the second fused feature into the corresponding enhancer network includes: inputting the first feature map, the first fusion feature map or the second fusion feature into a first convolution layer in a corresponding enhancer network to obtain a first intermediate feature map; inputting the first intermediate characteristic diagram into the variable convolution layer to obtain a second intermediate characteristic diagram; inputting the second intermediate feature map into the upsampled convolutional layer, and outputting a corresponding enhanced feature map.
Illustratively, the first convolutional layer may be a 1 × 1 convolutional layer. Wherein the 1 × 1 convolutional layer is used for enhancing the expression of nonlinear features. The variable convolution layer is used for improving the receptive field, and the situation that the receptive field is reduced and the context content information of a large target cannot be acquired due to the fact that the characteristic output scale is large is avoided. The upsampling convolutional layer is used for enlarging the deep layer characteristic size to be the same as the shallow layer characteristic size so as to facilitate the fusion and addition of the characteristics of the deep layer and the shallow layer with different dimensions.
It should be noted that, in order to avoid discarding the deep features during fusion due to too large difference of feature values at different stages during feature fusion, in this embodiment, a batch normalization function (batchnorm) is used to perform mean-reducing and variance-dividing calculation on the features of each convolution layer, so as to achieve normalization processing on the output features.
The thermodynamic diagram prediction sub-network acquires the maximum value of the features in the input enhanced feature diagram by using a maximum pooling algorithm, obtains a classification thermodynamic diagram, and determines the position of the maximum value of the features in the features, wherein the position is used as the predicted face position information of the face image. The higher the value of the point in the classification thermodynamic diagram is, the higher the possibility that the point is the center of the face is, and finally whether the point has the face or not is determined according to the value.
The region location prediction subnetwork processes the input enhanced feature map, for example, using a target box regression algorithm, to obtain a target box regression feature map. In this embodiment, the dimensions of the classification thermodynamic diagram and the target frame regression feature diagram are the same, a value at the same position as the maximum value of the features in the classification thermodynamic diagram in the target frame regression feature diagram is selected, that is, the width and height information, since the width and height information is a proportional value, which is equivalent to a normalized value, and is between 0 and 1, the width and height decoding is that the actual face width and height information can be obtained by multiplying the proportional value of the width and height information by the original image width and height, so that the face width and height information is output as the predicted face region information of the face image.
Finally, the electronic device 1000 determines the predicted face position information of the face image and the predicted face area information of the face image as the predicted face area position information of the face image.
The training of the face detection model of the present embodiment has been described above with reference to the drawings and examples. Obtaining a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model comprises a basic network, an enhanced network and a prediction network; inputting the face image into a basic network of an initial model, and outputting a first feature map; inputting the first feature map into an enhancement network, and outputting an enhancement feature map; inputting the enhanced feature map into a prediction network to obtain the position information of a predicted face region of the face image; and substituting the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model. According to the embodiment of the disclosure, anchor points can be set without manpower, so that the influence on the detection precision caused by artificial factors is avoided, and the accuracy of face detection is greatly improved through the fusion enhancement of features.
< apparatus >
The present embodiment provides a training apparatus for a face detection model, which is, for example, a training apparatus 4000 for a face detection model shown in the figure.
As shown in fig. 4, the training apparatus 4000 for face detection model may include an obtaining module 4100, a feature extraction module 4200, an enhancement module 4300, a prediction module 4400, and an update module 4500.
The acquiring module 4100 is configured to acquire a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model includes a base network, an enhanced network, and a predicted network.
The feature extraction module 4200 is configured to input the facial image to a basic network of the initial model, and output a first feature map.
An enhancement module 4300, configured to input the first feature map to the enhancement network and output an enhanced feature map.
The prediction module 4400 is configured to input the enhanced feature map to the prediction network, so as to obtain position information of a predicted face area of the face image.
An updating module 4500, configured to substitute the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function for calculation, and update the initial model according to a loss value obtained through calculation, so as to obtain the face detection model.
In one example, the enhanced network includes: a first convolution layer, a variable convolution layer and an up-sampling convolution layer which are connected in sequence; the enhancement module 4300 is configured to input the first feature map to the first convolution layer to obtain a first intermediate feature map; inputting the first intermediate characteristic diagram into the variable convolution layer to obtain a second intermediate characteristic diagram; inputting the second intermediate feature map into the upsampled convolutional layer, and outputting the enhanced feature map.
In one example, the base network includes: shallow networks and deep networks; the feature extraction module 4200 is specifically configured to: inputting the face image into the shallow network, and outputting the shallow feature map; and inputting the shallow feature map into the deep network and outputting the first feature map.
Wherein the shallow network may include a second convolutional layer and a third convolutional layer; the deep network includes a fourth convolutional layer, a fifth convolutional layer, and a sixth convolutional layer. Accordingly, the feature extraction module 4200 may be specifically configured to: inputting the face image into the second convolution layer and outputting a second convolution layer feature map; inputting the second convolutional layer feature map into the third convolutional layer, and outputting the shallow layer feature map; inputting the shallow layer feature map into the fourth convolutional layer, and outputting a fourth convolutional layer feature map; inputting the fourth convolutional layer characteristic diagram into the fifth convolutional layer, and outputting a fifth convolutional layer characteristic diagram; inputting the fifth convolutional layer feature map into the sixth convolutional layer, and outputting the first feature map.
In one example, the enhanced network includes: a first enhancement subnetwork, a second enhancement subnetwork, and a third enhancement subnetwork; the enhancement module 4300 may be specifically configured to: inputting the first feature map into the first enhancement sub-network, and outputting the enhanced first feature map; fusing the fifth convolution layer feature map and the enhanced first feature map to obtain a first fused feature map; inputting the first fused feature map into the second enhancement subnetwork, and outputting the enhanced first fused feature map; fusing the fourth convolution layer feature map and the enhanced first fused feature map to obtain a second fused feature map; inputting the second fused feature map into the third enhancement subnetwork and outputting the enhanced feature map.
In one example, the real face region position information of the face image includes real face region information of the face image and real face position information of the face image; the predicted face region position information of the face image includes predicted face region information of the face image and predicted face position information of the face image.
The predictive network may include: thermodynamic diagrams predict subnetworks and regional location predict subnetworks. The prediction module 4400 may be specifically configured to: inputting the enhanced feature map into the thermodynamic diagram prediction sub-network, and outputting the predicted face position information of the face image; and inputting the enhanced feature map into the regional position prediction sub-network, and outputting the predicted face region information of the face image.
In one example, the update module 4500 can be specifically configured to: and substituting the corresponding predicted face region information of the face image and the real face region position information of the face image into a first loss function, substituting the corresponding predicted face position information of the face image and the real face region position information of the face image into a second loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
The training device of the face detection model of this embodiment may be used to implement the method technical solution of this embodiment, and its implementation principle and technical effect are similar, which are not described herein again.
< apparatus >
In this embodiment, an electronic device is further provided, where the electronic device includes a training apparatus 4000 for a face detection model described in the embodiments of the present disclosure; alternatively, the electronic device is the electronic device 5000 shown in fig. 5, which includes a processor 5200 and a memory 5100:
the memory 5100 stores machine-executable instructions that are executable by the processor; a processor 5200, executing the machine executable instructions to implement a method of training a face detection model as in any of the present embodiments.
< computer-readable storage Medium embodiment >
The present embodiments provide a computer-readable storage medium having stored therein an executable command that, when executed by a processor, performs a method described in any of the method embodiments of the present disclosure.
< second embodiment >
< method >
The embodiment provides a face detection method, which detects a face of an image to be detected by applying a face detection model obtained by training in the embodiment.
Specifically, as shown in fig. 6, the method includes the following steps 6100 to 6200:
step 6100, acquiring an image to be detected, wherein the image to be detected includes at least one human face.
And 6200, inputting the image to be detected into a face detection model, and outputting the area position information of the detected face.
The face detection model comprises a basic network, an enhanced network and a prediction network; the basic network is used for extracting the characteristics of the image to be detected and outputting a first characteristic diagram; the enhancement network is used for outputting an enhancement feature map according to the first feature map; the prediction network is used for obtaining the regional position information of the face in the image to be detected according to the enhanced feature map.
And after the input image to be detected is subjected to feature extraction, feature enhancement and feature prediction in sequence through the face detection model, outputting the regional position information of all faces in the image to be detected. In practical application, the face can be marked on the image to be detected in a target frame mode while the regional position information of the face is displayed.
According to the face detection method, the face of the image to be detected is detected by the pre-trained face detection model, anchor points do not need to be set manually, the influence on detection precision caused by artificial factors is avoided, and meanwhile, the accuracy of face detection is greatly improved through feature fusion enhancement.
< apparatus >
The present embodiment provides a face detection apparatus, which is, for example, a face detection apparatus 7000 shown in fig. 7.
As shown in fig. 7, the face detection apparatus 7000 may include: an acquisition module 7100 and an output module 7200.
The obtaining module 7100 is used for obtaining an image to be detected, wherein the image to be detected comprises at least one face.
The output module 7200 is configured to input the image to be detected to the face detection model, and output the region position information of the detected face.
The face detection model comprises a basic network, an enhanced network and a prediction network; the basic network is used for extracting the characteristics of the image to be detected and outputting a first characteristic diagram; the enhancement network is used for outputting an enhancement feature map according to the first feature map; the prediction network is used for obtaining the regional position information of the face in the image to be detected according to the enhanced feature map.
The face detection apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects thereof are similar, and are not described herein again.
< apparatus >
In this embodiment, an electronic device is further provided, where the electronic device includes the face detection apparatus 7000 described in the embodiments of the present disclosure; alternatively, the electronic device is the electronic device 8000 shown in fig. 8, and includes a processor 8200 and a memory 8100:
memory 8100 stores machine-executable instructions capable of being executed by the processor; a processor 8200 executing the machine executable instructions to implement the face detection method according to any one of the embodiments.
< computer-readable storage Medium embodiment >
The present embodiments provide a computer-readable storage medium having stored therein an executable command that, when executed by a processor, performs a method described in any of the method embodiments of the present disclosure.
The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are equivalent.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the present disclosure is defined by the appended claims.

Claims (12)

1. A method of training a face detection model, the method comprising:
acquiring a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model comprises a basic network, an enhanced network and a prediction network;
inputting the face image into a basic network of the initial model, and outputting a first feature map;
inputting the first feature map into the enhancement network, and outputting an enhancement feature map;
Inputting the enhanced feature map into the prediction network to obtain the position information of the predicted face area of the face image;
and substituting the real face region position information of the face image and the predicted face region position information of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
2. The method of claim 1, wherein the enhanced network comprises: a first convolution layer, a variable convolution layer and an up-sampling convolution layer which are connected in sequence;
the inputting the first feature map into the enhancement network and outputting an enhancement feature map comprises:
inputting the first characteristic diagram into the first convolution layer to obtain a first intermediate characteristic diagram;
inputting the first intermediate characteristic diagram into the variable convolution layer to obtain a second intermediate characteristic diagram;
and inputting the second intermediate feature map into the up-sampling convolution layer, and outputting the enhanced feature map.
3. The method of claim 1, wherein the base network comprises: shallow networks and deep networks;
the inputting the face image into the basic network of the initial model and outputting a first feature map comprises:
Inputting the face image into the shallow network, and outputting a shallow feature map;
inputting the shallow feature map into the deep network, and outputting the first feature map.
4. The method of claim 3, wherein the shallow network comprises a second convolutional layer and a third convolutional layer; the deep network includes a fourth convolutional layer, a fifth convolutional layer and a sixth convolutional layer;
inputting the face image into the shallow network and outputting a shallow feature map; inputting the shallow feature map into the deep network, and outputting the first feature map, including:
inputting the face image into the second convolution layer and outputting a second convolution layer feature map;
inputting the second convolutional layer feature map into the third convolutional layer, and outputting the shallow layer feature map;
inputting the shallow feature map into the fourth convolutional layer, and outputting a fourth convolutional layer feature map;
inputting the fourth convolutional layer characteristic diagram into the fifth convolutional layer and outputting a fifth convolutional layer characteristic diagram;
inputting the fifth convolutional layer feature map into the sixth convolutional layer, and outputting the first feature map.
5. The method of claim 4, wherein the enhanced network comprises: a first enhancement subnetwork, a second enhancement subnetwork, and a third enhancement subnetwork;
The inputting the first feature map into the enhancement network and outputting an enhancement feature map comprises:
inputting the first feature map into the first enhancement sub-network, and outputting the enhanced first feature map;
fusing the fifth convolutional layer feature map and the enhanced first feature map to obtain a first fused feature map;
inputting the first fused feature map into the second enhancement sub-network, and outputting an enhanced first fused feature map;
fusing the fourth convolution layer feature map and the enhanced first fusion feature map to obtain a second fusion feature map;
and inputting the second fused feature map into the third enhancement sub-network, and outputting the enhanced feature map.
6. The method of claim 1, wherein,
the real face region position information of the face image comprises real face region information of the face image and real face position information of the face image;
the predicted face region position information of the face image comprises the predicted face region information of the face image and the predicted face position information of the face image;
the predictive network includes: thermodynamic map prediction sub-networks and regional location prediction sub-networks;
The inputting the enhanced feature map into the prediction network to obtain the position information of the predicted face region of the face image includes:
inputting the enhanced feature map into the thermodynamic diagram prediction sub-network, and outputting predicted face position information of the face image;
and inputting the enhanced feature map into the regional position prediction sub-network, and outputting the predicted face region information of the face image.
7. The method of claim 6, wherein the substituting the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model comprises:
and substituting the predicted face region information of the face image and the real face region position information of the face image into a first loss function, substituting the predicted face region information of the face image and the real face region position information of the face image into a second loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
8. A face detection method, comprising:
acquiring an image to be detected, wherein the image to be detected comprises at least one face;
inputting the image to be detected into a face detection model, and outputting the region position information of the detected face;
the face detection model comprises a basic network, an enhanced network and a prediction network;
the basic network is used for extracting the characteristics of the image to be detected and outputting a first characteristic diagram;
the enhancement network is used for outputting an enhancement feature map according to the first feature map;
and the prediction network is used for obtaining the region position information of the face in the image to be detected according to the enhanced feature map.
9. An apparatus for training a face detection model, the method comprising:
the acquisition module is used for acquiring a training sample and a preset initial model; the training sample comprises a face image and real face region position information of the face image; the initial model comprises a basic network, an enhanced network and a prediction network;
the feature extraction module is used for inputting the face image into a basic network of the initial model and outputting a first feature map;
The enhancement module is used for inputting the first feature map into the enhancement network and outputting an enhanced feature map;
the prediction module is used for inputting the enhanced feature map into the prediction network to obtain the position information of the predicted face area of the face image;
and the updating module is used for substituting the predicted face region position information of the face image and the real face region position information of the face image into a preset loss function for calculation, and updating the initial model according to the calculated loss value to obtain the face detection model.
10. A face detection apparatus comprising:
the system comprises an acquisition module, a detection module and a processing module, wherein the acquisition module is used for acquiring an image to be detected, and the image to be detected comprises at least one face;
the output module is used for inputting the image to be detected into a face detection model and outputting the region position information of the detected face;
the face detection model comprises a basic network, an enhanced network and a prediction network;
the basic network is used for extracting the characteristics of the image to be detected and outputting a first characteristic diagram;
the enhancement network is used for outputting an enhancement feature map according to the first feature map;
And the prediction network is used for obtaining the region position information of the face in the image to be detected according to the enhanced feature map.
11. An electronic device comprising a processor and a memory; the memory stores machine executable instructions executable by the processor; the processor executes the machine executable instructions to implement the method of training a face detection model of any one of claims 1-7.
12. An electronic device comprising a processor and a memory; the memory stores machine executable instructions executable by the processor; the processor executes the machine-executable instructions to implement the face detection method of claim 8.
CN202010611503.8A 2020-06-29 2020-06-29 Face detection method, training method and device of model thereof and electronic equipment Pending CN111860214A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010611503.8A CN111860214A (en) 2020-06-29 2020-06-29 Face detection method, training method and device of model thereof and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010611503.8A CN111860214A (en) 2020-06-29 2020-06-29 Face detection method, training method and device of model thereof and electronic equipment

Publications (1)

Publication Number Publication Date
CN111860214A true CN111860214A (en) 2020-10-30

Family

ID=72989380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010611503.8A Pending CN111860214A (en) 2020-06-29 2020-06-29 Face detection method, training method and device of model thereof and electronic equipment

Country Status (1)

Country Link
CN (1) CN111860214A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818975A (en) * 2021-01-27 2021-05-18 北京金山数字娱乐科技有限公司 Text detection model training method and device and text detection method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818975A (en) * 2021-01-27 2021-05-18 北京金山数字娱乐科技有限公司 Text detection model training method and device and text detection method and device

Similar Documents

Publication Publication Date Title
CN109858445B (en) Method and apparatus for generating a model
CN109829432B (en) Method and apparatus for generating information
CN109981787B (en) Method and device for displaying information
CN113837934B (en) Image generation method and device, electronic equipment and storage medium
WO2020047261A1 (en) Active image depth prediction
CN109658346B (en) Image restoration method and device, computer-readable storage medium and electronic equipment
CN110288625B (en) Method and apparatus for processing image
CN115376211B (en) Lip driving method, lip driving model training method, device and equipment
CN110633717A (en) Training method and device for target detection model
CN115511779B (en) Image detection method, device, electronic equipment and storage medium
CN113378855A (en) Method for processing multitask, related device and computer program product
CN113627536A (en) Model training method, video classification method, device, equipment and storage medium
CN114898177B (en) Defect image generation method, model training method, device, medium and product
CN107016055A (en) Method, equipment and electronic equipment for excavating entity alias
CN111104874B (en) Face age prediction method, training method and training device for model, and electronic equipment
CN114546460A (en) Firmware upgrading method and device, electronic equipment and storage medium
CN111860214A (en) Face detection method, training method and device of model thereof and electronic equipment
CN116205819B (en) Character image generation method, training method and device of deep learning model
CN111105440B (en) Tracking method, device, equipment and storage medium for target object in video
CN110335237B (en) Method and device for generating model and method and device for recognizing image
WO2023097952A1 (en) Pre-trained model publishing method and apparatus, electronic device, storage medium, and computer program product
CN111784726A (en) Image matting method and device
EP4156124A1 (en) Dynamic gesture recognition method and apparatus, and device and storage medium
CN114327718B (en) Interface display method, device, equipment and medium
CN108038863B (en) Image segmentation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination