CN108921017B - Face detection method and system - Google Patents

Face detection method and system Download PDF

Info

Publication number
CN108921017B
CN108921017B CN201810506447.4A CN201810506447A CN108921017B CN 108921017 B CN108921017 B CN 108921017B CN 201810506447 A CN201810506447 A CN 201810506447A CN 108921017 B CN108921017 B CN 108921017B
Authority
CN
China
Prior art keywords
feature
class
feature map
type
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810506447.4A
Other languages
Chinese (zh)
Other versions
CN108921017A (en
Inventor
王鲁许
董远
白洪亮
熊风烨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SUZHOU FEISOU TECHNOLOGY Co.,Ltd.
Original Assignee
Suzhou Feisou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Feisou Technology Co ltd filed Critical Suzhou Feisou Technology Co ltd
Priority to CN201810506447.4A priority Critical patent/CN108921017B/en
Publication of CN108921017A publication Critical patent/CN108921017A/en
Application granted granted Critical
Publication of CN108921017B publication Critical patent/CN108921017B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Abstract

The application provides a face detection method and system. The face detection method comprises the following steps: processing at least one input sample picture through at least two layers of network structures to obtain at least two characteristic graphs corresponding to at least two convolution layers of the at least two layers of network structures; extracting feature information of the first class of feature maps and feature information of the second class of feature maps; acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture; updating the weight value of each convolution layer according to the matching degree of the characteristic graph corresponding to the last convolution layer and the target image; and generating a face detection model according to the weight value of each convolution layer. The first-class feature extraction layer has a relatively large feature plane size before the first-class feature extraction layer, so that the first-class feature extraction layer can be used for detecting a small face, and the purpose of small face detection capability can be improved by extracting feature information of the first-class feature map.

Description

Face detection method and system
Technical Field
The present application relates to the field of image detection technologies, and in particular, to a face detection method and system.
Background
Face detection is the process of locating face regions in an image. In practical application, the face detection is mainly applied to a face recognition system, and further face recognition is carried out according to a detected face area. An SSD (single-shot multi-box detector) is widely used in face detection as a fast target detection framework, and an SSD network has a high accuracy in detecting a large object based on the size characteristics of its feature extraction layer.
The SSD is a single-step detection convolutional neural network framework which is mainly divided into two parts: one part is the basic network layer (such as VGG) located at the front end; the other part is a feature extraction layer added on the basis of the basic network layer. When the SSD network detects the picture, the feature vector is extracted according to the feature plane of the convolution layer, so that the face is detected according to the feature vector. The SSD network has a large number of structural layers, is relatively deep, gradually reduces the size of a feature plane, and gradually increases the size of a corresponding default frame corresponding to the feature plane, so that the SSD network has high accuracy in detecting large objects, and small objects have relatively less feature information which can be acquired by a feature extraction layer after multilayer convolution operation, so that the performance in detecting small faces is poor.
Disclosure of Invention
In view of this, embodiments of the present application provide a face detection method and system to solve the problem of poor performance of detecting a small face by an SSD network.
The embodiment of the application adopts the following technical scheme:
the embodiment of the application provides a face detection method, which comprises the following steps:
performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure;
and generating a face detection model according to the weight value of each convolution layer.
The embodiment of the application also provides a face detection method, which comprises the following steps:
performing convolution processing on an input picture to be detected through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
and acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the picture to be detected.
An embodiment of the present application further provides a face detection system, including:
the processing unit is used for performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first-class feature extraction layer and at least one second-class feature extraction layer, the feature map corresponding to the first-class feature extraction layer is a first-class feature map, the feature map corresponding to the second-class feature extraction layer is a second-class feature map, and the first-class feature extraction layer is positioned in front of the second-class feature extraction layer;
the first extraction unit is used for extracting the feature information of the first class feature map;
the second extraction unit is used for extracting the feature information of the second type of feature map;
the detection unit is used for acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first class of feature map and the feature information of the second class of feature map in the sample picture;
the updating unit is used for updating the weight value of each convolution layer according to the matching degree of a characteristic graph corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer positioned on the tail end output layer in the at least two-layer network structure;
and the generating unit is used for generating a face detection model according to the weight value of each convolution layer.
An embodiment of the present application further provides a face detection system, including:
the processing unit is used for performing convolution processing on an input picture to be detected through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first-class feature extraction layer and at least one second-class feature extraction layer, the feature map corresponding to the first-class feature extraction layer is a first-class feature map, the feature map corresponding to the second-class feature extraction layer is a second-class feature map, and the first-class feature extraction layer is positioned in front of the second-class feature extraction layer;
the first extraction unit is used for extracting the feature information of the first class feature map;
the second extraction unit is used for extracting the feature information of the second type of feature map;
and the detection unit is used for acquiring the detection position coordinates of the face frame in the picture to be detected according to the corresponding position coordinates of the characteristic information of the first class of characteristic diagram and the characteristic information of the second class of characteristic diagram.
Embodiments of the present application also provide an electronic system, including at least one processor and a memory, where the memory stores a program and is configured to be executed by the at least one processor to perform the following steps:
performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure;
and generating a face detection model according to the weight value of each convolution layer.
Embodiments of the present application also provide a computer-readable storage medium, containing a program for use in conjunction with an electronic system, the program being executable by a processor to perform the steps of:
performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure;
and generating a face detection model according to the weight value of each convolution layer.
Embodiments of the present application also provide an electronic system, including at least one processor and a memory, where the memory stores a program and is configured to be executed by the at least one processor to perform the following steps:
performing convolution processing on an input picture to be detected through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
and acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the picture to be detected.
Embodiments of the present application also provide a computer-readable storage medium, containing a program for use in conjunction with an electronic system, the program being executable by a processor to perform the steps of:
performing convolution processing on an input picture to be detected through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
and acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the picture to be detected.
The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:
processing at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer; extracting feature information of the first class of feature maps and feature information of the second class of feature maps; acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture; updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure; and generating a face detection model according to the weight value of each convolution layer. The first-class feature extraction layer has a relatively large feature plane size before the first-class feature extraction layer, so that the first-class feature extraction layer can be used for detecting the small face, the feature depth of the small face can be improved by extracting feature information of the first-class feature map, and the purpose of improving the small face detection capability is further achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a diagram of a conventional SSD network architecture;
FIG. 2 is a schematic flow chart of a face detection method according to the present invention;
FIG. 3 is a schematic diagram of an SSD network architecture implemented by applying the face detection method of the present invention;
FIG. 4 is a schematic diagram illustrating a convolutional layer merging principle in the face detection method according to the present invention;
FIG. 5 is a schematic flow chart of a face detection method according to the present invention;
FIG. 6 is a schematic flow chart of a method according to an embodiment of the face detection method of the present invention;
FIG. 7 is a schematic structural diagram of a face detection system according to the present invention;
fig. 8 is a schematic structural diagram of the face detection system according to the present invention.
Detailed Description
As shown in fig. 1, an existing SSD network structure is shown, which mainly comprises two parts, one part is a base network layer (VGG-16 in the dashed box) which is the first 5 Layers of the convolutional network of the VGG-16, and the other part is a Feature extraction layer (i.e. convolutional layer added on the base network layer) newly added to the SSD network for extracting high-level Feature information. Wherein convolutional layers Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2 are main feature extraction layers, the output of the convolutional layers is respectively connected with two convolution kernel volumes with the size of 3 multiplied by 3 to obtain feature values, one convolutional layer outputs probability values for classification, and each default box generates 2 probability values; the other convolutional layer outputs the relative position coordinates for regression, and each default box generates 4 relative coordinate values (x, y, w, h). In addition, these 6 convolutional layers also pass through the prior box layer to generate the original coordinates of the default box. The number of default boxes per layer in the 6 convolutional layers described above is given. And finally, respectively combining the three calculation results, transmitting the three calculation results to a loss layer to calculate a loss value, then performing feedback, and adjusting learning parameters. The picture is detected according to the module which is well learned and trained, in the detection process, the size of the feature plane is gradually reduced, and the corresponding detection area corresponding to the feature plane is enlarged, so that the SSD network has higher accuracy in the aspect of detecting a large object, and the detection performance is poorer because effective information is less and less after multi-layer convolution is carried out on a target object with the pixel below 50 multiplied by 50 (for example, in the aspect of a small face).
In view of the above problems, in order to achieve the purpose of the present application, an embodiment of the present application provides a face detection method and a face detection system capable of detecting a small face (a picture region with pixels below 50 × 50) based on an SSD network, by extracting feature information of a first class feature map of the first class feature extraction layer (i.e., a convolution layer in a base network layer), performing prediction type processing on the feature information, and determining a face feature vector in the sample picture, thereby achieving the purpose of improving the small face detection capability. The feature depth of the extracted small target image is improved by carrying out normalization and convolution processing on the feature plane of the first-class feature image, so that the advantage that the feature plane of the first-class feature extraction layer is large in relative size and suitable for detecting the small face is fully utilized, and the purpose of improving the accuracy of detecting the small face is achieved.
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 2 is a schematic flow chart of a face detection method provided in the embodiment of the present application. The method may be as follows. The subject of the embodiment of the application can be a face detection system and also can be a detection component in a face recognition system.
Step S101: performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is located in front of the second type feature extraction layer.
In this embodiment of the present application, when the at least two layers of network structures process the sample picture, a first class feature map located in the basic network layer and a second class feature map located in the feature extraction layer are obtained, where the first class feature extraction layer may be a convolutional layer located in the basic network layer. The first-class feature map obtained when the sample picture is processed may include a convolutional layer added on the basis of the base network layer (i.e., a feature extraction layer) in addition to the convolutional layer of the base network layer. The characteristic planar dimensions of the convolutional layer decrease from layer to layer as the number of layers increases. The detection area corresponding to the picture feature information acquired with the increase of the convolution layers is larger.
Further, before processing the sample picture, the sample picture may be adjusted to a preset size, for example: 300 x 300 pixels, and then processed to make the sample picture size meet the requirements of the SSD network structure.
Step S102: and extracting the feature information of the first class feature map.
In this embodiment of the present application, extracting feature information of the first-class feature extraction layer includes:
compressing the first class feature map;
and performing convolution processing on the compressed first-class feature map to acquire feature information of the first-class feature map.
Specifically, the description is given by taking the first-class feature extraction layer as a Conv3_3 layer: a separate feature extraction layer is added for the Conv3_3 layer. Since the Conv3_3 layer is relatively front and the feature plane size is relatively large, containing some relatively low-level, simple features, the original SSD network did not feature extraction on this layer alone. Since the features of the basic network layer are easier to detect the small face, the face detection method provided by the application can separately add a feature extraction layer to the Conv3_3 layer to improve the capability of detecting the small face. The feature extraction structure of this layer is shown in fig. 3. The Conv3_3 layer is processed by a Norm layer and a 3 x 3 convolution layer respectively, then a prior box, the classification probability and a layer of a relative position are extracted by an SSD network to obtain a feature vector, and finally the feature vector is input to a loss layer to calculate a loss value.
The feature depth of the extracted image is improved by carrying out normalization and convolution processing on the first-class feature image, so that the advantage that the feature plane of the first-class feature extraction layer is large in relative size and suitable for detecting the small face is fully utilized, and the purpose of improving the accuracy of detecting the small face is achieved.
Furthermore, the original classification Softmax Loss can be changed into Focal Loss, and experiments prove that the Focal Loss can improve the learning capacity of difficult samples, so that the performance of the network is improved.
It should be noted that the first-class feature extraction layer may be any one of the underlying network layers, and is not limited to the Conv3_3 layer.
In the embodiment of the present application, while extracting the feature information of the first-class feature extraction layer, the feature vector of each convolution layer subsequent to the first-class feature extraction layer may also be extracted.
Specifically, for the extraction of the feature vector of the convolutional layer of the base network layer, the feature vector may be extracted by referring to the feature information extraction manner of the first type of feature extraction layer, and for the extraction of the feature vector of the feature extraction layer, the feature vector may be acquired by respectively acquiring feature values by using the output of the convolutional layer and two convolution kernels of 3 × 3 size.
Step S103: and extracting the feature information of the second class of feature map.
In this embodiment of the present application, extracting feature information of the second type of feature map includes:
merging the second-class feature graphs corresponding to at least two second-class feature extraction layers;
and extracting corresponding feature information from the merged second-class feature map.
In this embodiment, the second-type feature extraction layer may be a convolutional layer Conv4_3 located in the base network layer, and the corresponding second-type feature map may acquire feature information by acquiring feature values by associating the output of the convolutional layer with two convolutional kernels of 3 × 3 size, respectively.
Further, merging the second-class feature graphs corresponding to the at least two second-class feature extraction layers, including:
carrying out downsampling processing on one second-class feature map with relatively large feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through downsampling processing with the other second-class feature map in any two second-class feature maps; or
And performing deconvolution processing on one second-class feature map with relatively small feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through deconvolution processing with the other second-class feature map in any two second-class feature maps.
Further explanation is given by taking the merging of the second-class feature maps corresponding to the two second-class feature extraction layers as an example: acquiring characteristic information by respectively matching the output of the second type of characteristic extraction layer with two convolution kernels with the size of 3 multiplied by 3 to acquire characteristic values; the feature information may also be obtained by combining and extracting corresponding feature information, for example: a down-sampling (Pooling) layer of a second class of feature maps is merged with another second class of feature maps to extract feature vectors, or a deconvolution (deconvolution) layer of the second class of feature maps is merged with another second class of feature maps to extract feature information. It should be noted that the second-type feature map may also be merged with the first-type feature map to obtain feature information of the feature extraction layer. The method for extracting the corresponding characteristic information in a combined manner can improve the accuracy and recall rate of the whole image detection.
Wherein, the down-sampling refers to sampling once every several samples of a sample sequence, and thus the obtained new sequence is the down-sampling of the original sequence. Deconvolution, as the name implies, is the inverse of the convolution operation. The convolution operation is the characteristics of an input picture and an output picture, and the theoretical basis is the translation invariance in the statistical invariance, thereby playing the role of reducing the dimension; the deconvolution is the characteristic of the input picture, and the output picture plays a role in restoration.
In practical application, when there are no more second-type feature maps corresponding to the two second-type feature extraction layers, the feature vectors can still be extracted by adopting the above-mentioned pairwise combination mode.
In this embodiment of the present application, merging second-class feature maps corresponding to at least three second-class feature extraction layers includes:
after the second-class feature map obtained through the downsampling processing is combined with the other second-class feature map in the any two second-class feature maps, the combined second-class feature map is further combined with the other second-class feature map subjected to the deconvolution processing, and the feature plane size of the other second-class feature map is smaller than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the down-sampling processing with the other one of the two arbitrary second-class feature maps, the down-sampling processing is carried out on one second-class feature map with a relatively large feature plane size in any three second-class feature maps.
When at least three second-class feature maps are merged, the three second-class feature maps can be merged to extract corresponding feature information. Seven second-class feature extraction layers are taken as an example for detailed description: the down-sampling layer of the second class characteristic diagram is merged with the reverse second class characteristic diagram of the second class characteristic diagram and another second class characteristic diagram to extract the characteristic vector. As shown in fig. 3-4, Conv3_3 is a first type feature extraction layer, and Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2 are second type feature extraction layers; downsampling the Conv4_3 layer, the Conv7 layer, the Conv8_2 layer, the Conv9_2 layer and the Conv10_2 layer from the Conv4_3 layer, and deconvoluting the samples in the reverse direction from the Conv11_2 layer; first, the intermediate layer and the layer obtained by downsampling the previous layer and the layer obtained by deconvoluting the next layer are combined, similar to a sandwich structure, so that 4 feature extraction layers can be obtained. The 4 feature extraction layers not only have the features with relatively simple lower layers, but also have the features with relatively complex higher layers, and the expression capability of the features is much better than that of the prior network structure. Wherein the Conv4_3 layer and the last Conv11_2 layer perform feature vector extraction respectively. There are 6 feature extraction layers in total except the first type of feature extraction layer.
It should be noted that in practical application, feature information of the first-type feature extraction layer itself, feature information of the last second-type feature map itself, and feature information of the three second-type feature maps which are merged need to be extracted, so as to obtain three feature information of the three second-type feature maps. When there are more than three second-class feature maps, the feature vector can still be extracted by adopting the way of merging the three second-class feature maps. When the three second-type feature maps are combined, the three second-type feature maps may be adjacent to each other, or may be random three second-type feature maps.
In practical application, the feature depth of the extracted image is improved by carrying out normalization and convolution processing on the feature plane of the first-class feature extraction layer, so that the advantage that the feature plane of the first-class feature extraction layer is large in relative size and suitable for detecting the small face is fully utilized, and the purpose of improving the accuracy of detecting the small face is achieved.
In this embodiment of the present application, merging second-class feature maps corresponding to at least three second-class feature extraction layers includes:
after the second-class feature map obtained through deconvolution processing is merged with the other second-class feature map in the any two second-class feature maps, the merged second-class feature map is further merged with the other second-class feature map subjected to downsampling processing, and the feature plane size of the other second-class feature map is larger than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the deconvolution processing with the other one of the two arbitrary second-class feature maps, performing deconvolution processing on one of the three arbitrary second-class feature maps with a relatively small feature plane size.
It should be noted that in practical application, the feature vector of the first-type feature extraction layer itself, the feature vector of the last second-type feature map itself, and the feature vector of the merged three second-type feature maps need to be extracted, so as to obtain the three feature vectors of the three second-type feature maps. When there are more than three second-class feature maps, the feature vector can still be extracted by adopting the way of merging the three second-class feature maps. When the three second-type feature maps are combined, the three second-type feature maps may be adjacent to each other, or may be random three second-type feature maps.
In the embodiment of the present application, any three second-type feature maps are two adjacent second-type feature maps.
Refer to the structure of fig. 4. The merging of the three second-class characteristic diagrams is to convolute the three layers with a convolution kernel of 3 × 3 respectively, laminate the merged results sequentially through a BN (Batch Normalization) layer and an Eltw Product layer, and output the final merged characteristic layer (namely, the merged layer) through a PReLU layer. Then, a prior box and the classification probability value are extracted from the SSD, and finally the obtained values are input into a loss layer to calculate a loss value.
Step S104: and acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture.
In this embodiment of the present application, obtaining detection position coordinates of a face frame according to corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture includes:
acquiring first position information of the feature information of the first type of feature map in the sample picture according to the mapping relation between the first type of feature map and the sample picture;
acquiring second position information of the feature information of the second type of feature map in the sample picture according to the mapping relation between the second type of feature map and the sample picture;
and combining the first position information and the second position information according to the score data corresponding to the first position information and the score data corresponding to the second position information to obtain the detection position coordinates of the face frame.
It should be noted that, in the embodiment of the present application, for each feature extraction layer, the number of anchors of each feature point is set to be 6, and the default part of the anchors of the original SSD network is 4, where this number directly affects the detection performance of the SSD network. The recall rate and accuracy of the SSD network can be improved by modifying the number of anchors. Recall (Recall Rate, also called Recall) is the ratio of the number of relevant documents retrieved to the number of all relevant documents in the document library, measured as the Recall Rate of the retrieval system.
The Anchor is a structure in the SSD network, exists in a network layer for extracting features, and the number of the Anchor directly influences the detection effect.
In the embodiment of the application, the prediction category processing is carried out according to the feature information to obtain a face classification value, the face classification value is compared with a target value, and if the face classification value meets the target threshold range, the feature vector can be used as face feature information; and if the target threshold range is not met, the image corresponding to the characteristic information is not a human face.
Step S105: and updating the weight value of each convolution layer according to the matching degree of the characteristic graph corresponding to the last convolution layer and the target image, wherein the last convolution layer is a convolution layer positioned on the tail end output layer in the at least two-layer network structure.
Step S106: and generating a face detection model according to the weight value of each convolution layer.
In this embodiment, the feature map output by the last convolutional layer in the at least two layers of network structures is matched with the target image to obtain a corresponding matching degree (i.e., a regression gradient), the weight value and the offset of each convolutional layer are adjusted according to the corresponding matching degree, so that the adjusted convolutional layers are beneficial to detecting a small face, the accuracy of detecting the small face is improved, and a face detection model is generated according to the finally determined weight value and offset of each convolutional layer.
Based on the same inventive concept, fig. 5 shows that the present invention further provides a face detection method, including:
step S201: performing convolution processing on an input picture to be detected through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
step S202: extracting feature information of the first class feature map;
further, extracting feature information of the first class feature map includes:
compressing the first class feature map;
and performing convolution processing on the compressed first-class feature map to acquire feature information of the first-class feature map.
Step S203: extracting feature information of the second class of feature maps;
further, extracting feature information of the second class of feature maps includes:
merging the second-class feature graphs corresponding to at least two second-class feature extraction layers;
and extracting corresponding feature information from the merged second-class feature map.
Step S204: and acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the picture to be detected.
In this embodiment of the application, obtaining the detection position of the face frame according to the corresponding position coordinates of the feature information of the first class of feature map and the feature information of the second class of feature map in the picture to be detected includes:
acquiring first position information of the feature information of the first type of feature map in the sample picture according to the mapping relation between the first type of feature map and the sample picture;
acquiring second position information of the feature information of the second type of feature map in the sample picture according to the mapping relation between the second type of feature map and the sample picture;
and combining the first position information and the second position information by using a Soft-NMS module according to the score data corresponding to the first position information and the score data corresponding to the second position information to obtain the detection position coordinates of the face frame.
In practical application, the original NMS module loses the face under the condition that the faces are overlapped, and the detection result is not good enough. Therefore, the original NMS module is replaced by the Soft-NMS module, the detection performance under the condition of face overlapping can be reduced, and the face detection accuracy is improved.
The characteristic information of the first type of characteristic extraction layer (namely, the convolution layer in the basic network layer) is extracted and processed to determine the face characteristic information in the sample picture, so that the aim of improving the small face detection capability is fulfilled. The feature depth of the extracted small target image is improved by carrying out normalization and convolution processing on the feature plane of the first-class feature image, so that the advantage that the feature plane of the first-class feature extraction layer is large in relative size and suitable for detecting the small face is fully utilized, and the purpose of improving the accuracy of detecting the small face is achieved.
In one or more embodiments of the present application, merging second-class feature maps corresponding to at least two second-class feature extraction layers includes:
carrying out downsampling processing on one second-class feature map with relatively large feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through downsampling processing with the other second-class feature map in any two second-class feature maps; or
And performing deconvolution processing on one second-class feature map with relatively small feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through deconvolution processing with the other second-class feature map in any two second-class feature maps.
In one or more embodiments of the present application, merging second-class feature maps corresponding to at least three second-class feature extraction layers includes:
after the second-class feature map obtained through the downsampling processing is combined with the other second-class feature map in the any two second-class feature maps, the combined second-class feature map is further combined with the other second-class feature map subjected to the deconvolution processing, and the feature plane size of the other second-class feature map is smaller than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the down-sampling processing with the other one of the two arbitrary second-class feature maps, the down-sampling processing is carried out on one second-class feature map with a relatively large feature plane size in any three second-class feature maps.
In one or more embodiments of the present application, merging second-class feature maps corresponding to at least three second-class feature extraction layers includes:
after the second-class feature map obtained through deconvolution processing is merged with the other second-class feature map in the any two second-class feature maps, the merged second-class feature map is further merged with the other second-class feature map subjected to downsampling processing, and the feature plane size of the other second-class feature map is larger than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the deconvolution processing with the other one of the two arbitrary second-class feature maps, performing deconvolution processing on one of the three arbitrary second-class feature maps with a relatively small feature plane size.
In one or more embodiments of the present application, any three second-type feature maps are two adjacent second-type feature maps.
As a preferred embodiment, as shown in fig. 6, the face detection method may include two stages, namely, a training stage and a detection stage.
The training phase comprises the following steps:
step 1: adjusting the sample picture to a preset size;
step 2: performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
and step 3: extracting feature information of a first class of feature maps and extracting feature information of a second class of feature maps;
and 4, step 4: acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
and 5: updating the weight value of each convolution layer according to the matching degree of a characteristic graph corresponding to the last convolution layer of at least two layers of network structures and a target image;
step 6: and generating a face detection model according to the weight value of each convolution layer.
The detection stage comprises the following steps:
and 7: initializing at least two layers of network structures by using a face detection model obtained by training;
and 8: adjusting the picture to be detected to a preset size;
and step 9: performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two characteristic graphs corresponding to at least two convolution layers of the at least two layers of network structures;
step 10: extracting feature information of a first class of feature maps and extracting feature information of a second class of feature maps;
step 11: and acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the picture to be detected.
Based on the same inventive concept, fig. 7 is a face detection system provided by the present invention, which includes:
the processing unit 11 is configured to perform convolution processing on at least one input sample picture through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, where the at least two convolution layers include at least one first-class feature extraction layer and at least one second-class feature extraction layer, a feature map corresponding to the first-class feature extraction layer is a first-class feature map, a feature map corresponding to the second-class feature extraction layer is a second-class feature map, and the first-class feature extraction layer is located before the second-class feature extraction layer;
a first extraction unit 12, configured to extract feature information of the first class feature map;
a second extraction unit 13, configured to extract feature information of the second type feature map;
the detection unit 14 is configured to obtain a detection position coordinate of the face frame according to the corresponding position coordinates of the feature information of the first class of feature map and the feature information of the second class of feature map in the sample picture;
an updating unit 15, configured to update a weight value of each convolution layer according to a matching degree between a feature map corresponding to a last convolution layer and a target image, where the last convolution layer is a convolution layer located in a last output layer of the at least two-layer network structure;
and a generating unit 16, configured to generate a face detection model according to the weight value of each convolution layer.
In one or more embodiments of the present application, the extracting unit 12 is configured to extract feature information of the feature map of the first class, and includes:
compressing the feature plane of the first-class feature extraction layer;
and performing convolution processing on the compressed first-class feature map to acquire feature information of a feature plane of the first-class feature map.
In one or more embodiments of the present application, the second extracting unit 13 is configured to extract feature information of the second class of feature maps, including:
merging second type feature graphs corresponding to at least a second type feature extraction layer;
and extracting corresponding feature information from the merged second-class feature map.
In one or more embodiments of the present application, the merging, by the second extraction unit 13, second class feature maps corresponding to at least two second class feature extraction layers, includes:
carrying out downsampling processing on one second-class feature map with relatively large feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through downsampling processing with the other second-class feature map in any two second-class feature maps; or
And performing deconvolution processing on one second-class feature map with relatively small feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through deconvolution processing with the other second-class feature map in any two second-class feature maps.
In one or more embodiments of the present application, the merging, by the second extracting unit 13, second class feature maps corresponding to at least three second class feature extraction layers, includes:
after the second-class feature map obtained through the downsampling processing is combined with the other second-class feature map in the any two second-class feature maps, the combined second-class feature map is further combined with the other second-class feature map subjected to the deconvolution processing, and the feature plane size of the other second-class feature map is smaller than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the down-sampling processing with the other one of the two arbitrary second-class feature maps, the down-sampling processing is carried out on one second-class feature map with a relatively large feature plane size in any three second-class feature maps.
In one or more embodiments of the present application, the merging, by the second extraction unit 13, second class feature maps corresponding to at least three second class feature extraction layers, includes:
after the second-class feature map obtained through deconvolution processing is merged with the other second-class feature map in the any two second-class feature maps, the merged second-class feature map is further merged with the other second-class feature map subjected to downsampling processing, and the feature plane size of the other second-class feature map is larger than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the deconvolution processing with the other one of the two arbitrary second-class feature maps, performing deconvolution processing on one of the three arbitrary second-class feature maps with a relatively small feature plane size.
In one or more embodiments of the present application, any three second-type feature maps are two adjacent second-type feature maps.
In one or more embodiments of the present application, the detecting unit 14 is configured to obtain the detected position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type feature map and the feature information of the second type feature map in the sample picture, and includes:
acquiring first position information of the feature information of the first type of feature map in the sample picture according to the mapping relation between the first type of feature map and the sample picture;
acquiring second position information of the feature information of the second type of feature map in the sample picture according to the mapping relation between the second type of feature map and the sample picture;
and combining the first position information and the second position information according to the score data corresponding to the first position information and the score data corresponding to the second position information to obtain the detection position coordinates of the face frame.
It should be noted that, in this embodiment of the application, at least two feature maps corresponding to at least two convolution layers of at least two network structures are obtained by performing convolution processing on at least one input sample picture by using the at least two network structures, where the at least two convolution layers include at least one first-class feature extraction layer and at least one second-class feature extraction layer, a feature map corresponding to the first-class feature extraction layer is a first-class feature map, and a feature map corresponding to the second-class feature extraction layer is a second-class feature map, where the first-class feature extraction layer is located before the second-class feature extraction layer; extracting feature information of the first class of feature maps and feature information of the second class of feature maps; acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture; updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure; and generating a face detection model according to the weight value of each convolution layer. The first-class feature extraction layer has a relatively large feature plane size before the first-class feature extraction layer is used for detecting the small face, and the feature depth of the detected small face can be improved by extracting the feature information of the first-class feature map, so that the purpose of improving the detection capability of the small face is realized; by adopting the convolution layer combination mode, the recall rate and the accuracy of system detection are improved.
Based on the same inventive concept, fig. 8 is a face detection system provided by the present invention, which includes:
the processing unit 21 is configured to perform convolution processing on an input picture to be detected through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, where the at least two convolution layers include at least one first-class feature extraction layer and at least one second-class feature extraction layer, a feature map corresponding to the first-class feature extraction layer is a first-class feature map, a feature map corresponding to the second-class feature extraction layer is a second-class feature map, and the first-class feature extraction layer is located before the second-class feature extraction layer;
a first extraction unit 22, configured to extract feature information of the first class feature map;
a second extracting unit 23, configured to extract feature information of the second type feature map;
the detection unit 24 is configured to obtain a detection position coordinate of the face frame according to the corresponding position coordinate of the feature information of the first class of feature map and the feature information of the second class of feature map in the picture to be detected.
In one or more embodiments of the present application, the extracting unit 22 is configured to extract feature information of the feature map of the first class, including:
compressing the first class feature map;
and performing convolution processing on the compressed first-class feature map to acquire feature information of the first-class feature map.
In one or more embodiments of the present application, the extracting, by the second extracting unit 23, the feature information of the second class feature map includes:
merging the second-class feature graphs corresponding to at least two second-class feature extraction layers;
and extracting corresponding feature information from the merged second-class feature map.
In one or more embodiments of the present application, the merging, by the second extracting unit 23, the second class feature maps corresponding to at least two second class feature extraction layers, includes:
carrying out downsampling processing on one second-class feature map with relatively large feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through downsampling processing with the other second-class feature map in any two second-class feature maps; or
And performing deconvolution processing on one second-class feature map with relatively small feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through deconvolution processing with the other second-class feature map in any two second-class feature maps.
In one or more embodiments of the present application, the merging, by the second extracting unit 23, second class feature maps corresponding to at least three second class feature extraction layers, includes:
after the second-class feature map obtained through the downsampling processing is combined with the other second-class feature map in the any two second-class feature maps, the combined second-class feature map is further combined with the other second-class feature map subjected to the deconvolution processing, and the feature plane size of the other second-class feature map is smaller than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the down-sampling processing with the other one of the two arbitrary second-class feature maps, the down-sampling processing is carried out on one second-class feature map with a relatively large feature plane size in any three second-class feature maps.
In one or more embodiments of the present application, the merging, by the second extracting unit 23, second class feature maps corresponding to at least three second class feature extraction layers, includes:
after the second-class feature map obtained through deconvolution processing is merged with the other second-class feature map in the any two second-class feature maps, the merged second-class feature map is further merged with the other second-class feature map subjected to downsampling processing, and the feature plane size of the other second-class feature map is larger than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the deconvolution processing with the other one of the two arbitrary second-class feature maps, performing deconvolution processing on one of the three arbitrary second-class feature maps with a relatively small feature plane size.
In one or more embodiments of the present application, any three second-type feature maps are two adjacent second-type feature maps.
In one or more embodiments of the present application, the detecting unit 24 is configured to obtain the detected position of the face frame according to the corresponding position coordinates of the feature information of the first type feature map and the feature information of the second type feature map in the picture to be detected, and includes:
acquiring first position information of the feature information of the first type of feature map in the sample picture according to the mapping relation between the first type of feature map and the sample picture;
acquiring second position information of the feature information of the second type of feature map in the sample picture according to the mapping relation between the second type of feature map and the sample picture;
and combining the first position information and the second position information by using a Soft-NMS module according to the score data corresponding to the first position information and the score data corresponding to the second position information to obtain the detection position coordinates of the face frame.
It should be noted that, in the embodiment of the present application, the detection performance of system detection is improved by combining the convolutional layers through the second extraction unit 23; the position coordinates of the face features are combined together by using the Soft-NMS module to generate the detection position coordinates of the face frame, so that the recall rate and the accuracy of system detection are improved.
Based on the same inventive concept, the present invention provides an electronic system comprising at least one processor and a memory, said memory storing a program and being configured to perform the following steps by at least one of said processors:
performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure;
and generating a face detection model according to the weight value of each convolution layer.
Based on the same inventive concept, the present invention provides a computer-readable storage medium including a program for use in conjunction with an electronic system, the program being executable by a processor to perform the steps of:
performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure;
and generating a face detection model according to the weight value of each convolution layer.
Based on the same inventive concept, the present invention provides an electronic system comprising at least one processor and a memory, said memory storing a program and being configured to perform the following steps by at least one of said processors:
performing convolution processing on an input picture to be detected through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
and acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the picture to be detected.
Based on the same inventive concept, the present invention provides a computer-readable storage medium including a program for use in conjunction with an electronic system, the program being executable by a processor to perform the steps of:
performing convolution processing on an input picture to be detected through at least two layers of network structures based on a face detection model to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
and acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the picture to be detected.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (18)

1. A face detection method, comprising:
performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure;
and generating a face detection model according to the weight value of each convolution layer.
2. The face detection method of claim 1, wherein extracting feature information of the first class of feature maps comprises:
compressing the first class feature map;
and performing convolution processing on the compressed first-class feature map to acquire feature information of the first-class feature map.
3. The face detection method of claim 1, wherein extracting the feature information of the second class of feature maps comprises:
merging the second-class feature graphs corresponding to at least two second-class feature extraction layers;
and extracting corresponding feature information from the merged second-class feature map.
4. The face detection method of claim 3, wherein merging the second-class feature maps corresponding to at least two second-class feature extraction layers comprises:
carrying out downsampling processing on one second-class feature map with relatively large feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through downsampling processing with the other second-class feature map in any two second-class feature maps; or
And performing deconvolution processing on one second-class feature map with relatively small feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through deconvolution processing with the other second-class feature map in any two second-class feature maps.
5. The face detection method of claim 4, wherein merging the second-class feature maps corresponding to at least three second-class feature extraction layers comprises:
after the second-class feature map obtained through the downsampling processing is combined with the other second-class feature map in the any two second-class feature maps, the combined second-class feature map is further combined with the other second-class feature map subjected to the deconvolution processing, and the feature plane size of the other second-class feature map is smaller than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the down-sampling processing with the other one of the two arbitrary second-class feature maps, the down-sampling processing is carried out on one second-class feature map with a relatively large feature plane size in any three second-class feature maps.
6. The face detection method of claim 4, wherein merging the second-class feature maps corresponding to at least three second-class feature extraction layers comprises:
after the second-class feature map obtained through deconvolution processing is merged with the other second-class feature map in the any two second-class feature maps, the merged second-class feature map is further merged with the other second-class feature map subjected to downsampling processing, and the feature plane size of the other second-class feature map is larger than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the deconvolution processing with the other one of the two arbitrary second-class feature maps, performing deconvolution processing on one of the three arbitrary second-class feature maps with a relatively small feature plane size.
7. The face detection method according to claim 5, wherein any three second-type feature maps are two-by-two adjacent second-type feature maps.
8. The method according to claim 1, wherein obtaining the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first class of feature map and the feature information of the second class of feature map in the sample picture comprises:
acquiring first position information of the feature information of the first type of feature map in the sample picture according to the mapping relation between the first type of feature map and the sample picture;
acquiring second position information of the feature information of the second type of feature map in the sample picture according to the mapping relation between the second type of feature map and the sample picture;
and combining the first position information and the second position information according to the score data corresponding to the first position information and the score data corresponding to the second position information to obtain the detection position coordinates of the face frame.
9. A face detection system, comprising:
the processing unit is used for performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first-class feature extraction layer and at least one second-class feature extraction layer, the feature map corresponding to the first-class feature extraction layer is a first-class feature map, the feature map corresponding to the second-class feature extraction layer is a second-class feature map, and the first-class feature extraction layer is positioned in front of the second-class feature extraction layer;
the first extraction unit is used for extracting the feature information of the first class feature map;
the second extraction unit is used for extracting the feature information of the second type of feature map;
the detection unit is used for acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first class of feature map and the feature information of the second class of feature map in the sample picture;
the updating unit is used for updating the weight value of each convolution layer according to the matching degree of a characteristic graph corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer positioned on the tail end output layer in the at least two-layer network structure;
and the generating unit is used for generating a face detection model according to the weight value of each convolution layer.
10. The face detection system of claim 9, wherein the first extraction unit is configured to extract feature information of the feature map of the first type, and includes:
compressing the feature plane of the first-class feature extraction layer;
and performing convolution processing on the compressed first-class feature map to acquire feature information of a feature plane of the first-class feature map.
11. The face detection system of claim 9, wherein the second extraction unit is configured to extract feature information of the second class of feature maps, and includes:
merging second type feature graphs corresponding to at least a second type feature extraction layer;
and extracting corresponding feature information from the merged second-class feature map.
12. The face detection system of claim 11, wherein the second extraction unit merges the second-class feature maps corresponding to at least two second-class feature extraction layers, and the merging includes:
carrying out downsampling processing on one second-class feature map with relatively large feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through downsampling processing with the other second-class feature map in any two second-class feature maps; or
And performing deconvolution processing on one second-class feature map with relatively small feature plane size in any two second-class feature maps, and merging the second-class feature map obtained through deconvolution processing with the other second-class feature map in any two second-class feature maps.
13. The face detection system of claim 12, wherein the second extraction unit merges the second class feature maps corresponding to at least three second class feature extraction layers, and the merging includes:
after the second-class feature map obtained through the downsampling processing is combined with the other second-class feature map in the any two second-class feature maps, the combined second-class feature map is further combined with the other second-class feature map subjected to the deconvolution processing, and the feature plane size of the other second-class feature map is smaller than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the down-sampling processing with the other one of the two arbitrary second-class feature maps, the down-sampling processing is carried out on one second-class feature map with a relatively large feature plane size in any three second-class feature maps.
14. The face detection system of claim 12, wherein the second extraction unit merges the second class feature maps corresponding to at least three second class feature extraction layers, and the merging includes:
after the second-class feature map obtained through deconvolution processing is merged with the other second-class feature map in the any two second-class feature maps, the merged second-class feature map is further merged with the other second-class feature map subjected to downsampling processing, and the feature plane size of the other second-class feature map is larger than the feature plane sizes of the any two second-class feature maps;
and in the process of merging the second-class feature map obtained by the deconvolution processing with the other one of the two arbitrary second-class feature maps, performing deconvolution processing on one of the three arbitrary second-class feature maps with a relatively small feature plane size.
15. The face detection system of claim 13, wherein any three second-type feature maps are two-by-two adjacent second-type feature maps.
16. The system according to claim 9, wherein the detection unit is configured to obtain the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first class of feature map and the feature information of the second class of feature map in the sample picture, and includes:
acquiring first position information of the feature information of the first type of feature map in the sample picture according to the mapping relation between the first type of feature map and the sample picture;
acquiring second position information of the feature information of the second type of feature map in the sample picture according to the mapping relation between the second type of feature map and the sample picture;
and combining the first position information and the second position information according to the score data corresponding to the first position information and the score data corresponding to the second position information to obtain the detection position coordinates of the face frame.
17. An electronic system comprising at least one processor and a memory, the memory storing a program and the program configured to be executed by at least one of the processors to:
performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure;
and generating a face detection model according to the weight value of each convolution layer.
18. A computer readable storage medium containing a program for use in conjunction with an electronic system, the program being executable by a processor to perform the steps of:
performing convolution processing on at least one input sample picture through at least two layers of network structures to obtain at least two feature maps corresponding to at least two convolution layers of the at least two layers of network structures, wherein the at least two convolution layers comprise at least one first type feature extraction layer and at least one second type feature extraction layer, the feature map corresponding to the first type feature extraction layer is a first type feature map, the feature map corresponding to the second type feature extraction layer is a second type feature map, and the first type feature extraction layer is positioned in front of the second type feature extraction layer;
extracting feature information of the first class feature map;
extracting feature information of the second class of feature maps;
acquiring the detection position coordinates of the face frame according to the corresponding position coordinates of the feature information of the first type of feature map and the feature information of the second type of feature map in the sample picture;
updating the weight value of each convolution layer according to the matching degree of a feature map corresponding to the last convolution layer and a target image, wherein the last convolution layer is a convolution layer located in an end output layer of the at least two-layer network structure;
and generating a face detection model according to the weight value of each convolution layer.
CN201810506447.4A 2018-05-24 2018-05-24 Face detection method and system Active CN108921017B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810506447.4A CN108921017B (en) 2018-05-24 2018-05-24 Face detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810506447.4A CN108921017B (en) 2018-05-24 2018-05-24 Face detection method and system

Publications (2)

Publication Number Publication Date
CN108921017A CN108921017A (en) 2018-11-30
CN108921017B true CN108921017B (en) 2021-05-18

Family

ID=64403533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810506447.4A Active CN108921017B (en) 2018-05-24 2018-05-24 Face detection method and system

Country Status (1)

Country Link
CN (1) CN108921017B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800770A (en) * 2018-12-28 2019-05-24 广州海昇计算机科技有限公司 A kind of method, system and device of real-time target detection
CN109815843B (en) * 2018-12-29 2021-09-14 深圳云天励飞技术有限公司 Image processing method and related product
CN110495962A (en) * 2019-08-26 2019-11-26 赫比(上海)家用电器产品有限公司 The method and its toothbrush and equipment of monitoring toothbrush position

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778448A (en) * 2015-03-24 2015-07-15 孙建德 Structure adaptive CNN (Convolutional Neural Network)-based face recognition method
CN105891215A (en) * 2016-03-31 2016-08-24 浙江工业大学 Welding visual detection method and device based on convolutional neural network
CN106709532A (en) * 2017-01-25 2017-05-24 京东方科技集团股份有限公司 Image processing method and device
CN106841216A (en) * 2017-02-28 2017-06-13 浙江工业大学 Tunnel defect automatic identification equipment based on panoramic picture CNN
CN107657204A (en) * 2016-07-25 2018-02-02 中国科学院声学研究所 The construction method and facial expression recognizing method and system of deep layer network model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11074492B2 (en) * 2015-10-07 2021-07-27 Altera Corporation Method and apparatus for performing different types of convolution operations with the same processing elements

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778448A (en) * 2015-03-24 2015-07-15 孙建德 Structure adaptive CNN (Convolutional Neural Network)-based face recognition method
CN105891215A (en) * 2016-03-31 2016-08-24 浙江工业大学 Welding visual detection method and device based on convolutional neural network
CN107657204A (en) * 2016-07-25 2018-02-02 中国科学院声学研究所 The construction method and facial expression recognizing method and system of deep layer network model
CN106709532A (en) * 2017-01-25 2017-05-24 京东方科技集团股份有限公司 Image processing method and device
CN106841216A (en) * 2017-02-28 2017-06-13 浙江工业大学 Tunnel defect automatic identification equipment based on panoramic picture CNN

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Accurate Single Stage Detector Using Recurrent Rolling Convolution";Jimmy Ren等;《2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)》;20171109;参见第3.2节、图2 *
"FlowNet: Learning Optical Flow with Convolutional Networks";Alexey Dosovitskiy等;《2015 IEEE International Conference on Computer Vision (ICCV)》;20160218;第2758-2766页 *
"Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation";Golnaz Ghiasi等;《https://arxiv.org/abs/1605.02264》;20160802;第1-16 *
"基于二分支卷积单元的深度卷积神经网络";侯聪聪等;《激光与光电子学进展》;20180228(第2期);第021005-1至021005-7页 *

Also Published As

Publication number Publication date
CN108921017A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
WO2021129691A1 (en) Target detection method and corresponding device
CN108399362B (en) Rapid pedestrian detection method and device
CN109241982B (en) Target detection method based on deep and shallow layer convolutional neural network
US11107194B2 (en) Neural network for enhancing original image, and computer-implemented method for enhancing original image using neural network
CN108921017B (en) Face detection method and system
CN109784372B (en) Target classification method based on convolutional neural network
US20160104056A1 (en) Spatial pyramid pooling networks for image processing
US11301509B2 (en) Image search system, image search method, and program
WO2019218136A1 (en) Image segmentation method, computer device, and storage medium
WO2019055093A1 (en) Extraction of spatial-temporal features from a video
US10769784B2 (en) Image analyzing method and electrical device
CN112365514A (en) Semantic segmentation method based on improved PSPNet
CN114332094A (en) Semantic segmentation method and device based on lightweight multi-scale information fusion network
CN111461211B (en) Feature extraction method for lightweight target detection and corresponding detection method
CN110008949B (en) Image target detection method, system, device and storage medium
CN113297959A (en) Target tracking method and system based on corner attention twin network
CN112749576B (en) Image recognition method and device, computing equipment and computer storage medium
CN112766392B (en) Image classification method of deep learning network based on parallel asymmetric hole convolution
Dong et al. Robust affine subspace clustering via smoothed ℓ 0-norm
CN109871779A (en) The method and electronic equipment of personal recognition
CN111898560B (en) Classification regression feature decoupling method in target detection
CN114998756A (en) Yolov 5-based remote sensing image detection method and device and storage medium
CN114049491A (en) Fingerprint segmentation model training method, fingerprint segmentation device, fingerprint segmentation equipment and fingerprint segmentation medium
Rodin et al. Document image quality assessment via explicit blur and text size estimation
CN111435448B (en) Image saliency object detection method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210416

Address after: 215123 unit 2-b702, creative industry park, No. 328, Xinghu street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: SUZHOU FEISOU TECHNOLOGY Co.,Ltd.

Address before: 100876 Beijing, Haidian District, 10 West Road, Beijing, 12 Beijing, North Post Science and technology exchange center, room 1216

Applicant before: BEIJING FEISOU TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant