CN108229281B

CN108229281B - Neural network generation method, face detection device and electronic equipment

Info

Publication number: CN108229281B
Application number: CN201710277489.0A
Authority: CN
Inventors: 杨硕; 熊元骏; 吕健勤; 汤晓鸥
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-04-25
Filing date: 2017-04-25
Publication date: 2020-07-17
Anticipated expiration: 2037-04-25
Also published as: CN108229281A

Abstract

The embodiment of the invention provides a neural network generation method, a face detection device and electronic equipment, wherein the neural network generation method comprises the following steps: dividing a target scale range to obtain a plurality of sub-target scale ranges, wherein the target scale range is used for face detection of a neural network; determining sub-neural networks respectively corresponding to the plurality of sub-target scale ranges; and fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network. The neural network generated by the embodiment of the invention can detect the face in the face image with a large face size range, improve the face detection precision in the face image and improve the face detection efficiency in the face image.

Description

Neural network generation method, face detection device and electronic equipment

Technical Field

The embodiment of the invention relates to an artificial intelligence technology, in particular to a neural network generation method, a neural network generation device and electronic equipment, and a face detection method, a face detection device and electronic equipment.

Background

The face detection technology is one of the most important research directions in the field of computer vision, and is the basis of many face analysis technologies, such as face key point detection technology and face recognition technology. The modeling of face detection techniques is challenging, and face detection techniques require the detection of faces in different situations. For example, faces under different posture changes, faces that are occluded to different degrees, faces under different expressions, faces of different sizes, and faces under different lighting conditions. On the one hand, the human face appearances among different people are greatly different, even if the same person exists, when the human face is in different postures, the human face is shielded in different degrees, and the appearances of the human face are also very different under different expressions and different ambient lights. On the other hand, in a video monitoring system, such as a subway, a face far away from a camera is usually small in size, a face near the camera is usually large in size, and faces of different sizes are very different in appearance, for example, a face of small size cannot see details of five sense organs clearly, but only can see a contour, and a face of large size can see details of five sense organs clearly. In addition, the face detection technique needs to achieve real-time running speed because it is the basis of many other face analysis techniques.

The existing face detection method can better solve the problem of detecting faces with different postures and different degrees of shielding, but has low detection efficiency on the faces with different sizes.

Disclosure of Invention

The embodiment of the invention aims to provide a technical scheme of neural network generation and a technical scheme of face detection.

According to a first aspect of the embodiments of the present invention, there is provided a method for generating a neural network, the method including: dividing a target scale range to obtain a plurality of sub-target scale ranges, wherein the target scale range is used for face detection of a neural network; determining sub-neural networks respectively corresponding to the plurality of sub-target scale ranges; and fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network.

Optionally, the dividing the target scale range to obtain a plurality of sub-target scale ranges includes: uniformly dividing the target scale range into a plurality of sub-ranges; for each sub-range, extracting a plurality of face samples falling within the current sub-range in the training data set; and determining the multiple sub-target scale ranges according to the characteristic values of the face samples in each sub-range.

Optionally, the determining the multiple sub-target scale ranges according to the feature values of the face samples in each sub-range includes: for each face sample in each sub-range, generating a color histogram of the face sample based on pixel values of the face sample; respectively calculating chi-square distance between color histograms of every two face samples in each sub-range to obtain a face sample distance matrix of the sub-range; generating a feature value of the appearance change of the sub-range according to the face sample distance matrix; and determining the plurality of sub-target scale ranges according to the feature value of the change of appearance of each sub-range.

Optionally, before generating, for each face sample in each sub-range, a color histogram of the face sample based on pixel values of the face sample, the method further includes: and scaling the size of each face sample in each sub-range according to the lower bound value of the current sub-range.

Optionally, the dividing the target scale range to obtain a plurality of sub-target scale ranges includes: and dividing the target scale range based on the characteristic points of the human face to obtain a plurality of sub-target scale ranges.

Optionally, the determining the sub-neural networks corresponding to the plurality of sub-target scale ranges respectively includes: determining a pooling down-sampling step length corresponding to each sub-target scale range; and determining a sub-neural network corresponding to the sub-target scale range based on the pooling down-sampling step length.

Optionally, the merging the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network includes: and fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges based on the shared network parameters to obtain the neural network.

According to a second aspect of the embodiments of the present invention, there is provided a face detection method, including: zooming the face image to be detected to obtain a zoomed face image; and detecting the zoomed face image through a neural network to obtain a face detection result of the face image to be detected, wherein the neural network is generated according to the method of the first aspect of the embodiment of the invention.

Optionally, the scaling the face image to be detected to obtain a scaled face image includes: and zooming the face image to be detected according to the upper bound of the target scale range of the neural network.

Optionally, the face image to be detected is a still image, or a video image in a video frame sequence.

Optionally, the sequence of video frames is a sequence of video frames in a live broadcast.

According to a third aspect of the embodiments of the present invention, there is provided an apparatus for generating a neural network, the apparatus including: the dividing module is used for dividing a target scale range to obtain a plurality of sub-target scale ranges, wherein the target scale range is used for face detection of a neural network; the determining module is used for determining the sub-neural networks corresponding to the plurality of sub-target scale ranges respectively; and the fusion module is used for fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network.

Optionally, the dividing module includes: the first dividing unit is used for uniformly dividing the target scale range into a plurality of sub-ranges; an extraction unit for extracting, for each sub-range, a plurality of face samples falling within the current sub-range in the training data set; and the first determining unit is used for determining the plurality of sub-target scale ranges according to the characteristic value of each face sample in each sub-range.

Optionally, the first determining unit is specifically configured to: for each face sample in each sub-range, generating a color histogram of the face sample based on pixel values of the face sample; respectively calculating chi-square distance between color histograms of every two face samples in each sub-range to obtain a face sample distance matrix of the sub-range; generating a feature value of the appearance change of the sub-range according to the face sample distance matrix; and determining the plurality of sub-target scale ranges according to the feature value of the change of appearance of each sub-range.

Optionally, the first determining unit is further configured to: before generating the color histogram of the face sample based on the pixel values of the face sample, scaling the size of the face sample according to the lower bound value of the current sub-range for each face sample in each sub-range.

Optionally, the dividing module further includes: and the second dividing unit is used for dividing the target scale range based on the characteristic points of the human face to obtain the plurality of sub-target scale ranges.

Optionally, the determining module includes: the second determining unit is used for determining the pooling down-sampling step length corresponding to each sub-target scale range; and determining a sub-neural network corresponding to the sub-target scale range based on the pooling down-sampling step length.

Optionally, the fusion module includes: and the fusion unit is used for fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges based on the shared network parameters to obtain the neural network.

According to a fourth aspect of the embodiments of the present invention, there is provided a face detection apparatus, the apparatus including: the zooming module is used for zooming the face image to be detected to obtain a zoomed face image; a detection module, configured to detect the scaled face image through a neural network to obtain a face detection result of the face image to be detected, where the neural network is generated by the apparatus according to the third aspect of the embodiment of the present invention.

Optionally, the scaling module includes: and the scaling unit is used for scaling the face image to be detected according to the upper bound of the target scale range of the neural network.

According to a fifth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: the device comprises a first processor, a first memory, a first communication element and a first communication bus, wherein the first processor, the first memory and the first communication element are communicated with each other through the first communication bus; the first memory is configured to store at least one executable instruction, where the executable instruction causes the first processor to perform an operation corresponding to any one of the neural network generation methods provided in the first aspect of the embodiments of the present invention.

According to a sixth aspect of an embodiment of the present invention, there is provided an electronic apparatus including: the second processor, the second memory, the second communication element and the second communication bus are communicated with each other through the second communication bus; the second memory is used for storing at least one executable instruction, and the executable instruction causes the second processor to execute the operation corresponding to any one of the face detection methods provided by the second aspect of the embodiments of the present invention.

According to a seventh aspect of embodiments of the present invention, there is provided a computer-readable storage medium storing: the executable instructions are used for dividing the target scale range to obtain a plurality of sub-target scale ranges; executable instructions for determining sub-neural networks corresponding to the plurality of sub-target scale ranges, respectively; and the executable instructions are used for fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the executable instructions of the neural networks.

According to an eighth aspect of the embodiments of the present invention, there is provided another computer-readable storage medium storing: the executable instruction is used for zooming the face image to be detected to obtain a zoomed face image; and executable instructions for detecting the scaled face image through a neural network to obtain a face detection result of the face image to be detected.

According to the technical scheme provided by the embodiment of the invention, the target scale range is divided to obtain a plurality of sub-target scale ranges, the sub-neural network corresponding to each sub-target scale range is determined, and the sub-neural networks corresponding to each sub-target scale range are fused to obtain the neural network for face detection.

Drawings

Fig. 1 is a flowchart of a method for generating a neural network according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a method for generating a neural network according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a specific scenario in which the method embodiment of FIG. 2 is applied;

FIG. 4 is a flowchart of a face detection method according to a third embodiment of the present invention;

FIG. 5 is a flow chart of a face detection method according to a fourth embodiment of the invention;

fig. 6 is a block diagram of a generation apparatus of a neural network according to a fifth embodiment of the present invention;

fig. 7 is a block diagram of a generation apparatus of a neural network according to a sixth embodiment of the present invention;

fig. 8 is a block diagram of a face detection apparatus according to a seventh embodiment of the present invention;

fig. 9 is a block diagram of a face detection apparatus according to an eighth embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to a ninth embodiment of the present invention;

fig. 11 is a schematic structural diagram of an electronic device according to a tenth embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings (like numerals indicate like elements throughout the several views) and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present invention are used merely to distinguish one element, step, device, module, or the like from another element, and do not denote any particular technical or logical order therebetween.

Example one

Fig. 1 is a flowchart of a method for generating a neural network according to a first embodiment of the present invention.

Referring to fig. 1, in step S101, a target scale range is divided into a plurality of sub-target scale ranges.

The target scale range is a scale range to be generated and used by a neural network for face detection, that is, the neural network performs face detection by using the target scale range. In the embodiment of the present invention, the size of the face in the face image may be defined as the side length of the bounding box of the face in the face image, and then the target scale range used by the neural network to be generated refers to a range in which the side length of the bounding box of the face in the face image is located, where the side length of the bounding box of the face may be embodied by the pixel value of the side of the bounding box, for example, the target scale range may be between 10 pixels and 1300 pixels, and the sub-target scale range may be between 10 pixels and 500 pixels and between 500 pixels and 1300 pixels. The union of the sub-target scale ranges should contain the target scale range. In a specific embodiment, in order to enable the generated neural network to detect a face in a face image with a very large face size range, a target scale range of the neural network to be generated may be given in advance.

Specifically, the neural network to be generated may be a neural network that can implement feature extraction or target object detection, including but not limited to a convolutional neural network, an reinforcement learning neural network, a generation network in an antagonistic neural network, and the like.

In step S102, the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges are determined.

In a specific embodiment, the sub-neural network corresponding to each sub-target scale range may be determined according to each sub-target scale range. Therefore, the corresponding sub-neural network is designed for each sub-target scale range, and the detection performance of the face in the sub-target scale range in the face image can be improved. The sub-neural network may be a neural network that can implement feature extraction or target object detection, including but not limited to a convolutional neural network, an enhanced learning neural network, a generation network in an antagonistic neural network, and the like.

In step S103, the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges are merged to obtain the neural network.

In this embodiment, the sub-neural networks corresponding to each sub-target scale range have the same parameters, and each sub-neural network can be fused according to the same parameters to obtain a neural network for face detection.

According to the method for generating the neural network, the target scale range is divided to obtain a plurality of sub-target scale ranges, the sub-neural network corresponding to each sub-target scale range is determined, and then the sub-neural networks corresponding to each sub-target scale range are fused to obtain the neural network for detecting the face.

The generation method of the neural network of the present embodiment may be performed by any suitable device having data processing capability, including but not limited to: terminal equipment, a server and the like.

Example two

Fig. 2 is a flowchart of a method for generating a neural network according to a second embodiment of the present invention.

Referring to fig. 2, in step S201, the target scale range is divided based on the feature points of the human face, so as to obtain the plurality of sub-target scale ranges.

In a specific embodiment, assuming that the target scale range is [ n, m ], the target scale range may be divided into m-n +1 sub-target scale sub-ranges, where n and m are integers greater than or equal to 1, and n is much smaller than m. However, such partitioning results in an excessive number of sub-target scale ranges and also in an excessive number of sub-neural networks for the sub-target scale ranges, resulting in a high degree of redundancy of the sub-neural networks. In addition, the change of the face appearance is not uniform along with the size of the face size. Clearly, such an embodiment is not desirable. According to the observation of the inventor of the present application, the inventor of the present application finds that a face with a small size (less than 40 pixels in size) loses most of appearance information, and the face with the small size can be characterized by a rigid structure and an environment. Mid-size faces (between 40 and 140 pixels) have a large variation because these faces in the image are usually not the subject of the photographer and therefore they have different poses and different viewing directions. Large faces (above 140 pixels) typically have little variation because they are the subject of the photographer when the image is being taken. These large-sized faces are often in a frontal or lateral pose. The target scale range can be divided into a small range, a medium range and a large range according to the feature points of the human faces with different sizes, and the network structure is taken into account by the dividing mode. Therefore, the detection precision of the human face in the human face image can be improved.

Optionally, the dividing the target scale range to obtain a plurality of sub-target scale ranges includes: uniformly dividing the target scale range into a plurality of sub-ranges; randomly extracting, for each sub-range, a plurality of face samples falling within a current sub-range in a training data set of the neural network; and determining the multiple sub-target scale ranges according to the characteristic values of the face samples in each sub-range. Wherein, the determining the plurality of sub-target scale ranges according to the characteristic value of each face sample in each sub-range comprises: for each face sample in each sub-range, generating a color histogram of the face sample based on pixel values of the face sample; respectively calculating chi-square distance between color histograms of every two face samples in each sub-range to obtain a face sample distance matrix of the sub-range; generating a feature value of the appearance change of the sub-range according to the face sample distance matrix; and determining the plurality of sub-target scale ranges according to the feature value of the change of appearance of each sub-range. And for each sub-range, counting the pixel value of each face sample in the sub-range, and obtaining the color histogram of each face sample in the sub-range according to the pixel value. Optionally, before generating, for each face sample in each sub-range, a color histogram of the face sample based on pixel values of the face sample, the method further includes: and scaling the size of each face sample in each sub-range according to the lower bound value of the current sub-range. Specifically, for each face sample in each sub-range, the size of each face sample in the current sub-range is scaled to the size of the lower bound of the current sub-range according to the lower bound value of the current sub-range.

In a specific embodiment, the target scale range is equally divided into k (k > -100), for each sub-range, s (s > -300) face samples falling within the sub-range are randomly extracted from the training data set of the neural network to be generated, and the size of the face samples is scaled to the lower bound size of the sub-range. For each face sample, obtaining a color histogram of the face sample according to an image value of the face sample, wherein the color histogram can be used as a feature for describing the face sample. And for each sub-range, calculating the chi-square distance between every two color histograms of the face samples in the sub-range, and taking the norm of the obtained distance matrix as a characteristic value for describing the change size of the face appearance. And for each sub-range, calculating to obtain characteristic values of the change size of the face appearance by adopting the method, normalizing the characteristic values to a range of [0-1], and outputting in a histogram form. Then, the feature value of the change size of the face appearance is divided. For example, if b sub-target scale ranges are desired to be output, the feature value of the face appearance change size can be divided into b shares, and then the sub-range of each share is output, so as to obtain each sub-target scale range.

In step S202, the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges are determined.

In this embodiment, the determining the sub-neural networks corresponding to the plurality of sub-target scale ranges respectively includes: determining a pooling down-sampling step length corresponding to each sub-target scale range; and determining a sub-neural network corresponding to the sub-target scale range based on the pooling down-sampling step length. Specifically, the step size of pooling downsampling of the convolution layer of the sub-neural network corresponding to the sub-target scale range is determined according to the sub-target scale range, so that the absolute value of the difference between the upper bound and the lower bound of the scale sub-range of the sub-target scale range after mapping on the convolution layer of the corresponding sub-neural network and the size of a pre-configured template for the spatial pooling operation of feature extraction is smaller than a preset value. Therefore, the detection performance of the human face in the human face image can be improved.

The reason why the absolute value of the difference between the upper bound and the lower bound of the mapped scale sub-range and the size of the pre-configured template is smaller than the preset value is that the upper bound and the lower bound of the mapped scale sub-range are not excessively larger or smaller than the size of the template, and the scale sub-range of the sub-target scale range mapped on the corresponding convolution layer of the sub-neural network is basically consistent with the size of the pre-configured template, so that the detection performance of the human face can be improved. The preset value, which is generally 2, can be obtained by a person skilled in the art by testing.

Specifically, the template size refers to the size of the same-size face features obtained by performing a spatial pyramid pooling operation on different-size face features, and defines not only the size of the face features after such pooling operation, but also a convolution layer of a sub-neural network that extracts the face detection features, for example, c ═ c, (c ═ 3 or c ═ 5). And assuming that the sub-target scale range is [ a, b ] and the pooling downsampling step length of the convolutional layer of the sub-neural network is n, the scale sub-range of the sub-target scale range after mapping on the convolutional layer of the corresponding sub-neural network is [ a/n, b/n ]. After determining the step size of the pooling downsampling of the convolutional layer of the sub-neural network corresponding to the sub-target scale range, the sub-neural network corresponding to the sub-target scale range can be obtained, and the structure of the sub-neural network comprises the convolutional layer and all network layers before the convolutional layer.

For example, when the pre-configured template size is 5 × 5, for a small-sized face (between 10 pixels and 40 pixels), the best detection performance of the small-sized face can be achieved by pooling the sub-neural networks with the down-sampling step size of 8, because the scale sub-range on the feature map after projection of the small-sized face is between 2 pixels and 5 pixels, which is similar to the pre-configured template size.

In step S203, the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges are merged to obtain the neural network.

Specifically, the steps include: and fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges based on the shared network parameters to obtain the neural network. Specifically, fusing the sub-neural networks corresponding to each sub-target scale range; and obtaining the neural network for face detection according to the fused sub-neural network. Wherein all the sub-neural networks share the same network parameters. Therefore, the redundancy of the sub-neural network corresponding to each sub-target scale range can be reduced. In a particular embodiment, the same network parameter does not include a step size for pooling down-sampling.

In the present embodiment, after generating the neural network for face detection according to steps S201 to S203, the generated neural network is trained using an end-to-end training method. Specifically, the training sample set of the neural network is sampled according to the divided sub-target scale range, and the positive and negative samples in different sub-target scale ranges are used for training the detection branches in the integrated single neural network aiming at different sub-target scale ranges. Errors of detection branches of different sub-target scale ranges are calculated in parallel and then propagated backwards along with the gradient for training of the network. The training supervision information may be from, but is not limited to, a face frame, a face feature point, a face attribute, and the like. Specifically, for each sub-target scale range, positive and negative samples in the sub-target scale range are extracted from the training sample set, wherein the negative samples are from two parts, and one part is from a pure background, namely, samples which do not overlap with all face labeling samples in the sub-target scale range. Another portion of the negative samples are from samples that overlap a small portion of the face label within the sub-target scale, e.g., 1/3 where the overlap is less than the union. The negative examples of the two parts are added to the final negative example set with the same probability. The positive samples are from samples that overlap most of the face labeling samples in the sub-target scale range, e.g., the overlap is greater than 1/2 of the union. If the sample has a large part of overlapping with the face labeling sample outside the sub-target scale range, the sample is ignored.

Fig. 3 is a schematic diagram of a specific scenario in which the method embodiment of fig. 2 is applied. As shown in fig. 3, all threads for detecting faces are described, and this architecture includes 3 sub-neural networks with different spatial pooling step sizes and depths, and these 3 sub-neural networks are merged into a single backbone network by sharing parameters. The single backbone network becomes of the same structure as ResNet-50, face features of face samples can be extracted from the last layer of each block (e.g., res2x, res3x, res4x, res5x) of the backbone network using a pooling operation of the region of interest, and these extracted face features are used to train the region candidate network RPN and the fast convolutional neural network fast-RCNN. The target scale detection range of the backbone network is between 10 pixels and 1300 pixels, the sub-target scale detection ranges respectively corresponding to the 3 sub-neural networks are between 10 pixels and 40 pixels, between 40 pixels and 140 pixels, between 140 pixels and 1300 pixels, and the size of the template for the space pooling operation is 5x 5. When the face image is input into the backbone network, the face features of the last layer of the block res2x with the step size of 4 and the face features of the last layer of the block res3x with the step size of 8 can be extracted, the face 1 with the size of 10 pixels to 40 pixels in the face image can be detected according to the face features, the face features of the last layer of the block res3x with the step size of 8 and the face features of the last layer of the block res4x with the step size of 16 can be extracted, the face 2 with the size of 40 pixels to 140 pixels in the face image can be detected according to the face features, the face features of the last layer of the block res4x with the step size of 16 and the face features of the last layer of the block res5x with the step size of 32 can be extracted, and the face 3 with the size of 140 pixels to 1300 pixels in the face image can be detected according to the face features. In the figure, multi-scale features are utilized for face detection. Of course, face detection may also be performed without using multi-scale features, for example, by removing the block res2x in the backbone network, the face features of the last layer of the block res3x with the step size of 8 may be directly extracted, a face with the size falling between 10 pixels and 40 pixels in the face image may be obtained according to the face feature detection, the face features of the last layer of the block res4x with the step size of 16 may be directly extracted, a face with the size falling between 40 pixels and 140 pixels in the face image may be obtained according to the face feature detection, the face features of the last layer of the block res5x with the step size of 32 may be directly extracted, and a face with the size falling between 140 pixels and 1300 pixels in the face image may be obtained according to the face feature detection. In a specific embodiment, the performance of face detection using multi-scale features is slightly better than that of face detection without multi-scale features.

According to the method for generating the neural network, the target scale range is divided based on the characteristic points of the human face to obtain a plurality of sub-target scale ranges, the sub-neural network corresponding to each sub-target scale range is determined according to each sub-target scale range, and the sub-neural networks corresponding to each sub-target scale range are fused by using a parameter sharing method to obtain the neural network for human face detection. The method can improve the face detection precision in the face image and can also improve the face detection efficiency in the face image.

The neural network generation method of the present embodiment may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like.

EXAMPLE III

Fig. 4 is a flowchart of a face detection method according to a third embodiment of the present invention.

Referring to fig. 4, in step S301, a face image to be detected is scaled to obtain a scaled face image.

Because the size of the face image to be detected is arbitrary, it is possible that the size of the face image to be detected does not fall within the target scale range of the neural network for face detection, and at this time, the face image to be detected needs to be scaled so that the size of the face image to be detected falls within the target scale range of the neural network.

In step S302, the scaled face image is detected by the neural network, so as to obtain a face detection result of the face image to be detected.

The neural network is generated according to the method described in the first embodiment or the second embodiment. The face detection result comprises size information and position information of the face in the face image.

The exemplary embodiment of the present invention is directed to a face detection method, which includes scaling a face image to be detected to obtain a scaled face image, and detecting the scaled face image through a neural network to obtain a face detection result of the face image to be detected.

The face detection method of the present embodiment may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like.

Example four

Fig. 5 is a flowchart of a face detection method according to a fourth embodiment of the present invention.

Referring to fig. 5, in step S401, the face image to be detected is scaled according to the upper bound of the target scale range of the neural network.

Specifically, the face image to be detected is scaled according to the upper bound of the target scale range of the neural network, so that the size of the long edge of the scaled face image is smaller than or equal to the upper bound of the target scale range of the neural network.

The face image to be detected may be rectangular or square, and as long as the size of the long edge of the zoomed face image is ensured to be smaller than or equal to the upper bound of the target scale range of the neural network, the zoomed face image falls into the target scale range of the neural network. Specifically, the face image to be detected may be a photographed still image, a video image in a video frame sequence, or a synthesized image. The sequence of video frames may be a sequence of video frames in a live broadcast.

In step S402, the scaled face image is detected by the neural network, and a face detection result of the face image to be detected is obtained.

Since this step is the same as step S302 in the third embodiment, it is not described herein again.

The face detection method of the embodiment of the invention has important application, such as video monitoring in security protection, automatic focusing systems of mobile phones and digital cameras, electronic photo albums, face recognition systems and the like. In addition, the method can help to improve the effects of a face tracking technology, a face recognition technology and a face key point detection technology.

The exemplary embodiment of the present invention is directed to a face detection method, which includes scaling a face image to be detected according to an upper bound of a target scale range of a neural network, so that a size of a long edge of the scaled face image is smaller than or equal to the upper bound of the target scale range of the neural network, and detecting the scaled face image through the neural network to obtain a face detection result of the face image to be detected.

EXAMPLE five

Based on the same technical concept, fig. 6 is a block diagram illustrating a structure of a generation apparatus of a neural network according to a fifth embodiment of the present invention. The method of generating a neural network according to the first embodiment can be performed.

Referring to fig. 6, the generation apparatus of the neural network includes a division module 501, a determination module 502, and a fusion module 503.

The dividing module 501 is configured to divide a target scale range to obtain a plurality of sub-target scale ranges, where the target scale range is used for face detection by a neural network;

a determining module 502, configured to determine sub-neural networks corresponding to the multiple sub-target scale ranges respectively;

and a fusion module 503, configured to fuse the sub-neural networks corresponding to the multiple sub-target scale ranges, respectively, to obtain the neural network.

According to the device for generating the neural network, the target scale range is divided to obtain a plurality of sub-target scale ranges, the sub-neural network corresponding to each sub-target scale range is determined, and then the sub-neural networks corresponding to each sub-target scale range are fused to obtain the neural network for detecting the face.

EXAMPLE six

Based on the same technical concept, fig. 7 is a block diagram illustrating a structure of a generation apparatus of a neural network according to a sixth embodiment of the present invention. The method of generating the neural network according to the second embodiment can be performed.

Referring to fig. 7, the generation apparatus of the neural network includes a dividing module 601, a determining module 602, and a fusing module 603. The dividing module 601 is configured to divide a target scale range to obtain a plurality of sub-target scale ranges, where the target scale range is used for face detection by a neural network; a determining module 602, configured to determine sub-neural networks corresponding to the multiple sub-target scale ranges respectively; and a fusion module 603, configured to fuse the sub-neural networks respectively corresponding to the multiple sub-target scale ranges to obtain the neural network.

Optionally, the dividing module 601 includes: a first dividing unit 6011 configured to divide the target scale range into a plurality of sub-ranges uniformly; an extracting unit 6012, configured to extract, for each sub-range, a plurality of face samples falling within the current sub-range in the training data set; a first determining unit 6013, configured to determine the multiple sub-target scale ranges according to the feature values of the face samples in each sub-range.

Optionally, the first determining unit 6013 is specifically configured to: for each face sample in each sub-range, generating a color histogram of the face sample based on pixel values of the face sample; respectively calculating chi-square distance between color histograms of every two face samples in each sub-range to obtain a face sample distance matrix of the sub-range; generating a feature value of the appearance change of the sub-range according to the face sample distance matrix; and determining the plurality of sub-target scale ranges according to the feature value of the change of appearance of each sub-range.

Optionally, the first determining unit 6013 is further configured to: before generating the color histogram of the face sample based on the pixel values of the face sample, scaling the size of the face sample according to the lower bound value of the current sub-range for each face sample in each sub-range.

Optionally, the dividing module 601 further includes: a second dividing unit 6014, configured to divide the target scale range based on the feature points of the human face to obtain the multiple sub-target scale ranges.

Optionally, the determining module 602 includes: a second determining unit 6021, configured to determine, for each sub-target scale range, a pooling down-sampling step corresponding to the sub-target scale range; and determining a sub-neural network corresponding to the sub-target scale range based on the pooling down-sampling step length.

Optionally, the fusion module 603 includes: and a fusion unit 6031, configured to fuse, based on the shared network parameter, the sub-neural networks respectively corresponding to the multiple sub-target scale ranges to obtain the neural network.

It should be noted that, specific details related to the neural network generation apparatus provided in the embodiment of the present invention have been described in detail in the neural network generation method provided in the embodiment of the present invention, and are not described herein again.

EXAMPLE seven

Based on the same technical concept, fig. 8 is a block diagram illustrating the structure of a face detection apparatus according to a seventh embodiment of the present invention. The method can be used to execute the flow of the face detection method as described in the third embodiment.

Referring to fig. 8, the face detection apparatus includes a scaling module 701 and a detection module 702.

The zooming module 701 is used for zooming the face image to be detected to obtain a zoomed face image;

a detection module 702, configured to detect the scaled face image through a neural network to obtain a face detection result of the face image to be detected,

wherein the neural network is generated according to the apparatus of embodiment five or embodiment six.

Compared with the method for detecting the human face in the prior art, the human face detection device can detect the human face in the human face image with the extremely large human face size range, improve the human face detection precision in the human face image and improve the human face detection efficiency in the human face image.

Example eight

Based on the same technical concept, fig. 9 is a block diagram illustrating a configuration of a face detection apparatus according to an eighth embodiment of the present invention. The method can be used to execute the flow of the face detection method as described in the fourth embodiment.

Referring to fig. 9, the face detection apparatus includes a scaling module 801 and a detection module 802. The scaling module 801 is configured to scale a face image to be detected to obtain a scaled face image; a detection module 802, configured to detect the scaled face image through a neural network to obtain a face detection result of the face image to be detected, where the neural network is generated according to the apparatus in the fifth embodiment or the sixth embodiment.

Optionally, the scaling module 801 includes: the scaling unit 8011 is configured to scale the face image to be detected according to an upper bound of a target scale range of the neural network.

It should be noted that, specific details related to the face detection device provided in the embodiment of the present invention have been described in detail in the face detection method provided in the embodiment of the present invention, and are not described herein again.

Example nine

Embodiments of the present invention also provide AN electronic device, such as a mobile terminal, a Personal Computer (PC), a tablet, a server, etc., referring now to fig. 10, which illustrates a schematic diagram of AN electronic device 900 suitable for implementing a terminal device or a server of AN embodiment of the present invention, as illustrated in fig. 10, the electronic device 900 includes one or more first processors, such as one or more Central Processing Units (CPUs) 901, and/or one or more image processors (GPUs) 913, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)902 or loaded from a storage portion into a Random Access Memory (RAM)903, etc., a first read only memory 902 and a random access memory 903, which may collectively be referred to as a first memory, the first communication element includes a communication component 912 and/or a communication interface 909, wherein the communication component 912 may include, but is not limited to, a network card, which may include a network card 35ib, AN internet card, a communication interface card, such as AN internet card, a communication interface card, etc. (i.e., a communication interface card) L, which may perform communication via a communication interface card, such as AN internet card, a communication interface card, etc.

The first processor may communicate with the read only memory 902 and/or the random access memory 903 to execute executable instructions, connect with the communication component 912 through the first communication bus 904, and communicate with other target devices through the communication component 912, thereby completing operations corresponding to any one of the neural network generation methods provided by the embodiments of the present invention, for example, dividing a target scale range to obtain a plurality of sub-target scale ranges, where the target scale range is used for face detection of the neural network; determining sub-neural networks respectively corresponding to the plurality of sub-target scale ranges; and fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network.

In addition, in the RAM903, various programs and data necessary for the operation of the device can also be stored. The CPU901 or GPU913, ROM902, and RAM903 are connected to each other via a first communication bus 904. The ROM902 is an optional module in case of the RAM 903. The RAM903 stores or writes executable instructions into the ROM902 at runtime, and the executable instructions cause the first processor to perform operations corresponding to the above-described communication method. An input/output (I/O) interface 905 is also connected to the first communication bus 904. The communication component 912 may be integrated or may be configured with multiple sub-modules (e.g., IB cards) and linked over a communication bus.

To the I/O interface 905, AN input section 906 including a keyboard, a mouse, and the like, AN output section 907 including a keyboard such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 908 including a hard disk, and the like, and a communication interface 909 including a network interface card such as a L AN card, a modem, and the like, a drive 910 is also connected to the I/O interface 905 as necessary, a removable medium 911 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 910 as necessary so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

It should be noted that the architecture shown in fig. 10 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 10 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication element may be separately set, or the GPU and the CPU may be integrated, and so on. These alternative embodiments are all within the scope of the present invention.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for executing the method shown in the flowchart, where the program code may include instructions corresponding to the execution of the method steps provided by embodiments of the present invention, for example, dividing a target scale range into a plurality of sub-target scale ranges, where the target scale range is used for face detection in a neural network; determining sub-neural networks respectively corresponding to the plurality of sub-target scale ranges; and fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network. In such an embodiment, the computer program may be downloaded and installed from a network via the communication element, and/or installed from the removable medium 911. The computer program, when executed by the first processor, performs the above-described functions defined in the method of an embodiment of the invention.

Example ten

Embodiments of the present invention also provide AN electronic device, such as a mobile terminal, a Personal Computer (PC), a tablet, a server, etc., referring now to fig. 11, which shows a schematic diagram of AN electronic device 1000 suitable for implementing a terminal device or a server of AN embodiment of the present invention, as shown in fig. 11, the electronic device 1000 includes one or more second processors, such as one or more Central Processing Units (CPUs) 1001, and/or one or more image processors (GPUs) 1013, etc., which may perform various appropriate actions and processes according to executable instructions stored in a read-only memory (ROM)1002 or loaded from a storage portion 1008 into a Random Access Memory (RAM)1003, a second memory 1013, in this embodiment, the second read-only memory 1002 and the random access memory 1003 are collectively referred to as the second memory 1009, the second communications component includes a communications component 1012 and/or a communications interface 1009, wherein the communications component 1012 may include, but is not limited to, a network card, which may include, AN infiniband communications interface card L, a communications interface card, such as AN internet card (AN lan) via a communications interface card, a communications interface 1009, or the like.

The second processor may communicate with the read-only memory 1002 and/or the random access memory 1003 to execute the executable instructions, connect with the communication component 1012 through the second communication bus 1004, and communicate with other target devices through the communication component 1012, thereby completing the operations corresponding to any one of the face detection methods provided by the embodiments of the present invention, for example, scaling the face image to be detected to obtain a scaled face image; and detecting the zoomed face image through a neural network to obtain a face detection result of the face image to be detected, wherein the neural network is generated according to the method of the first embodiment or the second embodiment.

In addition, in the RAM1003, various programs and data necessary for the operation of the device can be stored. The CPU1001 or GPU1013, the ROM1002, and the RAM1003 are connected to each other by a second communication bus 1004. The ROM1002 is an optional module in the case of the RAM 1003. The RAM1003 stores executable instructions or writes executable instructions into the ROM1002 at runtime, and the executable instructions cause the second processor to execute operations corresponding to the communication method described above. An input/output (I/O) interface 1005 is also connected to the second communication bus 1004. The communication component 1012 may be integrated or configured with multiple sub-modules (e.g., IB cards) and linked over a communication bus.

To the I/O interface 1005, AN input section 1006 including a keyboard, a mouse, and the like, AN output section 1007 including a display section such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 1008 including a hard disk, and the like, and a communication interface 1009 including a network interface card such as a L AN card, a modem, and the like are connected, as necessary, a drive 1010 is also connected to the I/O interface 1005, a removable medium 1011 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

It should be noted that the architecture shown in fig. 11 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 11 may be selected, deleted, added, or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication element may be separately set, or the GPU and the CPU may be integrated, and so on. These alternative embodiments are all within the scope of the present invention.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, where the program code may include instructions corresponding to executing steps of the method provided by embodiments of the present invention, for example, scaling a face image to be detected to obtain a scaled face image; and detecting the zoomed face image through a neural network to obtain a face detection result of the face image to be detected, wherein the neural network is generated according to the method of the first embodiment or the second embodiment. In such an embodiment, the computer program may be downloaded and installed from a network via the communication element, and/or installed from the removable medium 1011. The computer program, when executed by the second processor, performs the above-described functions defined in the method of an embodiment of the invention.

The method and apparatus, device of the present invention may be implemented in a number of ways. For example, the method, apparatus and device of the embodiments of the present invention may be implemented by software, hardware, firmware or any combination of software, hardware and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the embodiments of the present invention are not limited to the order specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to embodiments of the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to an embodiment of the present invention.

The description of the present embodiments has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed, and many modifications and variations will be apparent to those skilled in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method of generating a neural network, the method comprising:

dividing a target scale range to obtain a plurality of sub-target scale ranges, wherein the target scale range is used for face detection of a neural network;

determining sub-neural networks respectively corresponding to the plurality of sub-target scale ranges;

fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network,

wherein, the dividing the target scale range to obtain a plurality of sub-target scale ranges comprises:

uniformly dividing the target scale range into a plurality of sub-ranges;

for each sub-range, extracting a plurality of face samples falling within the current sub-range in the training data set;

determining the multiple sub-target scale ranges according to the characteristic value of each face sample in each sub-range,

wherein, the determining the plurality of sub-target scale ranges according to the characteristic value of each face sample in each sub-range comprises:

for each face sample in each sub-range, generating a color histogram of the face sample based on pixel values of the face sample;

respectively calculating chi-square distance between color histograms of every two face samples in each sub-range to obtain a face sample distance matrix of the sub-range;

generating a feature value of the appearance change of the sub-range according to the face sample distance matrix;

and determining the plurality of sub-target scale ranges according to the feature value of the change of appearance of each sub-range.

2. The method of claim 1, wherein before generating the color histogram of the face sample based on the pixel values of the face sample for each face sample in each sub-range, the method further comprises:

and scaling the size of each face sample in each sub-range according to the lower bound value of the current sub-range.

3. The method according to any one of claims 1-2, wherein the dividing the target scale range into a plurality of sub-target scale ranges comprises:

and dividing the target scale range based on the characteristic points of the human face to obtain a plurality of sub-target scale ranges.

4. The method according to any one of claims 1-2, wherein the determining the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges comprises:

determining a pooling down-sampling step length corresponding to each sub-target scale range; and determining a sub-neural network corresponding to the sub-target scale range based on the pooling down-sampling step length.

5. The method according to any one of claims 1-2, wherein the fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network comprises:

and fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges based on the shared network parameters to obtain the neural network.

6. A face detection method, comprising:

zooming the face image to be detected to obtain a zoomed face image;

detecting the zoomed human face image through a neural network to obtain a human face detection result of the human face image to be detected,

wherein the neural network is generated according to the method of any one of claims 1-5.

7. The method of claim 6, wherein scaling the face image to be detected to obtain a scaled face image comprises:

and zooming the face image to be detected according to the upper bound of the target scale range of the neural network.

8. The method according to any one of claims 6 to 7, wherein the face image to be detected is a still image or a video image in a sequence of video frames.

9. The method of claim 8, wherein the sequence of video frames is a sequence of video frames in a live broadcast.

10. An apparatus for generating a neural network, the apparatus comprising:

the dividing module is used for dividing a target scale range to obtain a plurality of sub-target scale ranges, wherein the target scale range is used for face detection of a neural network;

the determining module is used for determining the sub-neural networks corresponding to the plurality of sub-target scale ranges respectively;

a fusion module for fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges to obtain the neural network,

wherein, the dividing module comprises:

the first dividing unit is used for uniformly dividing the target scale range into a plurality of sub-ranges;

an extraction unit for extracting, for each sub-range, a plurality of face samples falling within the current sub-range in the training data set;

a first determining unit, configured to determine the multiple sub-target scale ranges according to the feature values of the face samples in each sub-range;

the first determining unit is specifically configured to:

11. The apparatus of claim 10, wherein the first determining unit is further configured to:

before generating the color histogram of the face sample based on the pixel values of the face sample, scaling the size of the face sample according to the lower bound value of the current sub-range for each face sample in each sub-range.

12. The apparatus according to any one of claims 10-11, wherein the dividing module further comprises:

and the second dividing unit is used for dividing the target scale range based on the characteristic points of the human face to obtain the plurality of sub-target scale ranges.

13. The apparatus according to any one of claims 10-11, wherein the determining module comprises:

the second determining unit is used for determining the pooling down-sampling step length corresponding to each sub-target scale range; and determining a sub-neural network corresponding to the sub-target scale range based on the pooling down-sampling step length.

14. The apparatus according to any one of claims 10-11, wherein the fusion module comprises:

and the fusion unit is used for fusing the sub-neural networks respectively corresponding to the plurality of sub-target scale ranges based on the shared network parameters to obtain the neural network.

15. An apparatus for face detection, the apparatus comprising:

the zooming module is used for zooming the face image to be detected to obtain a zoomed face image;

a detection module for detecting the zoomed face image via the neural network to obtain the face detection result of the face image to be detected,

wherein the neural network is generated by the apparatus of any one of claims 10-14.

16. The apparatus of claim 15, wherein the scaling module comprises:

and the scaling unit is used for scaling the face image to be detected according to the upper bound of the target scale range of the neural network.

17. The apparatus according to any one of claims 15-16, wherein the face image to be detected is a still image or a video image in a sequence of video frames.

18. The apparatus of claim 17, wherein the sequence of video frames is a sequence of video frames in a live broadcast.

19. An electronic device, comprising: the device comprises a first processor, a first memory, a first communication element and a first communication bus, wherein the first processor, the first memory and the first communication element are communicated with each other through the first communication bus;

the first memory is used for storing at least one executable instruction, and the executable instruction causes the first processor to execute the operation corresponding to the generation method of the neural network as claimed in any one of claims 1 to 5.

20. An electronic device, comprising: the second processor, the second memory, the second communication element and the second communication bus are communicated with each other through the second communication bus;

the second memory is used for storing at least one executable instruction, and the executable instruction causes the second processor to execute the operation corresponding to the face detection method according to any one of claims 6 to 9.