Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The living human face detection method provided by the first embodiment of the present invention can be applied to an application environment as shown in fig. 1, in which a client (computer device) communicates with a server through a network. The method comprises the steps that a server obtains a first face image obtained through a single visible light camera, the first face image is preprocessed to obtain a second face image, the second face image is classified through a pre-trained support vector machine classifier, whether the second face image is a color image or not is judged, when the second face image is the color image, the second face image is processed according to a first preset rule to obtain first multichannel data of the second face image, the first multichannel data are processed according to a pre-trained convolutional neural network to obtain a living body detection result of the first face image, and the detection result is sent to a client side. Among them, the client (computer device) may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.
In a first embodiment of the present invention, as shown in fig. 2, a face live detection method is provided, which is described by taking the method applied to the server side in fig. 1 as an example, and includes the following steps 11 to 15.
Step 11: a first face image is acquired through a single visible light camera.
Specifically, a face image shot by one camera may be used as the first face image. That is, in the present embodiment, the acquisition process of the first face image does not require a plurality of cameras.
Step 12: and preprocessing the first face image to obtain a second face image.
The second face image should be a predetermined size and include a face image.
Further, as an implementation manner of this embodiment, as shown in fig. 3, the step 12 includes the following steps 121 to 123.
Step 121: and carrying out face detection on the first face image to obtain the face position in the first face image.
The first face image can be subjected to face detection by specifically adopting a neural network, so that the face position in the first face image is obtained.
For example, in the VGG16 network architecture, refinet is adopted as a face positioning network, features in each dimension in a first face image are extracted through the pre-trained face positioning network, multi-scale features of the first face image are obtained, a face positioning frame is obtained according to multi-scale feature regression, and a framed position of the face positioning frame is taken as a face position. The method for obtaining the pre-trained face positioning network comprises the following steps: acquiring a plurality of face sample images, smoothing the plurality of face sample images by using a low-pass filter, then, the face sample image is down-sampled, so as to obtain a series of face sample images with reduced sizes, extracting the features of each dimension in the face sample image through a face positioning network to obtain the multi-scale features of the face sample image, and obtaining a predicted face positioning frame according to the multi-scale feature regression, judging whether the predicted face positioning frames and the actual face positioning frames of the plurality of face sample images meet a first training requirement, if not, adjusting the size of each feature weight in the face positioning network, repeatedly predicting through the face positioning network to obtain the predicted face positioning frame until the predicted face positioning frames and the actual face positioning frames of the plurality of face sample images meet the first training requirement, and taking the current face positioning network as a pre-trained face positioning network.
It should be noted that the above only lists one method capable of detecting the face position, and does not limit the method for detecting the face position in this embodiment at all.
Step 122: and intercepting a face position image containing the face from the first face image according to the face position.
Specifically, a face position image including a face and a face surrounding background is captured.
In some examples, the area outlined by the face location box in step 121 above should include the surrounding background of the face.
Step 123: and processing the face position image according to a second preset rule to obtain a second face image.
Specifically, the size of the pixels of the face position image is converted, so that the obtained second face image can meet the same standard, and the subsequent processing of the second face image is facilitated. In some examples, the face position image may be converted into a 48 × 48 three-channel image of red, green and blue, and this should be able to be set specifically according to different requirements.
Through the implementation of the steps 121 to 123, the second face image can be obtained according to the first face image, so that the second face image can meet the same standard, the difficulty of subsequent processing of the second face image is reduced, and the processing efficiency is improved.
Step 13: and classifying the second face image through a pre-trained support vector machine classifier, and judging whether the second face image is a color image.
The pre-trained support vector machine classifier has two classification results, namely a color image and a gray image. Specifically, when the pre-trained support vector machine classifier judges that the second face image is a color image, step 14 is performed, and when the pre-trained support vector machine classifier judges that the second face image is a gray image, a detection result containing information that the first face image is a non-living body is directly returned.
Step 14: and when the second face image is a color image, processing the second face image according to a first preset rule to obtain first multi-channel data of the second face image.
Wherein the first multichannel data is available for direct input into the convolutional neural network.
Further, as an implementation manner of this embodiment, as shown in fig. 4, the step 14 specifically includes the following steps 141 to 143.
Step 141: and when the second face image is colored, extracting red, green and blue three-channel data of each pixel in the second face image.
In some examples, before the step 141, the method may further include: and adjusting the second face image to a specified size, and adopting the same standard to facilitate the subsequent input into a pre-trained convolutional neural network. For example, the second face image is converted into a 112 × 112 three-channel image of red, green and blue.
Step 142: and acquiring three channels (YCbCr for short) in a color space according to the data of the red, the green and the blue channels of each pixel.
Specifically, YCbCr can be obtained by calculation according to the following formula (1):
Y=0.299R+0.587G+0.114B
Cb=0.564(B-Y) (1)
Cr=0.713(R-Y)
where R represents red channel data, G represents green channel data, B represents blue channel data, Y represents luminance component channel data, Cb represents blue chrominance component channel data, and Cr represents red chrominance component channel data.
Step 143: and correlating the red, green and blue three-channel data and the color space three-channel data to obtain first multi-channel data.
The method specifically includes the steps that red channel data, green channel data, red channel data, luminance component channel data, blue chrominance component channel data and red chrominance component channel data of each pixel in a second face image are in one-to-one correspondence, and accordingly correlation is achieved. In the present embodiment, the first multi-channel data should be six-channel data.
Through the implementation of the steps 141 to 143, the first multi-channel data can be obtained according to the second face image, and the first multi-channel data is input into the convolutional neural network, which is more favorable for improving the accuracy of detecting the living body compared with the input of the red, green and blue three-channel data into the convolutional neural network.
Step 15: and processing the first multi-channel data according to a pre-trained convolutional neural network to obtain a living body detection result of the first face image.
Specifically, a pre-trained convolutional neural network extracts features from the first multi-channel data, and a living body detection result of the first face image is obtained according to the extracted features.
Through the implementation of the steps 11 to 15, the problem that the existing face living body detection method is low in detection accuracy can be solved through the implementation of the method and the device, wherein the first face image shot by the single visible light camera is processed, then whether the first face image is a color image is judged, and then whether the first face image is a living body is detected through the convolutional neural network.
Further, as an embodiment of the present embodiment, as shown in fig. 5, in order to obtain a trained and pre-trained support vector machine classifier.
Step 21: a plurality of first sample images are acquired.
The first sample image should include a color sample image and a gray sample image, and the ratio of the color sample image and the gray sample image can be set by human. The larger the number of the first sample images is, the better the subsequent classification effect of the support vector machine classifier obtained by training is.
It should be noted that the first face image should be obtained through preprocessing, and the preprocessing method is similar to the method for preprocessing the first face image in the above steps 121 to 122 to obtain the second face image, and details thereof are not repeated here.
Step 22: and respectively extracting red, green and blue pixel values of each pixel in the plurality of first sample images.
Specifically, red, green and blue pixel values of each sample image in each first sample image are extracted.
Step 23: and processing the red, green and blue pixel values of the first sample image through a support vector machine classifier to obtain a predicted color black and white classification result of the first sample image.
Step 24: and comparing the predicted color black-white classification results of all the first sample images with the actual color black-white classification results, adjusting parameters in the support vector machine classifier when the comparison results do not meet the preset requirements, and predicting again until the comparison results of the predicted color black-white classification results and the actual color black-white classification results of all the first sample images meet the preset requirements so as to obtain the pre-trained support vector machine classifier.
When the predicted color black-white classification result of all the first sample images is completely the same as the actual color black-white classification result, the training effect of the representative support vector machine classifier is good.
Through the implementation of the steps 21 to 24, a pre-trained support vector machine classifier can be obtained, and whether the second face image is a color image or not can be judged.
Further, as an embodiment of the present embodiment, as shown in fig. 6, in order to obtain a convolutional neural network trained in advance, the following steps 31 to 33 need to be performed:
step 31: acquiring a plurality of second sample images;
step 32: respectively preprocessing the plurality of second sample images to obtain a plurality of third face images;
step 33: classifying the third face image through a pre-trained support vector machine classifier, and judging whether the second face image is a color image;
step 34: when the third face image is colorful, carrying out data augmentation processing on the third face image to obtain a fourth face image;
step 35: processing the fourth face image according to a first preset rule to obtain second multichannel data of the fourth face image;
and step 36: respectively processing second multi-channel data of the fourth face images through a convolutional neural network to obtain a predicted in vivo detection result of the fourth face images;
step 37: calculating the in-vivo detection results of the plurality of fourth face images through a loss function to obtain an evaluation coefficient of the convolutional neural network;
step 38: and when the evaluation coefficient does not reach the preset threshold value, adjusting the feature weight of each feature in the convolutional neural network, and repeatedly processing the second multi-channel data through the convolutional neural network until the evaluation coefficient reaches the preset threshold value so as to obtain the pre-trained convolutional neural network.
The method for obtaining the third face image according to the second sample image and classifying the third face image in the steps 31 to 33 is similar to the method for obtaining the second face image according to the first face image and classifying the second face image in the steps 11 to 13, and is not repeated here.
In step 34, the third facial image data is subjected to augmentation processing, so that each parameter of the third facial image can be randomly changed to adapt to different environments, and then a fourth facial image is obtained.
In step 37, the evaluation coefficient of the convolutional neural network may be obtained by specifically calculating according to the following formula (2):
wherein L is lmc Represents the evaluation coefficient, N represents the number of training samples of the current training batch, cos (theta) j I) represents the cosine distance between samples of the same class, cos (θ) yi I) represents the cosine distance between samples of different kinds, m represents margin (interval) and s represents a hyperparameter. In some examples, the value of m is 0.3, the value of s is 64, the distance between samples of the same type can be effectively shortened, and the values of m and s are preset by a user through experiments.
Through the implementation of the steps 31 to 38, a pre-trained convolutional neural network can be obtained to process the first multi-channel data and obtain a living body detection result.
It should be noted that, in the above step 31, the second sample image may include various types, for example, a captured live-person video, an attack video obtained by capturing the captured live-person video with a capturing device, color image data obtained by copying the captured live-person video, and grayscale image data obtained by copying the captured live-person video. The second sample image should be identified as an attack video when the number of kinds of the second sample image should be the same as the number of kinds of the predicted live body detection results obtained by the convolutional neural network, that is, when the trained convolutional neural network classifies the attack video obtained by capturing the captured live body video using the capturing device.
In this embodiment, although the second sample image has a large number of types and the trained convolutional neural network can classify and obtain a corresponding actual type, in actual live body detection, the trained convolutional neural network should obtain a classification result as a live body or a non-live body, and when it is recognized that the first face image is an attack video obtained by shooting a shot live-person video using a shooting device, or color image data obtained by copying the shot live-person video, or grayscale image data obtained by copying the shot live-person video, the first face image is output as a non-live body. Specifically, the probability of a living body and the probability of a non-living body are compared to obtain a classification result.
Further, as an implementation manner of this embodiment, the step 34 specifically includes the following steps: and when the third face image is colorful, respectively carrying out data expansion, compression, mirror image and rotation processing on the third face image to obtain a fourth face image.
The third face image is subjected to mirror image turning at a probability of 50%, the third face image is compressed by using opencv (Open Source Computer Vision Library) and Python (Python), the contrast of the third face image is changed randomly, the third face image is subjected to set deflection and randomly rotated, and new data are formed among data of the third face images of different types in a data enhancement (mixup) mode to form the third face image.
The fourth face image is obtained by respectively carrying out data expansion, compression, mirror image and rotation processing on the third face image, the fourth face image under different environments can be obtained, and the method is similar to the method for simulating the third face image under the condition of different camera devices under different angles and illumination environments, which is equivalent to the improvement of the sample size in the training process, so that the training effect is more obvious, and the robustness is greatly improved.
Further, as an implementation manner of the embodiment, the face live detection method further includes: the pooling layer replaces the global depth convolutional layer in the convolutional neural network architecture. In this embodiment, a convolutional neural network architecture including a modified version of mobility is specifically adopted.
It should be noted that, in general, the more the number of layers of the global depth convolutional layer, the better the effect of the living body detection is, but the more the number of layers of the global depth convolutional layer for the living body detection is, the less the effect of the living body detection is changed, and by using the pooling layer instead of the global depth convolutional layer in the convolutional neural network architecture, the same living body detection effect as using the global depth convolutional layer can be achieved, and the data processing magnitude is small.
In the embodiment, the implementation of the pooling layer instead of the global depth convolutional layer in the convolutional neural network architecture can effectively improve the data processing speed, and meanwhile, the face in-vivo detection can be completed at the front end of the intelligent terminal without performing a large amount of calculation through a server, so that the application of the face in-vivo detection method is more convenient.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
A second embodiment of the present invention provides a face liveness detection device, which corresponds to the face liveness detection method provided in the first embodiment one to one.
Further, as shown in fig. 7, the living body detection device of the human face comprises a first human face image acquisition module 41, a second human face image acquisition module 42, a color classification module 43, a first multi-channel data acquisition module 44 and a living body detection result acquisition module 45. The functional modules are explained in detail as follows:
a first face image obtaining module 41, configured to obtain a first face image through a single visible light camera;
the second face image acquiring module 42 is configured to pre-process the first face image to obtain a second face image;
the color classification module 43 is configured to classify the second face image through a pre-trained support vector machine classifier, and determine whether the second face image is a color image;
the first multichannel data acquisition module 44 is configured to, when the second face image is colored, process the second face image according to a first preset rule to obtain first multichannel data of the second face image;
and the living body detection result acquisition module 45 is configured to process the first multichannel data according to a pre-trained convolutional neural network to obtain a living body detection result of the first face image.
Further, as an implementation manner of the present embodiment, the second face image acquiring module 42 includes a face position acquiring unit, a face position image acquiring unit, and a second face image acquiring unit. The functional units are detailed as follows:
the face position acquisition unit is used for carrying out face detection on the first face image to obtain a face position in the first face image;
the face position image acquisition unit is used for intercepting a face position image containing a face from the first face image according to the face position;
and the second face image acquisition unit is used for processing the face position image according to a second preset rule to obtain a second face image.
Further, as an implementation manner of this embodiment, the first multi-channel data acquiring module 44 includes a red, green, and blue three-channel data acquiring unit, a color space three-channel acquiring unit, and a first multi-channel data acquiring unit. The functional units are detailed as follows:
the red, green and blue three-channel data acquisition unit is used for extracting red, green and blue three-channel data of each pixel in the second face image when the second face image is colored;
the color space three-channel acquisition unit is used for acquiring three color space channels according to the red, green and blue three-channel data of each pixel;
the first multi-channel data acquisition unit is used for correlating the red, green and blue three-channel data and the color space three-channel data to acquire first multi-channel data.
Further, as an implementation manner of this embodiment, the face liveness detection device further includes a first sample image acquisition module, a red, green, and blue pixel value acquisition module, a predicted color black and white classification result acquisition module, and a support vector machine acquisition module. The functional modules are explained in detail as follows:
the first sample image acquisition module is used for acquiring a plurality of first sample images;
the red, green and blue pixel value acquisition module is used for respectively extracting the red, green and blue pixel values of each pixel in the plurality of first sample images;
the predicted color black-white classification result acquisition module is used for processing the red, green and blue pixel values of the first sample image through the support vector machine classifier to obtain a predicted color black-white classification result of the first sample image;
and the support vector machine acquisition module is used for comparing the predicted color black-white classification results of all the first sample images with the actual color black-white classification results, adjusting parameters in the support vector machine classifier when the comparison results do not meet the preset requirements, and predicting again until the comparison results of the predicted color black-white classification results of all the first sample images and the actual color black-white classification results meet the preset requirements so as to obtain the pre-trained support vector machine classifier.
Further, as an implementation manner of this embodiment, the face living body detection apparatus further includes a second sample image acquisition module, a third face image acquisition module, a color image judgment module, a fourth face image acquisition module, a second multi-channel data acquisition module, a predicted living body detection result acquisition module, an evaluation coefficient acquisition module, and a convolutional neural network acquisition module. The functional modules are explained in detail as follows:
the second sample image acquisition module is used for acquiring a plurality of second sample images;
the third face image acquisition module is used for respectively preprocessing the second sample images to obtain third face images;
the color image judging module is used for classifying the third face image through a pre-trained support vector machine classifier and judging whether the second face image is a color image;
the fourth face image acquisition module is used for performing data augmentation processing on the third face image to obtain a fourth face image when the third face image is colored;
the second multichannel data acquisition module is used for processing the fourth face image according to a first preset rule to obtain second multichannel data of the fourth face image;
the predicted living body detection result acquisition module is used for respectively processing second multi-channel data of the fourth face images through a convolutional neural network to obtain predicted living body detection results of the fourth face images;
the evaluation coefficient acquisition module is used for calculating the in-vivo detection results of the plurality of fourth face images through the loss function to obtain the evaluation coefficient of the convolutional neural network;
and the convolutional neural network acquisition module is used for adjusting the characteristic weight of each characteristic in the convolutional neural network when the evaluation coefficient does not reach the preset threshold value, and repeatedly processing the second multi-channel data through the convolutional neural network until the evaluation coefficient reaches the preset threshold value so as to obtain the pre-trained convolutional neural network.
Further, as an implementation manner of the present embodiment, the fourth face image acquisition module includes a data augmentation unit. The detailed functions of the data amplification unit are as follows:
and the data amplification unit is used for respectively carrying out data expansion, compression, mirror image and rotation processing on the third face image to obtain a fourth face image when the third face image is colored.
Further, as an implementation manner of this embodiment, the living human face detection apparatus further includes a network architecture module. The detailed functions of the network architecture module are as follows:
and the network architecture module is used for replacing the global depth convolutional layer in the convolutional neural network architecture with the pooling layer.
For specific limitations of the face living body detection device, reference may be made to the above limitations of the face living body detection method, and details are not repeated here. All or part of the modules in the human face living body detection device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
A third embodiment of the present invention provides a computer device, which may be a server, and the internal structure diagram of which may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data involved in the face in-vivo detection method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement the face liveness detection method provided by the first embodiment of the present invention.
A fourth embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of the face live detection method provided by the first embodiment of the present invention, such as steps 11 to 15 shown in fig. 2, steps 121 to 123 shown in fig. 3, steps 141 to 143 shown in fig. 4, steps 21 to 24 shown in fig. 5, and steps 31 to 38 shown in fig. 6. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the face liveness detection method provided by the first embodiment described above. To avoid repetition, further description is omitted here.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.