CN115147937A

CN115147937A - Face living body detection method and device, electronic equipment and storage medium

Info

Publication number: CN115147937A
Application number: CN202210800553.XA
Authority: CN
Inventors: 王珂尧
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-10-04

Abstract

The present disclosure provides a face in-vivo detection method, an apparatus, an electronic device and a storage medium, which relate to the technical field of artificial intelligence, specifically to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as in-vivo detection. The method comprises the following steps: extracting a first visible light feature from the first visible light image; extracting a first near-infrared feature from the first near-infrared image; fusing the second visible light characteristic and the second near-infrared characteristic to obtain a fused characteristic; determining a fusion living body detection result of the face object according to the fusion characteristics; respectively determining a visible light living body detection result and a near-infrared living body detection result according to the third visible light characteristic and the third near-infrared characteristic; according to the fusion living body detection result, the visible light living body detection result and the near-infrared living body detection result, the face living body detection result is determined, and the accuracy of face living body detection is improved.

Description

Face living body detection method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as living body detection.

Background

In the prior art, when a traditional face living body algorithm is applied to a face living body detection model, the recognition effect is not ideal due to the fact that the face posture is too large, the illumination difference is large or the accuracy is poor when attack modes are various in a real scene.

Disclosure of Invention

The disclosure provides a method, a device, an electronic device and a storage medium for human face living body detection.

According to a first aspect of the present disclosure, there is provided a face live detection method, including:

extracting a first visible light feature from the first visible light image;

extracting a first near-infrared feature from a first near-infrared image, wherein the first visible light image and the first near-infrared image contain the same human face object;

fusing a second visible light feature and a second near-infrared feature to obtain a fused feature, wherein the first visible light feature comprises the second visible light feature, and the first near-infrared feature comprises the second near-infrared feature;

determining a fusion living body detection result of the face object according to the fusion characteristics;

respectively determining a visible light living body detection result and a near-infrared living body detection result according to a third visible light characteristic and a third near-infrared characteristic, wherein the first visible light characteristic comprises the third visible light characteristic, and the first near-infrared characteristic comprises the third near-infrared characteristic;

and determining a human face living body detection result according to the fusion living body detection result, the visible light living body detection result and the near-infrared living body detection result.

According to a second aspect of the present disclosure, there is provided a face live detection model training method, including:

acquiring a visible light sample image and a near-infrared sample image containing the same sample face object, and acquiring a label of the sample face object, wherein the label represents whether the sample face object is a living body face;

performing feature extraction on the visible light sample image and the near-infrared sample image by using a visible light feature extraction network and a near-infrared feature extraction network of a human face living body detection model to respectively obtain a first sample visible light feature and a first sample near-infrared feature;

fusing a second sample visible light characteristic and a second sample near-infrared characteristic to obtain a sample fusion characteristic, wherein the first sample visible light characteristic comprises the second sample visible light characteristic, and the first sample near-infrared characteristic comprises the second sample near-infrared characteristic;

predicting the sample fusion characteristics, the third sample visible light characteristics and the third sample near-infrared characteristics by utilizing each classification network of the face living body detection model, and respectively obtaining a fusion living body prediction result, a visible light living body prediction result and a near-infrared living body prediction result of the sample face object, wherein the first sample visible light characteristics comprise the third sample visible light characteristics, and the first sample near-infrared characteristics comprise the third sample near-infrared characteristics;

determining a human face living body prediction result of the sample human face object according to the fusion living body prediction result, the visible light living body prediction result and the near-infrared living body prediction result;

and adjusting parameters of a human face living body detection model according to the human face living body prediction result of the sample human face object and the label.

According to a third aspect of the present disclosure, there is provided a face liveness detection apparatus comprising:

the first characteristic extraction module is used for extracting first visible light characteristics from the first visible light image;

the second feature extraction module is used for extracting a first near-infrared feature from a first near-infrared image, wherein the first visible light image and the first near-infrared image contain the same human face object;

the feature fusion module is configured to fuse a second visible light feature and a second near-infrared feature to obtain a fused feature, where the first visible light feature includes the second visible light feature, and the first near-infrared feature includes the second near-infrared feature;

the first determining module is used for determining a fusion living body detection result of the face object according to the fusion characteristics;

a second determining module, configured to determine a visible living body detection result and a near-infrared living body detection result according to a third visible light feature and a third near-infrared feature, respectively, where the first visible light feature includes the third visible light feature, and the first near-infrared feature includes the third near-infrared feature;

and the third determining module is used for determining the human face in-vivo detection result according to the fusion in-vivo detection result, the visible light in-vivo detection result and the near-infrared in-vivo detection result.

According to a fourth aspect of the present disclosure, there is provided a face in-vivo detection model training device, comprising:

the system comprises a sample image and label acquisition module, a sample image and label acquisition module and a label acquisition module, wherein the sample image and label acquisition module is used for acquiring a visible light sample image and a near infrared sample image which contain the same sample face object and acquiring a label of the sample face object, and the label represents whether the sample face object is a living body face;

the sample feature extraction module is used for extracting features of the visible light sample image and the near-infrared sample image by utilizing a visible light feature extraction network and a near-infrared feature extraction network of the human face living body detection model to respectively obtain a first sample visible light feature and a first sample near-infrared feature;

the sample characteristic fusion module is used for fusing a second sample visible light characteristic and a second sample near-infrared characteristic to obtain a sample fusion characteristic, wherein the first sample visible light characteristic comprises the second sample visible light characteristic, and the first sample near-infrared characteristic comprises the second sample near-infrared characteristic;

the prediction result acquisition module is used for predicting the sample fusion characteristics, the third sample visible light characteristics and the third sample near-infrared characteristics by utilizing each classification network of the face living body detection model to respectively obtain a fusion living body prediction result, a visible light living body prediction result and a near-infrared living body prediction result of the sample face object, wherein the first sample visible light characteristics comprise the third sample visible light characteristics, and the first sample near-infrared characteristics comprise the third sample near-infrared characteristics;

a determining module, configured to determine a living human face prediction result of the sample human face object according to the fusion living body prediction result, the visible light living body prediction result, and the near-infrared living body prediction result;

and the parameter adjusting module is used for adjusting parameters of the human face living body detection model according to the human face living body prediction result of the sample human face object and the label.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the present disclosure.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of the present disclosure.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any one of the present disclosure.

The embodiment of the disclosure has the following beneficial effects:

the face in-vivo detection method, the face in-vivo detection device, the electronic equipment and the storage medium provided by the embodiment of the disclosure improve the accuracy of face in-vivo detection.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

fig. 4a is a schematic flow chart of a training method for a human face living body detection model according to an embodiment of the present disclosure;

fig. 4b is a schematic structural diagram of a living human face detection network based on a feature pyramid network according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a living human face detection apparatus provided in an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a second living human face detection device provided in the embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a third living human face detection device provided in the embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a human face in-vivo detection model training device according to an embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a face in-vivo detection method and a face in-vivo detection model training method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to improve the accuracy of face in-vivo detection, the embodiment of the disclosure provides a face in-vivo detection method, a face in-vivo detection device, an electronic device and a storage medium.

Firstly, a method for detecting a living human face provided by an embodiment of the present disclosure is described in detail, with reference to fig. 1, including the following steps:

step S101, extracting a first visible light characteristic from a first visible light image;

step S102, extracting a first near-infrared feature from a first near-infrared image, wherein the first visible light image and the first near-infrared image contain the same human face object;

each visible light image has a corresponding near-infrared image, under the conditions of poor illumination conditions and the like, much information in the visible light image can be lost or cannot be seen, the wavelength of the near-infrared image is longer, more detailed information can be obtained, and the visible light image and the near-infrared image contain the same human face object.

And respectively extracting the characteristics of the first visible light image and the first near-infrared image to obtain a first visible light characteristic and a first near-infrared characteristic. The image features are used for extracting useful data or information from the image to obtain a representation or description of "non-image" of the image, such as values, vectors, symbols and the like, and the process is the image feature extraction, and the extracted representation or description of the "non-image" is the image features.

Step S103, fusing a second visible light feature and a second near-infrared feature to obtain a fused feature, wherein the first visible light feature comprises the second visible light feature, and the first near-infrared feature comprises the second near-infrared feature;

after feature extraction is respectively carried out on the first visible light image and the first near-infrared image, a first visible light feature and a first near-infrared feature are obtained, the first visible light feature comprises a second visible light feature, the first near-infrared feature comprises a second near-infrared feature, and the second visible light feature and the second near-infrared feature are fused to obtain a fusion feature. The feature fusion is to combine features extracted from the images. In the deep learning task, feature fusion can be performed by adding elements based on the feature pyramid network to obtain fusion features. That is, the second visible light feature and the second near-infrared feature may be fused in an additive manner to obtain a fused feature.

Step S104, determining a fusion living body detection result of the face object according to the fusion characteristics;

and respectively carrying out living body detection on the human face object on each fusion feature according to the obtained fusion features, and determining the fusion living body detection result of the human face object.

Step S105, respectively determining a visible living body detection result and a near-infrared living body detection result according to a third visible light characteristic and a third near-infrared characteristic, wherein the first visible light characteristic comprises the third visible light characteristic, and the first near-infrared characteristic comprises the third near-infrared characteristic;

as can be seen from the above analysis, the second visible light feature and the second near-infrared feature are fused to obtain a fusion feature, and a fusion living body detection result of the face object is determined based on the fusion feature. In order to improve the accuracy of the living body detection result of the face object, the visible light living body detection result and the near infrared living body detection result of the face object are respectively determined based on the third visible light characteristic and the third near infrared characteristic, and the data quantity of the support detection result is increased. Wherein the first visible light feature comprises a third visible light feature and the first near-infrared feature comprises a third near-infrared feature.

Specifically, the third visible light characteristic is input into a visible light classification network, the visible light classification network is a two-classification network, and whether the visible light living body detection result of the face object is real or not is judged through the two-classification network, so that the visible light living body detection result of the face object is obtained.

And inputting the third near-infrared features into a near-infrared classification network, wherein the near-infrared classification network is a two-classification network, and judging whether the near-infrared living body detection result of the face object is real or not through the two-classification network to obtain the near-infrared living body detection result of the face object.

And S106, determining a human face living body detection result according to the fusion living body detection result, the visible light living body detection result and the near-infrared living body detection result.

In one example, the visible light live body detection result and the near-infrared live body detection result are both real, and when the data amount of the real result is greater than or equal to one half of the total data amount in the fused live body detection result, it can be determined that the face live body detection result of the face object to be detected is real, that is, the face object to be detected is a live body.

In the above embodiment, the second visible light feature and the second near-infrared feature are fused to obtain a fused feature; determining a fusion living body detection result of the face object according to the fusion characteristics; respectively determining a visible light living body detection result and a near-infrared living body detection result according to the third visible light characteristic and the third near-infrared characteristic; and finally, determining a human face living body detection result according to the fusion living body detection result, the visible light living body detection result and the near-infrared living body detection result. Compared with the prior art in which the human face living body detection is performed only by using the visible light feature and the near infrared feature, the embodiment of the disclosure increases the element of the fusion feature, and further increases the result of the fusion living body detection in the process of the human face living body detection, thereby improving the accuracy of the human face living body detection.

Referring to fig. 2, a second flowchart of the face live detection method provided in the embodiment of the present disclosure is respectively detailed in step S101 and step S102 based on fig. 1, and includes the following steps:

step S201, inputting the first visible light image into a visible light feature extraction network, and obtaining at least two first visible light features according to the output of at least two preset convolution layer regions of the visible light feature extraction network;

the visible light feature extraction network can be a VGG11 network, and the VGG11 network has 11 parameter layers, including 8 convolutional layers and 3 full connection layers. The VGG11 network adopted in the embodiment of the present disclosure includes at least two preset convolution layer regions, each preset convolution layer region outputs a first visible light feature, and at least two first visible light features are extracted. In one example, the number of the preset convolution layer regions may be four.

Step S202, inputting the first near-infrared image into a near-infrared feature extraction network, and acquiring at least two first near-infrared features according to the output of at least two preset convolution layer regions of the near-infrared feature extraction network;

the near-infrared feature extraction network and the visible light feature extraction network have the same structure, namely the near-infrared feature extraction network also adopts a VGG11 network, and similarly, the near-infrared feature extraction network comprises at least two preset convolution layer areas, and the number of the preset convolution layer areas is the same as that of the preset convolution layer areas in the visible light feature extraction network. And each preset convolution layer region outputs a first near-infrared characteristic, and at least two first near-infrared characteristics are extracted. In one example, the number of the preset convolution layer regions may be four.

Wherein the human face living body detection model comprises: the visible light feature extraction network and the near infrared feature extraction network.

The visible light feature extraction network and the near infrared feature extraction network are the same VGG11 network, and the visible light feature extraction network and the near infrared feature extraction network comprise the same preset convolution layer number. The convolution neural network takes the visible light characteristic extraction network and the near infrared characteristic extraction network as main structures of a human face living body detection model to carry out human face object living body detection.

In the above embodiment, the at least two first visible light features are obtained by inputting the first visible light image into the visible light feature extraction network and according to the output of the at least two preset convolution layer regions of the visible light feature extraction network; the at least two first near-infrared features are obtained by inputting the first near-infrared image into the near-infrared feature extraction network and according to the output of at least two preset convolution layer regions of the near-infrared feature extraction network. The extraction of the first visible light characteristic and the first near-infrared characteristic is realized. The visible light feature extraction network and the near-infrared feature extraction network respectively comprise a plurality of preset convolution layer areas, each preset convolution layer area outputs a first visible light feature and a first near-infrared feature, and a plurality of first visible light features and first near-infrared features are extracted, so that feature diversity in feature extraction is realized.

In a possible embodiment, the third visible light feature is a first visible light feature output by a last preset convolution layer region of the visible light feature extraction network; the second visible light characteristic comprises the third visible light characteristic;

the third near-infrared feature is a first near-infrared feature output by the last preset convolution layer region of the near-infrared feature extraction network; the second near-infrared features comprise the third near-infrared features;

the fusion features comprise target fusion features, and the target fusion features are obtained by fusing third near-infrared features and third visible light features.

In one example, the number of the preset convolution layer regions may be four for the visible light feature extraction network. Each preset convolutional layer region outputs a first visible light feature, the first visible light feature comprises a second visible light feature, the second visible light feature comprises a third visible light feature, and the third visible light feature is the first visible light feature output by a fourth preset convolutional layer region of the visible light feature extraction network.

In one example, the number of the preset convolution layer regions may be four for the near infrared feature extraction network. Each preset convolution layer region outputs a first near-infrared feature, the first near-infrared features comprise second near-infrared features, the second near-infrared features comprise third near-infrared features, and the third near-infrared features are first near-infrared features output by a fourth preset convolution layer region of the near-infrared feature extraction network.

Fusing the second visible light characteristic and the second near-infrared characteristic to obtain a fused characteristic; the fusion features comprise target fusion features, and the target fusion features are obtained by fusing third near-infrared features and third visible light features. In one example, the target fusion feature may be a feature obtained by fusing a first visible light feature output by a fourth predetermined convolutional layer region of the visible light feature extraction network and a first near-infrared feature output by a fourth predetermined convolutional layer region of the near-infrared feature extraction network.

The first visible light characteristic is a visible light characteristic output by each preset convolution layer area in the visible light characteristic extraction network, and the second visible light characteristic is a visible light characteristic used in characteristic fusion. In one example, the first visible light characteristic and the second visible light characteristic may be partially the same, e.g., the first visible light characteristic may be 4, and the second visible light characteristic may be 3 of the first visible light characteristic (excluding the third visible light characteristic); in one example, the first visible light characteristic and the second visible light characteristic may all be the same, e.g., the first visible light characteristic may be 4, and the second visible light characteristic may be the same as all of the first visible light characteristic (including the third visible light characteristic).

The first near-infrared feature is a near-infrared feature output by each preset convolutional layer region in the near-infrared feature extraction network, and the second near-infrared feature is a near-infrared feature used in feature fusion. In one example, the first near-infrared signature and the second near-infrared signature may be partially identical, e.g., the first near-infrared signature may be 4, and the second near-infrared signature may be 3 of the first near-infrared signature (excluding the third near-infrared signature); in one example, the first near-infrared signature and the second near-infrared signature may all be the same, e.g., the first near-infrared signature may be 4 and the second near-infrared signature may be the same as all of the first near-infrared signature (including the third near-infrared signature).

In the above embodiment, the third visible light feature is the first visible light feature output by the last preset convolution layer region of the visible light feature extraction network; the third near-infrared feature is a first near-infrared feature output by the last preset convolution layer region of the near-infrared feature extraction network; and fusing the third near-infrared characteristic and the third visible light characteristic to obtain a target fusion characteristic. And the acquisition of the third visible light characteristic and the third near-infrared characteristic and the acquisition of the target fusion characteristic are realized.

In one possible embodiment, the second visible light characteristic further includes: an ith visible light feature, which is a visible light feature output by an ith preset convolution layer region of the visible light feature extraction network, wherein i is an integer from 1 to n-1, n is the number of preset convolution layer regions in the visible light feature extraction network, and n is an integer greater than 1;

the second near-infrared feature further comprises: an ith near-infrared feature, which is a near-infrared feature output by an ith preset convolutional layer region of the near-infrared feature extraction network;

the fusion feature further comprises: the ith fusion feature is obtained by fusing the ith near-infrared feature, the ith visible light feature and the (i + 1) th fusion feature, and the nth fusion feature is the target fusion feature;

the fusion in vivo detection result comprises: the living body detection result of the ith fusion characteristic and the living body detection result of the target fusion characteristic.

Specifically, for any non-target fusion feature pair, obtaining a fusion feature corresponding to a next preset convolution layer region of the non-target fusion feature pair, and obtaining a feature to be fused of the non-target fusion feature pair; fusing the second visible light characteristic, the second near-infrared characteristic and the characteristic to be fused of the non-target fusion characteristic pair to obtain a fusion characteristic corresponding to the preset convolution layer area of the non-target fusion characteristic pair; the ith second visible light feature and the ith second near-infrared feature form an ith non-target fusion feature pair, the ith second visible light feature is a first visible light feature output by an ith preset convolution layer region of the visible light feature extraction network, the ith second near-infrared feature is a first near-infrared feature output by an ith preset convolution layer region of the near-infrared feature extraction network, and the ith non-target fusion feature pair corresponds to the ith preset convolution layer region.

In an example, when the value of n is 4, the visible light feature extraction network may include a first preset convolution layer region, a second preset convolution layer region, a third preset convolution layer region, and a fourth preset convolution layer region; the near-infrared feature extraction network may include a first predetermined convolution layer region, a second predetermined convolution layer region, a third predetermined convolution layer region, and a fourth predetermined convolution layer region.

Adding the features output by the fourth preset convolution layer region of the visible light feature extraction network and the fourth preset convolution layer region of the near-infrared feature extraction network to obtain a fourth fusion feature, namely a target fusion feature; adding the third preset convolution layer area of the visible light feature extraction network and the feature output by the third preset convolution layer area of the near-infrared feature extraction network, and adding the third preset convolution layer area and the fourth fusion feature to obtain a third fusion feature; adding the second preset convolution layer area of the visible light characteristic extraction network and the characteristic output by the second preset convolution layer area of the near-infrared characteristic extraction network, and adding the second preset convolution layer area and the third fusion characteristic to obtain a second fusion characteristic; and adding the first preset convolution layer area of the visible light characteristic extraction network and the characteristic output by the first preset convolution layer area of the near-infrared characteristic extraction network, and adding the sum of the first preset convolution layer area and the second fusion characteristic to obtain a first fusion characteristic. The scales of the feature maps corresponding to the features output by each preset convolution layer region are the same, and the scales of the feature maps corresponding to the first fusion feature, the second fusion feature, the third fusion feature and the fourth fusion feature are different.

And adopting a characteristic pyramid network to perform up-sampling on the target fusion characteristics of the bottommost layer, and fusing the target fusion characteristics with the target fusion characteristics.

And predicting fusion characteristics corresponding to any preset convolution layer region, wherein each fusion characteristic is connected with a two-classification network, and judging whether the fusion living body detection result of the face object is real or not through the two-classification network to obtain the fusion living body detection result of the face object corresponding to each preset convolution layer region. In one example, the fused live view results may include live view results of the first fused feature, the second fused feature, and the third fused feature, and live view results of the target fused feature.

In the above embodiment, the in-vivo detection result of the non-target fusion feature and the in-vivo detection result of the target fusion feature are obtained by acquiring the target fusion feature and the non-target fusion feature.

Referring to fig. 3, a schematic flow chart of a third method for detecting a living human face according to an embodiment of the present disclosure is shown, where the method further includes the following steps:

step S301, acquiring a second visible light image and a second near-infrared image containing the face object;

the second visible light image and the second near-infrared image are the first visible light image and the first near-infrared image before processing. Each visible light image has a corresponding near-infrared image, under the conditions of poor illumination conditions and the like, much information in the visible light image can be lost or cannot be seen, the wavelength of the near-infrared image is longer, more detailed information can be obtained, and the visible light image and the near-infrared image contain the same human face object.

Step S302, carrying out face key point detection on the second visible light image and the second near-infrared image to respectively obtain a visible light face key point and a near-infrared face key point;

carrying out face detection on the second visible light image through the visible light face detection model to obtain a position area of a face in the second visible light image; carrying out face detection on the second near-infrared image through a near-infrared face detection model to obtain a position area of a face in the second near-infrared image; the face detection model is an existing face detection model and can detect a face position area. Based on the position areas of the human faces in the second visible light image and the second near-infrared image, respectively detecting the human face key points in the second visible light image and the second near-infrared image through a human face key point detection model to obtain the coordinate values of the human face key points in the second visible light image and the second near-infrared image; the face key point detection model is an existing model, the existing model is called, a second image of the detected face is input, and a plurality of face key point coordinate values are obtained. In one example, the face key point coordinate values may be 72, which are (x 1, y 1) \8230; (x 72, y 72), respectively.

Step S303, aligning the face region in the second visible light image and the face region in the second near-infrared image based on the visible light face key point and the near-infrared face key point to respectively obtain a third visible light image and a third near-infrared image;

and aligning a face region in the second visible light image and a face region in the second near-infrared image according to the coordinate values of the key points of the face of the second visible light image and the coordinate values of the key points of the face of the near-infrared image, intercepting the region only containing the face through affine transformation, and adjusting the intercepted visible light image and the intercepted near-infrared image to the same scale. The face key point coordinates are also remapped to new coordinates according to the affine transformation matrix. In one example, the scale may be 224x224. If the scale is too small, the image information is lost too seriously, and if the scale is too large, the abstract level of the image information is not high enough, the calculation amount is larger, and the size of 7 × 7 is a good balance. The image is reduced from large to small resolution by a factor that is typically 2 to the power of the exponent, so the input to the image must be 7 x (2 to the power of the exponent). Most sorted datasets, the image has a length and width around 300 resolution. Therefore, one should find a 7 x (2 to the power of the exponent) and take the resolution around 300 as input, where 7 x2 ⁴ Equal to 112,7 x2 ⁵ Equal to 224,7 x2 ⁶ Equal to 448 and the closest to 300 is 224.

Step S304, performing image normalization processing on the third visible light image and the third near-infrared image, respectively, to obtain the first visible light image and the first near-infrared image, respectively.

Respectively carrying out image normalization processing on the human face areas of the third visible light image and the third near-infrared image, specifically, sequentially carrying out normalization processing on each pixel in the third visible light image and the third near-infrared image, wherein the normalization processing method comprises the following steps: the pixel value of each pixel is subtracted by 128 and divided by 256 to bring the pixel value of each pixel between-0.5, 0.5.

In the above embodiment, a second visible light image and a second near-infrared image including a human face object are acquired; detecting the face key points of the second visible light image and the second near-infrared image to respectively obtain visible light face key points and near-infrared face key points; aligning a face region in the second visible light image and a face region in the second near-infrared image based on the visible light face key points and the near-infrared face key points to respectively obtain a third visible light image and a third near-infrared image; and respectively carrying out image normalization processing on the third visible light image and the third near-infrared image to respectively obtain a first visible light image and a first near-infrared image. The acquisition of the first visible light image and the first near-infrared image containing the same human face object is realized.

Referring to fig. 4a, a schematic flow chart of the training method for a human face living body detection model provided by the embodiment of the present disclosure includes the following steps:

step S401, acquiring a visible light sample image and a near-infrared sample image containing the same sample face object, and acquiring a label of the sample face object, wherein the label represents whether the sample face object is a living body face;

step S402, extracting the characteristics of the visible light sample image and the near-infrared sample image by utilizing a visible light characteristic extraction network and a near-infrared characteristic extraction network of a human face living body detection model to respectively obtain a first sample visible light characteristic and a first sample near-infrared characteristic;

step S403, fusing a second sample visible light feature and a second sample near-infrared feature to obtain a sample fusion feature, wherein the first sample visible light feature comprises the second sample visible light feature, and the first sample near-infrared feature comprises the second sample near-infrared feature;

step S404, predicting the sample fusion characteristics, the third sample visible light characteristics and the third sample near-infrared characteristics by utilizing each classification network of the face living body detection model, and respectively obtaining a fusion living body prediction result, a visible light living body prediction result and a near-infrared living body prediction result of the sample face object, wherein the first sample visible light characteristics comprise the third sample visible light characteristics, and the first sample near-infrared characteristics comprise the third sample near-infrared characteristics;

step S405, determining a human face living body prediction result of the sample human face object according to the fusion living body prediction result, the visible light living body prediction result and the near-infrared living body prediction result;

and step S406, adjusting parameters of a human face living body detection model according to the human face living body prediction result of the sample human face object and the label.

Referring to fig. 4b, a schematic structural diagram of a face live detection network based on a feature pyramid network according to an embodiment of the present disclosure is shown.

Acquiring a second visible light image (RGB) and a second near infrared image (NIR) containing the human face object;

the second visible light image and the second near-infrared image are the first visible light image and the first near-infrared image before processing. The specific analysis is the same as the analysis process of the human face living body detection method, and the detailed description is omitted here.

Performing face key point detection on the second visible light image and the second near-infrared image to respectively obtain a visible light face key point and a near-infrared face key point;

the specific analysis is the same as the analysis process of the human face living body detection method, and the detailed description is omitted here.

Aligning a face region in the second visible light image and a face region in the second near-infrared image based on the visible light face key points and the near-infrared face key points to respectively obtain a third visible light image and a third near-infrared image;

And respectively carrying out image normalization processing on the third visible light image and the third near-infrared image to respectively obtain the first visible light image and the first near-infrared image.

Specifically, each pixel in the third visible light image and the third near-infrared image is sequentially subjected to normalization processing, and the normalization processing method includes: the pixel value of each pixel is subtracted by 128 and divided by 256 to bring the pixel value of each pixel between-0.5, 0.5. And performing random data enhancement processing on the visible light image and the near-infrared image after the normalization processing, such as common data enhancement methods of random inversion, random scaling, color disturbance and the like.

Extracting a first visible light feature from the first visible light image;

Inputting the first visible light image into a visible light feature extraction network, and acquiring at least two first visible light features according to the output of at least two preset convolution layer regions of the visible light feature extraction network;

Inputting the first near-infrared image into a near-infrared feature extraction network, and acquiring at least two first near-infrared features according to the output of at least two preset convolution layer areas of the near-infrared feature extraction network;

The third visible light feature is a first visible light feature output by a last preset convolution layer region of the visible light feature extraction network; the second visible light characteristic comprises the third visible light characteristic;

the third near-infrared feature is a first near-infrared feature output by the last preset convolution layer region of the near-infrared feature extraction network; the second near-infrared feature comprises the third near-infrared feature;

The second visible light feature further comprises: an ith visible light feature, which is a visible light feature output by an ith preset convolution layer region of the visible light feature extraction network, wherein i is an integer from 1 to n-1, n is the number of preset convolution layer regions in the visible light feature extraction network, and n is an integer greater than 1;

the second near-infrared feature further comprises: the ith near-infrared feature is a near-infrared feature output by an ith preset convolutional layer region of the near-infrared feature extraction network;

In an example, when the value of n is 4, the visible light feature extraction network may include a first preset convolutional layer region (Block 1), a second preset convolutional layer region (Block 2), a third preset convolutional layer region (Block 3), and a fourth preset convolutional layer region (Block 4); the near-infrared feature extraction network may include a first preset convolutional layer region (Block 1), a second preset convolutional layer region (Block 2), a third preset convolutional layer region (Block 3), and a fourth preset convolutional layer region (Block 4).

And predicting fusion characteristics corresponding to any preset convolution layer region, wherein each fusion characteristic is connected with a two-classification network, and judging whether the fusion living body detection result of the face object is real or not through the two-classification network to obtain the fusion living body detection result of the face object corresponding to each preset convolution layer region. In one example, the first fusion feature, the second fusion feature, the third fusion feature and the target fusion feature are respectively connected with four branches, each branch is respectively connected with a classifier for two-class supervision, and four corresponding cross entropy losses L1, L2, L3 and L4 are output; and obtaining a fusion living body detection result of the human face object based on the cross entropy loss. In one example, the fused live detection results may include live detection results of the first fused feature, the second fused feature, and the third fused feature, and live detection results of the target fused feature.

And determining the human face living body detection result of the human face object by combining and fusing the living body detection result, the visible light living body detection result and the near-infrared living body detection result.

The optimized Loss function Loss = W (L1 + L2+ L3+ L4) + L5+ L6 is finally trained, where W is the weight, and the weight W =0.2 is used in the embodiment of the present disclosure, and the face in-vivo detection model is optimized until the face in-vivo detection model finally converges.

In the embodiment, the visible light characteristic of the second sample and the near-infrared characteristic of the second sample are fused to obtain a sample fusion characteristic; respectively predicting the sample fusion characteristics, the third sample visible light characteristics and the third sample near-infrared characteristics to obtain a fusion living body prediction result, a visible light living body prediction result and a near-infrared living body prediction result of the sample face object; determining a human face living body prediction result of the sample human face object according to the three living body prediction results; and finally, adjusting parameters of the human face living body detection model according to the human face living body prediction result and the label of the sample human face object. Compared with the prior art in which the human face living body detection is performed only by using the visible light feature and the near infrared feature, the embodiment of the disclosure increases the element of the fusion feature, and further increases the result of the fusion living body detection in the process of the human face living body detection, thereby improving the accuracy of the human face living body detection.

Based on the same conception, the face living body detection method provided by the embodiment of the disclosure also provides a face living body detection device correspondingly. Fig. 5 is a schematic structural diagram of a first face living body detection device provided in the embodiment of the present disclosure, including:

a first feature extraction module 510, configured to extract a first visible light feature from the first visible light image;

a second feature extraction module 520, configured to extract a first near-infrared feature from a first near-infrared image, where the first visible light image and the first near-infrared image contain a same face object;

a feature fusion module 530, configured to fuse a second visible light feature and a second near-infrared feature to obtain a fused feature, where the first visible light feature includes the second visible light feature, and the first near-infrared feature includes the second near-infrared feature;

a first determining module 540, configured to determine a fused living body detection result of the face object according to the fused feature;

a second determining module 550, configured to determine a visible living body detection result and a near-infrared living body detection result according to a third visible light feature and a third near-infrared feature, respectively, where the first visible light feature includes the third visible light feature, and the first near-infrared feature includes the third near-infrared feature;

and a third determining module 560, configured to determine a face in-vivo detection result according to the fusion in-vivo detection result, the visible light in-vivo detection result, and the near-infrared in-vivo detection result.

In the above embodiment, the second visible light feature and the second near-infrared feature are fused to obtain a fused feature; determining a fusion living body detection result of the face object according to the fusion characteristics; respectively determining a visible light living body detection result and a near-infrared living body detection result according to the third visible light characteristic and the third near-infrared characteristic; and finally, determining a human face living body detection result according to the fusion living body detection result, the visible light living body detection result and the near-infrared living body detection result. Compared with the prior art that the human face living body detection is carried out by only utilizing the visible light characteristic and the near infrared characteristic, the fusion characteristic is added in the embodiment of the invention, and then the fusion living body detection result is added in the human face living body detection process, so that the accuracy of the human face living body detection is improved.

Fig. 6 is a schematic structural diagram of a second face live detection apparatus provided in the embodiment of the present disclosure, and a first feature extraction module 510 and a second feature extraction module 520 are respectively refined based on fig. 5, where the first feature extraction module 510 includes:

a first feature extraction sub-module 610, configured to input the first visible light image into a visible light feature extraction network, and obtain at least two first visible light features according to outputs of at least two preset convolution layer regions of the visible light feature extraction network;

the second feature extraction module 520 includes:

the second feature extraction submodule 620 is configured to obtain at least two first near-infrared features by inputting the first near-infrared image into a near-infrared feature extraction network and according to outputs of at least two preset convolution layer regions of the near-infrared feature extraction network;

In the above embodiment, the at least two first visible light features are obtained by inputting the first visible light image into the visible light feature extraction network and according to the output of the at least two preset convolution layer regions of the visible light feature extraction network; the at least two first near-infrared features are obtained by inputting the first near-infrared image into the near-infrared feature extraction network and according to the output of at least two preset convolution layer regions of the near-infrared feature extraction network. The extraction of the first visible light characteristic and the first near-infrared characteristic is realized. The visible light feature extraction network and the near-infrared feature extraction network both comprise a plurality of preset convolution layer regions, each preset convolution layer region outputs a first visible light feature and a first near-infrared feature, and a plurality of first visible light features and first near-infrared features are extracted, so that feature diversity in feature extraction is realized.

Fig. 7 is a schematic structural diagram of a third example of a living human face detection apparatus provided in the embodiment of the present disclosure, where the apparatus further includes:

a first image obtaining module 710, configured to obtain a second visible light image and a second near-infrared image that include the face object;

a face key point obtaining module 720, configured to perform face key point detection on the second visible light image and the second near-infrared image, and obtain a visible light face key point and a near-infrared face key point respectively;

a second image obtaining module 730, configured to align a face region in the second visible light image and a face region in the second near-infrared image based on the visible light face key point and the near-infrared face key point, and obtain a third visible light image and a third near-infrared image respectively;

the third image obtaining module 740 is configured to perform image normalization processing on the third visible light image and the third near-infrared image, respectively, to obtain the first visible light image and the first near-infrared image, respectively.

Fig. 8 is a schematic structural diagram of a face in-vivo detection model training device provided in an embodiment of the present disclosure, where the device includes:

a sample image and label acquiring module 810, configured to acquire a visible light sample image and a near-infrared sample image that include the same sample face object, and acquire a label of the sample face object, where the label indicates whether the sample face object is a living body face;

a sample feature extraction module 820, configured to perform feature extraction on the visible light sample image and the near-infrared sample image by using a visible light feature extraction network and a near-infrared feature extraction network of a human face living body detection model, so as to obtain a first sample visible light feature and a first sample near-infrared feature respectively;

a sample feature fusion module 830, configured to fuse a second sample visible light feature and a second sample near-infrared feature to obtain a sample fusion feature, where the first sample visible light feature includes the second sample visible light feature, and the first sample near-infrared feature includes the second sample near-infrared feature;

a prediction result obtaining module 840, configured to predict, by using each classification network of the face living body detection model, the sample fusion feature, a third sample visible light feature, and a third sample near-infrared feature, and obtain a fusion living body prediction result, a visible light living body prediction result, and a near-infrared living body prediction result of the sample face object, respectively, where the first sample visible light feature includes the third sample visible light feature, and the first sample near-infrared feature includes the third sample near-infrared feature;

a determining module 850, configured to determine a living human face prediction result of the sample human face object according to the fusion living body prediction result, the visible light living body prediction result, and the near-infrared living body prediction result;

and a parameter adjusting module 860, configured to adjust parameters of the living human face detection model according to the living human face prediction result of the sample human face object and the label.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

It should be noted that the head model in this embodiment is not a head model for a specific user, and cannot reflect personal information of a specific user.

It should be noted that the two-dimensional face image in the present embodiment is from a public data set.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the method of face live detection and the face live detection model training method. For example, in some embodiments, the method of live face detection and the method of training the live face detection model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described face in-vivo detection method and face in-vivo detection model training method may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g. by means of firmware) to perform the method of face liveness detection and the face liveness detection model training method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for detecting a living human face comprises the following steps:

extracting a first visible light feature from the first visible light image;

2. The method of claim 1, wherein said extracting first visible light features from the first visible light image comprises:

the extracting of the first near-infrared feature from the first near-infrared image includes:

3. The method of claim 2, wherein the third visible light feature is a first visible light feature output by a last preset convolutional layer region of the visible light feature extraction network; the second visible light characteristic comprises the third visible light characteristic;

4. The method of claim 3, wherein the second visible light characteristic further comprises: an ith visible light feature, which is a visible light feature output by an ith preset convolution layer region of the visible light feature extraction network, wherein i is an integer from 1 to n-1, n is the number of preset convolution layer regions in the visible light feature extraction network, and n is an integer greater than 1;

5. The method of claim 1, wherein the method further comprises:

acquiring a second visible light image and a second near-infrared image containing the human face object;

6. A face in-vivo detection model training method comprises the following steps:

7. An apparatus for in vivo detection of a human face, comprising:

the feature fusion module is used for fusing a second visible light feature and a second near-infrared feature to obtain a fused feature, wherein the first visible light feature comprises the second visible light feature, and the first near-infrared feature comprises the second near-infrared feature;

and the third determining module is used for determining the human face living body detection result according to the fusion living body detection result, the visible light living body detection result and the near-infrared living body detection result.

8. The apparatus of claim 7, wherein the first feature extraction module comprises:

the first characteristic extraction submodule is used for inputting the first visible light image into a visible light characteristic extraction network and acquiring at least two first visible light characteristics according to the output of at least two preset convolution layer areas of the visible light characteristic extraction network;

the second feature extraction module includes:

the second characteristic extraction submodule is used for inputting the first near-infrared image into a near-infrared characteristic extraction network and acquiring at least two first near-infrared characteristics according to the output of at least two preset convolution layer areas of the near-infrared characteristic extraction network;

9. The apparatus of claim 8, wherein the third visible light feature is a first visible light feature output by a last pre-set convolution layer region of the visible light feature extraction network; the second visible light characteristic comprises the third visible light characteristic;

10. The apparatus of claim 9, wherein the second visible light characteristic further comprises: an ith visible light feature, which is a visible light feature output by an ith preset convolution layer region of the visible light feature extraction network, wherein i is an integer from 1 to n-1, n is the number of preset convolution layer regions in the visible light feature extraction network, and n is an integer greater than 1;

11. The apparatus of claim 7, wherein the apparatus further comprises:

the first image acquisition module is used for acquiring a second visible light image and a second near-infrared image which contain the human face object;

a face key point acquisition module, configured to perform face key point detection on the second visible light image and the second near-infrared image, and obtain a visible light face key point and a near-infrared face key point respectively;

a second image acquisition module, configured to align a face region in the second visible light image and a face region in the second near-infrared image based on the visible light face key point and the near-infrared face key point, and obtain a third visible light image and a third near-infrared image respectively;

and the third image acquisition module is used for respectively carrying out image normalization processing on the third visible light image and the third near-infrared image to respectively obtain the first visible light image and the first near-infrared image.

12. A face liveness detection model training device, the device comprising:

the prediction result acquisition module is used for predicting the sample fusion characteristics, the third sample visible light characteristics and the third sample near-infrared characteristics by utilizing each classification network of the face living body detection model to respectively obtain fusion living body prediction results, visible light living body prediction results and near-infrared living body prediction results of the sample face object, wherein the first sample visible light characteristics comprise the third sample visible light characteristics, and the first sample near-infrared characteristics comprise the third sample near-infrared characteristics;

the determining module is used for determining a human face living body prediction result of the sample human face object according to the fusion living body prediction result, the visible light living body prediction result and the near-infrared living body prediction result;

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-6.