CN116311549A

CN116311549A - Living body object identification method, apparatus, and computer-readable storage medium

Info

Publication number: CN116311549A
Application number: CN202310265731.8A
Authority: CN
Inventors: 徐静涛; 冯昊; 安耀祖; 张超; 单言虎; 兪炳仁; 韩在濬; 崔昌圭
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2017-03-27
Filing date: 2017-03-27
Publication date: 2023-06-23
Also published as: CN108664843A; KR102352345B1; KR20180109664A; CN108664843B

Abstract

The present disclosure proposes a method, apparatus and computer-readable storage medium for in vivo target detection. The method comprises the following steps: determining a first living body detection value of a target image based on a category of the target image; and determining whether a target in the target image is a living target according to a first living body detection value of the target image.

Description

Living body object identification method, apparatus, and computer-readable storage medium

Description of the division

The present application is a divisional application of the inventive patent application of which the application date is 2017, 03, 27, 201710192070.5, and the invention name is "living object identification method, apparatus, and computer-readable storage medium".

Technical Field

The present disclosure relates generally to the field of image recognition, and more particularly to a method, apparatus, and computer-readable storage medium for recognizing living objects in images.

Background

With the progress of technology, face recognition systems have been widely used in such fields as entrance guard, security, unlocking of mobile devices, mobile payment, image processing (e.g., beauty, etc.), and the like. However, different types of spoofing attacks are a great threat to face recognition systems. Common spoofing attacks include, for example, paper print image attacks, photo image attacks, screen video attacks, 3D print attacks, and the like. These attacks fool the face recognition system by using a non-living copy of the object (e.g., a print photo, a cell phone screen image, a computer screen image, etc.), and thus gain rights that it should not gain.

Therefore, how to obtain stable and effective features to judge living bodies has been a major and difficult problem in the field of living body detection research. Living detection methods can be broadly divided into two categories depending on whether user cooperation is required: (1) invasive living detection methods; and (2) a non-invasive in vivo detection method.

Invasive biopsy methods typically require user-dependent coordination. The user needs to make corresponding actions such as blinking, shaking head, smiling and the like under the prompt of software, and then the detection system recognizes the corresponding actions, so that the detection system is used as the basis for living body detection. The method has the defects of long time consumption, unfriendly interface, poor user experience and the like in practical application.

Non-invasive living detection methods generally do not require any special action by the user, but rather directly extract the corresponding features from the image or video information acquired by the device. The method mainly utilizes experiences in the related computer vision and image processing research fields to design a certain objective algorithm to extract the characteristics of images or videos, and the method for extracting the characteristics is the same for different equipment and application scenes. Such methods rely heavily on the designer's ability, but are sometimes not robust to changing complex actual scenarios.

Disclosure of Invention

To at least partially solve or mitigate the above-described problems, methods, apparatuses, and computer-readable storage media for identifying a living object in an image according to embodiments of the present disclosure are provided.

According to a first aspect of the present disclosure, a method for identifying a living object in an image is provided. The method comprises the following steps: determining candidate objects in the image; determining one or more first living body detection values respectively corresponding to one or more sub-images related to the candidate object according to the one or more sub-images in the image; and determining whether the candidate object is a living object based on the one or more first living object detection values.

In some embodiments, the one or more sub-images include at least one of: only sub-images including the candidate object; a sub-image including only a part of the candidate object; and a sub-image comprising the candidate object and a background of the candidate object. In some embodiments, each first in-vivo detection value is determined for a respective sub-image using a respective first convolutional neural network. In some embodiments, each first convolutional neural network comprises 6 convolutional layers, including 16 3x3x3 convolutional kernels, 16 3x3x16 convolutional kernels, 32 3x3x32 convolutional kernels, 64 3x3x32 convolutional kernels, and 64 3x3x64 convolutional kernels, respectively. In some embodiments, in each first convolutional neural network, after at least one convolutional layer and before the next convolutional layer, a batch normalization layer, a modified linear unit, and a pooling layer are further included. In some embodiments, in each first convolutional neural network, after the fifth convolutional layer and before the sixth convolutional layer, no correction linear units and no pooling layer are included. In some embodiments, the step of determining whether the candidate object is a living object based on the one or more first living detection values comprises: a weighted average or arithmetic average is performed on the one or more first living being detection values to obtain a comprehensive first living being detection value; and comparing the integrated first living body detection value with a preset first living body detection threshold value, and determining whether the candidate object is a living body object according to a comparison result. In some embodiments, the step of weighted or arithmetic averaging the one or more first living being detection values to obtain an integrated first living being detection value comprises: the integrated first living being detection value is calculated according to the following formula:

Score _pre ＝0.3*Score _CNN1 +0.4Score _CNN2 +0.3*Score _CNN3

Wherein Score _pre To synthesize the first in vivo detection value, score _CNN1 Score for a first biopsy value corresponding to a first sub-image _CNN2 For a first living being detection value corresponding to the second sub-image, and Score _CNN3 Is the first living body detection value corresponding to the third sub-image. In some embodiments, the method further comprises: if it is determined that the candidate object is a living object, determining whether the candidate object isOne of the one or more objects pre-stored in the object database. In some embodiments, the step of determining whether the candidate object is one of one or more objects pre-stored in an object database comprises: a second convolutional neural network is used to determine whether the candidate object is one of one or more objects pre-stored in an object database. In some embodiments, if it is determined that the candidate object is one of one or more objects pre-stored in an object database, the method further comprises: determining one or more second in-vivo detection values for one or more feature maps of at least one convolutional layer in the second convolutional neural network, respectively, using one or more third convolutional neural networks; and determining whether the candidate object is a living object based on the one or more second living detection values. In some embodiments, each third convolutional neural network comprises: a convolution layer, an averaging pooling layer, a first fully-connected layer, and a second fully-connected layer. In some embodiments, the step of determining whether the candidate object is a living object based on the one or more second living detection values comprises: a weighted average or a arithmetic average is performed on the one or more second living body detection values to obtain a comprehensive second living body detection value; and comparing the comprehensive second living body detection value with a preset second living body detection threshold value, and determining whether the candidate object is a living body object according to a comparison result.

According to a second aspect of the present disclosure, there is provided an apparatus for identifying a living object in an image. The apparatus includes: a candidate object determining unit configured to determine a candidate object in the image; a first living body detection value determining unit configured to determine one or more first living body detection values respectively corresponding to one or more sub-images related to the candidate object from among the images; and a living object determining unit configured to determine whether the candidate object is a living object based on the one or more first living object detection values.

In some embodiments, the one or more sub-images include at least one of: only sub-images including the candidate object; a sub-image including only a part of the candidate object; and a sub-image comprising the candidate object and a background of the candidate object. In some embodiments, each first in-vivo detection value is determined for a respective sub-image using a respective first convolutional neural network. In some embodiments, each first convolutional neural network comprises 6 convolutional layers, including 16 3x3x3 convolutional kernels, 16 3x3x16 convolutional kernels, 32 3x3x32 convolutional kernels, 64 3x3x32 convolutional kernels, and 64 3x3x64 convolutional kernels, respectively. In some embodiments, in each first convolutional neural network, after at least one convolutional layer and before the next convolutional layer, a batch normalization layer, a modified linear unit, and a pooling layer are further included. In some embodiments, in each first convolutional neural network, after the fifth convolutional layer and before the sixth convolutional layer, no correction linear units and no pooling layer are included. In some embodiments, the living object determination unit is further configured to: a weighted average or arithmetic average is performed on the one or more first living being detection values to obtain a comprehensive first living being detection value; and comparing the integrated first living body detection value with a preset first living body detection threshold value, and determining whether the candidate object is a living body object according to a comparison result. In some embodiments, the living object determination unit is further configured to: the integrated first living being detection value is calculated according to the following formula:

Score _pre ＝0.3*Score _CNN1 +0.4*Score _CNN2 +0.3*Score _CNN3

Wherein Score _pre To synthesize the first in vivo detection value, score _CNN1 Score for a first biopsy value corresponding to a first sub-image _CNN2 For a first living being detection value corresponding to the second sub-image, and Score _CNN3 Is the first living body detection value corresponding to the third sub-image. In some embodiments, the apparatus further comprises: and an object comparison unit configured to determine whether the candidate object is one of one or more objects stored in advance in an object database if it is determined that the candidate object is a living object. In some implementationsIn an embodiment, the object comparison unit is further configured to: a second convolutional neural network is used to determine whether the candidate object is one of one or more objects pre-stored in an object database. In some embodiments, the apparatus further comprises: a second living body detection value determining unit for determining one or more second living body detection values by using one or more third convolution neural networks for one or more characteristic diagrams of at least one convolution layer in the second convolution neural network; and a living object finalization unit for determining whether the candidate object is a living object based on the one or more second living object detection values. In some embodiments, each third convolutional neural network comprises: a convolution layer, an averaging pooling layer, a first fully-connected layer, and a second fully-connected layer. In some embodiments, the living object finalization unit is further configured to: a weighted average or a arithmetic average is performed on the one or more second living body detection values to obtain a comprehensive second living body detection value; and comparing the comprehensive second living body detection value with a preset second living body detection threshold value, and determining whether the candidate object is a living body object according to a comparison result.

According to a third aspect of the present disclosure, there is provided an apparatus for identifying a living object in an image. The apparatus includes: a processor; a memory, wherein instructions are stored that, when executed by the processor, cause the processor to: determining candidate objects in the image; determining one or more first living body detection values respectively corresponding to one or more sub-images related to the candidate object according to the one or more sub-images in the image; and determining whether the candidate object is a living object based on the one or more first living object detection values.

In some embodiments, the one or more sub-images include at least one of: only sub-images including the candidate object; a sub-image including only a part of the candidate object; and a sub-image comprising the candidate object and a background of the candidate object. In some embodiments, each first in-vivo detection value is determined for a respective sub-image using a respective first convolutional neural network. In some embodiments, each first convolutional neural network comprises 6 convolutional layers, including 16 3x3x3 convolutional kernels, 16 3x3x16 convolutional kernels, 32 3x3x32 convolutional kernels, 64 3x3x32 convolutional kernels, and 64 3x3x64 convolutional kernels, respectively. In some embodiments, in each first convolutional neural network, after at least one convolutional layer and before the next convolutional layer, a batch normalization layer, a modified linear unit, and a pooling layer are further included. In some embodiments, in each first convolutional neural network, after the fifth convolutional layer and before the sixth convolutional layer, no correction linear units and no pooling layer are included. In some embodiments, the instructions further cause the processor to: a weighted average or arithmetic average is performed on the one or more first living being detection values to obtain a comprehensive first living being detection value; and comparing the integrated first living body detection value with a preset first living body detection threshold value, and determining whether the candidate object is a living body object according to a comparison result. In some embodiments, the instructions further cause the processor to: the integrated first living being detection value is calculated according to the following formula:

Score _pre ＝0.3*Score _CNN1 +0.4Score _CNN2 +0.3*Score _CNN3

Wherein Score _pre To synthesize the first in vivo detection value, score _CNN1 Score for a first biopsy value corresponding to a first sub-image _CNN2 For a first living being detection value corresponding to the second sub-image, and Score _CNN3 Is the first living body detection value corresponding to the third sub-image. In some embodiments, the instructions further cause the processor to: if it is determined that the candidate object is a living object, it is determined whether the candidate object is one of one or more objects stored in advance in an object database. In some embodiments, the instructions further cause the processor to: a second convolutional neural network is used to determine whether the candidate object is one of one or more objects pre-stored in an object database. In some embodiments, the instructions further cause the processor to: for the purpose ofOne or more feature maps of at least one convolutional layer in the second convolutional neural network, and determining one or more second living body detection values respectively by using one or more third convolutional neural networks; and determining whether the candidate object is a living object based on the one or more second living detection values. In some embodiments, each third convolutional neural network comprises: a convolution layer, an averaging pooling layer, a first fully-connected layer, and a second fully-connected layer. In some embodiments, the instructions further cause the processor to: a weighted average or a arithmetic average is performed on the one or more second living body detection values to obtain a comprehensive second living body detection value; and comparing the comprehensive second living body detection value with a preset second living body detection threshold value, and determining whether the candidate object is a living body object according to a comparison result.

According to a fourth aspect of the present disclosure, there is provided a living body target detection method. The method comprises the following steps: determining a first living body detection value of a target image based on a category of the target image; and determining whether a target in the target image is a living target according to a first living body detection value of the target image.

In some embodiments, before determining the first living body detection value of the target image, further comprising: and carrying out target detection on the image to be detected to obtain a target image. In some embodiments, the category of the target image includes at least one of: a global image of the target, a local image of the target, and a target image containing background information. In some embodiments, the step of determining the first biopsy value of the target image based on the class of the target image comprises determining the first biopsy value of the target image based on at least one of: target global information determined according to the target global image; local texture feature information determined from the target local image; and context information determined from the target image containing the context information. In some embodiments, the step of determining the first in-vivo detection value of the target image based on the class of the target image comprises: for each class of target images, a respective first in-vivo detection value is determined by extracting features using a corresponding first convolutional neural network. In some embodiments, the first convolutional neural network corresponding to each class of target images is trained using the corresponding class of images. In some embodiments, the first convolutional neural network comprises at least one of: a plurality of convolution layers, a plurality of pooling layers, a plurality of activation layers, a plurality of batch normalization layers, and no pooling layers and activation layers are present between the penultimate and penultimate convolution layers. In some embodiments, determining whether the target in the target image is a living target based on the first living detection value of the target image comprises: determining a second living body detection value of the target image; determining whether a target in the target image is a living target according to a second living body detection value of the target image; or determining whether the target in the target image is a living target based on the first living body detection value and the second living body detection value of the target image. In some embodiments, determining the second in-vivo detection value of the target image specifically includes: and when the first living body detection value of the target image meets the preset condition, carrying out target identification and determining the second living body detection value of the target image. In some embodiments, the second in-vivo detection value is determined by: one or more second living body detection values are respectively determined by using one or more third convolution neural networks aiming at one or more characteristic diagrams of at least one convolution layer in the second convolution neural network used in the target identification. In some embodiments, the third convolutional neural network comprises at least one of: a convolution layer, a pooling layer, and a plurality of fully connected layers. In some embodiments, the target comprises a body part of a living being. In some embodiments, the body part comprises at least one of: face, palmprint, fingerprint, iris, limb.

According to a fifth aspect of the present disclosure, there is provided a living body target detection apparatus. The apparatus includes: a living body detection value determining unit configured to determine a first living body detection value of a target image based on a category of the target image; and a living body target determining unit configured to determine whether a target in the target image is a living body target based on a first living body detection value of the target image.

According to a sixth aspect of the present disclosure, there is provided a living body target detection apparatus. The apparatus includes: a processor; a memory, wherein instructions are stored that, when executed by the processor, cause the processor to: determining a first living body detection value of a target image based on a category of the target image; and determining whether a target in the target image is a living target according to a first living body detection value of the target image.

According to a seventh aspect of the present disclosure, there is provided a computer readable storage medium storing instructions, wherein the instructions, when executed by a processor, cause the processor to perform the method according to the first or fourth aspect of the present disclosure.

Drawings

The foregoing and other objects, features and advantages of the present disclosure will be more apparent from the following description of the preferred embodiments of the present disclosure, taken in conjunction with the accompanying drawings in which:

Fig. 1 is a flowchart illustrating an example method for identifying a living object in an image according to an embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating an example first living body detection method according to an embodiment of the present disclosure.

Fig. 3 illustrates an example image according to an embodiment of the disclosure.

Fig. 4 is a schematic diagram illustrating an example first convolutional neural network used in the example first living detection method shown in fig. 2.

Fig. 5 is a schematic diagram illustrating an example second convolutional neural network used in the example method shown in fig. 1.

Fig. 6 is a flowchart illustrating an example second in-vivo detection method according to an embodiment of the present disclosure.

Fig. 7 is a schematic diagram illustrating an example third convolutional neural network used in the example second in-vivo detection method shown in fig. 6.

Fig. 8 is a schematic effect diagram showing the use of a scheme according to an embodiment of the present disclosure.

Fig. 9 is a flowchart illustrating an example method for identifying a living object in an image according to an embodiment of the present disclosure.

Fig. 10 is a functional block diagram illustrating an example device for performing the method illustrated in fig. 9, according to an embodiment of the present disclosure.

Fig. 11 is a schematic diagram showing a hardware arrangement of an example apparatus for identifying a living object in an image according to an embodiment of the present disclosure.

Detailed Description

The following detailed description of preferred embodiments of the present disclosure, with reference to the accompanying drawings, omits details and functions that are not necessary for the present disclosure in order to prevent confusion of an understanding of the present disclosure. In this specification, the various embodiments described below for the purpose of describing the principles of the present disclosure are illustrative only and should not be construed in any way as limiting the scope of the disclosure. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present disclosure defined by the claims and their equivalents. The following description includes numerous specific details to aid in understanding, but these details should be construed as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Furthermore, the same reference numbers will be used throughout the drawings to refer to the same or like functions and operations. Furthermore, all or portions of the functions, features, elements, modules, etc. described in the various embodiments below may be combined, deleted, and/or modified to create a new embodiment and still fall within the scope of the present disclosure. Furthermore, in the present disclosure, the terms "include" and "comprise," as well as derivatives thereof, are intended to be inclusive, rather than limiting.

Hereinafter, the present disclosure will be described in detail taking a scenario in which the present disclosure is applied to face recognition as an example. The present disclosure is not limited thereto and may be applied to any other suitable object recognition. For example for other part recognition of a person, facial or other part recognition of other living beings, or recognition of other objects.

As described above, the conventional non-invasive living body detection method can obtain better results in certain equipment and application scenarios, but the robustness of the algorithm cannot meet the actual needs for the numerous intelligent devices at present. In addition, the feature expression capability based on manual design is limited by the design of the algorithm itself, and cannot work in complex real scenes, such as a method based on feature extraction like local binary pattern (Local binary pattern, LBP), only the local texture information of the image is considered. However, in actual living detection, such as under low light, backlight, etc., such features are not effective in distinguishing between a real image and an attack image.

In addition, the feature learning-based method mostly uses only a single face image for feature extraction and living body detection, ignoring other useful information such as background information and texture information of the image. Taking the background information of the image as an example, it may often contain edge information of the attack image, such as a mobile phone screen frame, a paper boundary, etc., and these features are very effective for detecting the attack image, so as to determine whether the target is a living body. In addition, the face recognition system served by the living body detection system can also provide some effective characteristics for living body detection, which is also neglected in the prior art.

In view of the above, the present disclosure proposes a face living body detection scheme. The scheme can at least partially solve or alleviate the technical problems, can be used as an effective human face image living body detection method to be combined with most of existing human face authentication methods, improves the performance of the human face image living body detection method, has higher calculation efficiency and has wide application prospect.

The living detection scheme involves the use of convolutional neural networks (Convolutional Neural Network). The study of Hubel and Wiesel et al in 1950 and 1960 showed that: the visual cortex of cats and monkeys contains neurons therein that respond individually to small areas in the field of view. The area in visual space that affects a single neuron by visual stimuli is called the receptive field or receptive field (receptive field) of that neuron if the eye is not moving. Adjacent neurons have similar and overlapping receptive fields. The size and location of the receptive field systematically varies across the cortex to form a complete visual space map. Inspired by this study, in the field of machine learning, convolutional neural networks (abbreviated as CNN or ConvNet) are proposed, which are a type of feed-forward artificial neural network. Specifically, the pattern of connections between neurons of the neural network is inspired by the visual cortex of animals. Individual neurons respond to stimuli in a limited region of space, the receptive field described above. The respective receptive fields of the different neurons partially overlap each other such that they are aligned to form the entire field of view. The response of an individual neuron to its receptor field stimulus can be mathematically approximated by means of a convolution operation. Accordingly, convolutional neural networks have wide applications in the fields of image and video recognition, recommendation (e.g., commodity recommendation of shopping websites, etc.), and natural language processing.

In general, the present disclosure proposes to train a plurality of convolutional neural networks (Convolutional Neural Network, CNN) separately to extract features and to obtain a first in-vivo detection value using different types of images, such as face images, face images including background, partial face images, etc. These images may describe the living information from different angles, thereby increasing the robustness of the detection system. In addition, the invention also provides a series of middle layer characteristics of the face matching model, a small number of new convolution layers and full connection layers are trained, and a second living body detection value is obtained, so that face living body detection is carried out, and the robustness of a detection system is further enhanced. Since these features are already generated in the face recognition process, little additional computation time is required to the system.

Next, an example method for identifying a living object in an image according to an embodiment of the present disclosure will be described in detail first with reference to fig. 1.

Fig. 1 is a flowchart illustrating an example method 100 for identifying a living object in an image according to an embodiment of the disclosure. As shown in fig. 1, the method 100 starts at step S110, in which step S110, first an image of a living object to be detected may be acquired. For example, the input image including the candidate object may be acquired by a camera of the mobile device, a camera of the access control system, or more generally any image sensor.

Next, in step S120, a first living object detection may be performed. For example, a variety of feature biopsies may be performed: and generating different kinds of images by using the original images of the detected human faces, then respectively inputting the images into a CNN network with similar structures to obtain a plurality of living body detection scores, fusing the living body detection scores, and obtaining a comprehensive first living body detection value for judgment. This step S120 will be described in detail below in conjunction with fig. 2 to 4.

Then, if the integrated first living detection value satisfies a condition, for example, if it is greater than a preset first living detection threshold, then subsequent operations may proceed, otherwise the method 100 may determine that no living object is present in the image and end directly. In some embodiments, if it is determined that no living object is present in the image, the determination may be output to a user or other entity. If it is determined that a living object is present in the image, the method 100 may proceed to step S130.

In step S130, identification may be performed for a living object in the image for which the presence has been preliminarily determined. For example, the determined living object may be compared with features of one or more objects previously stored or registered in the object database to determine whether it matches one of the one or more objects. For example, in an embodiment such as a door access system, an object determined in an image from a camera may be compared to one or more objects registered in advance to determine if it is a person allowed to enter. This step S130 will be described in detail below in conjunction with fig. 5.

Next, in step S140, a second living object detection may be performed. For example, a variety of feature biopsies may be performed: after the matching of the objects, a plurality of different second living body detection scores can be obtained, a comprehensive second living body detection value is obtained after the fusion, then the judgment similar to the aforementioned comprehensive first living body detection value is performed, and finally, the final living body detection result is obtained in step S150. Step S140 will be described in detail below in conjunction with fig. 6 to 7.

It should be noted, however, that the various steps in method 100 are not limited to being performed in the order shown in fig. 1. In fact, due to the relative independence of the first living object detection step S120 from steps S130 and S140, and may be performed at any location in the method 100 after step S110 and before step S150. For example in parallel with step S130 or step S140, or after step S130 or step S140, etc. Furthermore, although step S140 depends on the partial intermediate result of step S130 as will be seen below, step S140 may also be performed at least partially in parallel with step S130, e.g. after the partial intermediate result of step S130 required for step S140 has been obtained, step S130 and step S140 may be performed in parallel. In addition, step S150 may make a comprehensive judgment with respect to the outputs of step S120, step S130, and/or step S140. For example, in the case where the result of all of the three steps is yes, a conclusion indicating that a living object exists in the image is output in step S150. For another example, in the case where the results of step S130 and step S120 are yes, a conclusion indicating that a living object exists in the image may be output in step S150 regardless of the result of step S140. In other embodiments, in step S150, the plurality of first and second living body detection values obtained in steps S120 and S140 may be weighted-averaged or arithmetically averaged to obtain a combined (final) living body detection value. Further, any other suitable combination of the outputs of these steps is also possible.

Next, flowcharts of the first living body detection method according to the embodiment of the present disclosure will be described in detail with reference to fig. 2 to 4.

Fig. 2 is a flowchart illustrating an example (first) living body detection method 200 according to an embodiment of the disclosure. The method 200 may be used as step S120 in the method 100 shown in fig. 1, however, the present disclosure is not limited thereto. In fact, the method 200 may also be used as a separate method for detecting living objects in an image, rather than as a sub-step of another method. For example, the method 200 may be used to determine whether a candidate object in an image is a living object alone.

The method 200 shown in fig. 2 starts at step S210, where an input image is acquired at step S210. Then in step S220, an object recognition (detection) process may be performed with respect to the image to determine whether or not a candidate object exists in the image. This step S220 may be implemented using any currently known or future developed technique, such as deep learning, convolutional neural networks, and the like. If no candidate is detected in step S220, the method 200 may end directly. When a candidate is detected in step S220, such as when a face is detected (e.g., the leftmost "original image" of fig. 3 or any one of the images of fig. 8), the method 200 may proceed to step S230.

In step S230, one or more sub-images included in the original image may be determined according to the candidate object. For example, in the embodiment shown in FIG. 3, three sub-images may be determined, one for each: "partial face image", "face image", and "face image including background". These three sub-images correspond to: a sub-image including only a part of the candidate object, a sub-image including only the candidate object, and a sub-image including the candidate object and a background of the candidate object. By operating on these three images as described below, living body detection information reflecting different gradation features in the original image can be obtained. For example, from a "face image", global information about a face may be obtained; according to the 'local face image', local texture characteristic information of the face can be obtained; and according to the face image containing the background, the face image containing the background can be obtained, so that the background environment information of the face image can be obtained. In other words, a more robust living detection result can be obtained by focusing on features of different layers. However, the embodiments of the present disclosure are not limited thereto. Indeed, in other embodiments, a different number of sub-images of the viewfinder range may be employed. For example, a sub-image specifying a face portion (e.g., eyes, nose, mouth) of a person, or a sub-image specifying a background may be included, or any one or more of the above three sub-images may not be included.

After determining one or more sub-images in step S230, the method 200 may proceed to step S240. In step S240, for each sub-image, a corresponding first living detection value may be determined by a corresponding Convolutional Neural Network (CNN) (hereinafter, also sometimes referred to as "first convolutional neural network"), respectively. In an embodiment such as that shown in fig. 3, three CNNs may be used for three sub-images. In some embodiments, each CNN may be an example first CNN 400 as shown in fig. 4.

In the first CNN 400 shown in fig. 4, the input thereof may be, for example, an original image of 128x128x3 (i.e., a resolution of 128x128 and having RGB (red, green, blue) 3 color channels). In the CNN 400 shown in fig. 4, the original image may be first cropped to obtain a 120x120x3 input image. CNN 400 may include 6 convolution layers, which may include 16 3x3x3 convolution kernels, 16 3x3x16 convolution kernels, 32 3x3x32 convolution kernels, 64 3x3x32 convolution kernels, and 64 3x3x64 convolution kernels, respectively. It can be seen that the number of convolution kernels of the convolution layers increases with depth, sequentially 16, 32, 64, to enhance the expressive power of the model.

Furthermore, CNN 400 may also include a batch normalization (Batch Normalization or BN) layer, a modified linear units (Rectified Linear Unit or ReLU) layer, and/or a Pooling layer (Pooling) after at least one convolutional layer and before the next convolutional layer. For example, in the CNN 400 shown in fig. 4, after the first to sixth convolution layers, batch normalization layers (BN 1, BN2, BN3, BN4, BN5, BN 6) are respectively provided so that the feature map after the convolution of each convolution layer is a normalized feature map. In addition, in the CNN 400 shown in fig. 4, the modified linear cell layer and the pooling layer may not be included after the fifth convolution layer and before the sixth convolution layer. The pooling area size of each pooling layer may be 2x2. Further, the pooling layer disposed after the other convolution layers (except the fifth convolution layer) may be the maximum pooling layer, except that the pooling layer disposed after the sixth convolution layer is the average pooling layer. Furthermore, in the CNN 400 shown in fig. 4, after the average pooling layer following the sixth convolution layer, there is also a full connection layer and a Softmax layer to obtain the final output of the CNN 400, i.e., the first in-vivo detection value.

However, the embodiments of the present disclosure are not limited thereto. In fact, the construction of the first convolutional neural network 400 is not limited to the embodiment shown in fig. 4. In fact, it may also employ other numbers of convolutional layers. Furthermore, there may be other types of pooling layers after each convolution layer, such as average pooling, L2 norm pooling, instead of maximum pooling. In addition, the ReLU layer may be replaced with other active layers having similar functions, such as Sigmoid layer, hyperbolic tangent (tanh) layer, and the like. In addition, the Softmax layer may be replaced with other loss layers having similar functions, such as Sigmoid cross entropy layers, euclidean loss layers, and the like. Furthermore, the number and size of convolution kernels in each convolution layer is also not limited to the example configuration in CNN 400 shown in fig. 4.

Returning to fig. 2, for each type of sub-image, a trained respective CNN (e.g., CNN 400) is employed to operate on it to determine a respective first in-vivo detection value for each sub-image in step S240. For example, in the embodiment shown in fig. 3, CNNs are trained for three (class) sub-images, respectively, and the first living detection values of the respective sub-images in the input image are determined using the respective trained CNNs in step S240.

Next, in step S250, it may be determined whether the candidate object determined in step S220 is a living object based on the one or more first living detection values determined in step S240 according to the fusion policy. For example, a weighted average or an arithmetic average may be made for the one or more first living being detection values to obtain an integrated first living being detection value. Then, the integrated first living body detection value may be compared with a preset first living body detection threshold value, and whether the candidate object is a living body object may be determined according to the comparison result.

For example, in the embodiment shown in fig. 3, the integrated first living body detection value may be calculated according to the following formula:

Score _pre ＝0.3*Score _CNN1 +0.4*Score _CNN2 +0.3*Score _CNN3

wherein Score _pre To synthesize the first in vivo detection value, score _CNN1 Score for a first living body detection value corresponding to a first sub-image (e.g., "partial face image") _CNN2 For a first living detection value corresponding to a second sub-image (e.g. "face image"), and × Score _CNN3 Is a first living body detection value corresponding to a third sub-image (e.g., "face image containing background"). The integrated first living body detection value is interval [0,1 ]]In the case of using, for example, the first living body detection threshold value of 0.5, when the integrated first living body detection value is 0.5 or more, it may be determined that the living body object, for example, a real person, is included in the original image instead of the mobile phone image, the display image, the developed photograph, the printed photograph, or the like; and when the integrated first living body detection value is less than 0.5, it may be determined that the living body object is not included in the original image. Note, however, that the present disclosure is not limited thereto. In fact, other weights and weighting means may be employed to determine the integrated first living being detection value.

As previously described, the method 200 may be a stand-alone scheme, for example, to determine whether a candidate object in an image is a living object. It may also serve as step S120 of the method 100 shown in fig. 1. In the case where it is as step S120 of the method 100 shown in fig. 1, when it is determined that the living object is not included in the original image, the method 100 may directly end as described above.

Step S130 in the example method 100 shown in fig. 1 will be described below in conjunction with fig. 5. In step S130, object matching may be performed using the (second) convolutional neural network 500. Fig. 5 is a schematic diagram illustrating an example second convolutional neural network 500 used in the example method 100 shown in fig. 1. In the embodiment shown in fig. 5, the second convolutional neural network 500 may include a plurality (e.g., 5) of convolutional layer groups, each of which may include, for example, 10 convolutional layers and 1 max pooling layer as shown in fig. 5. However, embodiments of the present disclosure are not so limited and may employ any number, configuration of convolutional groups, pooling groups, and the like. Although only the convolutional layer and the pooling layer are shown in fig. 5, the present disclosure is not limited thereto. In fact the convolutional layer set shown in fig. 5 may have additional connections. For example, the output of convolution layer 1 may be superimposed (elementwise sum) onto the output of convolution layer 3 and then input to convolution layer 4. For another example, further, the superposition of the outputs of

convolution layers

1 and 3 may be superimposed (elementwise sum) onto the output of convolution layer 5 and then input to convolution layer 6, and so on. In this context, superposition or elementwise sum refers to that the feature maps of the two outputs are the same in size and number, and the values of the corresponding positions are added together.

In the embodiment shown in fig. 5, it is determined whether the candidate object is one of one or more objects stored in advance in the object database by using the second convolutional neural network 500. For example, in the embodiment of the access control system, a face image of a person allowed to enter may be stored in advance in a database of the access control system, and when it is detected that a person performs face authentication at the access control, the detected candidate object (face) may be compared with the face image stored in advance in the database of the access control system as described above, and further, whether or not it is a person allowed to enter may be determined. If not, the method 100, for example, shown in FIG. 1, may end directly and an indication of prohibited entry may be returned. Otherwise, the method 100 may proceed to step S140 to perform the second living object detection S140.

Hereinafter, a second living body detection method according to an embodiment of the present disclosure will be described in detail with reference to fig. 6. Fig. 6 is a flowchart illustrating an example second in-vivo detection method 600 according to an embodiment of the disclosure. As previously described, the method 600 may correspond to step S140 of the method 100 shown in fig. 1, and the method 600 may thus utilize some of the intermediate results in the previous step S130. After all or part of the object matching steps are performed, the in-vivo object detection may be performed again using some intermediate layer features of the second CNN 500.

Specifically, as shown in fig. 6, after, for example, one or more convolution groups of the second CNN 500, a feature map (feature map) generated by each convolution layer may be taken. Then, a corresponding second in-vivo detection value may be obtained through a second detection network (e.g., a third Convolutional Neural Network (CNN) 700 shown in fig. 7). In the embodiment shown in fig. 6, for example, N third CNNs 700 may be set for all N convolution groups, and operated for one or more feature maps of the corresponding convolution groups, respectively, to obtain corresponding second living body detection values 1 to N. However, the present disclosure is not limited thereto. In fact, in other embodiments, different numbers of different configurations of feature maps of convolutionally organized groups of layers may be employed. For example, only the feature maps of the first three convolutional groups of layers may be employed, or only the feature maps of the last two convolutional groups of layers may be employed.

Next, similar to the determination of the integrated first living body detection values (e.g., step S250 shown in fig. 2), the integrated second living body detection values may be determined based on the respective second living body detection values according to the fusion policy. The fusion strategy may be, for example, an arithmetic average, a weighted average, a cascading strategy, or the like. Finally, similarly to the determination of the first detection result (for example, step S260 shown in fig. 2), it may be finally determined whether or not the living object is included in the image.

As described previously, the second living body detection value may be determined using, for example, the third CNN 700 shown in fig. 7. The third CNN 700 shown in fig. 7 may include a convolutional layer, an average pooling layer, a first fully-connected layer, and a second fully-connected layer. However, embodiments of the present disclosure are not limited thereto. In fact, more convolutional layers, more or fewer fully-connected layers, and/or other pooling layers (max-pooling, L2-norm pooling, etc.) may also be included.

In the case of using the method 100 shown in fig. 1, for example, as described above, an offline experiment was performed with respect to a human face living detection database. The test database includes 9137 images in total, which involves three attack patterns: print face image attacks, print photos attacks, and screen image attacks. Of these, 4318 images of a real face and 4819 images of an attack, with a ratio of about 1:1, were acquired from 100 individuals. The test results are shown in the following table.

Where the single model Resize128 is an algorithm involving only full-face size reduction, the single model Random128 is an algorithm involving only face local (specifically, it is a local area that randomly extracts 128 x 128 from the originally detected global face), the single model Context64 is an algorithm involving only background and face, and the multiple feature models are living object detection algorithms according to the embodiments of the present disclosure. In addition, TPR represents the correct reporting rate, FPR is the false reporting rate, and TH is the judgment threshold. From the above table, it is clear that the solution based on various features according to the embodiments of the present disclosure has significant advantages, and can significantly improve the performance of the in vivo detection method.

In addition, as shown in fig. 8, the scheme according to the embodiment of the present disclosure can clearly distinguish between a "real face" (e.g., a living object on the right side of fig. 8) and a "false face" (e.g., four objects on the left side of fig. 8, respectively, a mobile phone screen image, a computer screen image, a photo-taking, and a photo-printing). In addition, since each convolutional neural network can be trained in advance, and thus the speed of living object detection can be significantly improved at the time of actual use.

Fig. 9 is a flowchart illustrating a method 900 for in-vivo target detection according to an embodiment of the present disclosure. As shown in fig. 9, method 900 may include steps S910 and S920. In accordance with the present disclosure, some of the steps of method 900 may be performed alone or in combination, and may be performed in parallel or sequentially, and are not limited to the specific order of operations shown in fig. 9. In some embodiments, method 900 may be performed by device 1000 shown in fig. 10 or device 1100 shown in fig. 11.

Fig. 10 is a functional block diagram illustrating an example device 1000 for performing the method 900 illustrated in fig. 9, according to an embodiment of the disclosure. As shown in fig. 9, the apparatus 1000 may include: a living body detection value determining unit 1010 and a living body target determining unit 1020.

The living body detection value determining unit 1010 may be configured to determine a first living body detection value of the target image based on the category of the target image. The living body detection value determining unit 1010 may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a microprocessor, a microcontroller, or the like of the apparatus 1000.

The living object determination unit 1020 may be configured to determine whether the object in the target image is a living object based on the first living detection value of the target image. The living body target determining unit 1020 may also be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a microprocessor, a microcontroller, or the like of the apparatus 1000.

Further, the apparatus 1000 may further include other units not shown in fig. 10, such as a candidate object determination unit, a first living body detection value determination unit, a living body object determination unit, an object comparison unit, a second living body detection value determination unit, a living body object final determination unit, and the like. In some embodiments, the candidate determination unit may be configured to determine the candidate in the image. In some embodiments, the first living being detection value determining unit may be configured to determine one or more first living being detection values corresponding to the one or more sub-images, respectively, from the one or more sub-images related to the candidate object in the image. In some embodiments, the living object determining unit may be configured to determine whether the candidate object is a living object based on the one or more first living detection values. In some embodiments, the object comparison unit may be configured to determine whether the candidate object is one of one or more objects stored in advance in the object database if it is determined that the candidate object is a living object. In some embodiments, the object comparison unit may be further configured to determine whether the candidate object is one of one or more objects stored in advance in the object database using a second convolutional neural network. In some embodiments, the second living body detection value determining unit may be configured to determine the one or more second living body detection values for one or more feature maps of at least one convolutional layer in the second convolutional neural network, respectively, using one or more third convolutional neural networks. In some embodiments, the living object finalization unit may be configured to determine whether the candidate object is a living object based on the one or more second living object detection values.

A method 900 for in-vivo target detection performed on the apparatus 1000 according to an embodiment of the present disclosure and the apparatus 1000 will be described in detail below with reference to fig. 9 and 10.

The method 900 starts at step S910, in which a first living detection value of a target image may be determined by the living detection value determining unit 1010 of the device 1000 based on a category of the target image.

In step S920, it may be determined by the living object determination unit 1020 of the apparatus 1000 whether or not the object in the object image is a living object based on the first living object detection value of the object image.

In some embodiments, the method 900 may further include, prior to step S910: and carrying out target detection on the image to be detected to obtain a target image. In some embodiments, the category of the target image may include at least one of: a global image of the target, a local image of the target, and a target image containing a background. In some embodiments, step S910 may include determining a first living detection value of the target image based on at least one of: target global information determined from the target global image; local texture feature information determined from the target local image; and context information determined from the target image containing the context. In some embodiments, step S910 may include: for each class of target images, a respective first in-vivo detection value is determined by extracting features using a corresponding first convolutional neural network. In some embodiments, the first convolutional neural network corresponding to each class of target images may be trained using the corresponding class of images. In some embodiments, the first convolutional neural network may include at least one of: a plurality of convolution layers, a plurality of pooling layers, a plurality of activation layers, a plurality of batch normalization layers, and no pooling layers and activation layers are present between the penultimate and penultimate convolution layers. In some embodiments, step S920 may include: determining a second living body detection value of the target image; determining whether a target in the target image is a living target according to a second living body detection value of the target image; or determining whether the target in the target image is a living target based on the first living body detection value and the second living body detection value of the target image. In some embodiments, determining the second in-vivo detection value of the target image specifically includes: and when the first living body detection value of the target image meets the preset condition, carrying out target identification and determining the second living body detection value of the target image. In some embodiments, the second in-vivo detection value may be determined by: one or more second living body detection values are respectively determined by using one or more third convolution neural networks aiming at one or more characteristic diagrams of at least one convolution layer in the second convolution neural network used in the target identification. In some embodiments, the third convolutional neural network may include at least one of: a convolution layer, a pooling layer, and a plurality of fully connected layers. In some embodiments, the target may comprise a body part of a living being. In some embodiments, the body part may include at least one of: face, palmprint, fingerprint, iris, limb.

Fig. 11 is a block diagram illustrating an example hardware arrangement 1100 of an example device according to an embodiment of the disclosure. The arrangement 1100 includes a processor 1106. Processor 1106 can be a single processing unit or multiple processing units for performing the different actions of the flows described herein. Arrangement 1100 may also include an input unit 1102 for receiving signals from other entities, and an output unit 1104 for providing signals to other entities. The input unit 1102 and the output unit 1104 may be arranged as a single entity or as separate entities.

Furthermore, the arrangement 1100 may include at least one readable storage medium 1108 in the form of non-volatile or volatile memory, such as electrically erasable programmable read-only memory (EEPROM), flash memory, optical disks, blu-ray disks, and/or a hard disk drive. The readable storage medium 1108 may comprise a computer program 1110, which computer program 1110 may comprise code/computer readable instructions which, when executed by the processor 1106 in the arrangement 1100, enable the hardware arrangement 1100 and/or the device 100 comprising the hardware arrangement 1100 to perform the flow described above in connection with fig. 1, 2, 4-7 and/or 9 and any variations thereof.

The computer program 1110 may be configured as computer program code having an architecture of, for example, computer program modules 1110A-1110C. Thus, in an example embodiment when using hardware arrangement 1100, code in the computer program of arrangement 1100 comprises: a module 1110A for determining a first living being detection value for a target image based on a class of the target image. The code in the computer program further comprises: a module 1110B is configured to determine whether a target in the target image is a living target according to a first living detection value of the target image. However, other modules for performing the various steps of the various methods described herein may also be included in computer program 1110.

The computer program modules may substantially perform the various actions in the flows illustrated in fig. 1, 2, 4-7, and/or 9 to simulate various devices. In other words, when different computer program modules are executed in the processor 1106, they may correspond to the various different elements of the various devices mentioned herein.

Although the code means in the embodiment disclosed above in connection with fig. 11 are implemented as computer program modules which, when executed in the processor 1106, cause the hardware arrangement 1100 to perform the actions described above in connection with fig. 1, 2, 4-7 and/or 9, in alternative embodiments at least one of the code means may be implemented at least partly as hardware circuitry.

The processor may be a single CPU (central processing unit), but may also comprise two or more processing units. For example, a processor may include a general purpose microprocessor, an instruction set processor, and/or an associated chipset and/or special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)). The processor may also include on-board memory for caching purposes. The computer program may be carried by a computer program product connected to the processor. The computer program product may include a computer readable medium having a computer program stored thereon. For example, the computer program product may be a flash memory, a Random Access Memory (RAM), a Read Only Memory (ROM), an EEPROM, and the above-described computer program modules may be distributed in alternative embodiments in the form of memory within the UE into different computer program products.

The disclosure has been described with reference to the preferred embodiments. It should be understood that various other changes, substitutions, and alterations can be made by those skilled in the art without departing from the spirit and scope of the disclosure. Accordingly, the scope of the present disclosure is not limited to the specific embodiments described above, but should be defined by the appended claims.

Furthermore, functions described herein as being implemented by pure hardware, pure software, and/or firmware may also be implemented by means of dedicated hardware, a combination of general purpose hardware and software, or the like. For example, functionality described as being implemented by dedicated hardware (e.g., field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), etc.) may be implemented as a combination of general purpose hardware (e.g., central Processing Units (CPUs), digital Signal Processors (DSPs)) and software, or vice versa.

Claims

1. A method, comprising:

determining a living score based on an identification model for identifying a user included in an input image, wherein determining the living score includes:

determining individual living body scores from feature vectors output from hidden layers within the recognition model, and

determining the living scores based on the respective living scores; and

the user is authenticated based on the living score and the recognition result from the recognition model.

2. The method of claim 1, wherein determining the respective living scores comprises:

the individual vital signs are determined from the feature vectors using a vital test model.

3. The method of claim 1, wherein determining the living scores based on the respective living scores comprises: applying a weight to at least one of the individual living scores; and determining the living body score based on a result of applying the weight.

4. The method of claim 1, further comprising:

it is determined whether the object is a pre-registered object based on the recognition model.

5. The method of claim 1, further comprising:

an operation requested by the user is performed in response to the user being authenticated as a registered user.

6. The method of claim 5, wherein performing the operation comprises: unlocking the device, making the payment, and performing the user login.

7. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1-5.

8. An electronic device, comprising:

one or more processors configured to determine a living score based on an identification model for identifying an object included in an input image, and verify a user based on the living score and an identification result from the identification model,

Wherein the one or more processors are further configured to determine respective living scores from feature vectors output from hidden layers within the recognition model, and determine the living scores based on the respective living scores.

9. The electronic device of claim 8, wherein the one or more processors are further configured to determine the respective vital scores from the feature vectors using a vital test model.

10. The electronic device of claim 8, wherein the one or more processors are further configured to apply a weight to at least one of the respective living scores and determine the living score based on a result of applying the weight.

11. The electronic device of claim 8, further comprising a memory configured to store instructions to be executed by the one or more processors.

12. The electronic device of claim 8, wherein the one or more processors are further configured to perform an operation requested by the user in response to the user being authenticated as a registered user.

13. The electronic device of claim 12, wherein the operation requested by the user comprises at least one of unlocking the device, making a payment, and performing a user login.