CN114373218A

CN114373218A - Method for generating convolution network for detecting living body object

Info

Publication number: CN114373218A
Application number: CN202210274678.3A
Authority: CN
Inventors: 王明魁; 李茂林
Original assignee: Beijing Superred Technology Co Ltd
Current assignee: Beijing Superred Technology Co Ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-04-19
Anticipated expiration: 2042-03-21
Also published as: CN114373218B

Abstract

The present disclosure discloses a method for generating a convolutional network for detecting a living object, comprising the steps of: acquiring an infrared image and three-dimensional coordinate information containing an eye region, and processing the infrared image and the three-dimensional coordinate information to generate a first image and a second image containing an iris and a label image indicating iris depth information; inputting the first image into an initial convolution network for processing, and outputting a first prediction result, a second prediction result and a third prediction result; calculating a first loss value based on the first prediction result and the label image, and adjusting a first network parameter of the convolutional network according to the first loss value until a preset condition is met to obtain a middle convolutional network; inputting the second image into a middle convolution network, and outputting a third prediction result; and calculating a second loss value based on third prediction results corresponding to the first image and the second image respectively, adjusting a second network parameter according to the second loss value until a preset condition is met, and finally generating a convolution network for detecting the living body object.

Description

Method for generating convolution network for detecting living body object

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method for generating a convolutional network for detecting a living object and a method for detecting a living object.

Background

With the development of video, image processing and pattern recognition technologies, biometric identification, particularly facial identification, has become a stable, accurate and efficient identity authentication technology, which is widely used for identity authentication in various fields of finance, justice, public safety, military and people's daily life.

However, biometric-based identity authentication also has certain drawbacks. For example, an illegal user obtains an image/video of a legal user in various ways; for another example, the biometric spoofing identification system is forged by public network photo albums, personal resumes, pinhole cameras, etc., so as to obtain the authority of legal identity, etc. Therefore, a critical need is faced in biometric identification technology, which requires that the person to be identified is a living body, rather than a photograph or recorded video, i.e., biometric identification needs to rely on a liveness detection technique.

In all biological characteristics, the iris belongs to in-vivo biological characteristics, and is not affected by abrasion, aging and the like, and the iris recognition has the lowest error rate in various biological characteristic recognition methods, so that the living body detection is generally performed by extracting the iris characteristics. However, it is a problem to be solved in iris recognition that how to detect an object to be detected that is not a living object and is forged by a photograph, a video, or the like, and a fake iris such as a cosmetic pupil or a glass eyeball.

In view of the above, a new approach for detecting a living subject is needed.

Disclosure of Invention

To this end, the present disclosure provides a method of generating a convolutional network for detecting a living object in an attempt to solve or at least alleviate the problems presented above.

According to a first aspect of the present disclosure, there is provided a method of generating a convolutional network for detecting a living object, comprising the steps of: acquiring an infrared image and three-dimensional coordinate information of an eye region; processing the infrared image and the three-dimensional coordinate information to generate a first image and a second image containing the iris and a label image indicating the iris depth information; inputting a first image into an initial convolutional network for processing, wherein the convolutional network comprises a parameter sharing component, a depth estimation component and a blocking countermeasure component which are coupled, outputting a first prediction result through the depth estimation component, and outputting a second prediction result and a third prediction result through the blocking countermeasure component; calculating a first loss value based on the first prediction result and the label image, and adjusting a first network parameter of the convolutional network according to the first loss value until a preset condition is met to obtain a middle convolutional network; inputting the second image into a middle convolution network, and outputting a third prediction result after the second image is processed by a parameter sharing component and a blocking countermeasure component; and calculating a second loss value based on a third prediction result corresponding to the first image and a third prediction result corresponding to the second image, and adjusting a second network parameter according to the second loss value until a preset condition is met, wherein the corresponding convolutional network is the finally generated convolutional network for detecting the living body object.

Optionally, the method according to the present disclosure further comprises the steps of: performing iris detection on the infrared image to determine position information of the iris; determining an iris region from the infrared image as a first image based on the position information of the iris; performing blocking processing on the first image to generate a second image; determining depth coordinates of the iris region from the three-dimensional coordinate information; the depth coordinates are normalized to generate a label image.

Optionally, the method according to the present disclosure further comprises the steps of: the method comprises the steps of carrying out blocking processing on a first image to obtain a plurality of image blocks; randomly ordering image blocks of each row in the row direction of the image to obtain an intermediate disordered image; and randomly ordering the image blocks in each column in the intermediate disordered image in the column direction of the image to obtain a second image.

Optionally, in the method according to the present disclosure, the depth estimation component and the partitioning countermeasure component are respectively coupled with a parameter sharing component, the parameter sharing component is adapted to extract detail features of the input image and generate a feature map; the depth estimation component is suitable for carrying out feature coding on the feature map and outputting a first prediction result of the depth information of the prediction input image; the blocking countermeasure component is adapted to extract high-level features of the feature map, output a second prediction result predicting whether the iris in the input image is not a prosthetic iris, and a third prediction result predicting whether the input image is a blocked image.

Optionally, in a method according to the present disclosure, the depth estimation component comprises: the first feature coding module is suitable for performing feature coding on the feature map and outputting a first feature map; the second characteristic coding module is suitable for carrying out characteristic coding on the first characteristic diagram and outputting a second characteristic diagram; the third feature coding module is suitable for performing feature coding on the second feature graph and outputting a third feature graph; and the feature aggregation module is suitable for respectively processing the first feature map, the second feature map and the third feature map to correspondingly obtain each processed feature, fusing each processed feature to generate a fused feature, and is also suitable for performing convolution on the fused feature to finally output a first prediction result.

Optionally, in a method according to the present disclosure, the first feature encoding module includes 3 feature encoding sub-blocks, a pooling layer, and a down-sampling layer coupled in sequence; the second feature coding module comprises 4 feature coding sub-blocks, a pooling layer and a down-sampling layer which are sequentially coupled; the third feature coding module comprises 6 feature coding sub-blocks, a pooling layer and a down-sampling layer which are sequentially coupled; the feature aggregation module comprises 3 attention subblocks, a feature fusion layer and a convolution layer, wherein the 3 attention subblocks are respectively and correspondingly coupled to the first feature encoding module, the second feature encoding module and the third feature encoding module so as to respectively process the first feature map, the second feature map and the third feature map, and the feature encoding subblocks comprise a plurality of convolution layers with different convolution kernel sizes.

Optionally, in a method according to the present disclosure, the partitioning countermeasure component comprises: a lightweight convolutional layer adapted to extract high-level features of the feature map; a pooling layer adapted to aggregate the extracted high-level features; a first convolution layer adapted to output a second prediction result predicting that an iris in the input image is not a prosthetic iris; and a second convolution layer adapted to output a third prediction result of predicting whether the input image is a block image.

Optionally, in a method according to the present disclosure, the first network parameter includes: network parameters of the parameter sharing component, network parameters of the depth estimation component and network parameters of the blocking countermeasure component; the second network parameters include: network parameters of the block countermeasure component.

According to a second aspect of the present disclosure, there is provided a method for detecting a living subject, comprising the steps of: acquiring an infrared image of an eye region containing an object to be detected; preprocessing the infrared image to generate an image to be detected; inputting an image to be detected into a convolution network for processing so as to output a first prediction result for predicting the depth information of the image to be detected and a second prediction result for predicting that the iris in the image to be detected is not a false iris; respectively processing the first prediction result and the second prediction result to obtain processed prediction values; and when the processed predicted value meets a threshold condition, determining that the object to be detected is a living body, wherein the convolution network is generated by executing the method.

Optionally, in the method according to the present disclosure, the step of processing the first prediction result and the second prediction result respectively to obtain processed prediction values includes: determining the mean value of the first prediction result as a first prediction value; inputting the second prediction result into a normalization index function for processing, and outputting a second prediction value; and taking the first predicted value and the second predicted value as processed predicted values.

Optionally, in the method according to the present disclosure, when both the first predicted value and the second predicted value satisfy respective threshold conditions, it is determined that the object to be detected is a living body.

According to a third aspect of the present disclosure, there is provided a system for detecting a living subject, comprising: the image acquisition unit is suitable for acquiring an infrared image of an eye region containing an object to be detected and preprocessing the infrared image to generate an image to be detected; the image processing unit is suitable for inputting the image to be detected into the convolutional network for processing so as to output a first prediction result for predicting the depth information of the image to be detected and a second prediction result for predicting that the iris in the image to be detected is not the false iris; the prediction result unit is suitable for respectively processing the first prediction result and the second prediction result to obtain processed prediction values, and determining that the object to be detected is a living body when the processed prediction values meet a threshold condition; and the convolutional network generating unit is suitable for training and generating a convolutional network and is also suitable for updating the convolutional network.

According to a fourth aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory storing program instructions that, when read and executed by the processor, cause the computing device to perform the above-described method.

According to a fifth aspect of the present disclosure, there is provided a readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the above method.

According to the technical scheme of the disclosure, the first image, the second image and the label image used for training the convolutional network are obtained by respectively preprocessing the infrared image containing the iris and the three-dimensional coordinate information. And then, sequentially inputting the first image and the second image into a convolutional network, and adjusting network parameters of the convolutional network at least based on the output result and the label image until the training is finished to generate the convolutional network. The convolutional network can be viewed as comprising two parts, a depth estimation branch and a blocking countermeasure branch, where the two branches have partial parameter sharing (i.e., parameter sharing components). The depth information of the input image is predicted from the depth information of the depth estimation branch learning image to recognize a disguised object such as a print, a photograph, a video, or the like. The possibility that the iris in the input image is not a prosthesis is predicted by learning more detailed information about the iris in the image against branches by blocking to recognize iris prostheses such as cosmetic pupil, artificial eye, glass eyeball, and the like.

The foregoing description is only an overview of the technical solutions of the present disclosure, and the embodiments of the present disclosure are described below in order to make the technical means of the present disclosure more clearly understood and to make the above and other objects, features, and advantages of the present disclosure more clearly understandable.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a system 100 for detecting a living subject according to one embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of a computing device 200 according to one embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a method 300 of generating a convolutional network for detecting a living object according to one embodiment of the present disclosure;

FIG. 4 illustrates a schematic structural diagram of a convolutional network 400 according to some embodiments of the present disclosure;

fig. 5 shows a schematic structural diagram of a feature encoded sub-block 500 according to an embodiment of the present disclosure;

FIG. 6 illustrates a schematic structural diagram of a feature aggregation module 428 according to some embodiments of the present disclosure;

fig. 7 shows a flow diagram of a method 700 of detecting a living object according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

To address the problems in the prior art, the present disclosure provides a solution for detecting a living subject. Fig. 1 shows a schematic diagram of a system 100 for detecting a living object according to one embodiment of the present disclosure. As shown in fig. 1, the system 100 includes an image acquisition unit 110, an image processing unit 120, a prediction result unit 130, and a convolution network generation unit 140.

It should be noted that the system 100 shown in fig. 1 is merely exemplary. In particular implementations, a different number of units (e.g., image acquisition units) may be included in the system 100, as the present disclosure is not limited in this respect.

The system 100 processes the acquired image to be detected to output a prediction result. The predicted result at least comprises: and predicting a first prediction result of the depth information of the image to be detected and predicting a second prediction result that the iris in the image to be detected is not the false iris. Thus, the first prediction result indicates whether the object to be detected is a camouflaged living object of a printed drawing, a photograph, a video, or the like. The second prediction result indicates whether the iris of the object to be detected is an iris prosthesis such as a beautiful pupil, an artificial eye, a glass eyeball and the like. The system 100 determines whether the object to be detected is a living body based on the prediction result. That is, only when the object to be detected is not camouflaged by printed drawings, photographs, videos, or the like, and the detected iris thereof is not an iris prosthesis such as a beautiful pupil, an artificial eye, a glass eyeball, or the like, it is determined that the object to be detected is a living object.

Based on the system 100, the living body object in the object to be detected can be accurately detected, and can be arranged in the iris recognition equipment, so that the safety performance of iris recognition is effectively improved.

The image capturing unit 110 may be any type of image capturing device, and the present disclosure does not limit the type and hardware configuration. Preferably, the image capturing unit 110 may be a camera with a near infrared light source capable of capturing iris information for capturing an infrared image of an eye region containing an object to be detected. In other embodiments, the image capturing unit 110 may also be an infrared camera capable of capturing three-dimensional information.

According to the embodiment of the present disclosure, the image acquisition unit 110 may also pre-process the infrared image to generate an image to be detected. Preprocessing operations include, but are not limited to: and (5) cutting and zooming.

The image processing unit 120 is coupled to the image acquisition unit 110, and receives the image to be detected, and inputs the image to be detected into a convolutional network for processing, so as to output a first prediction result and a second prediction result.

The prediction result unit 130 is coupled to the image processing unit 120, and the prediction result unit 130 processes the first prediction result and the second prediction result respectively to obtain processed prediction values, and determines that the object to be detected is a living body when the processed prediction values satisfy a threshold condition.

In some embodiments, the system 100 further comprises a convolutional network generation unit 140 for training to generate a convolutional network. Of course, the convolutional network generating unit 140 may also periodically update the convolutional network according to the real-time data, which is not limited by this disclosure.

It should be noted that, regarding the specific execution flow of the system 100 and its parts, reference may be made to the following detailed description of the method 300 and the method 700, which is not specifically expanded herein.

System 100 may be implemented by one or more computing devices. FIG. 2 shows a schematic diagram of a computing device 200 according to one embodiment of the present disclosure. It should be noted that the computing device 200 shown in fig. 2 is only one example.

As shown in FIG. 2, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.

Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a digital information processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.

Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The physical memory in the computing device is usually referred to as a volatile memory RAM, and data in the disk needs to be loaded into the physical memory to be read by the processor 204. System memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some implementations, the application 222 can be arranged to execute instructions on the operating system with the program data 224 by the one or more processors 204. Operating system 220 may be, for example, Linux, Windows, or the like, which includes program instructions for handling basic system services and for performing hardware-dependent tasks. The application 222 includes program instructions for implementing various user-desired functions, and the application 222 may be, for example, but not limited to, a browser, instant messenger, a software development tool (e.g., an integrated development environment IDE, a compiler, etc.), and the like. When the application 222 is installed into the computing device 200, a driver module may be added to the operating system 220.

When the computing device 200 is started, the processor 204 reads program instructions of the operating system 220 from the memory 206 and executes them. Applications 222 run on top of operating system 220, utilizing the interface provided by operating system 220 and the underlying hardware to implement various user-desired functions. When the user starts the application 222, the application 222 is loaded into the memory 206, and the processor 204 reads the program instructions of the application 222 from the memory 206 and executes the program instructions.

Computing device 200 also includes storage 232, storage 232 including removable storage 236 and non-removable storage 238, each of removable storage 236 and non-removable storage 238 being connected to storage interface bus 234.

Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display 253 or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or more other computing devices 262 over a network communication link via one or more communication ports 264.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

The computing device 200 also includes a storage interface bus 234 coupled to the bus/interface controller 230. The storage interface bus 234 is coupled to the storage device 232, and the storage device 232 is adapted for data storage. The example storage device 232 may include removable storage 236 (e.g., CD, DVD, U-disk, removable hard disk, etc.) and non-removable storage 238 (e.g., hard disk drive, HDD, etc.).

In the computing device 200 according to the present disclosure, the application 222 includes instructions for performing the method 300 for generating a convolutional network for detecting a living object of the present disclosure, and/or instructions for performing the method 700 for detecting a living object of the present disclosure, which may instruct the processor 204 to perform the method of the present disclosure to train generation of a convolutional network for detecting a living object and to detect a living object using the convolutional network.

Fig. 3 shows a flow diagram of a method 300 of generating a convolutional network for detecting a living object according to one embodiment of the present disclosure. The method 300 may be performed by the image acquisition unit 110 and the convolutional network generation unit 140. The method 300 aims to generate a convolutional network through training, and the convolutional network can be applied to an iris recognition scheme to detect a living object in an object to be recognized so as to improve the accuracy of iris recognition.

As shown in fig. 3, the method 300 begins at step S310. In step S310, the image acquisition unit 110 acquires an infrared image including an eye region and three-dimensional coordinate information.

According to the embodiment of the present disclosure, it is ensured that the infrared image and the three-dimensional coordinate information can be aligned. The acquired three-dimensional coordinates are labeled (x, y, z), where x, y represent two-dimensional plane coordinates and z represents the distance of the acquired object from the camera (e.g., image acquisition unit 110), i.e., the depth coordinate. In one embodiment, the three-dimensional coordinate information is saved in text form.

Thereafter, the image acquisition unit 110 transmits the acquired image to the convolution network generation unit 140 for subsequent processing.

First, in step S320, the infrared image and the three-dimensional coordinate information are processed to generate a first image and a second image including an iris, and a tag image indicating iris depth information.

According to the embodiment of the present disclosure, step S320 may be further divided into: processing the infrared image and processing the three-dimensional coordinate information.

1) Processing of infrared images

First, iris detection is performed on the infrared image to determine position information of the iris. Then, an iris region is determined from the infrared image based on the position information of the iris, and the determined iris region is preprocessed to generate a first image. The iris region is preprocessed, for example, by cropping the iris region from the ir image and scaling it to a fixed size (e.g., 256 x 256, but not limited to) as a first image, denoted as I1.

It should be noted that the present disclosure is not limited to what manner of iris detection is performed, and any iris detection method can be combined with the embodiments of the present disclosure to implement the method 300 of the present disclosure.

Next, the first image I1 is subjected to a blocking process to generate a second image. According to an embodiment of the present disclosure, the purpose of blocking the first image is to force a neural network to learn local detail features of the first image.

According to an embodiment, the second image is generated by the following steps.

Firstly, a first image is subjected to blocking processing to obtain a plurality of image blocks. In one embodiment, the first image is divided into m image blocks of size n x n.

And secondly, performing shuffling operation on the m image blocks according to rows and columns and a certain rule respectively to obtain a processed image, namely a second image.

In one embodiment, the "certain rule" is: randomly ordering image blocks of each row in the row direction of the image to obtain an intermediate disordered image; and randomly ordering the image blocks in each column in the intermediate disordered image in the column direction of the image to obtain a second image. Of course, the image blocks may be first sorted in the column direction to obtain an intermediate disordered image, and then sorted in the row direction to obtain a second image. The disclosed embodiments are not so limited.

One specific procedure is shown below. Let each row and column have image blocks of my block and mx block n × n, respectively. Firstly, shuffling is carried out according to rows, each t1 image blocks form a group (t 1 is smaller than or equal to my), the step length is b (b is smaller than or equal to t 1), and random positions of the image blocks in the group are disordered to obtain the corresponding disordered image rows. And according to the rule, continuing to execute the processing according to the lines, and starting to process according to the columns after all the image lines are subjected to the shuffling processing. The processing rule is similar to that of line processing, each t2 image blocks form a group (at the moment, t2 is smaller than or equal to mx), the step length is also b, random position scrambling is carried out on each group, and after all image columns are subjected to shuffling processing, a mixed image, namely a second image, is obtained.

In one embodiment according to the present disclosure, to eliminate the negative impact of image blocking (introducing image noise), two labels I1 and I2 are given to indicate whether the corresponding image is a blocked image. For example, I1 has a label of 0 (indicating that it is not a tile image) and I2 has a label of 1 (indicating that it is a tile image). The above labels can be used for countermeasure training in convolutional networks, which is expanded below.

2) Processing three-dimensional coordinate information

From the three-dimensional coordinate information, the depth coordinate (z as described earlier) of the iris region is determined. The depth coordinates are normalized to generate a label image.

Specifically, the maximum and minimum values of z within the iris region are counted, and the value of z within the iris region is normalized to the interval [0, 1 ]. The normalized values are then saved to the new image at the corresponding coordinates and the new image is scaled to a size of N x N (N =32 in one embodiment) to yield a label image, designated I3.

Subsequently, in step S330, the first image I1 is input to an initial convolution network for processing.

According to the present disclosure, the method 300 further comprises the steps of: and constructing a convolutional network, and setting initial network parameters as the initial convolutional network. According to one embodiment, a convolutional network includes a parameter sharing component, a depth estimation component, and a block countermeasure component coupled. The image is input into a convolution network, and the input image is processed by a parameter sharing component to generate a characteristic diagram. The feature map is respectively transmitted into a depth estimation component and a blocking countermeasure component, a first prediction result is output through the depth estimation component, and a second prediction result and a third prediction result are output through the blocking countermeasure component. The first prediction result predicts depth information of the input image, the second prediction result predicts the possibility that the iris in the input image is not a false body (the false iris comprises a beautiful pupil, an artificial eye, a glass eyeball and the like, for example), and the third prediction result predicts whether the input image is a block image or not.

The following illustrates a specific structure of the convolutional network 400 and components therein, according to some embodiments of the present disclosure. It should be understood that the following structures are shown as examples only, and any convolutional network constructed based on the description of the embodiments of the present disclosure is within the scope of the present disclosure.

Fig. 4 illustrates a schematic structural diagram of a convolutional network 400 according to some embodiments of the present disclosure. As shown in fig. 4, convolutional network 400 includes a parameter sharing component 410, a depth estimation component 420, and a block countermeasure component 430, and depth estimation component 420 and block countermeasure component 430 are coupled to parameter sharing component 410, respectively.

The following describes the convolution network 400 and its internal structure, and the processing flow of the first image I1 with reference to fig. 4 to 6.

The first image I1 is input to the convolutional network 400, and the detail features of the input image (i.e., I1) are first extracted by the parameter sharing component 410 and a feature map (denoted as F) is generated.

As shown in fig. 4, the parameter sharing component 410 includes one 3 × 3 convolutional layer (C), two max pooling layers (MP 1 and MP 2), and one 1 × 1 convolutional layer (C). And obtaining a characteristic X after the input image passes through a 3-by-3 convolution layer, wherein the input of the two maximum pooling layers are X and-X respectively, splicing the result after pooling together, and finally inputting the result into a 1-by-1 convolution layer to obtain a characteristic diagram F. The purpose of the parameter sharing component 410 is to extract sufficiently rich local detail features.

Feature map F is then input to depth estimation component 420 and partitioning countermeasure component 430, respectively.

The depth estimation component 420 performs feature encoding on the feature map F and outputs a first prediction result of predicting depth information of the input image.

In one embodiment, depth estimation component 420 includes: a first feature encoding module 422, a second feature encoding module 424, a third feature encoding module 426, and a feature aggregation module 428. As shown in fig. 4, the second feature encoding module 424 is coupled to the first feature encoding module 422 and the third feature encoding module 426, respectively, and the feature aggregation module 428 is coupled to the first feature encoding module 422, the second feature encoding module 424, and the third feature encoding module 426, respectively.

The first feature encoding module 422 performs feature encoding on the feature map and outputs a first feature map. According to an embodiment of the present disclosure, the first feature encoding module 422 is used to extract shallow features of the feature map.

The second feature encoding module 424 performs feature encoding on the first feature map and outputs a second feature map. According to an embodiment of the present disclosure, the second feature encoding module 424 is used to extract middle-level features of the feature map.

The third feature encoding module 426 performs feature encoding on the second feature map and outputs a third feature map. According to an embodiment of the present disclosure, the third feature encoding module 426 is used to extract high-level features of the feature map.

According to one embodiment of the present disclosure, the shallow features represent lines, textures, the middle features represent patterns, local features, etc. of the image, and the high features represent object features in the image. Of course, not limited thereto.

The feature aggregation module 428 processes the first feature map, the second feature map, and the third feature map respectively to obtain the processed features, and fuses the processed features to generate a fused feature. In addition, the feature aggregation module 428 convolves the fused features and finally outputs the first prediction result.

According to one embodiment of the present disclosure, the first eigen-coding module 422 includes 3 eigen-coding sub-blocks, a pooling layer, and a down-sampling layer coupled in sequence. The second eigen-coding module 424 comprises 4 eigen-coding sub-blocks, a pooling layer and a down-sampling layer coupled in sequence. The third feature encoding module 426 includes 6 feature encoding sub-blocks, a pooling layer, and a down-sampling layer coupled in sequence.

The above-mentioned pooling layer may be maximum pooling, but is not limited thereto. And adding a pooling layer and a down-sampling layer in each feature coding module, so that the size and the channel number of the output first feature map, the second feature map and the third feature map are kept the same. In one embodiment, the first, second and third profiles are all 32 x 32 in size and the number of channels is the same. Thus, the first prediction result is also a 32 × 32 size feature map.

According to one embodiment, the feature coded sub-blocks include convolutional layers of different convolutional kernel sizes. Fig. 5 shows a schematic structural diagram of a feature encoded sub-block 500 according to an embodiment of the present disclosure. As shown in fig. 5, the feature code subblock 500 includes 5 × 5 convolutional layers, 3 × 3 convolutional layers, and 1 × 1 convolutional layers. The last layer of input-output-O feature coding sub-block 500 is processed by 5 × 5 convolutional layers and 3 × 3 convolutional layers respectively to obtain corresponding feature maps F1 and F2, then F1 and F2 are fused with O (coordination), and then the feature maps are sent to 1 × 1 convolutional layers for processing and output.

Fig. 6 illustrates a structural schematic of a feature aggregation module 428 according to some embodiments of the present disclosure.

The feature aggregation module 428 includes 3 attention sub-blocks, a feature fusion layer, and a convolution layer. As shown in fig. 6, the 3 attention sub-blocks are respectively and correspondingly coupled to the first feature encoding module 422, the second feature encoding module 424, and the third feature encoding module 426, so as to process the first feature map, the second feature map, and the third feature map, respectively, input the processed results into a feature fusion layer (localization), perform channel splicing by the feature fusion layer, and obtain an output result after passing through a 1 × 1 convolution layer.

In one embodiment, the attention sub-block includes two max pooling layers (MP 1 and MP 2), one 1 x 1 convolution layer and one Sigmoid layer. As shown in fig. 6, the maximum pooling layer is formed by pooling according to channels to obtain two feature maps, splicing (Cat) the two feature maps, sending the feature maps into a 1 × 1 convolution layer, inputting the feature maps subjected to convolution processing into a Sigmoid layer, and then performing hadamard product operation on the output of the Sigmoid layer and the input feature maps to obtain the output of the attention subblock.

The blocking countermeasure component 430 extracts the high-level features of the feature map F and outputs a second prediction result and a third prediction result.

As shown in fig. 4, the partitioning countermeasure component 430 includes: a lightweight convolutional layer 432, a pooling layer 434, a first convolutional layer 436, and a second convolutional layer 438.

As described above, the purpose of block shuffling of images is to force the convolutional network to learn more local detail features, and whether the iris of an input image is a false iris can be identified by inputting the original image (first image) and the block shuffled image (second image) to the block countermeasure component 430. However, after the image is block-mixed, some other noise may be introduced, so to eliminate the influence of the noise, a countermeasure module is introduced in the block countermeasure component 430 to determine whether the input image is a block-mixed image.

Specifically, the lightweight convolutional layer 432 extracts the high-level features of the feature map. The lightweight convolutional layer 432 may use the currently commonly used series models of MobileNet, ShuffleNet, etc., and the disclosure does not limit this. The output of the lightweight convolutional layer 432 is input to a pooling layer 434, which pooling layer 434 employs, in an embodiment according to the present disclosure, averaging pooling to aggregate the extracted high-level features. Then, after passing through two convolution layers of 1 × 1, namely the first convolution layer 436 and the second convolution layer 438, the first convolution layer 436 outputs a second prediction result that predicts that the iris in the input image is not a prosthetic iris; the second convolution layer 438 outputs a third prediction result of predicting whether the input image is a block image.

Then, in step S340, a first loss value is calculated based on the first prediction result and the label image, and a first network parameter of the convolutional network is adjusted according to the first loss value until a predetermined condition is satisfied, so as to obtain an intermediate convolutional network.

According to an embodiment of the present disclosure, the first prediction result and the label image are each 32 × 32 in size, and a first loss value of the first prediction result and the label image is calculated using a contrast loss function. And adjusting a first network parameter according to the first loss value, and updating the convolutional network. Then, inputting the first image into the updated convolution network, and outputting a first prediction result; adjusting a first network parameter corresponding to the convolutional network by using the first prediction result and a first loss value of the label image, and updating the convolutional network; … …, respectively; and repeating the iteration process until a preset condition is met, wherein the corresponding convolution network is the middle convolution network.

It should be noted that, the present disclosure does not make too much limitation on the predetermined condition, and may be an error between loss values of two or more iterations, which is smaller than a preset smaller value; or iteratively updated up to a predetermined maximum number of times, etc.

According to an embodiment of the present disclosure, the first network parameter includes: network parameters of parameter sharing component 410, network parameters of depth estimation component 420, and network parameters of block countermeasure component 430. In other words, the first network parameter is a full network parameter.

Subsequently, in step S350, the second image is input into the intermediate convolutional network, processed by the parameter sharing component and the blocking countermeasure component, and then the third prediction result is output.

According to the embodiment of the present disclosure, the process of inputting the second image into the intermediate convolutional network for processing may refer to the foregoing description about the process flow of the first image by the convolutional network 400, and the process flows are substantially the same, and are not described herein again.

It should be noted that, at this time, the depth estimation component 420 may be frozen, and only the third prediction result output by the countermeasure module needs to be focused on the output of the block countermeasure component 430.

Subsequently, in step S360, a second loss value is calculated based on the third prediction result (obtained in step S330) corresponding to the first image and the third prediction result (obtained in step S350) corresponding to the second image, and a second network parameter of the convolutional network in the middle at this time is adjusted according to the second loss value until a predetermined condition is satisfied, and the corresponding convolutional network is the finally generated convolutional network for detecting the living body object.

According to one embodiment of the disclosure, a second loss value is calculated using a cross entropy loss function based on a third prediction result corresponding to the first image and the second image and the labels corresponding to the first image and the second image. And adjusting the second network parameter according to the second loss value, and continuously updating the convolutional network. Then, inputting the second image into the updated convolution network, and outputting a third prediction result; adjusting a second network parameter of the corresponding convolutional network by using the corresponding third prediction result and the label, and updating the convolutional network; … …, respectively; and repeating the iteration process until a preset condition is met, wherein the corresponding convolution network is the finally generated convolution network for detecting the living body object.

As mentioned above, the present disclosure does not impose too much limitation on the predetermined condition, and may be an error between loss values of two or more iterations, which is smaller than a preset smaller value; or iteratively updated up to a predetermined maximum number of times, etc.

According to an embodiment of the present disclosure, the second network parameter is a network parameter of the partitioning countermeasure component.

According to the method 300 of the present disclosure, a first image, a second image and a label image for training a convolutional network are obtained by respectively preprocessing an infrared image including an iris and three-dimensional coordinate information. And then, sequentially inputting the first image and the second image into the convolutional network, and adjusting network parameters of the convolutional network at least based on the output result and the label image until the training is finished. The convolutional network generated based on the method 300 can be regarded as including two parts, namely a depth estimation branch and a blocking countermeasure branch, wherein the two branches have partial parameter sharing (i.e., parameter sharing components). The depth information of the input image is predicted from the depth information of the depth estimation branch learning image to recognize a disguised object such as a print, a photograph, a video, or the like. The possibility that the iris in the input image is not a prosthesis is predicted by learning more detailed information about the iris in the image against branches by blocking to recognize iris prostheses such as cosmetic pupil, artificial eye, glass eyeball, and the like.

Fig. 7 shows a flow diagram of a method 700 of detecting a living object according to one embodiment of the present disclosure. Method 700 is implemented with system 100. It should be understood that the contents of the method 700 and the method 300 are complementary and repeated, and are not described again here.

As shown in fig. 7, the method 700 begins at step S710. In step S710, an infrared image of an eye region including an object to be detected is acquired.

In one embodiment, an infrared image containing an eye region (in particular an iris region) of the object to be detected is acquired with the image acquisition unit 110.

Subsequently, in step S720, the infrared image is preprocessed to generate an image to be detected.

In one embodiment, iris detection is performed on the infrared image to determine the location information of the iris. And then, based on the position information of the iris, cutting out the iris area from the infrared image, and zooming the iris area to a fixed size to be used as an image to be detected.

It should be noted that the image to be detected may be generated in the same manner as the first image, and the detailed description may refer to the processing of the infrared image in step S320, which is not described herein again.

Subsequently, in step S730, the image to be detected is input into a convolution network to be processed to output a first prediction result predicting the depth information of the image to be detected and a second prediction result predicting that the iris in the image to be detected is not a prosthetic iris.

Where the convolutional network may be generated by the method 300 training. For the processing procedure of the input image (i.e., the image to be detected) by the convolutional network, reference may be made to the related description in the method 300, and details are not repeated here.

Subsequently, in step S740, the first prediction result and the second prediction result are processed respectively to obtain processed prediction values.

According to one embodiment of the present disclosure, the first prediction result and the second prediction result are processed by the following steps.

First, a mean value of the first prediction results is determined as a first prediction value. In other words, the first predicted value is obtained by adding and averaging all the values in the first predicted result.

As described above, in one embodiment, the first prediction result is 32 × 32 size image, and the sum is obtained by first summing the pixel values. The sum is then divided by 1024 (i.e., 32 x 32), and the quotient is the first predicted value, denoted as s 1.

Then, the second prediction result is input into a normalized exponential function (such as a softmax function) for processing, and a second prediction value is output. In one embodiment, the normalized exponential function represents the result of the second classification (whether it is a prosthetic iris) in the form of a probability, which is the second predicted value and is denoted as s 2.

And finally, taking the first predicted value and the second predicted value as processed predicted values.

Subsequently, in step S750, when the processed predicted value satisfies the threshold condition, it is determined that the object to be detected is a living body.

According to the embodiment of the disclosure, when the first predicted value and the second predicted value both meet the respective threshold conditions, the object to be detected is determined to be a living body. For example, two thresholds T1 and T2 are set, and only when s1> T1 and s2> T2, the object to be detected is determined to be a living body, and vice versa to be a prosthesis.

According to the method 700 for detecting living body objects of the present disclosure, an image to be detected is input into a convolution network, and after being processed by the convolution network, depth information of the image to be detected can be predicted so as to identify disguised living body objects such as prints, photos, videos, and the like; meanwhile, the possibility that the iris in the image to be detected is not a prosthesis can be predicted to identify iris prostheses such as a beautiful pupil, an artificial eye, a glass eyeball and the like, and further determine whether the object to be detected is a living object. Based on this scheme, can realize detecting the false body iris, promote the security performance of iris identification equipment.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the methods of the present disclosure according to instructions in the program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with examples of the present disclosure. The required structure for constructing such a system will be apparent from the description above. Moreover, this disclosure is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the present disclosure as described herein, and any descriptions above of specific languages are provided for disclosure of preferred embodiments of the present disclosure.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

Claims

1. A method of generating a convolutional network for detecting a living object, comprising the steps of:

acquiring an infrared image and three-dimensional coordinate information of an eye region;

processing the infrared image and the three-dimensional coordinate information to generate a first image and a second image containing the iris and a label image indicating the iris depth information;

inputting the first image into an initial convolutional network for processing, wherein the convolutional network comprises a parameter sharing component, a depth estimation component and a blocking countermeasure component which are coupled, outputting a first prediction result through the depth estimation component, and outputting a second prediction result and a third prediction result through the blocking countermeasure component;

calculating a first loss value based on the first prediction result and the label image, and adjusting a first network parameter of the convolutional network according to the first loss value until a preset condition is met to obtain a middle convolutional network;

inputting the second image into a middle convolution network, and outputting a third prediction result after the second image is processed by the parameter sharing component and the blocking countermeasure component;

calculating a second loss value based on a third prediction result corresponding to the first image and a third prediction result corresponding to the second image, and adjusting a second network parameter according to the second loss value until a predetermined condition is satisfied, wherein the corresponding convolutional network is a finally generated convolutional network for detecting the living body object,

wherein the first network parameter comprises: network parameters of the parameter sharing component, network parameters of the depth estimation component, and network parameters of the partitioning countermeasure component; the second network parameters include: network parameters of the block countermeasure component.

2. The method of claim 1, wherein the processing the infrared image and the three-dimensional coordinate information to generate a first image and a second image comprising an iris, and a label image indicative of iris depth information comprises:

performing iris detection on the infrared image to determine position information of the iris;

determining an iris region from the infrared image as a first image based on the position information of the iris;

performing blocking processing on the first image to generate a second image;

determining depth coordinates of the iris region from the three-dimensional coordinate information;

normalizing the depth coordinates to generate a label image.

3. The method of claim 1, wherein the step of blocking the first image to generate the second image comprises:

carrying out blocking processing on the first image to obtain a plurality of image blocks;

randomly ordering image blocks of each row in the row direction of the image to obtain an intermediate disordered image;

and randomly sequencing the image blocks of each column in the intermediate disordered image in the column direction of the image to obtain a second image.

4. The method of claim 1, wherein the depth estimation component and the partitioning countermeasure component are each coupled with the parameter sharing component,

the parameter sharing component is suitable for extracting detail features of the input image and generating a feature map;

the depth estimation component is adapted to perform feature encoding on the feature map and output a first prediction result for predicting depth information of the input image;

the block countermeasure component is adapted to extract high-level features of the feature map, output a second prediction result predicting whether the iris in the input image is not a prosthetic iris, and a third prediction result predicting whether the input image is a block image; and

the depth estimation component includes:

the first feature coding module is suitable for performing feature coding on the feature map and outputting a first feature map;

the second feature coding module is suitable for performing feature coding on the first feature map and outputting a second feature map;

the third feature coding module is suitable for performing feature coding on the second feature map and outputting a third feature map;

a feature aggregation module adapted to process the first feature map, the second feature map, and the third feature map respectively to obtain processed features, fuse the processed features to generate a fused feature, perform convolution on the fused feature, and finally output the first prediction result,

the first feature coding module comprises 3 feature coding sub-blocks, a pooling layer and a down-sampling layer which are sequentially coupled; the second feature coding module comprises 4 feature coding sub-blocks, a pooling layer and a down-sampling layer which are sequentially coupled; the third feature coding module comprises 6 feature coding sub-blocks, a pooling layer and a down-sampling layer which are sequentially coupled; the feature aggregation module comprises 3 attention sub-blocks, a feature fusion layer and a convolution layer, wherein the 3 attention sub-blocks are respectively and correspondingly coupled to the first feature encoding module, the second feature encoding module and the third feature encoding module to respectively process the first feature map, the second feature map and the third feature map, and the feature encoding sub-blocks comprise convolution layers with different convolution kernel sizes.

5. The method of claim 4, wherein the partitioning countermeasure component comprises:

a lightweight convolutional layer adapted to extract high-level features of the feature map;

a pooling layer adapted to aggregate the extracted high-level features;

a first convolution layer adapted to output a second prediction result predicting that an iris in the input image is not a prosthetic iris; and

a second convolution layer adapted to output a third prediction result predicting whether the input image is a block image.

6. A method for detecting a living subject, comprising the steps of:

acquiring an infrared image of an eye region containing an object to be detected;

preprocessing the infrared image to generate an image to be detected;

inputting the image to be detected into a convolution network for processing so as to output a first prediction result for predicting the depth information of the image to be detected and a second prediction result for predicting that the iris in the image to be detected is not a false iris;

respectively processing the first prediction result and the second prediction result to obtain processed prediction values;

when the processed predicted value meets the threshold condition, determining that the object to be detected is a living body,

wherein the convolutional network is generated by performing the method of any of claims 1-5.

7. The method of claim 6, wherein the step of processing the first prediction result and the second prediction result respectively to obtain processed prediction values comprises:

determining the mean value of the first prediction result as a first prediction value;

inputting the second prediction result into a normalization index function for processing, and outputting a second prediction value;

taking the first predicted value and the second predicted value as processed predicted values; and

when the processed predicted value meets the threshold condition, the step of determining that the object to be detected is a living body comprises the following steps:

and when the first predicted value and the second predicted value both meet respective threshold conditions, determining that the object to be detected is a living body.

8. A system for detecting a living subject, comprising:

the image acquisition unit is suitable for acquiring an infrared image of an eye region containing an object to be detected and preprocessing the infrared image to generate an image to be detected;

the image processing unit is suitable for inputting the image to be detected into a convolutional network for processing so as to output a first prediction result for predicting the depth information of the image to be detected and a second prediction result for predicting that the iris in the image to be detected is not a false iris;

the prediction result unit is suitable for respectively processing the first prediction result and the second prediction result to obtain processed prediction values, and determining that the object to be detected is a living body when the processed prediction values meet a threshold condition;

and the convolutional network generating unit is suitable for training and generating the convolutional network and updating the convolutional network.

9. A computing device, comprising:

at least one processor and a memory storing program instructions;

the program instructions, when read and executed by the processor, cause the computing device to perform the method of any of claims 1-5 and/or perform the method of claim 6 or 7.

10. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the method of any of claims 1-5 and/or perform the method of claim 6 or 7.