CN112287723A

CN112287723A - In-vivo detection method and device based on deep learning and storage medium

Info

Publication number: CN112287723A
Application number: CN201910668140.9A
Authority: CN
Inventors: 孔志飞; 赵幸福; 赵立军
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2021-01-29

Abstract

The application discloses a living body detection method and device based on deep learning and a storage medium. The method comprises the following steps: acquiring a video related to an object to be recognized, wherein the video contains a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading first text information by the object to be recognized; generating lip language information corresponding to the mouth region image by using a recognition model based on deep learning training; and judging whether the object to be identified is a living body or not according to the lip language information and the first text information. The whole lip language in-vivo detection method based on deep learning has the technical effects of strong generalization capability, simple and reliable flow and high identification accuracy.

Description

In-vivo detection method and device based on deep learning and storage medium

Technical Field

The present application relates to the field of information identification, and in particular, to a method and an apparatus for in vivo detection based on deep learning, and a storage medium.

Background

The living body detection is a method for determining the real physiological characteristics of an object in some identity verification scenes, and in the application of face recognition, the living body detection can verify whether a user operates for the real living body by combining actions of blinking, mouth opening, shaking, nodding and the like and using technologies such as face key point positioning, face tracking and the like. Common attack means such as photos, face changing, masks, sheltering and screen copying can be effectively resisted, so that a user is helped to discriminate fraudulent behaviors, and the benefit of the user is guaranteed. The current methods for performing biopsy using mouth information mainly include two methods:

the method comprises the steps of extracting a plurality of video frames from a face video to be detected, obtaining a plurality of key point positions of a mouth of each video frame extracted from the face video to be detected, obtaining the mouth length and the mouth width of the mouth of each extracted video frame through the plurality of key point positions of the mouth, obtaining the mouth numerical value of the corresponding video frame through calculating the ratio of the mouth length to the mouth width, and judging the mouth movement condition of the face video to be detected based on the mouth numerical value of each extracted video frame.

The other method comprises the steps of collecting video information of a user to be identified when verification content is read, obtaining mouth feature information of the user according to the video information, judging whether the mouth feature information of the user is matched with a reference lip language feature sequence corresponding to the verification content in a lip language library or not, and determining the user to be identified as a living body if the mouth feature information of the user is matched with the reference lip language feature sequence corresponding to the verification content in the lip language library.

The first method has poor generalization capability, and can affect the in-vivo detection result by a little side face, and the second method has the defect that a user must input specific mouth information into a database at first, so that the specific implementation is complicated and difficult.

Aiming at the technical problems that the existing in-vivo detection method in the prior art has poor generalization capability and complicated and difficult specific implementation, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the disclosure provides a living body detection method, a living body detection device and a storage medium based on deep learning, and at least solves the technical problems that the existing living body detection method in the prior art is poor in generalization capability and large in detailed implementation difficulty.

According to an aspect of an embodiment of the present disclosure, there is provided a living body detection method based on deep learning, including: acquiring a video related to an object to be recognized, wherein the video contains a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading first text information by the object to be recognized; generating lip language information corresponding to the mouth region image by using a recognition model based on deep learning training; and judging whether the object to be identified is a living body or not according to the lip language information and the first text information.

According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the method described above is performed by a processor when the program is executed.

According to another aspect of the embodiments of the present disclosure, there is also provided a living body detecting apparatus based on deep learning, including: the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a video related to an object to be recognized, the video comprises a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading first text information by the object to be recognized; the recognition module is used for generating lip language information corresponding to the mouth region image by using a recognition model based on deep learning training; and the judging module is used for judging whether the object to be identified is a living body according to the lip language information and the first text information.

According to another aspect of the embodiments of the present disclosure, there is also provided a lip language in-vivo detection device based on deep learning, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a video related to an object to be recognized, wherein the video contains a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading first text information by the object to be recognized; generating lip language information corresponding to the mouth region image by using a recognition model based on deep learning training; and judging whether the object to be identified is a living body or not according to the lip language information and the first text information.

In the embodiment of the disclosure, the lip language information of the object to be recognized when reading the text information can be quickly recognized. And whether the object to be recognized is a living body can be determined by matching the recognized lip language information with the text information read by the object to be recognized. Because the recognition model is based on deep learning training, the recognition model can well recognize longer sequence information and has high recognition rate. Thus, based on the recognition model, it is possible to quickly and accurately recognize the lip language information corresponding to the mouth region image and match the recognized lip language information with the text information. Thus, according to the technical solution described in the embodiments of the present disclosure, it is possible to quickly determine whether or not the object to be recognized is a living body. In addition, the in-vivo detection method based on deep learning has the technical effects of strong generalization capability, simple and reliable flow and high identification accuracy, and further solves the technical problems of poor generalization capability and complex and difficult specific implementation of the existing in-vivo detection method.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a hardware configuration block diagram of a [ computer terminal (or mobile device) ] for implementing the method according to embodiment 1 of the present disclosure;

fig. 2 is a schematic flowchart of a deep learning-based in-vivo detection method according to a first aspect of embodiment 1 of the present disclosure;

FIG. 3 is a schematic structural diagram of a Transformer model according to a first aspect of embodiment 1 of the present disclosure;

fig. 4 is a schematic diagram of a deep learning-based in-vivo detection device according to embodiment 2 of the present disclosure; and

fig. 5 is a schematic diagram of a deep learning-based in-vivo detection device according to embodiment 3 of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some of the nouns or terms appearing in the description of the embodiments of the present disclosure are applicable to the following explanations:

converter model: the converter model described in this application is a Chinese translation of "transform model", a model introduced by Google for natural language processing. Hereinafter also referred to as "transformer model"

Example 1

According to the present embodiment, there is provided a method embodiment of a deep learning based liveness detection method, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method provided by the embodiment can be executed in a mobile terminal, a computer terminal or a similar operation device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing a deep learning-based liveness detection method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, processing devices such as GPUs, microprocessors MCUs, or programmable logic devices FPGAs), memories 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the deep learning based living body detection method in the embodiment of the disclosure, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implements the deep learning based living body detection method of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

In the operating environment described above, according to the first aspect of the present embodiment, a living body detection method based on deep learning is provided. Fig. 2 shows a flow diagram of the method, which, with reference to fig. 2, comprises:

s202: acquiring a video related to an object to be recognized, wherein the video contains a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading first text information by the object to be recognized;

s204: generating lip language information corresponding to the mouth region image by using a recognition model based on deep learning training; and

s206: and judging whether the object to be identified is a living body or not according to the lip language information and the first text information.

As described in the background art, the current method for performing the biopsy using the mouth information mainly includes two methods: the method comprises the steps of extracting a plurality of video frames from a face video to be detected, obtaining a plurality of key point positions of a mouth of each video frame extracted from the face video to be detected, obtaining the mouth length and the mouth width of the mouth of each extracted video frame through the plurality of key point positions of the mouth, obtaining the mouth numerical value of the corresponding video frame through calculating the ratio of the mouth length to the mouth width, and judging the mouth movement condition of the face video to be detected based on the mouth numerical value of each extracted video frame. The other method is that video information of a user to be identified is collected when the verification content is read, mouth characteristic information of the user is obtained according to the video information, whether the mouth characteristic information of the user is matched with a reference lip language characteristic sequence corresponding to the verification content in a lip language library or not is judged, and if the mouth characteristic information of the user is matched with the reference lip language characteristic sequence, the user to be identified is confirmed as a living body. The first method has poor generalization capability, and can affect the in-vivo detection result by a little side face, and the second method has the defect that a user must input specific mouth information into a database at first, so that the specific implementation is complicated and difficult.

In view of the problems in the background art, the present embodiment is implemented by acquiring a video related to an object to be recognized, as shown in fig. 2. The video comprises a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading the first text information by the object to be recognized. The first text information may be, for example, but not limited to, a text information such as "youth is strong and then nationality". Then, the mouth region image is recognized by using a recognition model based on deep learning training, and lip language information corresponding to the mouth region image is generated. And finally, judging whether the object to be identified is a living body or not according to the lip language information and the first text information. Wherein, in the case where the generated lip language information matches the first text information, it is determined that the object to be recognized is a living body. On the contrary, in the case where the generated lip language information does not match the first text information, it is determined that the object to be recognized is not a living body.

Thus, in this way, the lip language information corresponding to the mouth region image can be quickly recognized. And whether the object to be recognized is a living body can be determined by matching the recognized lip language information with the text information read by the object to be recognized. And because the recognition model is based on deep learning training, the recognition model can well recognize longer sequence information and has high recognition rate. Thus, based on the recognition model, the lip language information corresponding to the mouth region image can be recognized quickly and accurately, so that whether the object to be recognized is a living body or not can be determined quickly from the recognized lip language and the first text information. The whole lip language in-vivo detection method based on deep learning has the technical effects of strong generalization capability, simple and reliable flow and high identification accuracy. Further solves the technical problems of poor generalization capability and complex and difficult specific implementation of the existing in-vivo detection method.

Optionally, the recognition model includes a feature extraction model based on image feature extraction, a residual network model, and a conversion model based on natural language processing. And an operation of generating lip language information corresponding to the mouth region image using the recognition model, including: generating a first feature sequence corresponding to the mouth region image by using the feature extraction model; extracting the features of the first feature sequence by using the residual error network model to generate a second feature sequence; and generating lip language information according to the second characteristic sequence by using the conversion model.

Specifically, the recognition model includes a feature extraction model, a residual network model, and a transformation model, and at this time, a first feature sequence corresponding to the mouth region image may be generated using the feature extraction model based on image feature extraction. The feature extraction model may be, for example and without limitation, a 3D convolutional neural network model or another type of feature extraction model, and the first feature sequence is a feature sequence in time and space corresponding to the mouth region image. And then, utilizing the residual error network model to perform feature extraction on the first feature sequence to generate a second feature sequence. Therefore, high-level information is extracted, namely more useful information can be better extracted by adding a residual error network model (Resnet model), and information loss is reduced.

Further, the second feature sequence is converted into lip language information by using a conversion model. The transformation model may be, for example, but not limited to, a Transformer model. Then, lip language information is generated from the lip language. For example: and converting the second characteristic sequence by using a conversion model to obtain the lip language information which is sequentially vectors corresponding to the 'less', 'year', 'strong', 'regular', 'country' and 'strong'. The lip language information generated at this time is a sequence of vectors corresponding to "youth-strong nationality".

Therefore, by the mode, the global information can be acquired in one step, and the longer sequence information can be well identified. The residual error network model (Resnet model) can be used for better extracting more useful information, and the information loss is reduced. And the conversion model training uses less resources, the recognition accuracy is high, and meanwhile, the conversion model can be calculated in parallel, so that the training time is reduced.

In addition, a method of cascading two network structures, namely a residual network model (Resnet model) and a transform model, is used for processing decoding of a feature sequence corresponding to the lip language video, and the advantages of the two network structures can be fully exerted, so that the lip language identification capability is stronger and more accurate.

Optionally, the operation of determining whether the object to be recognized is a living body according to the lip language information and the first text information includes: matching the lip language information with the first text information; and judging whether the object to be identified is a living body according to the matching result.

Specifically, according to the technical solution of the present embodiment, lip language information is first matched with first text information, and then whether the object to be recognized is a living body is determined according to the matching result. Wherein the determination that the object to be recognized is the living body may be made by determining whether or not the matching result is larger than a threshold value set in advance, and in the case of being larger than the threshold value set in advance. Otherwise, the result is no. For example: the preset threshold value of the matching similarity is 90%, the matching result is that the lip language information and the first text information have the similarity of 95%, and the object to be recognized is determined to be the living body because the matching result is larger than the preset threshold value. In this way, it is possible to quickly and accurately determine whether or not the object to be recognized is a living body.

Optionally, the operation of matching the lip language information with the first text information includes: acquiring text coding information corresponding to the first text information; judging whether the lip language information is matched with the text coding information; and determining that the lip language information is matched with the first text information under the condition that the lip language information is matched with the text coding information.

In general, lip language information generated by a conversion model based on natural language processing is coded information (for example, a vector) corresponding to actual text information. Therefore, when matching the lip language information with the first text information, the matching operation cannot be directly performed, and the first text information also needs to be converted into corresponding text encoding information (such as a vector, etc.), and then the matching operation is performed. So that in case the lip language information matches the text encoding information (e.g. the similarity is above a predetermined threshold), it can be determined that the lip language information matches the first text information. Conversely, in the case where the lip language information does not match the text encoding information (e.g., the similarity is below a predetermined threshold), it is determined that the lip language information does not match the first text information.

In addition, in reverse, the operation of matching the lip language information with the first text information may include: converting the lip language information into corresponding second text information; and matching the second text information with the first text information. So that in case the first text information matches the second text information (e.g. the similarity is above a predetermined threshold), it can be determined that the lip language information matches the first text information. Conversely, in a case where the first text information does not match the second text information (e.g., the similarity is below a predetermined threshold), it is determined that the lip language information does not match the first text information.

Optionally, before performing an operation of recognizing the mouth region image by using the recognition model based on the deep learning training, the method further includes: a mouth region image is extracted from the video.

Specifically, when the object to be recognized reads the first text information, the reading process of the object to be recognized is recorded, so that a video is obtained. Then, before the operation of identifying the mouth region image in the video, the video needs to be preprocessed to extract the valid data frame in the video, that is, extract the mouth region image. Therefore, by the mode, invalid data frames are filtered and removed, the workload of identifying the model is reduced, and the identification accuracy of the model is further guaranteed.

Optionally, the operation of generating a first feature sequence corresponding to the mouth region image by using the feature extraction model includes: a first feature sequence corresponding to the mouth region image is generated using a 3D convolutional neural network model.

Specifically, the 3D convolutional neural network model is a 3D convolutional network model, the mouth region image is used as an input of the 3D convolutional neural network model, and a first feature sequence of the mouth region image in space and time is obtained through output. For example: a mouth region image sequence of size 16 × 112 × 112 × 3 is taken as an input to the 3D convolutional neural network model. Where "16" indicates that the number of frames of the mouth region image is 16 frames, "112 × 112" represents that the width and height of the image is 112 × 112, and "3" represents that the number of channels per frame image is 3 channels (for example, three channels of RGB). Thus, by the 3D convolution network, the feature sequences respectively corresponding to the 16 frame images can be extracted from the mouth region image. For example, the size of the feature sequence is 16 × 1 × 1 × 64. Wherein "16" represents that the signature sequence contains 16 signatures. "1 × 1 × 64" indicates that each feature includes 64 channels, and each channel is a feature sequence of a 1 × 1 matrix (i.e., each channel contains 1 element, so that 64 channels constitute a 64-dimensional feature vector). Thus, in this way, a feature sequence consisting of 16 64-dimensional feature vectors can be generated. Here, for example, by providing an averaging pooling (averaging pooling) layer in the 3D convolutional neural network model, the width and height dimensions of the mouth region image can be changed to 1 × 1 by the averaging pooling (averaging pooling) layer.

Optionally, the operation of generating the lip language information according to the second feature sequence by using the conversion model includes: lip language information is generated from the second signature sequence using a Transformer Model (or a transform Model).

FIG. 3 shows a schematic structural diagram of a Transformer model. Specifically, referring to fig. 3, the Transformer model (the structure of which is shown in fig. 3) used in the present embodiment is divided into two parts, namely an encoder and a decoder, in which a plurality of multi-headed self-attention mechanisms and fully-connected feed-forward networks are stacked, and the encoder is a stack of the self-attention mechanisms, in which the input tensors simultaneously serve as queues, keys and values. The inputs to the second multi-headed attention mechanism of the decoder come from the keys and values of the encoder, and the queues of the last decoder output. In the past, a great amount of RNN structures and encoder-decoder structures are used in natural language processing, and the RNN and derivative networks thereof have the defects of slowness and the problem of dependence on front and back hidden states and incapability of realizing parallelism. In the embodiment, a transform model is adopted, so that a recursive structure is completely abandoned, attention mechanism is relied on, and the relation between input and output is mined.

Further, the second feature sequence is used as the input of a Transformer model, the Transformer model converts the probability values of characters corresponding to each element in the input second feature sequence, and then the character corresponding to the element with the highest probability value is determined as the character corresponding to the second feature sequence. And finally, mapping the characters output by the Transformer model into corresponding lip languages according to a preset mapping relation in the database. Illustratively, the 3D convolutional neural network model outputs a feature sequence of size 16 × 1 × 1 × 64, according to the above. Taking a feature sequence with the size of 16 × 1 × 1 × 64 as an input of a transform model, outputting an N-dimensional feature vector after encoding by an encoder and decoding by a decoder, wherein N is the number of words in a word stock, and calculating a probability value of each element in the feature sequence by using a Softmax classifier. And then determining the character corresponding to the element with the highest probability value as the character corresponding to the characteristic sequence. And finally, mapping the characters output by the Transformer model into corresponding lip language information according to a preset mapping relation in the database. The preset mapping relationship may be: a mapping relation exists between the character "10" and the lip language "few", and a mapping relation exists between the character "15" and the lip language "strong". That is, when the character output by the transform model is "10", it can be converted into "less" according to the mapping relationship set in the database.

Therefore, by the mode, the Attention mechanism in the transform model is utilized to acquire the global information of the lip region in one step, and the longer sequence information can be well identified. In addition, the Transformer model training uses less resources and has high identification accuracy. The Transformer model can be computed in parallel to reduce time training. The network structure of the Transform Model (TM) is used for processing decoding of the lip language video feature sequence, the whole framework process is simple and reliable, and the strong performance of the Transform Model (TM) enables model parameters to be fewer, training time to be shorter and recognition accuracy to be high.

Alternatively, the operation of generating the lip language information from the second feature sequence by using the conversion model is not limited to using a transform model, and the second feature sequence may be converted into the lip language information by using a Seq2Seq model.

Further, referring to fig. 1, according to a second aspect of the present embodiment, a storage medium 104 is provided. The storage medium 104 comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 4 shows a deep learning based liveness detection device 400 according to the present embodiment, which device 400 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 4, the apparatus 400 includes: an obtaining module 410, configured to obtain a video related to an object to be recognized, where the video includes a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in a process of reading the first text information by the object to be recognized; the recognition module 420 is used for recognizing the mouth region image by using a recognition model based on deep learning training and generating lip language information; and a determination module 430, configured to determine whether the object to be identified is a living body according to the lip language information and the first text information.

Alternatively, the recognition model 420 includes a feature extraction model based on image feature extraction, a residual network model, and a conversion model based on natural language processing, and the recognition module 420 includes: the first generation submodule is used for generating a first feature sequence corresponding to the mouth region image by using the feature extraction model; the second generation submodule is used for extracting the features of the first feature sequence by using the residual error network model to generate a second feature sequence; and the third generation submodule is used for generating the lip language information according to the second characteristic sequence by utilizing the conversion model.

Optionally, the decision module 430 comprises: the matching submodule is used for matching the lip language information with the first text information; and the judging submodule is used for judging whether the object to be identified is a living body or not according to the matching result.

Optionally, the matching sub-module comprises: an acquisition unit configured to acquire text code information corresponding to the first text information; and the first matching unit is used for matching the lip language information with the text coding information.

Optionally, the matching sub-module comprises: the conversion unit is used for converting the lip language information into corresponding second text information; and a second matching unit for matching the second text information with the first text information.

Optionally, the first generation submodule includes: and a first generation unit configured to generate a first feature sequence corresponding to the mouth region image using the 3D convolutional neural network model.

Optionally, the conversion submodule includes: and the second generating unit is used for generating the lip language information according to the second characteristic sequence by utilizing the converter model.

Optionally, the conversion submodule includes: and a third generating unit, configured to generate the lip language information according to the second feature sequence by using a Seq2Seq model.

Thus, according to the present embodiment, lip language corresponding to the mouth region image can be quickly recognized. And because the recognition model is based on deep learning training, the recognition model can well recognize longer sequence information and has high recognition rate. Thus, based on the recognition model, it is possible to quickly and accurately recognize the lip language corresponding to the mouth region image, thereby quickly determining whether the object to be recognized is a living body from the recognized lip language and the first text information. The whole lip language in-vivo detection method based on deep learning has the technical effects of strong generalization capability, simple and reliable flow and high identification accuracy. Further solves the technical problems of poor generalization capability and complex and difficult specific implementation of the existing in-vivo detection method.

Example 3

Fig. 5 shows a deep learning based liveness detection device 500 according to the present embodiment, which device 500 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 5, the apparatus 500 includes: a processor 510; and a memory 520 coupled to processor 510 for providing processor 510 with instructions to process the following process steps: acquiring a video related to an object to be recognized, wherein the video contains a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading first text information by the object to be recognized; generating lip language information corresponding to the mouth region image by using a recognition model based on deep learning training; and judging whether the object to be identified is a living body or not according to the lip language information and the first text information.

Optionally, the recognizing model includes a feature extraction model based on image feature extraction, a residual network model and a conversion model based on natural language processing, and the operation of generating lip language information corresponding to the mouth region image using the recognizing model includes: generating a first feature sequence corresponding to the mouth region image by using the feature extraction model; extracting the features of the first feature sequence by using the residual error network model to generate a second feature sequence; and generating the lip language information according to the second characteristic sequence by utilizing the conversion model.

Optionally, the operation of matching the lip language information with the first text information includes: acquiring text coding information corresponding to the first text information; and matching the lip language information with the text coding information.

Optionally, the operation of matching the lip language information with the first text information includes: converting the lip language information into corresponding second text information; and matching the second text information with the first text information.

Optionally, the operation of generating the lip language information according to the second feature sequence by using the conversion model includes: and generating the lip language information according to the second characteristic sequence by using a converter model.

Optionally, the operation of converting the second feature sequence into lip language by using a conversion model includes: and generating the lip language information according to the second characteristic sequence by utilizing a Seq2Seq model.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A living body detection method based on deep learning is characterized by comprising the following steps:

acquiring a video related to an object to be recognized, wherein the video contains a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading first text information by the object to be recognized;

generating lip language information corresponding to the mouth region image by using a recognition model based on deep learning training; and

and judging whether the object to be identified is a living body or not according to the lip language information and the first text information.

2. The method according to claim 1, wherein the recognition model includes a feature extraction model based on image feature extraction, a residual network model, and a conversion model based on natural language processing, and the operation of generating the lip language information corresponding to the mouth region image using the recognition model includes:

generating a first feature sequence corresponding to the mouth region image by using the feature extraction model;

extracting the features of the first feature sequence by using the residual error network model to generate a second feature sequence; and

and generating the lip language information according to the second characteristic sequence by utilizing the conversion model.

3. The method according to claim 1, wherein the operation of determining whether the object to be recognized is a living body based on the lip language information and the first text information includes:

matching the lip language information with the first text information; and

and judging whether the object to be identified is a living body or not according to the matching result.

4. The method of claim 3, wherein matching the lip language information with the first text information comprises:

acquiring text coding information corresponding to the first text information; and

matching the lip language information with the text coding information.

5. The method of claim 3, wherein matching the lip language information with the first text information comprises:

converting the lip language information into corresponding second text information; and

and matching the second text information with the first text information.

6. The method according to claim 2, wherein the operation of generating a first sequence of features corresponding to the mouth region image using the feature extraction model comprises: and generating a first feature sequence corresponding to the mouth region image by using a 3D convolutional neural network model.

7. The method of claim 2, wherein the operation of generating the lip language information according to the second feature sequence using the transformation model comprises: and generating the lip language information according to the second characteristic sequence by using a converter model.

8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.

9. A living body detecting device based on deep learning, characterized by comprising:

the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a video related to an object to be recognized, the video comprises a mouth area image of the object to be recognized, and the video is obtained by shooting the object to be recognized in the process of reading first text information by the object to be recognized;

the recognition module is used for recognizing the mouth region image by using a recognition model based on deep learning training and generating lip language information; and

and the judging module is used for judging whether the object to be identified is a living body according to the lip language information and the first text information.

10. A living body detecting device based on deep learning, characterized by comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:

recognizing the mouth region image by using a recognition model based on deep learning training, and generating lip language information; and