WO2021218695A1

WO2021218695A1 - Monocular camera-based liveness detection method, device, and readable storage medium

Info

Publication number: WO2021218695A1
Application number: PCT/CN2021/088272
Authority: WO
Inventors: 郭宏伟; 李辉; 马杰延
Original assignee: 华为技术有限公司
Priority date: 2020-04-26
Filing date: 2021-04-20
Publication date: 2021-11-04
Also published as: CN113553887A

Abstract

The present application relates to the field of data processing and provides a monocular camera-based liveness detection method. An RGB image is used to determine whether the face in the image is live. The method comprises: obtaining a first image, the first image being an RGB image, and the first image comprising a face image of a target object; obtaining a first depth image according to the first image and a depth image generation network; determining a liveness detection result according to the first image, the first depth image, and a detection network, the liveness detection result being used for indicating whether the target object is live; and executing an action on the basis of the liveness detection result.

Description

Living body detection method, equipment and readable storage medium based on monocular camera

This application requires the priority of a Chinese patent application filed with the State Intellectual Property Office of China with the application number 202010338191.8 on April 26, 2020, and the title of the invention is "A method, equipment and readable storage medium for living body detection based on a monocular camera The priority of the Chinese patent application of ”, the entire content of which is incorporated into this application by reference.

Technical field

This application relates to the field of data processing, and in particular to a living body detection method, device and readable storage medium based on a single-camera RGB image.

Background technique

With the popularization of smart terminals and various devices with cameras, a broad foundation has been provided for biometric recognition based on face detection and recognition. For face detection and recognition, living body detection is usually indispensable in many scenarios, such as payment, access control and other security-related occasions. For Huawei terminal products, such as the child mode in the big screen, video call portrait tracking, etc.

Generally, in order to improve the accuracy of living body detection, multi-modal images are introduced. In addition to common RGB images, infrared images, depth images, etc., will be introduced to make up for the shortcomings of conventional visible light images. Depth images can be generated by binocular cameras and specific equipment, but this method is relatively expensive and has relatively low popularity.

Summary of the invention

The embodiment of the present application provides a living body detection method based on a monocular camera, which can ensure the accuracy of living body detection without adding additional costs.

In order to achieve the foregoing objectives, the following technical solutions are adopted in the embodiments of this application:

In a first aspect, a living body detection method is provided, the method is applied to an electronic device, the method may include: acquiring a first image, the first image is an RGB image, and the first image includes a face image of a target object; Obtain a first depth image according to the first image and the depth image generation network; determine the living body detection result according to the first image, the first depth image, and the detection network; the living body detection result is used to indicate whether the target object is a living body; according to The result of the living body test executes the action.

The technical solution provided by the above-mentioned first aspect generates a depth image from an RGB image, and then performs a living body detection based on the RGB image and the depth image. This method can effectively prevent the aggressive behavior in the living body detection, while improving the accuracy of the living body detection, without adding additional equipment to obtain the depth image, effectively reducing the cost.

In a possible implementation manner, the depth image generation network includes a first neural network and a second neural network; the obtaining the first depth image according to the first image and the depth image generation network specifically includes: through the first neural network Extracting the coarse granularity feature of the first image; extracting the fine granularity feature of the first image through the second neural network; generating the first depth image according to the coarse granularity feature and the fine granularity feature.

In a possible implementation manner, the generating the first depth image according to the coarse-grained feature and the fine-grained feature includes: acquiring a fusion feature, and the fused feature is to fuse the coarse-grained feature and the fine-grained feature through a fusion algorithm. Granularity feature is obtained; the first depth image is generated according to the fusion feature.

In a possible implementation manner, the first neural network and the second neural network are lightweight convolutional neural networks.

In a possible implementation manner, the detection network includes a third neural network and a fourth neural network, and the determining the living body detection result according to the first image, the first depth image, and the detection network specifically includes: extracting the detection result through the third neural network. The feature of the first image; the feature of the first depth image is extracted through the fourth neural network; the feature map is obtained according to the feature of the first image and the feature of the first depth image; the living body detection result is determined according to the feature map.

In a possible implementation manner, the determining the living body detection result according to the characteristic map specifically includes: performing global pooling on the characteristic map to obtain the global characteristic; and determining the living body detection result according to the global characteristic.

In a possible implementation manner, the method further includes: acquiring a second image; acquiring the face image in the second image according to a face detection algorithm; and determining the first image according to the face image.

In a possible implementation, the first image is a face image that has been aligned and preprocessed. 9. The method of claim 1, wherein performing an action according to the result of the living body detection comprises: when the result of the living body detection indicates that the target object is a living body, performing portrait tracking on the target object.

In a possible implementation manner, performing an action based on the result of the living body detection includes: when the result of the living body detection indicates that the target object is a living body, determining whether the target object is a child; when the target object is a child, switching to a child mode .

In a possible implementation, the depth image generation network and the detection network belong to a living body detection neural network;

The living body detection neural network is obtained by joint training based on the local feature and the global feature.

In a second aspect, an electronic device is provided, which has the methods and functions described in any one of the possible implementations of the first aspect described above. The function can be implemented by hardware, or by hardware executing corresponding software. . The hardware or software includes one or more modules corresponding to the above-mentioned functions.

In a third aspect, an electronic device is provided, including one or more processors; a memory; a camera; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more The computer program includes instructions, and when the instructions are executed by the electronic device, the electronic device executes any one of the possible implementation manners of the first aspect described above.

In a fourth aspect, a computer-readable storage medium is provided, including computer instructions, which when executed on an electronic device, cause the electronic device to execute any of the possible implementation manners of the first aspect described above.

In a fifth aspect, a chip is provided, which is coupled with a memory in an electronic device, so that the chip invokes program instructions stored in the memory when it is running, so that the electronic device executes any of the possible methods as in the first aspect above. Method to realize.

Description of the drawings

FIG. 1a is an application scenario of a living body detection method provided by an embodiment of this application;

FIG. 1b is an application scenario of another living body detection method provided by an embodiment of the application;

FIG. 1c is an application scenario of another living body detection method provided by an embodiment of the application;

FIG. 1d is a schematic structural diagram of an electronic device provided by an embodiment of this application;

FIG. 1e is a schematic diagram of a convolutional neural network provided by an embodiment of this application;

2 is a schematic flowchart of a method for training a living body detection model provided by an embodiment of the application;

FIG. 3 is a schematic diagram of generating a depth image based on coarse-to-fine according to an embodiment of the application;

FIG. 4 is a schematic diagram of a local feature training provided by an embodiment of this application;

FIG. 5 is a flowchart of applying a living body detection model provided by an embodiment of the application;

FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of this application;

FIG. 7 is a schematic structural diagram of another electronic device provided by an embodiment of the application;

FIG. 8 is a schematic structural diagram of another electronic device provided by an embodiment of the application.

Detailed ways

The terms "first", "second", and "third" in the description, claims, and description of the drawings in this application are used to distinguish different objects, rather than to limit a specific order.

In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations, or illustrations. Any embodiment or design solution described as "exemplary" or "for example" in the embodiments of the present application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as "exemplary" or "for example" are used to present related concepts in a specific manner.

In order to make the description of the following embodiments clear and concise, first a brief introduction of related technologies is given:

Each pixel of an RGB image has 3 values to represent colors, that is, a variety of colors can be obtained through the changes of the three colors of red, green, and blue and the superposition of each other.

Depth image, also called distance image, refers to an image in which the distance from an image collector, such as a camera, to a point in the scene is used as the pixel value of each pixel. The depth image directly reflects the geometry of the visible surface of the subject.

Monocular camera generally refers to a camera. The monocular camera can only take one type of image at the same time.

Binocular cameras generally refer to two cameras, which can acquire two different types of images at the same time. For example, a binocular camera can simultaneously acquire RGB images and depth images. Exemplarily, the binocular camera may be a 3D camera, including a color camera and a depth sensor.

Deep neural network (DNN) is a framework of deep learning that can provide modeling for complex nonlinear systems. In other words, deep neural networks can systematically classify data.

Convolutional neural network (convolutional neural network, CNN) is composed of one or more convolutional layers and a fully connected layer at the top. It also includes associated weights and pooling layers. Convolutional neural network is a bottom-up network structure, which uses a multi-layer network and abstracts layer by layer. Each layer abstracts higher-level feature representations that deal with various invariances on the basis of the upper layer. Referring to FIG. 1e, a convolutional neural network may include a convolutional layer, a pooling layer, and a fully connected layer. In some cases, the convolutional neural network can also be connected to a loss layer.

The convolutional layer is a set of parallel feature maps, which are composed by sliding different convolution kernels on the input image and running certain operations. In addition, at each sliding position, an element-corresponding product and summation operation is run between the convolution kernel and the input image to project information to an element in the feature map. For example, for an RGB image, the convolutional layer can convert the image into a feature map.

The pooling layer is a non-linear form of downsampling, which is used to pool the feature map. Pooling can have a variety of different forms of non-linear pooling functions, such as max pooling and average pooling. Maximum pooling is to divide the input image into several rectangular areas, and output the maximum value for each sub-area. For example, the pooling layer can pool the feature map to reduce the number of features in the feature map.

The fully connected layer is used for advanced reasoning in neural networks. For example, "2" for a handwriting with a size of 32*32. The human eye can immediately recognize that the handwritten "2" is the number 2. But for electronic devices, all pixels in this picture need to be input to the neural network for processing to be recognized. However, if all pixels are directly input to the fully connected layer for processing, the amount of data will be extremely large. For example, for the above 32*32 size image, 1.6 billion parameters may be obtained. In this case, the image can be preprocessed first, and then finally input to the fully connected layer for recognition. For example, the fully connected layer can map the features processed by the convolutional layer and the pooling layer into a one-dimensional feature vector.

In order to train the neural network, the fully connected layer can also be connected to the loss function layer. The loss function layer can be used to determine the difference between the predicted result and the real result during the neural network training process. Various loss functions are suitable for different types of tasks. For example, the Softmax function can map the output of multiple neurons in the neural network to the (0,1) interval for classification and calculation.

In other words, the process of training a neural network is a process of continuously reducing the loss by adjusting the parameters in the neural network.

As the application of facial information recognition technology becomes more and more popular, security has become more and more the focus. The value of live detection technology lies in judging the identity of users and preventing attacks such as photos and videos. Specifically, in some scenarios, liveness detection refers to a user's behavior to determine whether a static picture is a real picture or a photo, and does not need to be recognized by the user's actions such as shaking his head or blinking an eye. Figure 1a shows a scene of living body detection. 1a, when the electronic device 102 wants to verify the user identity, it needs to determine whether the current operation is the real user 104 or the face photo 103 of the user 104, so as to prevent others from holding the photo 103 of the user 104 to obtain the authority of the user 104, thereby Harm the interests of the user 104. For example, when another person obtains the electronic device 102 of the user 104 and the photo of the user 104, it is impossible to unlock the screen of the electronic device 102 or complete functions such as payment by using the photo of the user 104.

The existing living body detection is mainly divided into two schemes: single mode and multi-mode. Among them, single mode refers to the use of images acquired by the same imaging device for living body detection. Exemplarily, the single-modal solution uses RGB image input into a neural network to extract features for classification, and then compares the features with previously saved user facial features, and finally determines whether it is a living body. Its main characteristics are simplicity, faster speed, and lower training and deployment costs. Multi-modality refers to the use of images acquired by different imaging devices for face matching. Exemplarily, a multi-modal solution fuses RGB images and corresponding multi-modal data, such as infrared images, depth images, etc., and uses neural networks to extract depth features for live detection. The depth image can be obtained through a binocular camera or other specific equipment. The advantage of the multi-modal scheme is that it has high accuracy and is not easy to be attacked.

However, both of these two methods of living body detection have certain shortcomings. For example, due to the lack of other types of data, a single-modal living body detection scheme can easily recognize photos, masks, etc. as the user himself, and the recognition accuracy is not high. In other single-modal living detection solutions, in order to improve accuracy, in addition to using static RGB images, it is also possible to determine whether a living body is a living body through user actions, such as blinking and turning the head. However, this method requires the user to make a specified action, and the user experience is not good. Although the multi-modal living body detection scheme has high detection accuracy, the multi-modal data is not easy to obtain, and multiple types of cameras are required, resulting in high cost. At the same time, multi-modal data and training through neural networks are more complicated.

In order to overcome the above-mentioned technical problems, this application provides a living body detection method based on a single-camera RGB image. Specifically, after the electronic device obtains the RGB image, it can determine whether the portrait in the RGB image is a living body through the living body detection neural network. Among them, the living body detection neural network can generate a depth image based on the RGB image, and then perform feature fusion between the RGB image and the depth image, and determine whether the portrait in the RGB image is alive according to the fused features. The embodiments of the present application provide a living body detection method, which does not require additional equipment to obtain a multi-modal image, and at the same time, can ensure the accuracy of living body detection.

Fig. 1b shows an application scenario of an embodiment of the present application, which is mainly applied to portrait tracking of video calls. As shown in FIG. 1b, the electronic device 101 has a camera 105. Exemplarily, the camera 105 can capture RGB images. When a user uses the electronic device 101 to make a video call with another user, the camera 105 can determine whether the user is actually in front of the camera through the single-camera RGB image living detection technology provided in this embodiment of the application, and adjust the captured image to show the user The image of is set in the center of the screen.

Figure 1c shows another application scenario of an embodiment of the present application, which is mainly applied in a large-screen child mode. As shown in FIG. 1c, the electronic device 101 has a camera 105. Exemplarily, the electronic device 101 has a child mode, and the camera 105 can determine whether the person currently watching the screen is a child, so as to switch to the child mode and display the program watched by the child. However, in some scenarios, an image 106 of a child may be hung on the wall opposite to the electronic device 101. As mentioned above, in the prior art, it is difficult to determine whether the image 106 is a real child based on only one camera. At this time, if an adult user is watching a program through the electronic device 101, the electronic device 101 may misrecognize the image 106 as the child himself and switch to the child mode, which seriously affects the user's experience. The living body detection technology provided by the embodiments of the present application can effectively detect that the image 106 is not a real person through a camera, thereby avoiding false triggering of the child mode.

It is understandable that the application scenarios shown in Fig. 1b and Fig. 1c are only exemplary, and the embodiments of the present application provide a living body detection method using a single camera. Any scene that can be used for living body detection, such as payment , Access control, etc., all of which can be applied to the technical solutions provided in this application.

The electronic device in the embodiments of the present application may be a portable electronic device containing other functions such as a personal digital assistant and/or a music player function, such as a mobile phone, a tablet computer, a wearable electronic device with wireless communication function (such as a smart watch), etc. . Exemplary embodiments of portable electronic devices include, but are not limited to, carrying

Or portable electronic devices with other operating systems, smart screens with cameras, TVs with cameras, etc. The aforementioned portable electronic device may also be other portable electronic devices, such as a laptop computer with a touch-sensitive surface (such as a touch panel). It should also be understood that in some other embodiments of the present application, the above-mentioned electronic device may not be a portable electronic device, but a desktop computer with a touch-sensitive surface (such as a touch panel).

The embodiments of this application do not impose special restrictions on the specific form of the electronic device.

Figure 1d exemplarily shows a schematic structural diagram of an electronic device. In Figure 1d, the electronic device is a mobile phone for illustration.

It should be understood that the illustrated electronic device is only an example, and the electronic device may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations . The various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.

As shown in Figure 1d, the mobile phone may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, and an antenna. 1. Antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone interface 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display Screen 194, and subscriber identification module (SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.

In the following, each component of the electronic device will be specifically introduced in conjunction with Figure 1d:

The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Among them, the different processing units may be independent devices or integrated in one or more processors. Among them, the controller can be the nerve center and command center of the mobile phone. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.

A memory may also be provided in the processor 110 to store instructions and data. For example, the corresponding relationship between the authentication device, the authentication method, and the authentication safety value in this application, as well as the corresponding relationship between the operation and the safety value, etc. can be stored. In some embodiments, the memory in the processor 110 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory, thereby avoiding repeated access, reducing the waiting time of the processor 110, and improving the efficiency of the system.

The processor 110 may be configured to execute the solution for authenticating user information in the embodiment of the present application. When the server is integrated on the electronic device, the processor 110 can also execute the processing schemes executed by the server mentioned in the following content, such as determining the authentication security value corresponding to the operating device, for example, calculating the total authentication based on the M authentication security values Safety value and so on. When the processor 110 integrates different devices, such as integrated CPU and GPU, the CPU and GPU can cooperate to execute the operation prompt method provided in the embodiment of the present application. For example, in the operation prompt method, part of the algorithm is executed by the CPU, and the other part of the algorithm is executed by the GPU. Execute to get faster processing efficiency.

In some embodiments, the processor 110 may include one or more interfaces. For example, the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver ( universal asynchronous receiver/transmitter, UART interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface , And/or Universal Serial Bus (USB) interface, etc.

The wireless communication function of the mobile phone can be realized by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor. The wireless communication function of the mobile phone can realize the communication between the electronic device and the electronic device, as well as the electronic device and the server in the embodiments of the present application.

The antenna 1 and the antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in the mobile phone can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna can be used in combination with a tuning switch.

The mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G, etc., which are applied to mobile phones. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like. The mobile communication module 150 can receive electromagnetic waves by the antenna 1, and perform processing such as filtering, amplifying and transmitting the received electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modem processor, and convert it into electromagnetic waves for radiation via the antenna 1. In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110. In some embodiments, at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.

The wireless communication module 160 can provide applications on mobile phones including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), and global navigation satellite systems ( Global navigation satellite system, GNSS), frequency modulation (FM), near field communication (NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110. The wireless communication module 160 can also receive the signal to be sent from the processor 110, perform frequency modulation, amplify it, and convert it into electromagnetic waves through the antenna 2 and radiate it out.

In some embodiments, the antenna 1 of the mobile phone is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the mobile phone can communicate with the network and other devices through wireless communication technology. Wireless communication technologies can include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), and broadband code division. Multiple access (wideband code division multiple access, WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC, FM , And/or IR technology, etc. GNSS can include global positioning system (GPS), global navigation satellite system (GLONASS), Beidou navigation satellite system (BDS), quasi-zenith satellite system, QZSS) and/or satellite-based augmentation systems (SBAS).

The mobile phone realizes the display function through GPU, display screen 194, and application processor. The GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations and is used for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display screen 194 includes a display panel. The display panel can use liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). AMOLED, flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.

The mobile phone can realize the shooting function through ISP, camera 193, video codec, GPU, display 194 and application processor.

The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and is projected to the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the mobile phone may include one or more cameras 193. In the embodiment of the present application, the camera 193 may be used to adopt the facial information of the user.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the mobile phone. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.

The internal memory 121 may be used to store computer executable program code, and the executable program code includes instructions. The internal memory 121 may include a storage program area and a storage data area. Among them, the storage program area can store the operating system, at least one application program (such as sound playback function, image playback function, etc.) required by at least one function. The data storage area can store data (such as audio data, phone book, etc.) created during the use of the mobile phone. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the mobile phone by running instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 170 can also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.

The microphone 170C, also called "microphone", "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C. The mobile phone can be equipped with at least one microphone 170C. In other embodiments, the mobile phone may be equipped with two microphones 170C, which can realize noise reduction function in addition to collecting sound signals. In other embodiments, the mobile phone can also be equipped with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions. In the embodiment of the present application, the microphone 170C may be used to adopt the user's voiceprint information.

The fingerprint sensor 180H is used to collect fingerprints. The mobile phone can use the collected fingerprint characteristics to unlock the fingerprint, access the application lock, take photos with the fingerprint, and answer calls with the fingerprint. For example, a fingerprint sensor can be arranged on the front of the mobile phone (below the display 194), or on the back of the mobile phone (below the rear camera). In addition, the fingerprint recognition function can also be realized by configuring the fingerprint sensor in the touch screen, that is, the fingerprint sensor can be integrated with the touch screen to realize the fingerprint recognition function of the mobile phone. In this case, the fingerprint sensor may be configured in the touch screen, may be a part of the touch screen, or may be configured in the touch screen in other ways. In addition, the fingerprint sensor can also be implemented as a full panel fingerprint sensor. Therefore, the touch screen can be regarded as a panel that can collect fingerprints at any position. In some embodiments, the fingerprint sensor may process the collected fingerprint (for example, whether the fingerprint is verified) and send it to the processor 110, and the processor 110 will make corresponding processing according to the fingerprint processing result. In other embodiments, the fingerprint sensor may also send the collected fingerprint to the processor 110, so that the processor 110 can process the fingerprint (for example, fingerprint verification, etc.). In the embodiment of the present application, the fingerprint sensor 180H may be used to adopt the fingerprint information of the user.

Although not shown in FIG. 1d, the mobile phone may also include a Bluetooth device, a positioning device, a flashlight, a miniature projection device, a near field communication (NFC) device, etc., which will not be repeated here.

In order to improve the accuracy of living body detection, the living body detection neural network in the embodiment of the present application may be trained in advance. It is understandable that the living body detection neural network may include one or more convolutional neural networks, and different convolutional neural networks can implement different functions. Exemplarily, the living body detection neural network may include a deep image generation network and a detection network. Each network can include one or more types of convolutional neural networks. The living body detection method provided by the embodiments of the present application performs living body detection by generating images of other modalities from RGB images. For ease of description, the embodiment of the present application takes a depth image as an example for description. It is understandable that the embodiment of the present application may also generate other types of images, such as infrared images, for living body detection. The embodiment of the application does not limit this.

Based on the foregoing content, FIG. 2 exemplarily shows a schematic flow chart of a training method for a living body detection neural network provided by an embodiment of the present application. As shown in Figure 2, the method includes:

S202: Acquire a first image and a first depth image corresponding to the first image.

When training the live detection model, the training data needs to be obtained first. Exemplarily, in the embodiment of the present application, the acquired training data may be the first image and the first depth image corresponding to the first image. Among them, the first image is an RGB image. It is understandable that the training data may include multiple first images and first depth images corresponding to the multiple first images. At the same time, the type of the first image may not be limited to an RGB image, and the image corresponding to the first image may also be another type of image, such as an infrared image. As long as the first image and the images corresponding to the first image are images of different types, the requirements of the training data in the embodiment of the present application are met. For the convenience of description, the following takes the first image as an RGB image and the image corresponding to the first image as the first depth image as an example.

Exemplarily, the first image and the first depth image corresponding to the first image may be pictures taken of the same object. For example, a binocular camera is used to photograph user A, and the first image and the first depth image corresponding to the first image are generated at the same time. It is understandable that the first image and the first depth image can be regarded as being taken of the user A at the same angle.

In other embodiments, the first depth image may also be generated using the principle of binocular stereo vision. For example, using two cameras to obtain two images of the surrounding scenery from different angles at the same time, or a single camera to obtain two images of the surrounding scenery from different angles at different times, and then based on the principle of parallax, the three-dimensional geometric information of the object can be restored, thereby obtaining Depth image.

In some embodiments, the first image and the first depth image corresponding to the first image may be in a portrait database or collected. The embodiment of the present application does not limit the acquisition method of the first image and the first depth image.

It can be understood that, in the embodiment of the present application, the first image and the first depth image are both shots of the same object, for example, the same human face, and the difference is only the type of the image.

S204: Generate a second depth image.

After the first image is obtained, the second depth image can be generated by the living body detection neural network. It is understandable that the first depth image is the original image taken by the camera. The second depth image is generated by algorithm by extracting the features of the first image.

The embodiment of the present application may use the depth image generation network in the living body detection neural network to generate the second depth image.

Exemplarily, the depth image generation network may include two independent convolutional neural networks, and the second depth image is generated through a coarse-to-fine (CTF) method. Among them, one convolutional neural network can be used to extract the coarse-grained features of the first image, and the other convolutional neural network can be used to extract the fine-grained features of the first image. Then the coarse-grained features and fine-grained features of the first image are generated through a fusion algorithm to generate a second depth image.

Further, these two independent convolutional neural networks can be lightweight convolutional neural networks. The advantage of lightweight convolutional neural networks is that they can be used on mobile devices while reducing network parameters without losing network performance. For example, the lightweight convolutional neural network may be a feature network (feather-net). This kind of network can guarantee operation speed and accuracy at the same time.

Fig. 3 shows a method for generating a depth image based on coarse-to-fine provided by an embodiment of the present application. As shown in Figure 3, the method includes:

S302: Obtain the coarse granularity feature of the first image through the first neural network.

Specifically, the first neural network may be a lightweight convolutional neural network. By inputting the first image to the first neural network, the coarse-grained feature of the first image can be obtained.

S304: Acquire the fine granularity feature of the first image through the second neural network.

Specifically, the first neural network may be a lightweight convolutional neural network. By inputting the first image to the first neural network, the fine-grained feature of the first image can be obtained. It should be noted that the characteristics of coarse granularity and fine granularity are relative concepts. For example, for a face image, the contour feature of the face can be defined as a coarse-grained feature. It is possible to define the local features of the face, such as the feature of eyebrows, as fine-grained features.

It is understandable that the first neural network may be a deep neural network or a convolutional neural network. The second neural network can be a deep neural network or a convolutional neural network.

In the embodiment of the present application, the order of execution of steps S302 and S304 is not limited, and step S302 may be executed after step S304, may also be executed before step S304, or may be executed simultaneously with step S304.

S204: Generate a second depth image.

After obtaining the coarse-grained features and fine-grained features of the first image, the two features can be merged to generate a second depth image. The embodiment of the present application does not limit the specific algorithm of fusion.

In the embodiment of the present application, the second depth image is generated by fusing the coarse granularity feature and the fine granularity feature, which helps to improve the robustness.

S206: Acquire a first loss value according to the first depth image and the second depth image.

After the second depth image is acquired, the difference between the second depth image and the first depth image can be compared through an algorithm to determine the first loss value.

For example, a scale invariant algorithm can be used to compare the difference between the second depth image and the first depth image to determine the scale invariant loss.

The first loss value is used to indicate the difference between the second depth image and the first depth image. The smaller the first loss value, the smaller the difference between the second depth image and the first depth image. Through continuous learning, the living body detection neural network generates the second depth image more and more accurate, so that the second depth image is getting closer and closer to the first depth image that is actually taken.

In the embodiment of the present application, through repeated training of multiple samples, the first loss value can be made smaller and smaller, so that the second depth map generated by the living body detection neural network is getting closer and closer to the first depth map that is actually shot. It is understandable that this step is performed when training the living body detection neural network. When the electronic device uses the living body detection neural network to detect the living body, there is no need to perform loss calculation on the generated depth image.

S208: Fusion features of the first image and the second depth image.

After the second depth image is obtained, the features of the first image and the second depth image can be extracted separately through the vitality detection neural network, and then the features extracted from the two images are merged to form a feature map.

Exemplarily, the features of the first image and the second depth image may be extracted by two independent feature extraction networks in the detection network. Exemplarily, the two feature extraction networks may be two identical and independent backbone networks. The backbone network is a model of deep learning, which is used to extract the features of the image and give the representation of different sizes and different abstract levels of the image. For another example, these two feature extraction networks can also be lightweight convolutional neural networks.

Since the first image and the second depth image belong to different modal data, this step can also be called multi-modal data fusion. Among them, the feature map not only includes feature values, but also includes relative position information. For example, for a face image, the eyes, nose, and mouth are all arranged from top to bottom, and the corresponding feature values extracted are also arranged in this order.

In other words, the first image and the second depth image can be separately input into at least one convolutional layer, and the features are extracted and then fused, and finally a feature map is formed.

Among them, the fused features can be stored in the data container tensor for further learning. Tensor is a data container used to store data. For example, for an RGB image, it can be processed into a 3D tensor, and each two-dimensional column has three elements, which represent the red, yellow, and blue values of a pixel.

Feature fusion can be achieved through a variety of fusion algorithms. For example, the feature fusion algorithm may include an algorithm based on Bayesian decision theory, an algorithm based on sparse representation theory, or an algorithm based on deep learning theory. The embodiment of the present application does not limit the specific algorithm implemented by feature fusion.

S210: Acquire a second loss value based on the global feature.

Global feature (global feature) refers to the overall attributes of an image. For example, the global features can be color features, texture features, shape features, and so on. Global features are easily disturbed by the external environment.

Specifically, after the fused feature map is obtained, it can be input into the detection network for processing. For example, global pooling is performed on the feature map through the detection network to obtain global features. Then the global features are further input to the fully connected layer for live detection.

In order to train the living body detection neural network, the loss function layer can be connected to the detection network, and the probability vector can be obtained through the loss function layer. The second loss value is determined according to the probability vector and the label of the first image.

In some embodiments, referring to FIG. 1e, after the global features are processed by the fully connected layer, they can be further input into the loss function layer for classification, and finally the probability of belonging to each feature is obtained. Exemplarily, the loss function may be a softmax function. Because the input first image and the first depth image can be labeled in advance for classification. Therefore, the difference between the probability output by the softmax function and the label can be compared to obtain the loss. The process of living body detection model training is to find the optimal model parameters so as to minimize the loss.

Exemplarily, the following uses a specific example to explain how to get the loss based on the probability and label output by softmax.

Suppose there is a set of training images with three kinds of labels, namely cat, duck, and chicken. Then the label vector can be cat[1,0,0], duck[0,1,0], chicken[0,0,1]. Image A is one of the training images, and its label vector is cat[1,0,0]. After bringing the image A into the live detection neural network, the probability vector output by the softmax function is [0.65, 0.05, 0.3]. At this time, by comparing the probability vector with the label vector, a loss value can be obtained. For example, the loss can be -log(0.65).

In some embodiments, the labels of the images used to train the living body detection neural network are living body and non-living body. The label vector of the living body is [1,0], and the label vector of the non-living body is [0,1]. Assuming that the label of the first image is a living body [1,0], after the first image is input to the living body detection neural network, the probability vector output by the loss function layer is [0.7, 0.3]. The second loss value can be determined by comparing the difference between the probability vector and the label vector through an algorithm. For example, the second loss value may be 0.3.

It is understandable that this example is only illustrative, and there may be many types of first image tags, which are not limited in the embodiment of the present application.

Since the second loss value is determined according to the global feature, the second loss value may also be called a global learning loss (global learning loss), which is used to compare the difference between the global feature and the real value.

It is understandable that the detection network in the living body detection neural network can remove the loss function layer after the training is completed, and only needs to output the judgment of whether it is a living body.

S212: Acquire a third loss value based on the local feature.

Local feature (local feature) refers to the feature extracted from the local area of the image. The correlation between local features is small. For example, for a human face image, the local features can be eye, nose, and mouth features. The local features can reflect the nuances of the image, and it is not easy to be disturbed by the external environment. After the local feature is determined according to the feature map, the processed local feature can be compared with the real value or the real label to determine the third loss value.

When training the living body detection neural network, the fusion features generated in step S208 are subjected to block reinforcement learning, that is, after local features are extracted, local feature training is performed to obtain better living body detection performance.

FIG. 4 shows a schematic diagram of a local feature training provided by an embodiment of the present application.

As shown in Figure 4, after the feature map is obtained, the feature map can be partially divided. Exemplarily, the feature map can be divided into a first part 401, a second part 402, and a third part 403.

The first part 401, the second part 402 and the third part 403 are pooled, convolved, and then input to the fully connected layer, and finally the third loss is determined by the loss function. The embodiment of the present application does not limit the specific determination methods of pooling, convolution, full connection, and loss. Exemplarily, the convolution process may adopt a 1*1 convolution mode. Exemplarily, after the local feature is processed by the loss function, the output probability vector is used to characterize the probability of whether the feature is a living body. Then, the probability vector is compared with the label vector to obtain the third loss value.

Exemplarily, the first part 401 may be an eye feature, the second part 402 may be a nose feature, and the third part 403 may be a mouth feature. These three parts are input into the loss function for processing, and the probabilities of the living body are output respectively. Assume that the labels are live and non-living. The probability vector output by the eye feature through the loss function layer is [0.5, 0.5], the probability vector related to the nose feature is [0.6, 0.4], and the probability vector related to the mouth feature is [0.7, 0.3]. Since there are only living and non-living labels, if the label of the first image is a living label [1,0], it can be considered that the label vector corresponding to the eye, nose, and mouth features is also [1,0]. Then, it can be determined that the third loss value includes 0.5, 0.4, and 0.3 according to the probability vector and the label vector.

It is understandable that after the vitality detection neural network is trained, the relevant part of the local feature learning can be deleted. In other words, local feature learning is to train the living body detection neural network, including the deep image generation network and the detection network.

After obtaining the first loss value, the second loss value, and the third loss value, some optimization algorithms can be used to train the living body detection model, that is, iterative learning, to minimize the loss value as much as possible. For example, the Stochastic Gradient Descent (SGD) method can be used to iteratively adjust the parameters in the living body detection model, so that the loss value calculated each time becomes less and less. Specifically, by adjusting the weight of the convolution kernel in the convolution layer, the loss value becomes smaller and smaller.

When the first loss value, the second loss value, and the third loss value meet certain thresholds, the living body detection model can be deployed on an electronic device for living body detection.

The living body detection neural network training method provided in this application combines local features and global features to learn together, which can improve the robustness of the living body detection neural network

In the embodiments of the present application, after the living body detection neural network is trained, it can be applied to electronic equipment for living body detection. For example, the electronic device 101 shown in FIG. 1b may use the living body detection neural network provided by the embodiment of the present application. Specifically, when a user initiates a video call with other users through the electronic device 101, the electronic device 101 can turn on the camera 105, and determine whether the human face in the image captured by the camera is a living body through the living body detection neural network. For another example, in the process of gesture recognition, the electronic device can determine whether the user making the gesture is a living body through the living body detection neural network. For another example, when the electronic device determines whether to switch to the child mode, it can determine whether the human face in the image belongs to a child's live human face or a child's photo on the wall through the image taken by the camera.

The living body detection neural network can be pre-set in the electronic device, or it can be downloaded to the electronic device through the server. The embodiment of the present application does not limit the specific deployment mode of the living body detection model.

Fig. 5 shows a flow chart of applying a living body detection model provided by an embodiment of the present application. This method can be applied to the electronic equipment introduced above, and can also be applied to the electronic equipment not introduced above. For the convenience of description, the electronic device 101 shown in FIG. 1b is taken as an example for description.

As shown in Figure 5, the specific steps of the method include:

S502: Acquire a current image.

Referring to FIG. 1b, the electronic device 101 acquires a current image. The current image may be an image captured by the electronic device 101 through the camera 105, or an image captured by the electronic device 101 through another electronic device.

Exemplarily, when the electronic device 101 needs to activate the portrait tracking or face tracking function, it may start the step of acquiring the current image. For example, when the user starts to perform video communication with other users through the electronic device 101, or starts to use gesture operations, the electronic device 101 may start to acquire the current image through the camera 105. Exemplarily, the current image may be an RGB image.

The current image can be an image taken by the camera alone, or it can be the current frame or a certain frame of image in the video that the camera continuously takes.

The electronic device can obtain the current image directly from the camera, or obtain the current image through the camera of another device. For example, the electronic device is physically or wirelessly connected to an independent camera, and the current image is obtained from the independent camera. Alternatively, the electronic device can also obtain the current image through the camera of another electronic device. Alternatively, the electronic device may also obtain the current image through the cloud server. For example, after the camera of an entrance guard captures the current image, it is transmitted to the cloud server, and then the cloud server sends the current image to the electronic device for identification.

It is understandable that the embodiment of the present application does not limit whether the acquired image is the currently captured image.

S504: Perform face detection on the current image, and determine at least one face.

After the electronic device 101 obtains the current image, it can determine whether the current image contains a human face according to a face detection algorithm. Face detection is a technology that finds the position and size of a human face in any image. It can detect facial features and ignore other things such as buildings, trees, and bodies.

At the same time, the electronic device 101 can also determine how many faces the current image contains according to the face detection algorithm. For example, the electronic device 101 may determine that three human faces are included in the current image according to a face recognition algorithm, and determine the specific positions of the three human faces in the image. In other words, the electronic device 101 can obtain data of at least one face in the current image.

If it is determined that the current image does not contain a human face, then it can return to step S502 to continue to acquire the image.

If it is determined that the current image contains a human face, step S506 can be continued.

Face detection can be implemented in many ways. For example, human faces can be recognized based on geometric features, templates, or models. The embodiment of the present application does not limit the specific algorithm of face detection.

S506: Align at least one human face.

After detecting the face, you can align the face. Facial alignment refers to locating key facial feature points, such as eyes, nose tip, etc., according to the input face image. For example, face feature detection (face feature detection) can be used to identify the positions of different features of a human face. Specifically, after the human face is detected, the landmark of the human face can be detected. Landmark is used to mark various key positions of the face, such as the sides, corners, contours, etc. of the face. Landmark is used to describe the shape of a human face.

In some embodiments, a series of landmark points may be obtained after landmark detection is performed on at least one detected face. Use the detected landmarks and the template's landmarks to calculate the affine matrix H, and then use H to directly calculate the aligned faces.

This step is optional. After detecting the human face, the electronic device can directly input the image of the human face into the living body detection model for living body recognition.

S508: Perform preprocessing on at least one face.

Due to the face images obtained through face detection and face alignment, the size and angle may be inconsistent. The live detection model often requires the input image to have a uniform size or size. At this time, it is necessary to preprocess at least one face according to the requirements of the live detection model for the input image.

Exemplarily, the preprocessing may include denoising the face image, cropping, resizing the image, rotating the posture, and so on.

This step is optional. It can be understood that the embodiment of the present application does not limit the specific manner of preprocessing.

S510: Determine a living body detection result according to the living body detection neural network.

After the at least one face is preprocessed, the preprocessed at least one face image can be input into the living body detection neural network for living body detection, and then the living body detection result is determined. The living body detection result is used to indicate whether the face in the input face image is a living body.

For example, if there are 3 human faces in the current image, the electronic device inputs the 3 human face images into the living body detection neural network after recognition to determine whether the human faces in the 3 human face images are living bodies.

If there is a human face image that is recognized as a living body, step S512 is executed.

If the human face in the current image does not have a living human face, return to step S502 to continue acquiring the current image.

S512: Perform an action according to the result of the living body detection.

This step is optional. After the electronic device obtains the living body detection result, it can perform corresponding actions according to the living body detection result.

For example, refer to Figure 1b. The user uses the electronic device 101 to make a video call with another user. The electronic device 101 obtains the current image through the camera 105. If the live detection result shows that a face in the current image is alive, the electronic device can start a face tracking algorithm to track the live face, and the portrait is adjusted in the center of the video screen. . For example, the face 106 in FIG. 1b can be placed in the middle of the picture. If there is no living body detection result showing that there is a living body, the electronic device can continue the living body detection without adjusting the screen. Or, the electronic device can adjust the angle of the camera and reacquire images for live detection.

In some embodiments, the electronic device 101 may continue to perform live body detection during the video call to filter out non-living bodies. For example, during the passage, the user raises the mobile phone to show the other person a photo of the third person. At this time, the electronic device 101 may determine that the avatar of the photo is not alive through the live body detection, and will not adjust the photo to the middle of the screen.

In some other embodiments, after the electronic device 101 determines the face of the living body, it can use the face tracking algorithm to track the face of the living body, and no more living body detection is performed. When the face of the living body disappears, the living body detection is restarted.

In other embodiments, if the living body detection result shows that there is a living human face, the electronic device can further determine whether the living human face is a child. If the electronic device determines that the living human face is a child, the child mode is activated. If the living body detection result shows that there is no living body face, the electronic device can continue to display the current interface or enter the standby state as soon as possible to reduce power consumption.

The embodiments of the present application do not limit the actions performed according to the results of the living body detection, and the electronic device can perform any actions according to the results of the living body detection.

It should be noted that tracking the face of a living body is only one of the application situations of the living body detection neural network provided in the embodiment of the present application. Any scene that requires living body detection may be applied to what is provided in the embodiment of the present application. For example, users use electronic devices to make payments, child mode, gesture control, etc. The embodiments of the present application do not limit the application scenarios of the living body detection neural network.

Fig. 6 shows a flowchart of another living body detection method provided by an embodiment of the present application. The living body detection method can be applied to the electronic equipment introduced above, can also be applied to electronic equipment not introduced above, or run in a cloud server. This application does not limit the specific types of electronic devices, as long as they have computing capabilities.

S602: Acquire a second image of the target object.

Obtain a second image of the target object. Among them, the target object refers to the face that needs to be detected in vivo. The second image may be an RGB image.

The second image can be obtained directly from the camera, or it can be a processed image. For example, the second image can be obtained in the manner of steps S502-S508 shown in FIG. 5.

The embodiment of the present application does not limit the specific acquisition method of the second image.

S604: Generate a third depth image according to the second image.

The second image is input into the living body detection neural network. The electronic device may generate a third depth image corresponding to the second image according to the living body detection neural network. Specifically, the third depth image corresponding to the second image can be generated according to the depth image generation network in the living body detection neural network.

For the specific generation method, refer to step S204 shown in FIG. 2 and the coarse-to-fine method shown in FIG. 3.

S606: Acquire a global feature according to the second image and the third depth image.

After the third depth image is obtained, the features of the second image and the third depth image can be extracted according to the detection network in the living body detection neural network, and then the features of the two images are fused to form a fused feature map. The detection network may include two independent backbone networks to extract the features of the second image and the third depth image respectively.

For the specific process, refer to step S208 shown in FIG. 2.

After the fusion feature map is obtained, the global features can be further output by means such as pooling.

S608: Determine a living body detection result based on the global feature.

After obtaining the global features, the living body detection result can be determined according to the detection network. The live detection result is used to indicate whether the target object is alive. For a specific determination method, refer to step S210 shown in FIG. 2. The difference is that the detection network at this time directly outputs the judgment of whether the target object is a living body, and does not need to process through the loss function layer to output the probability vector.

In other words, the detection network in the living body detection neural network can determine whether the target object in the second image is a living body according to the second image and the third depth image corresponding to the second image.

The living body detection method provided by the embodiments of the present application does not require additional hardware to obtain multi-modal data, and can directly generate a depth image from an RGB image, merge the depth image and the RGB image, and perform living body detection according to the fused features. This detection method not only guarantees the accuracy of living body detection, but also requires additional costs. At the same time, the living body detection method provided by the embodiments of the present application can use a lightweight convolutional neural network in the process of generating a depth image, which greatly reduces the amount of calculation.

According to the foregoing method, FIG. 7 is a schematic diagram of the structure of the server provided by the embodiment of the application. As shown in FIG. The chip or circuit of an electronic device, for example, a chip or circuit that can be set in a cloud computing platform.

Further, the server 1301 may further include a bus system, wherein the processor 1302, the memory 1304, and the communication interface 1303 may be connected through the bus system.

It should be understood that the aforementioned processor 1302 may be a chip. For example, the processor 1302 may be a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or a system on chip (SoC). It can be a central processor unit (CPU), a network processor (NP), a digital signal processing circuit (digital signal processor, DSP), or a microcontroller (microcontroller). unit, MCU), and may also be a programmable logic device (PLD) or other integrated chips.

In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 1302 or instructions in the form of software. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor 1302. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 1304, and the processor 1302 reads the information in the memory 1304, and completes the steps of the foregoing method in combination with its hardware.

It should be noted that the processor 1302 in the embodiment of the present application may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method embodiments may be completed by hardware integrated logic circuits in the processor or instructions in the form of software. The above-mentioned processor may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components . The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.

It can be understood that the memory 1304 in the embodiment of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. The volatile memory may be random access memory (RAM), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAM are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), and synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) ) And direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memories of the systems and methods described herein are intended to include, but are not limited to, these and any other suitable types of memories.

When the server 1301 corresponds to the server in the foregoing method, the server 1301 may include a processor 1302, a communication interface 1303, and a memory 1304. The memory 1304 is used to store instructions, and the processor 1302 is used to execute the instructions stored in the memory 1304, so as to implement the related solution of the server in the method corresponding to any one of the above embodiments.

For the concepts related to the technical solutions provided by the embodiments of the present application, explanations, detailed descriptions, and other steps involved in the server, please refer to the descriptions of these contents in the foregoing methods or other embodiments, which are not repeated here.

Based on the above embodiments and the same concept, FIG. 8 is a schematic diagram of a server provided by an embodiment of the application. As shown in FIG. 8, the server 1501 may be a server, or a chip or circuit, such as a chip or circuit that can be installed in the server. .

It can be understood that the functions of the various units in the server 1501 can be implemented with reference to the corresponding method embodiments, which will not be repeated here.

It should be understood that the division of the above server units is only a division of logical functions, and may be fully or partially integrated into one physical entity during actual implementation, or may be physically separated. In the embodiment of the present application, the transceiver unit 1503 may be implemented by the communication interface 1303 in FIG. 7 described above, and the processing unit 1502 may be implemented by the processor 1302 in FIG. 7 described above.

According to the method provided in the embodiments of the present application, the present application also provides a computer program product. The computer program product includes: computer program code. When the computer program code runs on a computer, the computer executes the steps shown in FIGS. 2 to 6. The method of any one of the embodiments is shown.

According to the method provided in the embodiment of the present application, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable medium stores a program code, and when the program code runs on a computer, the computer executes FIGS. 2 to The method of any one of the embodiments shown in FIG. 6.

According to the method provided in the embodiment of the present application, the embodiment of the present application also provides an electronic device, which includes the aforementioned server.

According to the method provided in the embodiment of the present application, the embodiment of the present application also provides a system, which includes the aforementioned operating device, one or more authentication devices, and the aforementioned server.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application are generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. Computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, computer instructions may be transmitted from a website, computer, server, or data center through a cable (such as Coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL) or wireless (such as infrared, wireless, microwave, etc.) transmission to another website site, computer, server or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium can be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a high-density digital video disc (digital video disc, DVD)), or a semiconductor medium (for example, a solid state disk (solid state disc, SSD)) )Wait.

The server in the foregoing device embodiments corresponds to the server or the server in the method embodiment, and the corresponding module or unit executes the corresponding steps. For example, the communication unit (transceiver) executes the steps of receiving or sending in the method embodiment, except Steps other than sending and receiving can be executed by the processing unit (processor). For the functions of specific units, refer to the corresponding method embodiments. Among them, there may be one or more processors.

The terms "component", "module", "system", etc. used in this specification are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, the component may be, but is not limited to, a process, a processor, an object, an executable file, an execution thread, a program, and/or a computer running on a processor. Through the illustration, both the application running on the computing device and the computing device can be components. One or more components may reside in processes and/or threads of execution, and components may be located on one computer and/or distributed between two or more computers. In addition, these components can be executed from various computer readable media having various data structures stored thereon. The component can be based on, for example, a signal having one or more data packets (e.g. data from two components interacting with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal) Communicate through local and/or remote processes.

Those of ordinary skill in the art may realize that the various illustrative logical blocks and steps described in the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. accomplish. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the function is realized in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a server, etc.) execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

The above are only specific implementations of this application, but the scope of protection of this application is not limited to this. Any person skilled in the art can easily conceive of changes or substitutions within the technical scope disclosed in this application, which shall cover Within the scope of protection of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A living body detection method applied to electronic equipment, characterized in that it comprises:

Acquiring a first image, where the first image is an RGB image, and the first image includes a face image of a target object;

Obtaining a first depth image according to the first image and the depth image generation network;

Determine a living body detection result according to the first image, the first depth image, and a detection network; the living body detection result is used to indicate whether the target object is a living body;

Perform an action based on the result of the living body detection.
The method of claim 1, wherein the depth image generation network includes a first neural network and a second neural network;

The obtaining the first depth image according to the first image and the depth image generation network specifically includes:

Extracting coarse-grained features of the first image through the first neural network;

Extracting the fine-grained features of the first image through the second neural network;

The first depth image is generated according to the coarse granularity feature and the fine granularity feature.
The method according to claim 2, wherein said generating said first depth image according to said coarse granularity feature and said fine granularity feature comprises:

Acquiring a fusion feature, the fusion feature being obtained by fusing the coarse granularity feature and the fine granularity feature through a fusion algorithm;

The first depth image is generated according to the fusion feature.
The method of claim 2, wherein the first neural network and the second neural network are lightweight convolutional neural networks.
The method according to claim 1, wherein the detection network includes a third neural network and a fourth neural network,

The determining the living body detection result according to the first image, the first depth image, and the detection network specifically includes:

Extracting features of the first image through a third neural network;

Extracting features of the first depth image through a fourth neural network;

Acquiring a feature map according to the feature of the first image and the feature of the first depth image;

The living body detection result is determined according to the feature map.
The method according to claim 5, wherein the determining the result of the living body detection according to the characteristic map specifically comprises:

Global pooling the feature map to obtain global features;

The living body detection result is determined according to the global feature.
The method of claim 1, wherein the method further comprises:

Acquire a second image;

Acquiring the face image in the second image according to a face detection algorithm;

The first image is determined according to the face image.
8. The method of claim 7, wherein the first image is a face image that has been aligned and preprocessed.
The method of claim 1, wherein performing an action according to the living body detection result comprises:

When the living body detection result indicates that the target object is a living body, performing portrait tracking on the target object.
The method of claim 1, wherein performing an action according to the living body detection result comprises:

When the living body detection result indicates that the target object is a living body, determining whether the target object is a child;

When the target object is a child, switch to the child mode.
The method according to any one of claims 1-10, wherein the depth image generation network and the detection network belong to a living body detection neural network;

The living body detection neural network is obtained by joint training based on the local feature and the global feature.
An electronic device, characterized in that it comprises:

One or more processors;

Memory

Camera;

And one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs include instructions, and when the instructions are executed by the electronic device, The electronic device is caused to execute the living body detection method according to any one of claims 1-11.
A computer-readable storage medium, characterized by comprising computer instructions, when the computer instructions run on an electronic device, the electronic device is caused to execute the living body detection method according to any one of claims 1-11 .
A chip, characterized in that the chip is coupled with a memory in an electronic device, so that the chip calls the program instructions stored in the memory during operation, so that the electronic device executes any one of claims 1-11 The living body detection method.