CN113642466B

CN113642466B - Living body detection and model training method, apparatus and medium

Info

Publication number: CN113642466B
Application number: CN202110932017.0A
Authority: CN
Inventors: 付华; 赵立军; 蒋宁
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2022-11-01
Anticipated expiration: 2039-11-27
Also published as: CN111881706A; CN111881706B; CN113642466A

Abstract

The invention discloses a method, a device, equipment and a medium for in-vivo detection, image classification and model training, and relates to the technical field of image processing to improve the speed of in-vivo detection. The method comprises the following steps: acquiring a target face image group, wherein the target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image; fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image; and inputting the first fusion image into a first model to obtain a first living body detection result. The first model is SqueezeNet, the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement. The embodiment of the invention can improve the speed of in vivo detection.

Description

Living body detection and model training method, apparatus and medium

The invention relates to a divisional application of an invention application with the application date of 2019, 11 and 27, and the application number of 201911186208.6, and the invention name of the invention is a method, a device, equipment and a medium for in-vivo detection, image classification and model training.

Technical Field

The invention relates to the technical field of image processing, in particular to a living body detection and model training method, equipment and medium.

Background

With the wide application of technologies such as Face recognition and Face unlocking in daily life such as finance, entrance guard, mobile equipment and the like, a Face Anti-counterfeiting/living body detection (Face Anti-Spoofing) technology has gained more and more attention in recent years. Based on a deeper and more complex deep neural network model, the living body detection model running at the server end can reach 99% of accuracy. With the increase of application scenes, a living body detection model which runs in real time on a movable terminal is required.

Currently, the mobile terminal mostly adopts an interactive mode to perform the living body detection. However, this method requires the detected object to act in coordination, which is time-consuming, and thus affects the detection speed.

Disclosure of Invention

The embodiment of the invention provides a living body detection and model training method, equipment and a medium.

In a first aspect, an embodiment of the present invention provides a method for detecting a living body, including:

acquiring a target face image group, wherein the target face image group comprises a frame of RGB (Red, green, blue, red, green and Blue) image and a frame of depth image corresponding to the RGB image;

fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image;

inputting the first fusion image into a first model to obtain a first living body detection result;

wherein the first model is SqueezeNet; the size of the face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

In a second aspect, an embodiment of the present invention further provides a model training method, including:

obtaining a model training sample set, wherein the model training sample set comprises a plurality of fusion images, and each fusion image is obtained by fusing a frame of red, green and blue RGB image and a frame of depth image corresponding to the RGB image;

inputting the training sample set into a machine learning network model, and training to obtain a first model;

wherein, the first model is a compressed network SqueezeNet; the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

In a third aspect, an embodiment of the present invention further provides an image classification method, including:

acquiring a target image group, wherein the target image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image;

inputting the first fusion image into a first model to obtain an image classification result;

wherein the first model is SqueezeNet; the size of the target area in the RGB map meets a first preset requirement and the depth of the depth map meets a second preset requirement.

In a fourth aspect, an embodiment of the present invention further provides a living body detection apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target face image group, and the target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image;

the first fusion module is used for fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image;

the first processing module is used for inputting the first fusion image into a first model to obtain a first living body detection result;

In a fifth aspect, an embodiment of the present invention further provides a model training apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a model training sample set, the model training sample set comprises a plurality of fusion images, and each fusion image is obtained by fusing a frame of RGB image and a frame of depth image corresponding to the RGB image;

the training module is used for inputting the training sample set into a machine learning network model and training to obtain a first model;

wherein the first model is a SqueezeNet; the size of the face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

In a sixth aspect, an embodiment of the present invention further provides an image classification apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target image group, and the target image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image;

the first processing module is used for inputting the first fusion image into a first model to obtain an image classification result;

wherein the first model is a SqueezeNet; the size of the target area in the RGB map meets a first preset requirement and the depth of the depth map meets a second preset requirement.

In a seventh aspect, an embodiment of the present invention further provides an electronic device, including: a transceiver, a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in the method according to the first aspect or the second aspect or the third aspect as described above when executing the program.

In an eighth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the method according to the first aspect, the second aspect, or the third aspect.

In the embodiment of the invention, the single-frame RGB image in the acquired target face image group and the corresponding depth image are fused, and the fused result is used as the input of the model, so that the living body detection result is obtained. Therefore, by using the device provided by the embodiment of the invention, the detected object does not need to be matched to act, so that the detection speed is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flowchart of a method for detecting a living body according to an embodiment of the present invention;

FIG. 2 is a flowchart of selecting a target face image group according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an image fusion process provided by an embodiment of the invention;

FIG. 4 is a block diagram of a Fire Module according to an embodiment of the present invention;

FIG. 5 is a second flowchart of a method for detecting a living body according to an embodiment of the present invention;

FIG. 6 is a flow chart of a model training method provided by an embodiment of the invention;

FIG. 7 is a flowchart of an image classification method provided by an embodiment of the invention;

FIG. 8 is a structural view of a living body detecting apparatus provided in an embodiment of the present invention;

FIG. 9 is a block diagram of a model training apparatus according to an embodiment of the present invention;

fig. 10 is a structural diagram of an image classification apparatus provided in an embodiment of the present invention;

FIG. 11 is a block diagram of an electronic device according to an embodiment of the present invention;

FIG. 12 is a second block diagram of an electronic device according to an embodiment of the invention;

fig. 13 is a second structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a living body detection method according to an embodiment of the present invention, which is applied to an electronic device, such as a mobile terminal. As shown in fig. 1, the method comprises the following steps:

step 101, obtaining a target face image group, wherein the target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image.

In the embodiment of the invention, the target face image group can be acquired through the camera provided by the electronic equipment. In practical application, a plurality of face image groups can be acquired through a camera provided by electronic equipment. In the embodiment of the invention, in order to improve the accuracy of judgment, the size of a face area in an RGB image in a target face image group is required to meet a first preset requirement, and the depth of a depth image is required to meet a second preset requirement. The first preset requirement and the second preset requirement can be set according to needs.

For example, the first preset requirement may be that the size of the face region is greater than a certain preset value, and the second preset requirement may be that the depth is greater than a certain preset value.

Thus, prior to step 101, the method may further comprise: the method comprises the steps of obtaining a face image group to be detected, wherein the face image group to be detected comprises a frame of RGB image and a frame of depth image corresponding to the RGB image, and then selecting a target face image group from the face image group to be detected.

Referring to fig. 2, a process of selecting a target face image group is shown. For one frame of RGB image in the obtained face image group to be detected and one frame of depth image corresponding to the RGB image, firstly, judging whether a face area exists in the RGB. If so, continuing the subsequent processing. Otherwise, the face image group can be acquired again. And under the condition that the face area exists, determining the face area in the RGB image, and judging whether the size of the face area meets the requirement or not. If the image meets the requirements, continuing the subsequent processing, otherwise, acquiring the face image group again. And under the condition that the size of the face image meets the preset requirement, cutting out a face area from the RGB image. And in the cut human face area, the pixel positions of the RGB image and the depth image correspond to each other one by one. And judging whether the depth of the cut human face region meets the requirement. If the requirements are met, continuing the subsequent treatment. Otherwise, the face image group can be obtained again. Meanwhile, whether the cut human face area has the phenomenon that the human face is shielded or not is judged. If not, continuing the subsequent processing. Otherwise, the face image group can be acquired again. If no human face shielding exists and the depth of the cut human face region meets the preset requirement, the human face region can be used as a target human face image group and subjected to subsequent processing.

And step 102, fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image.

Referring to fig. 3, in an embodiment of the present invention, the fusion manner may include the following:

(1) Only preserving the Depth map to obtain a single-channel map (marked as A, depth (1));

(2) Mapping the Depth map into a Color map (marked as B), and superposing the Color map and the RGB map (such as superposing according to different weights) to obtain a three-channel map (Depth (3) + Color (3));

(3) Only the depth map is reserved to obtain a single-channel map; adding the single-channel image to an Alpha channel of the RGB image to obtain a four-channel image (Color (3) + Depth (a));

(4) Mapping the Depth map into a color map (denoted as B) (Depth (3));

(5) Converting the RGB image into a single-channel gray image, and mapping the depth image into a color image; the single-channel grayscale image is added to the Alpha channel of the Color image, resulting in a four-channel image (Depth (3) + Color (a)).

Accordingly, then, in this step, the first mode may be any one of the above fusion modes. Specifically, according to any one of the following manners, the RGB map and the depth map are fused in a first fusion manner to obtain a first fused image:

only the depth map is reserved to obtain a first single-channel map; or

Mapping the depth map into a first color map, and superposing the first color map and the RGB map to obtain a three-channel map; or

Only the depth map is reserved to obtain a second single-channel map; adding the second single-channel image to an Alpha channel of the RGB image to obtain a four-channel image; or alternatively

Mapping the depth map into a second color map; or

Converting the RGB map to a single channel grayscale map, mapping the depth map to a second color map; and adding the single-channel gray image to an Alpha channel of the second color image to obtain a four-channel image.

And 103, inputting the first fusion image into a first model to obtain a first living body detection result.

In the embodiment of the present invention, the first model may be, for example, squeezeNet or the like. Since the existing squeezet is improved, the squeezet in the embodiment of the present invention may be referred to as an improved squeezet. The SqueezeNet includes a Fire Module and a Streaming Module.

FIG. 4 is a diagram of a Fire Module according to an embodiment of the present invention. Wherein the Fire Module comprises a Squeeze layer, an Expand layer and a BatchNorm layer. The function of the Squeeze layer and the expanded layer is the same as that of the prior art, except that the Squeeze layer and the expanded layer perform Convolution operation using a Convolution kernel of 1 × 1 and a Convolution kernel of a Deformable Convolution constraint (DConv) of 3 × 3. The BatchNorm layer is used to converge the model. By converging the model, the speed of obtaining an accurate model can be increased.

Wherein the 3 × 3 Deformable constraint includes: deformable constraint V2 or Deformable constraint V1;

or, the Deformable constraint is a hole convolved constraint;

or, the 3 × 3 Deformable Convolution Deformable constraint is replaced with a combination of 1 × 3 Deformable constraint and 3 × 1 Deformable constraint;

the stream Module Streaming Module is a k × k Deformable depth Convolution Deformable Depthwise constraint for performing weighting calculation on each region of the image; and k is a positive integer greater than 1.

Wherein, the k × k Deformable depth Convolution Deformable Depthwise Convolition (DDWConv) is obtained by combining the k × k depth Convolution Depthwise Convolition with any one of the following Convolution modes:

a Deformable Convolution Deformable constraint V2 or Deformable constraint V1 or a void-convolved constraint;

alternatively, the Streaming Module includes: combination of 1 × k Deformable depth Convolution Deformable Depthwise Convolition with k × 1 Deformable depth Convolution Deformable Depthwise Convolition.

Specifically, the Streaming Module is a 7 × 7 Deformable default constraint, and is configured to perform weighting calculation on each region of the image;

wherein, the 7 × 7 Deformable Depthwise constraint is obtained by combining the 7 × 7 deep Convolution Deformable constraint with any one of the following Convolution modes:

deformable Convolution Deformable constraint V2 or Deformable constraint V1 or hole Convolution generalized constraint;

alternatively, a 7 × 7 Deformable Depthwise Convolition is also available: a combination of 1 × 7 Deformable Depthlose contribution and 7 × 1 Deformable Depthlose contribution.

The Streaming Module is used for carrying out weighting calculation on each region of the image, so that the accuracy of the model can be improved.

In practical applications, the SqueezeNet can also be implemented by using MobileNet, shuffleNet, efficientNet, and the like.

In the embodiment of the invention, the single frame RGB image in the acquired target face image group and the corresponding depth image are fused, and the fused result is used as the input of the model, so that the living body detection result is obtained. Therefore, by using the device provided by the embodiment of the invention, the detected object does not need to be matched to act, so that the detection speed is improved.

Referring to fig. 5, fig. 5 is a flowchart of a method for detecting a living body according to an embodiment of the present invention, which is applied to an electronic device, such as a mobile terminal. As shown in fig. 5, the method comprises the following steps:

step 501, training a first model.

Wherein the first model may comprise SqueezeNet, and the like.

Taking the SqueezeNet as an example, in the embodiment of the present invention, the SqueezeNet is improved to obtain an improved SqueezeNet. In an embodiment of the present invention, the SqueezeNet includes a Fire Module and a Streaming Module. For the description of the SqueezeNet, reference is made to the description of the preceding embodiments.

In this step, a model training sample set is obtained, where the model training sample set includes a plurality of fusion images, where each fusion image is obtained by performing fusion processing on a frame of RGB image and a frame of depth image corresponding to the RGB image, and then the training sample set is input into a machine learning network model to train to obtain the first model.

Step 502, obtaining a face image group to be detected, wherein the face image group to be detected comprises a frame of RGB image and a frame of depth image corresponding to the RGB image.

Step 503, selecting a target face image group from the face image group to be detected. The target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image; the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

And step 504, fusing the RGB image and the depth image of the target face image group in a first fusion mode to obtain a first fusion image.

And 505, inputting the first fusion image into a first model to obtain a first living body detection result.

In the embodiment of the present invention, the first living body detection result may be a numerical value. By comparing the value with a preset threshold value, whether a real face image is included can be determined. In addition, if the value of the first in-vivo detection result meets the preset requirement, for example, the value is within a certain value range, in order to improve the accuracy of the detection result, the subsequent cascade judgment can be performed.

Step 506, fusing the RGB image and the depth image in a second fusion mode to obtain a second fusion image; the second fusion mode is different from the first fusion mode.

The specific contents of the first fusion mode and the second fusion mode can refer to the description of the foregoing embodiments.

And 507, inputting the second fusion image into the first model or the second model to obtain a second living body detection result.

Wherein the first model and the second model are different models. The second model can be, for example, featherNet, mobileNet, shuffleNet, efficientNet, and the like. In practical applications, the second model may also be trained in advance.

And step 508, obtaining a final in vivo detection result according to the first in vivo detection result and the second in vivo detection result.

In the embodiment of the present invention, the second living body detection result may be a numerical value. Then, here, the first and second live detection results are operated, and the operation result is taken as the final live detection result.

The operation comprises any one of the following: calculating a product of the first in-vivo detection result and a first weighting value, calculating a product of the second in-vivo detection result and a second weighting value, and summing the obtained products; calculating an average value of the first and second in-vivo detection results. Of course, there may be other calculation manners in practical application, and the embodiments of the present invention are not limited thereto.

And comparing the obtained operation value with a certain preset value so as to determine whether the real face image is included.

And after the first in-vivo detection result is obtained, obtaining a second in-vivo detection result, and integrating the first in-vivo detection result and the second in-vivo detection result to obtain a final in-vivo detection result. Through the cascade detection, the accuracy of the detection result can be improved.

In the embodiment of the invention, the single-frame RGB image in the acquired target face image group and the corresponding depth image are fused, and the fused result is used as the input of the model, so that the living body detection result is obtained. Therefore, by using the device provided by the embodiment of the invention, the detected object does not need to be matched to act, so that the detection speed is improved. In addition, the SqueezeNet is adopted in the scheme of the embodiment of the invention, and the model is very small, so that the method and the device are suitable for being arranged at mobile terminals such as terminals.

Referring to fig. 6, fig. 6 is a flowchart of a model training method according to an embodiment of the present invention. As shown in fig. 6, the method comprises the following steps:

step 601, obtaining a model training sample set, wherein the model training sample set comprises a plurality of fusion images, and each fusion image is obtained by fusing a frame of RGB image and a frame of depth image corresponding to the RGB image.

In this step, an image to be processed may be acquired, and then a label is added to the image to be processed. The image to be processed comprises a frame of RGB image and a frame of depth image corresponding to the RGB image. When labeling, both the RGB map and the depth map may be labeled, or only the RGB map or the depth map may be labeled. The annotation is used for indicating whether a real face image exists in the image. And then, fusing the RGB image and the depth image to obtain a fused image. The fusion mode can refer to the description of the previous embodiment.

The method comprises the steps of training a classification model by using a Balanced cross entropy Loss function (a-Balanced Focal local) as a Loss function, and adding labels to the image to be processed, so that the problem of unbalanced distribution of classes and difficulty levels of training samples can be effectively relieved, and the generalization capability and accuracy of the model are improved.

The calculation method of the balance cross entropy loss function is as follows:

FL(p_t)＝-a_t(1-p_t)^γlog(p_t)

FL is a rulerThe cross entropy loss function with dynamically adjustable degree has two parameters a in FL_tAnd γ, wherein, a_tThe method mainly aims to solve the problem of unbalance of positive and negative samples, and gamma mainly aims to solve the problem of unbalance of difficult and easy samples.

In addition, a-Balanced Cross enhanced Loss can be used as a Loss function, and an OHEM (Online Hard sample Mining) is also matched to solve the problem of imbalance.

Step 602, inputting the training sample set into a machine learning network model, and training to obtain a first model.

In the embodiment of the invention, the first model is SqueezeNet. The Squeezenet comprises a Fire Module and a Streaming Module.

Wherein the Fire Module comprises a Squeeze layer, an Expand layer and a BatchNorm layer. The function of the Squeeze layer and the expanded layer is the same as that of the prior art, and the difference is that the Squeeze layer and the expanded layer perform Convolution operation by using a Convolution kernel of 1 × 1 and a Convolution kernel of a Deformable Convolution constraint of 3 × 3. The BatchNorm layer is used to converge the model. By converging the model, the speed of obtaining an accurate model can be increased.

Replacing a GAP (Global Average pool) layer with a Streaming Module (stream Module) for performing weighted calculation on each region of the image, thereby improving the accuracy of the model;

wherein the Deformable constraint comprises: deformable constraint V2 or Deformable constraint V1;

or, the Deformable constraint is a hole convolved constraint;

the stream Module Streaming Module is k multiplied by k Deformable depth Convolution Deformable Depthwise conversion and is used for carrying out weighting calculation on each region of the image; and k is a positive integer greater than 1.

Wherein, the k × k Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the k × k depth Convolution Depthwise constraint with any one of the following Convolution modes:

alternatively, the Streaming Module includes: the combination of a 1 × k Deformable depth Convolution Deformable Depthwise contribution and a k × 1 Deformable depth Convolution Deformable Depthwise contribution.

alternatively, a 7 × 7 Deformable Depthwise Convolition is also available: the combination of the 1 × 7 Deformable Depthwise contribution and the 7 × 1 Deformable Depthwise contribution is replaced.

In addition, in the embodiment of the invention, the Deformable constraint is adopted to replace a 3 × 3 Convolution kernel in the Squeezenet, so that the Convolution kernels are concentrated on a more effective sensing area, the feature extraction of the model is enhanced, and the accuracy of the model is improved.

The SqueezeNet model is small in size and more suitable for being applied to a mobile terminal. On the basis of the embodiment, pruning and retraining can be carried out on the trained model, so that the model is further reduced.

As can be seen from the above description, in the embodiment of the present invention, the RGB image and the depth image of the single frame are used to perform fusion in multiple ways, so that the processing speed is increased, and the accuracy of the detection result is improved through the cascade judgment. The SqueezeNet model in the embodiment of the invention is smaller, so the model is suitable for running at a mobile terminal. Meanwhile, the a-Balanced local is utilized in the process of training the Squeezenet model, so that the problem of unbalanced distribution of classes and difficulty levels of training samples can be effectively solved, and the generalization capability and accuracy of the model are improved. In addition, in the embodiment of the invention, the Deformable constraint is used for replacing a 3 × 3 Convolution kernel in the prior SqueezeNet, so that the Convolution kernels are concentrated on a more effective sensing area, the feature extraction of the model can be enhanced, and the accuracy of the model is improved.

Referring to fig. 7, fig. 7 is a flowchart of an image classification method according to an embodiment of the present invention. As shown in fig. 7, the method comprises the following steps:

step 701, acquiring a target image group, wherein the target image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image.

The target image group may be an image including any content, such as a human face, a landscape, and the like.

And 702, fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image.

The fusion mode can be seen from the description of the previous embodiment.

And 703, inputting the first fusion image into a first model to obtain an image classification result.

Wherein, the first model is a compressed network SqueezeNet. The size of the target area in the RGB map meets a first preset requirement and the depth of the depth map meets a second preset requirement. The target area may be, for example, a human face area. The meaning of the first preset requirement and the second preset requirement may refer to the description of the foregoing method embodiments.

The specific structure of the compressed network SqueezeNet is described above, and the training method of the compressed network SqueezeNet is described above. Depending on the classification target, there may be different image classification results. For example, the image classification result may be an image including a human face and an image not including a human face, an image including a landscape and an image not including a landscape, and the like.

In the embodiment of the invention, the acquired single-frame RGB image and the corresponding depth image are fused, and the fused result is used as the input of the model, so that the image classification result is obtained. Therefore, the speed of image classification is improved by using the device provided by the embodiment of the invention.

The embodiment of the invention also provides a living body detection device. Referring to fig. 8, fig. 8 is a structural diagram of a living body detecting device according to an embodiment of the present invention. Since the principle of solving the problems of the biopsy device is similar to that of the biopsy method in the embodiment of the present invention, the implementation of the biopsy device can be referred to the implementation of the method, and repeated details are not repeated.

As shown in fig. 8, the living body detecting apparatus includes: a first obtaining module 801, configured to obtain a target face image group, where the target face image group includes a frame of RGB image and a frame of depth image corresponding to the RGB image; a first fusion module 802, configured to fuse the RGB map and the depth map in a first fusion manner to obtain a first fusion image; a first processing module 803, configured to input the first fused image into a first model, so as to obtain a first living body detection result; the first model is an SqueezeNet, the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

Optionally, the first fusion module 802 fuses the RGB map and the depth map in a first fusion manner according to any one of the following manners to obtain a first fusion image:

only the depth map is reserved to obtain a first single channel map; or

Only the depth map is reserved to obtain a second single channel map; adding the second single-channel image to an Alpha channel of the RGB image to obtain a four-channel image; or alternatively

Mapping the depth map into a second color map; or alternatively

Converting the RGB map to a single channel grayscale map, mapping the depth map to a second color map; and adding the single-channel gray-scale image to an Alpha channel of the second color image to obtain a four-channel image.

Optionally, the apparatus may further include:

the second fusion module is used for fusing the RGB image and the depth image in a second fusion mode to obtain a second fusion image; the second fusion mode is different from the first fusion mode;

the second processing module is used for inputting the second fusion image into the first model or the second model to obtain a second living body detection result; wherein the first model and the second model are different models;

and the third processing module is used for obtaining a final in-vivo detection result according to the first in-vivo detection result and the second in-vivo detection result.

Optionally, the third processing module is configured to perform an operation on the first in-vivo detection result and the second in-vivo detection result, and use an operation result as the final in-vivo detection result;

the operation comprises any one of the following:

calculating a product of the first in-vivo detection result and a first weighting value, calculating a product of the second in-vivo detection result and a second weighting value, and summing the obtained products; or alternatively

Calculating an average value of the first and second in-vivo detection results.

Optionally, the apparatus may further include:

and the training module is used for training the first model by using the model training method provided by the embodiment of the invention. The description of the first model may refer to the description of the previous embodiments, among others.

The apparatus provided in the embodiment of the present invention may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

The embodiment of the invention also provides a model training device. Referring to fig. 9, fig. 9 is a structural diagram of a model training apparatus according to an embodiment of the present invention. Because the principle of solving the problem of the model training device is similar to the model training method in the embodiment of the invention, the implementation of the model training device can refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 9, the model training apparatus includes: a first obtaining module 901, configured to obtain a model training sample set, where the model training sample set includes a plurality of fusion images, and each fusion image is obtained by performing fusion processing on a frame of RGB image and a frame of depth image corresponding to the RGB image; a training module 902, configured to input the training sample set into a machine learning network model, and train to obtain a first model; wherein, the first model is SqueezeNet. The size of the face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

Optionally, the SqueezeNet includes a Fire Module and a Streaming Module;

wherein the Fire Module comprises a Squeeze layer, an Expand layer and a BatchNorm layer;

the Squeeze layer and the expanded layer carry out Convolution operation by using a Convolution kernel of 1 × 1 and a Convolution kernel of Deformable Convolution constraint of 3 × 3; the BatchNorm layer is used for converging the model;

the Streaming Module is used for performing weighting calculation on each region of the image;

or, the Deformable constraint is a hole convolved constraint;

the stream Module Streaming Module is a k × k Deformable depth Convolution Deformable Depthwise constraint for performing weighting calculation on each region of the image;

The apparatus provided in the embodiment of the present invention may implement the method embodiments, and the implementation principle and technical effects are similar, which are not described herein again.

The embodiment of the invention also provides an image classification device. Referring to fig. 10, fig. 10 is a structural diagram of an image classification apparatus according to an embodiment of the present invention. Because the principle of the image classification device for solving the problem is similar to the image classification method in the embodiment of the present invention, the implementation of the image classification device can refer to the implementation of the method, and the repeated parts are not described again.

As shown in fig. 10, the image classification apparatus includes: a first obtaining module 1001, configured to obtain a target image group, where the target image group includes a frame of RGB image and a frame of depth image corresponding to the frame of RGB image; a first fusion module 1002, configured to fuse the RGB map and the depth map in a first fusion manner to obtain a first fusion image; a first processing module 1003, configured to input the first fused image into a first model to obtain an image classification result;

the first model is a compressed network SqueezeNet, the size of a target area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

As shown in fig. 11, the electronic device according to the embodiment of the present invention includes: the processor 1100, which reads the program in the memory 1120, executes the following processes:

acquiring a target face image group, wherein the target face image group comprises a frame of red, green and blue (RGB) image and a frame of depth image corresponding to the RGB image;

inputting the first fusion image into a first model to obtain a first living body detection result; the first model is an SqueezeNet, the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

A transceiver 1111 for receiving and transmitting data under the control of the processor 1100.

Where in fig. 11, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 1100, and various circuits, represented by memory 1120, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 1111 may be a plurality of elements including a transmitter and a transceiver providing a means for communicating with various other apparatus over a transmission medium. The processor 1100 is responsible for managing the bus architecture and general processing, and the memory 1120 may store data used by the processor 1100 in performing operations.

The processor 1100 is responsible for managing the bus architecture and general processing, and the memory 1120 may store data used by the processor 1100 in performing operations.

The processor 1100 is also adapted to read the program and execute the following steps:

according to any one of the following modes, fusing the RGB map and the depth map in a first fusion mode to obtain a first fused image:

only the depth map is reserved to obtain a first single channel map; or alternatively

Mapping the depth map into a first color map, and superposing the first color map and the RGB map to obtain a three-channel map; or alternatively

Mapping the depth map into a second color map; or alternatively

fusing the RGB image and the depth image in a second fusion mode to obtain a second fusion image; the second fusion mode is different from the first fusion mode;

inputting the second fusion image into the first model or the second model to obtain a second living body detection result; wherein the first model and the second model are different models;

and obtaining a final in-vivo detection result according to the first in-vivo detection result and the second in-vivo detection result.

calculating the first in-vivo detection result and the second in-vivo detection result, and taking the calculation result as the final in-vivo detection result;

the operation comprises any one of the following:

Calculating an average value of the first and second in-vivo detection results.

the first model is trained by using the model training method of the embodiment of the invention.

The meaning of the first model can be referred to the description of the previous embodiments.

As shown in fig. 12, the electronic device according to the embodiment of the present invention includes: a processor 1200, for reading the program in the memory 1220, and executing the following processes:

the first model is a compressed network SqueezeNet, the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

A transceiver 1210 for receiving and transmitting data under the control of the processor 1200.

Where in fig. 12, the bus architecture may include any number of interconnected buses and bridges, with various circuits of one or more processors represented by processor 1200 and memory represented by memory 1220 being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 1210 may be a plurality of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 1200 is responsible for managing the bus architecture and general processing, and the memory 1220 may store data used by the processor 1200 in performing operations.

The processor 1200 is responsible for managing the bus architecture and general processing, and the memory 1220 may store data used by the processor 1200 in performing operations.

Wherein, the SqueezeNet comprises a Fire Module and a Streaming Module;

wherein the Fire Module comprises an Squeeze layer, an Expand layer and a BatchNorm layer;

the Streaming Module is used for carrying out weighting calculation on each area of the image;

wherein, the Deformable constraint includes: deformable constraint V2 or Deformable constraint V1;

or, the Deformable constraint is a hole convolved constraint;

Specifically, the Streaming Module is a 7 × 7 Deformable default constraint, and is used for performing weighting calculation on each region of the image;

wherein, the 7 × 7 Deformable Depthwise constraint is obtained by combining the 7 × 7 depth Convolution Deformable constraint with any one of the following Convolution modes:

alternatively, a 7 × 7 Deformable Depthwise conversion is also available: a combination of 1 × 7 Deformable Depthlose contribution and 7 × 1 Deformable Depthlose contribution.

As shown in fig. 13, the electronic device according to the embodiment of the present invention includes: a processor 1300, for reading the program in the memory 1320, for executing the following processes:

the first model is SqueezeNet, the size of the target area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

A transceiver 1310 for receiving and transmitting data under the control of the processor 1300.

In fig. 13, among other things, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by the processor 1300, and various circuits, represented by the memory 1320, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 1310 can be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 1300 is responsible for managing the bus architecture and general processing, and the memory 1320 may store data used by the processor 1300 in performing operations.

The processor 1300 is responsible for managing the bus architecture and general processing, and the memory 1320 may store data used by the processor 1300 in performing operations.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned embodiment of the living body detection method, the model training method, or the image classification method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. With such an understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of in vivo detection, comprising:

acquiring a target face image group, wherein the target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image;

the first model is an improved SqueezeNet, the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement;

the improved SqueezeNet comprises a Fire Module and a flow Module;

wherein the Fire Module comprises an Squeeze layer and an expanded layer;

the Squeeze layer and the Expand layer carry out convolution operation by using a convolution kernel and a convolution kernel of deformable convolution;

2. The method according to claim 1, wherein the fusing the RGB map and the depth map in a first fusion manner to obtain a first fused image comprises any one of the following manners:

only the depth map is reserved to obtain a first single-channel map; or

Only the depth map is reserved to obtain a second single-channel map; adding the second single-channel image to an Alpha channel of the RGB image to obtain a four-channel image; or

Mapping the depth map into a second color map; or alternatively

3. The method of claim 1, wherein obtaining a final in-vivo test result according to the first in-vivo test result and the second in-vivo test result comprises:

the operation comprises any one of the following:

Calculating an average value of the first and second in-vivo detection results.

4. The method of claim 1, wherein the deformable convolution comprises: a deformable convolution V2 or a deformable convolution V1;

alternatively, the deformable convolution comprises a hole convolution;

or, the deformable convolution comprises a combination of two deformable convolutions;

the flow module is a k x k deformable depth convolution and is used for performing weighting calculation on each region of the image, and k is a positive integer greater than 1;

the k × k deformable depth convolution is obtained by combining the k × k depth convolution with any one of the following convolution modes:

a deformable convolution V2 or a deformable convolution V1 or a hole convolution;

alternatively, the streaming module comprises: a combination of a 1 x k deformable depth convolution and a k x 1 deformable depth convolution.

5. A method of model training, comprising:

obtaining a model training sample set, wherein the model training sample set comprises a plurality of fusion images, and each fusion image is obtained by fusing a frame of red, green and blue RGB image and a frame of depth image corresponding to the RGB image; the model training sample set comprises fusion images of at least two different fusion modes;

inputting the model training sample set into a machine learning network model, and training to obtain a first model;

inputting the model training sample set into a machine learning network model, and training to obtain a second model, wherein the first model and the second model are different models;

the first model is a compressed network SqueezeNet, the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement;

the SqueezeNet comprises a Fire Module and a flow Module;

wherein the Fire Module comprises an Squeeze layer and an expanded layer;

when the living body detection is carried out, inputting a fusion image obtained by a first fusion mode into the first model to obtain a first detection result; inputting a fusion image obtained by a second fusion mode into the first model or the second model to obtain a second detection result; obtaining a final in-vivo detection result according to the first detection result and the second detection result; the second fusion mode is different from the first fusion mode.

6. The method of claim 5,

wherein the deformable convolution comprises: a deformable convolution V2 or a deformable convolution V1;

alternatively, the deformable convolution is a hole convolution;

the stream module is a k × k deformable depth convolution and is used for performing weighted calculation on each region of the image, and k is a positive integer greater than 1;

7. The method of claim 5, wherein the fused image is obtained by any one of:

only the depth map is reserved to obtain a first single-channel map; or,

mapping the depth map into a first color map, and superposing the first color map and the RGB map to obtain a three-channel map; or,

only the depth map is reserved to obtain a second single channel map; adding the second single-channel image to an Alpha channel of the RGB image to obtain a four-channel image; or,

mapping the depth map into a second color map; or,

8. An electronic device, comprising: a transceiver, a memory, a processor, and a program stored on the memory and executable on the processor; it is characterized in that the preparation method is characterized in that,

the processor, which is used for reading the program in the memory to realize the steps in the method according to any one of claims 1 to 4; or implementing a step in a method according to any of claims 5 to 7.

9. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the steps in the method according to any one of claims 1 to 4; or to implement a step in a method as claimed in any one of claims 5 to 7.