CN112861586B

CN112861586B - Living body detection, image classification and model training method, device, equipment and medium

Info

Publication number: CN112861586B
Application number: CN201911186211.8A
Authority: CN
Inventors: 付华; 赵立军; 高砚
Original assignee: Mashang Consumer Finance Co Ltd
Current assignee: Mashang Consumer Finance Co Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2022-12-13
Anticipated expiration: 2039-11-27
Also published as: CN112861586A

Abstract

The invention discloses a method, a device, equipment and a medium for in vivo detection, image classification and model training, relates to the technical field of data processing, and aims to improve the speed of in vivo detection. The method comprises the following steps: acquiring a target face image group, wherein the target face image group comprises an RGB (red, green and blue) image and a frame depth image corresponding to the RGB image; fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image; and inputting the first fusion image into a first model to obtain a first living body detection result. The embodiment of the invention can improve the speed of in vivo detection.

Description

Living body detection, image classification and model training method, device, equipment and medium

Technical Field

The invention relates to the technical field of image processing, in particular to a method, a device, equipment and a medium for living body detection, image classification and model training.

Background

With the wide application of technologies such as Face recognition and Face unlocking in daily life such as finance, entrance guard, mobile equipment and the like, a Face Anti-counterfeiting/living body detection (Face Anti-Spoofing) technology has gained more and more attention in recent years. Based on a deeper and more complex deep neural network model, the living body detection model running at the server end can reach 99% of accuracy. With the increase of application scenes, a living body detection model which runs in real time on a movable terminal is needed.

Currently, the mobile terminal mostly adopts an interactive mode to perform the living body detection. However, this method requires the detected object to act in coordination, which is time-consuming, and thus affects the detection speed.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a medium for living body detection, image classification and model training, which are used for improving the speed of living body detection.

In a first aspect, an embodiment of the present invention provides a method for detecting a living body, including:

acquiring a target face image group, wherein the target face image group comprises a frame of RGB (Red, green, blue) image and a frame of depth image corresponding to the RGB image;

fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image;

and inputting the first fusion image into a first model to obtain a first living body detection result.

In a second aspect, an embodiment of the present invention further provides a model training method, including:

obtaining a model training sample set, wherein the model training sample set comprises a plurality of fusion images, and each fusion image is obtained by fusing a frame of RGB image and a frame of depth image corresponding to the RGB image;

and inputting the training sample set into a machine learning network model, and training to obtain a first model.

In a third aspect, an embodiment of the present invention further provides an image classification method, including:

acquiring a target image group, wherein the target image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image;

and inputting the first fusion image into a first model to obtain an image classification result.

In a fourth aspect, an embodiment of the present invention further provides a living body detection apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target face image group, and the target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image; the size of a face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement;

the first fusion module is used for fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image;

and the first processing module is used for inputting the first fusion image into a first model to obtain a first living body detection result.

In a fifth aspect, an embodiment of the present invention further provides a model training apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a model training sample set, the model training sample set comprises a plurality of fusion images, and each fusion image is obtained by fusing a frame of RGB image and a frame of depth image corresponding to the RGB image;

and the training module is used for inputting the training sample set into a machine learning network model and training to obtain a first model.

In a sixth aspect, an embodiment of the present invention further provides an image classification apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target image group, and the target image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image;

and the first processing module is used for inputting the first fusion image into a first model to obtain an image classification result.

In a seventh aspect, an embodiment of the present invention further provides an electronic device, including: a transceiver, a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in the method according to the first aspect or the second aspect or the third aspect as described above when executing the program.

In an eighth aspect, the embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the method according to the first aspect, the second aspect, or the third aspect described above.

In the embodiment of the invention, the single-frame RGB image and the corresponding depth image in the acquired target face image group are fused, and the fused result is used as the input of the model, so that the living body detection result is obtained. Therefore, by using the device provided by the embodiment of the invention, the detected object does not need to be matched to act, so that the detection speed is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flowchart of a method for detecting a living body according to an embodiment of the present invention;

FIG. 2 is a flowchart of selecting a target face image group according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an image fusion process provided by an embodiment of the invention;

FIG. 4 is a second flowchart of a method for detecting a living body according to an embodiment of the present invention;

FIG. 5 is a block diagram of three modules stacked into CNN Stem according to an embodiment of the present invention;

FIG. 6 is a third flowchart of a method for detecting a living body according to an embodiment of the present invention;

FIG. 7 is a block diagram of a Fire Module provided in accordance with an embodiment of the present invention;

FIG. 8 is a flow chart of a model training method provided by an embodiment of the present invention;

FIG. 9 is a flowchart of an image classification method provided by an embodiment of the invention;

FIG. 10 is a structural view of a living body detecting apparatus provided in an embodiment of the present invention;

FIG. 11 is a block diagram of a model training apparatus according to an embodiment of the present invention;

fig. 12 is a block diagram of an image classification apparatus provided in an embodiment of the present invention;

FIG. 13 is a block diagram of an electronic device according to an embodiment of the present invention;

FIG. 14 is a second block diagram of an electronic device according to an embodiment of the invention;

fig. 15 is a third structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a living body detection method according to an embodiment of the present invention, which is applied to an electronic device, such as a mobile terminal. As shown in fig. 1, the method comprises the following steps:

step 101, obtaining a target face image group, wherein the target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image.

In the embodiment of the invention, the target face image group can be acquired through the camera provided by the electronic equipment. In practical application, a plurality of face image groups can be acquired through a camera provided by electronic equipment. In the embodiment of the invention, in order to improve the accuracy of judgment, the size of a face area in an RGB image in a target face image group is required to meet a first preset requirement, and the depth of a depth map is required to meet a second preset requirement. The first preset requirement and the second preset requirement can be set according to needs.

For example, the first preset requirement may be that the size of the face region is greater than a certain preset value, and the second preset requirement may be that the depth is greater than a certain preset value.

Thus, prior to step 101, the method may further comprise: the method comprises the steps of obtaining a face image group to be detected, wherein the face image group to be detected comprises a frame of RGB image and a frame of depth image corresponding to the RGB image, and then selecting a target face image group from the face image group to be detected.

Referring to fig. 2, a process of selecting a target face image group is shown. For one frame of RGB image in the obtained face image group to be detected and one frame of depth image corresponding to the RGB image, firstly, judging whether a face area exists in the RGB. If so, continuing the subsequent processing. Otherwise, the face image group can be obtained again. And under the condition that the face area exists, determining the face area in the RGB image, and judging whether the size of the face area meets the requirement or not. If the image meets the requirements, continuing the subsequent processing, otherwise, acquiring the face image group again. And under the condition that the size of the face image meets the preset requirement, cutting out a face area from the RGB image. And in the cut human face area, the pixel positions of the RGB image and the depth image correspond to each other one by one. And judging whether the depth of the cut human face region meets the requirement or not. If the requirements are met, continuing the subsequent treatment. Otherwise, the face image group can be obtained again. Meanwhile, whether the cut human face area has the phenomenon that the human face is shielded or not is judged. If not, continuing the subsequent processing. Otherwise, the face image group can be acquired again. If no human face shielding exists and the depth of the cut human face region meets the preset requirement, the human face region can be used as a target human face image group and subjected to subsequent processing.

And step 102, fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image.

Referring to fig. 3, in an embodiment of the present invention, the fusion manner may include the following:

(1) Only the Depth map is reserved, and a single-channel map (marked as A, depth (1)) is obtained;

(2) Mapping the Depth map into a Color map (marked as B), and superposing the Color map and the RGB map (for example, superposing according to different weights) to obtain a three-channel map (Depth (3) + Color (3));

(3) Only the depth map is reserved to obtain a single-channel map; adding a single-channel map to an Alpha channel of the RGB map to obtain a four-channel map (Color (3) + Depth (a));

(4) Mapping the Depth map into a color map (denoted as B) (Depth (3));

(5) Converting the RGB image into a single-channel grey-scale image, and mapping the depth image into a color image; the single-channel grayscale image is added to the Alpha channel of the Color image, resulting in a four-channel image (Depth (3) + Color (a)).

Accordingly, then, in this step, the first mode may be any one of the above fusion modes. Specifically, according to any one of the following manners, the RGB map and the depth map are fused in a first fusion manner to obtain a first fused image:

only the depth map is reserved to obtain a first single-channel map; or

Mapping the depth map into a first color map, and superposing the first color map and the RGB map to obtain a three-channel map; or

Only the depth map is reserved to obtain a second single channel map; adding the second single-channel image to an Alpha channel of the RGB image to obtain a four-channel image; or

Mapping the depth map into a second color map; or

Converting the RGB map to a single channel grayscale map, mapping the depth map to a second color map; and adding the single-channel gray image to an Alpha channel of the second color image to obtain a four-channel image.

And 103, inputting the first fusion image into a first model to obtain a first living body detection result.

In the embodiment of the present invention, the first model may be, for example, one of featuret (feather network), featuret, mobileNet, shuffleNet, efficientNet, and SqueezeNet.

In the embodiment of the present invention, featherNet is taken as an example, and is modified to serve as a first model herein. Thus, in embodiments of the present invention, a FeatherNet may be referred to as a modified FeatherNet.

The CNN Stem (convolutional neural network backbone) of the FeatherNet of the embodiment of the invention comprises a Deformable Depthwise Convolvulation (DDWConv); the Deformable Depthwise Convolition is included in the Streaming Module of FeatherNet.

The Deformable Depthwise contribution is obtained by combining Depthwise contribution (DWConv) with Deformable contribution (e.g., deformable contribution V2, second version of Deformable Convolution).

Or, in practical application, the CNN Stem of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of 3 × 3; wherein, the 3 × 3 Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the 3 × 3 Deformable depth Convolution with any one of the following Convolution modes: deformable constraint, or scaled constraint.

Alternatively, the CNN Stem of the FeatherNet is a combination of 1 × 3 DDWConv and 3 × 1 DDWConv.

The Streaming Module of the FeatherNet is k multiplied by k Deformable depth Convolution Deformable Depthwise conversion; the k × k Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the k × k Deformable depth Convolution Deformable constraint with any one of the following Convolution modes: deformable constraint, or related constraint.

Specifically, the k × k Deformable depth Convolution Deformable depth constraint may be a 7 × 7 Deformable depth Convolution Deformable depth constraint, and correspondingly, the k × k Deformable depth Convolution Deformable constraint is a 7 × 7 Deformable depth Convolution constraint.

Or the Streaming Module of the FeatherNet is the combination of the Deformable Depthlose contribution of 1 multiplied by k and the Deformable Depthlose contribution of k multiplied by 1. Specifically, the 1 × k Deformable Depthlose conversion may be a 1 × 7 Deformable Depthlose conversion, and the k × 1 Deformable Depthlose conversion may be a 7 × 1 Deformable Depthlose conversion.

In the embodiment of the invention, the single frame RGB image in the acquired target human face image group and the corresponding depth image are fused, and the fused result is used as the input of the model, so that the living body detection result is obtained. Therefore, by using the device provided by the embodiment of the invention, the detected object does not need to be matched to act, so that the detection speed is improved.

Referring to fig. 4, fig. 4 is a flowchart of a living body detection method according to an embodiment of the present invention, which is applied to an electronic device, such as a mobile terminal. As shown in fig. 4, the method comprises the following steps:

step 401, train the first model.

Wherein the first model may comprise FeatherNet.

Taking FeatherNet as an example, in the present invention example, featherNet is modified to obtain modified FeatherNet. In the embodiment of the invention, the FeatherNet is mainly formed by connecting a CNN Stem network and a Streaming Module. According to the difference of CNN Stem, it can be divided into two FeatherNet types: featherNet A and FeatherNet B.

CNN Stem of FeatherNet comprises 3 × 3 Deformable Depthwise restriction; A7X 7 Deformable Depthwise conversion is included in the Streaming Module of FeatherNet. Wherein, the Deformable Depthwise constraint is obtained by combining a 3 × 3 depth Convolution DWConv and a Deformable constraint V2. Or, in practical application, the 3 × 3 Deformable depth Convolution Deformable depth Convolution is obtained by combining the 3 × 3 Deformable depth Convolution with any one of the following Convolution modes: a Deformable Convolution Deformable constraint or a void convolved constraint; or, the CNN Stem of the FeatherNet is the combination of 1 × 3 Deformable Depthwise restriction and 3 × 1 Deformable Depthwise restriction;

the Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise conversion of 7 multiplied by 7; wherein, the 7 × 7 Deformable depth Convolution Deformable Depthwise Convolution is obtained by combining the 7 × 7 Deformable depth Convolution with any one of the following Convolution modes: deformable Convolution Deformable constraint or hole Convolution distorted constraint; or, the Streaming Module of the FeatherNet is a combination of 1 × k Deformable Depthlose contribution and k × 1 Deformable Depthlose contribution, for example, the 1 × k Deformable Depthlose contribution is 1 × 7 Deformable Depthlose contribution; the formalble Depthwise contribution of k × 1 is a formalble Depthwise contribution of 7 × 1. K is a positive integer greater than 1

FIG. 5 is a block diagram of three blocks, i.e., block A, block B, block C, stacked into CNN Stem. Specifically, with reference to fig. 5, in the embodiment of the present invention, in three modules, 3 × 3 DWConv in the original CNN Stem is combined with formalble constraint V2 to obtain formalble Depthwise constraint (DDWConv): that is, 3 more dimensions are added to 3 × 3 DWConv for learning the offset (x and y directions) and the weight term of the position, respectively. Meanwhile, 3 × 3 DWConv is replaced with 3 × 3 DDWConv.

Meanwhile, 7 × 7 DWConv in the original Streaming Module is replaced by 7 × 7 DDWConv, so that the design intention of the Streaming Module can be highlighted, and an Effective Receptive Field (Effective received Field) can be learned.

Wherein, the Deformable constraint V2 is added with a weight term on the basis of the V1 version, and the effect is better.

In this step, a model training sample set is obtained, where the model training sample set includes a plurality of fusion images, where each fusion image is obtained by performing fusion processing on a frame of RGB image and a frame of depth image corresponding to the RGB image, and then the training sample set is input into a machine learning network model to train to obtain the first model.

Step 402, obtaining a face image group to be detected, wherein the face image group to be detected comprises a frame of RGB image and a frame of depth image corresponding to the RGB image.

And 403, selecting a target face image group from the face image group to be detected. The target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image.

And step 404, fusing the RGB image and the depth image of the target face image group in a first fusion mode to obtain a first fusion image.

Step 405, inputting the first fusion image into a first model to obtain a first in vivo detection result.

In the embodiment of the present invention, the first living body detection result may be a numerical value. By comparing the value with a preset threshold value, whether the real face image is included can be determined. In addition, if the value of the first in-vivo detection result meets the preset requirement, for example, the value is within a certain value range, in order to improve the accuracy of the detection result, the subsequent cascade judgment can be performed.

Step 406, fusing the RGB image and the depth image in a second fusion mode to obtain a second fusion image; the second fusion mode is different from the first fusion mode.

The specific contents of the first fusion mode and the second fusion mode can refer to the description of the foregoing embodiments.

And 407, inputting the second fusion image into the first model or the second model to obtain a second living body detection result.

Wherein the first model and the second model are different models. The second model may be, for example, a SqueezeNet. In practical applications, the second model may also be trained in advance.

And step 408, obtaining a final in vivo detection result according to the first in vivo detection result and the second in vivo detection result.

In the embodiment of the present invention, the second living body detection result may be a numerical value. Then, here, the first and second living body detection results are operated, and the operated result is taken as the final living body detection result.

The operation comprises any one of the following: calculating a product of the first in-vivo detection result and a first weighting value, calculating a product of the second in-vivo detection result and a second weighting value, and summing the obtained products; alternatively, an average value of the first and second in-vivo detection results is calculated. Of course, there may be other calculation manners in practical application, and the embodiments of the present invention are not limited thereto.

And comparing the obtained operation value with a certain preset value so as to determine whether the real face image is included.

And after the first in-vivo detection result is obtained, obtaining a second in-vivo detection result, and integrating the first in-vivo detection result and the second in-vivo detection result to obtain a final in-vivo detection result. Through the cascade detection, the accuracy of the detection result can be improved.

In the embodiment of the invention, the single-frame RGB image and the corresponding depth image in the acquired target face image group are fused, and the fused result is used as the input of the model, so that the living body detection result is obtained. Therefore, by using the device provided by the embodiment of the invention, the detected object does not need to be matched to act, so that the detection speed is improved. In addition, as FeatherNet is adopted in the scheme of the embodiment of the invention, the model is very small, so that the method and the device are suitable for being arranged at mobile terminals such as terminals.

Referring to fig. 6, fig. 6 is a flowchart of a method for detecting a living body according to an embodiment of the present invention, which is applied to an electronic device, such as a mobile terminal. As shown in fig. 6, the method comprises the following steps:

step 601, training a first model.

Wherein the first model may comprise SqueezeNet, and the like.

Taking the SqueezeNet as an example, in the embodiment of the invention, the SqueezeNet is improved to obtain the improved SqueezeNet. In an embodiment of the present invention, the SqueezeNet includes a Fire Module and a Streaming Module.

FIG. 7 is a block diagram of the Fire Module in an embodiment of the present invention. Wherein the Fire Module comprises a Squeeze layer, an Expand layer and a BatchNorm layer. The function of the Squeeze layer and the expanded layer is the same as that of the prior art, and the difference is that the Squeeze layer and the expanded layer perform Convolution operation by using a Convolution kernel of 1 × 1 and a Convolution kernel of Deformable default constraint (DDWConv). The BatchNorm layer is used to converge the model. By converging the model, the speed of obtaining an accurate model can be increased. Wherein, the Deformable Depthwise conversion is obtained by combining a 3 × 3 deep Convolution DWConv and Deformable conversion V2. Or, in practical application, the Deformable Depthwise contribution is obtained by combining 3 × 3 DWConv with any one of the following Convolution kernels: deformable Convolution V1, scaled Convolution, a combination of 1 × 3 and 3 × 1 Convolution kernels.

The Streaming Module is used for carrying out weighting calculation on each region of the image to be processed, so that the accuracy of the model can be improved.

In this step, a model training sample set is obtained, where the model training sample set includes multiple fusion images, where each fusion image is obtained by fusing a frame of RGB image and a frame of depth image corresponding to the RGB image, and then the training sample set is input into a machine learning network model to train and obtain the first model.

Step 602, a face image group to be detected is obtained, wherein the face image group to be detected comprises a frame of RGB image and a frame of depth image corresponding to the RGB image.

Step 603, selecting a target face image group from the face image group to be detected. The target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image; the size of the face area in the RGB image meets a first preset requirement, and the depth of the depth image meets a second preset requirement.

And step 604, fusing the RGB image and the depth image of the target face image group in a first fusion mode to obtain a first fusion image.

Step 605, inputting the first fusion image into a first model to obtain a first in vivo detection result.

In the embodiment of the present invention, the first living body detection result may be a numerical value. By comparing the value with a preset threshold value, whether a real face image is included can be determined. In addition, if the value of the first in-vivo detection result meets the preset requirement, for example, the value is within a certain value range, in order to improve the accuracy of the detection result, the subsequent cascade judgment can be performed.

Step 606, fusing the RGB image and the depth image in a second fusion mode to obtain a second fusion image; the second fusion mode is different from the first fusion mode.

And 607, inputting the second fusion image into the first model or the second model to obtain a second living body detection result.

Wherein the first model and the second model are different models. The second model can be, for example, featherNet, mobileNet, shuffleNet, efficientNet, and the like. In practical applications, the second model may also be trained in advance.

And 608, obtaining a final in vivo detection result according to the first in vivo detection result and the second in vivo detection result.

In the embodiment of the present invention, the second living body detection result may be a numerical value. Then, here, the first and second live detection results are operated, and the operation result is taken as the final live detection result.

The operation comprises any one of the following: calculating a product of the first in-vivo detection result and a first weighting value, calculating a product of the second in-vivo detection result and a second weighting value, and summing the obtained products; calculating an average value of the first and second in-vivo detection results. Of course, there may be other calculation manners in practical application, and the embodiments of the present invention are not limited thereto.

In the embodiment of the invention, the single-frame RGB image in the acquired target face image group and the corresponding depth image are fused, and the fused result is used as the input of the model, so that the living body detection result is obtained. Therefore, by using the device provided by the embodiment of the invention, the detected object does not need to be matched to act, so that the detection speed is improved. In addition, the SqueezeNet is adopted in the scheme of the embodiment of the invention, and the model is very small, so that the method and the device are suitable for being arranged at mobile terminals such as terminals.

Referring to fig. 8, fig. 8 is a flowchart of a model training method according to an embodiment of the present invention. As shown in fig. 8, the method comprises the following steps:

step 801, obtaining a model training sample set, wherein the model training sample set comprises a plurality of fusion images, and each fusion image is obtained by fusing a frame of RGB image and a frame of depth image corresponding to the RGB image.

In this step, an image to be processed may be acquired, and then a label is added to the image to be processed. The image to be processed comprises a frame of RGB image and a frame of depth image corresponding to the RGB image. When labeling, both the RGB map and the depth map may be labeled, or only the RGB map or the depth map may be labeled. The annotation is used for indicating whether a real face image exists in the image. And then, fusing the RGB image and the depth image to obtain a fused image. The fusion mode can refer to the description of the foregoing embodiments.

In the method, a Balanced cross entropy Loss function (a-Balanced local Loss) is used as a Loss function to train a classification model, and labels are added to the image to be processed, so that the problem of unbalanced distribution of classes and difficulty levels of training samples can be effectively solved, and the generalization capability and accuracy of the model are improved.

The calculation method of the balance cross entropy loss function is as follows:

FL(p _t )＝-a _t (1-p _t ) ^γ log(p _t )

FL is a cross-entropy loss function with dynamically adjustable scale, and two parameters a are arranged in FL _t And γ, wherein, a _t The main function of (a) is to solve the problem of imbalance of positive and negative samples,gamma is mainly to solve the problem of unbalance of the difficult and easy samples.

And step 802, inputting the training sample set into a machine learning network model, and training to obtain a first model.

In an embodiment of the present invention, the first model includes one of FeatherNet, mobileNet, shuffleNet, efficientNet, and SqueezeNet.

Taking FeatherNet as an example, calculating by using Deformable Depthwise Convolition in a main CNN Stem of a convolutional neural network of FeatherNet; calculating by using a Deformable Depthwise constraint in a Streaming Module of the FeatherNet; the Deformable Depthwise constraint is obtained by combining the depth Convolution Depthwise constraint and the Deformable Convolution Depthwise constraint. Or CNN Stem of FeatherNet is Deformable depth Convolution Deformable Depthwise Convolition of 3 × 3; wherein, the 3 × 3 Deformable depth Convolution Deformable Depthwise Convolution is obtained by combining the 3 × 3 Deformable Depthwise Convolution with any one of the following Convolution modes: a Deformable Convolution Deformable constraint or a void convolved constraint; or, the CNN Stem of the FeatherNet is the combination of 1 × 3 Deformable Depthwise restriction and 3 × 1 Deformable Depthwise restriction;

the Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise conversion of 7 multiplied by 7; wherein, the 7 × 7 Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the 7 × 7 Deformable depth Convolution Deformable constraint with any one of the following Convolution modes: deformable Convolution Deformable constraint, or hole Convolution related.

Or, the Streaming Module of FeatherNet is the combination of 1 × k Deformable Depthwise contribution and k × 1 Deformable Depthwise contribution. Specifically, the 1 × k Deformable Depthlose conversion may be a 1 × 7 Deformable Depthlose conversion, and the k × 1 Deformable Depthlose conversion may be a 7 × 1 Deformable Depthlose conversion. And k is a positive integer greater than 1.

Through DWConv and Deformable convention V2 combination with among the FeatherNet, not only make the Convolution kernel concentrate on more effective experience area, strengthen the feature extraction of model, promote the model rate of accuracy, DWConv can reduce the model volume again moreover, more is fit for using on the mobile terminal.

On the basis of the embodiment, pruning and retraining can be performed on the trained model, so that the model is further reduced.

As can be seen from the above description, in the embodiment of the present invention, the single-frame RGB image and the depth image are used to perform fusion in multiple ways, so that the processing speed is increased, and the accuracy of the detection result is improved through the cascade judgment. The FeatherNet model in the embodiment of the invention is small, so that the FeatherNet model is suitable for running at a mobile terminal. Meanwhile, in the process of obtaining the FeatherNet model, DWConv and Deformable contribution V2 are combined, so that the Convolution kernel is concentrated on a more effective experience area to enhance the characteristic extraction of the model, the accuracy rate of the model is improved, and meanwhile, the volume of the model can be reduced.

Referring to fig. 9, fig. 9 is a flowchart of an image classification method according to an embodiment of the present invention. As shown in fig. 9, the method comprises the following steps:

step 901, obtaining a target image group, wherein the target image group includes a frame of RGB image and a frame of depth image corresponding to the RGB image.

The target image group may be an image including any content, such as a human face, a landscape, and the like.

And 902, fusing the RGB image and the depth image in a first fusion mode to obtain a first fusion image.

The fusion mode can be referred to the description of the previous embodiment.

And 903, inputting the first fusion image into a first model to obtain an image classification result.

Wherein, different image classification results can be obtained according to different classification targets. For example, the image classification result may be an image including a human face and an image not including a human face, an image including a landscape and an image not including a landscape, and the like; the image classification method can be applied to the field of living body detection and can also be applied to other fields. The first model can refer to the structure of the FeatherNet model and the content of the corresponding training process in the previous embodiment.

In the embodiment of the invention, the acquired single-frame RGB image and the corresponding depth image are fused, and the fused result is used as the input of the model, so that the image classification result is obtained. Therefore, the speed of image classification is improved by using the device provided by the embodiment of the invention.

The embodiment of the invention also provides a living body detection device. Referring to fig. 10, fig. 10 is a structural view of a living body detecting apparatus according to an embodiment of the present invention. Since the principle of solving the problem of the biopsy device is similar to that of the biopsy method in the embodiment of the present invention, the implementation of the biopsy device can be referred to the implementation of the method, and the repeated description is omitted.

As shown in fig. 10, the living body detecting apparatus includes: a first obtaining module 1001, configured to obtain a target face image group, where the target face image group includes a frame of RGB image and a frame of depth image corresponding to the RGB image; a first fusion module 1002, configured to fuse the RGB map and the depth map in a first fusion manner to obtain a first fusion image; the first processing module 1003 is configured to input the first fused image into a first model, so as to obtain a first living body detection result.

Optionally, the first fusion module 1002 fuses the RGB map and the depth map in a first fusion manner according to any one of the following manners to obtain a first fusion image:

only the depth map is reserved to obtain a first single-channel map; or,

mapping the depth map into a first color map, and superposing the first color map and the RGB map to obtain a three-channel map; or,

only the depth map is reserved to obtain a second single-channel map; adding the second single-channel image to an Alpha channel of the RGB image to obtain a four-channel image; or,

mapping the depth map into a second color map; or,

converting the RGB map to a single channel grayscale map, mapping the depth map to a second color map; and adding the single-channel gray-scale image to an Alpha channel of the second color image to obtain a four-channel image.

Optionally, the apparatus may further include:

the second fusion module is used for fusing the RGB image and the depth image in a second fusion mode to obtain a second fusion image; the second fusion mode is different from the first fusion mode;

the second processing module is used for inputting the second fusion image into the first model or the second model to obtain a second living body detection result; wherein the first model and the second model are different models;

and the third processing module is used for obtaining a final living body detection result according to the first living body detection result and the second living body detection result.

Optionally, the third processing module is configured to perform an operation on the first in-vivo detection result and the second in-vivo detection result, and use an operation result as the final in-vivo detection result;

the operation comprises any one of the following:

calculating a product of the first in-vivo detection result and a first weighting value, calculating a product of the second in-vivo detection result and a second weighting value, and summing the obtained products; or

Calculating an average value of the first and second in-vivo detection results.

Optionally, the apparatus may further include:

the second acquisition module is used for acquiring a face image group to be detected, wherein the face image group to be detected comprises a frame of RGB image and a frame of depth image corresponding to the RGB image;

and the selection module is used for selecting the target face image group from the face image group to be detected.

The meaning of the first model can be referred to the description of the method embodiments described above.

The apparatus provided in the embodiment of the present invention may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

The embodiment of the invention also provides a model training device. Referring to fig. 11, fig. 11 is a structural diagram of a model training apparatus according to an embodiment of the present invention. Because the principle of solving the problem of the model training device is similar to the model training method in the embodiment of the invention, the implementation of the model training device can refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 11, the model training apparatus includes: a first obtaining module 1101, configured to obtain a model training sample set, where the model training sample set includes multiple fusion images, and each fusion image is obtained by fusing a frame of RGB image and a frame of depth image corresponding to the RGB image; the training module 1102 is configured to input the training sample set into a machine learning network model, and train to obtain a first model.

Optionally, calculating by using a Deformable Depthwise Convolition in a convolutional neural network backbone CNN Stem of FeatherNet;

calculating by using Deformable Depthlose conversion in a Streaming Module of FeatherNet;

wherein, CNN Stem of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of 3 multiplied by 3; wherein, the 3 × 3 Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the 3 × 3 Deformable depth Convolution with any one of the following Convolution modes: a Deformable Convolution Deformable constraint or a void convolved constraint; or, CNN Stem of FeatherNet is the combination of Deformable Depthlose contribution of 1 × 3 and Deformable Depthlose contribution of 3 × 1.

The Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of k multiplied by k; the k × k Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the k × k Deformable depth Convolution Deformable constraint with any one of the following Convolution modes: deformable Convolution Deformable constraint or hole Convolution distorted constraint; or, the Streaming Module of FeatherNet is the combination of 1 × k Deformable Depthwise contribution and k × 1 Deformable Depthwise contribution. And k is a positive integer greater than 1.

Specifically, the k × k Deformable depth Convolution Deformable Depthwise Convolition may be a 7 × 7 Deformable depth Convolution Deformable Depthwise Convolition; the 7 × 7 Deformable depth Convolution Deformable depth Convolution constraint is obtained by combining the 7 × 7 Deformable depth Convolution constraint and any one of the following Convolution modes, and includes: 1 × 7 Deformable Depthwise contribution, or 7 × 1 Deformable Depthwise contribution.

Optionally, the fused image is obtained by any one of the following methods:

only the depth map is reserved to obtain a first single-channel map; or,

only the depth map is reserved to obtain a second single channel map; adding the second single-channel image to an Alpha channel of the RGB image to obtain a four-channel image; or,

mapping the depth map into a second color map; or,

The embodiment of the invention also provides an image classification device. Referring to fig. 12, fig. 12 is a structural diagram of an image classification apparatus according to an embodiment of the present invention. Because the principle of the image classification device for solving the problems is similar to the image classification method in the embodiment of the invention, the implementation of the image classification device can be referred to the implementation of the method, and repeated details are not repeated.

As shown in fig. 12, the image classification apparatus includes: a first obtaining module 1201, configured to obtain a target image group, where the target image group includes a frame of RGB image and a frame of depth image corresponding to the frame of RGB image; a first fusion module 1202, configured to fuse the RGB map and the depth map in a first fusion manner to obtain a first fusion image; the first processing module 1203 is configured to input the first fused image into a first model, so as to obtain an image classification result.

As shown in fig. 13, the electronic device according to the embodiment of the present invention includes: a processor 1300, for reading the program in the memory 1320, executes the following processes:

acquiring a target face image group, wherein the target face image group comprises a frame of RGB image and a frame of depth image corresponding to the RGB image;

A transceiver 1310 for receiving and transmitting data under the control of the processor 1300.

In fig. 13, among other things, the bus architecture may include any number of interconnected buses and bridges with various circuits being linked together, particularly one or more processors represented by processor 1300 and memory represented by memory 1320. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 1310 can be a number of elements including a transmitter and a transceiver that provide a means for communicating with various other apparatus over a transmission medium. The processor 1300 is responsible for managing the bus architecture and general processing, and the memory 1320 may store data used by the processor 1300 in performing operations.

The processor 1300 is responsible for managing the bus architecture and general processing, and the memory 1320 may store data used by the processor 1300 in performing operations.

The processor 1300 is further configured to read the program and execute the following steps:

according to any one of the following modes, fusing the RGB image and the depth image in a first fusion mode to obtain a first fused image:

only the depth map is reserved to obtain a first single-channel map; or,

mapping the depth map into a second color map; or,

fusing the RGB image and the depth image in a second fusion mode to obtain a second fusion image; the second fusion mode is different from the first fusion mode;

inputting the second fusion image into the first model or the second model to obtain a second living body detection result; wherein the first model and the second model are different models;

and obtaining a final in-vivo detection result according to the first in-vivo detection result and the second in-vivo detection result.

calculating the first living body detection result and the second living body detection result, and taking the calculation result as the final living body detection result;

the operation comprises any one of the following:

calculating a product of the first in-vivo detection result and a first weighting value, calculating a product of the second in-vivo detection result and a second weighting value, and summing the obtained products; or,

calculating an average value of the first and second in-vivo detection results.

Wherein the meaning of the first model can refer to the description of the previous embodiments.

As shown in fig. 14, the electronic device according to the embodiment of the present invention includes: the processor 1400 is used for reading the program in the memory 1420 and executing the following processes:

A transceiver 1410 for receiving and transmitting data under the control of the processor 1400.

Where, in fig. 14, the bus architecture may include any number of interconnected buses and bridges, particularly one or more processors, represented by processor 1400, and various circuits, represented by memory 1420, linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 1410 may be a number of elements including a transmitter and a transceiver providing a means for communicating with various other apparatus over a transmission medium. The processor 1400 is responsible for managing the bus architecture and general processing, and the memory 1420 may store data used by the processor 1400 in performing operations.

The processor 1400 is responsible for managing the bus architecture and general processing, and the memory 1420 may store data used by the processor 1400 in performing operations.

Wherein the first model comprises one of FeatherNet, mobileNet, shuffleNet, efficientNet and SqueezeNet.

Wherein the first model is FeatherNet; calculating by using Deformable Depthwise Convolition in a convolutional neural network backbone CNN Stem of FeatherNet;

wherein, the Deformable Depthwise constraint is obtained by combining the depth Convolution DWConv and the Deformable constraint V2.

Wherein the first model is FeatherNet; calculating by using Deformable Depthwise Convolition in CNN Stem of FeatherNet;

calculating by using Deformable Depthwise conversion in Streaming Module of FeatherNet;

wherein, the Deformable Depthwise contribution is obtained by combining DWConv with any one of the following Convolution kernels:

a first version of Deformable Convolution Deformable constraint V1, a hole Convolution scaled constraint; or, the CNN Stem of the FeatherNet is the combination of the Deformable depth Convolution Deformable Depthlose contribution of 1 × 3 and the Deformable depth Convolution Deformable Depthlose contribution of 3 × 1;

the Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of k x k, wherein k is a positive integer greater than 1; further, the Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of k × k; the k × k Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the k × k Deformable depth Convolution Deformable constraint with any one of the following Convolution modes: a Deformable Convolution Deformable constraint or a void convolved constraint;

or, the Streaming Module of FeatherNet is a combination of Deformable Depthwise constraint of 1 × k Deformable depth Convolution and Deformable Depthwise constraint of k × 1 Deformable depth Convolution.

Wherein the fused image is obtained by any one of the following methods:

only the depth map is reserved to obtain a first single channel map; or,

mapping the depth map into a second color map; or,

As shown in fig. 15, the electronic device according to the embodiment of the present invention includes: the processor 1500, which is used to read the program in the memory 1520, executes the following processes:

The transceiver 1510 is used to receive and transmit data under the control of the processor 1500.

In fig. 15, among other things, the bus architecture may include any number of interconnected buses and bridges, with one or more processors represented by processor 1500 and various circuits of memory represented by memory 1520 being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 1510 may be a plurality of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 1500 is responsible for managing the bus architecture and general processing, and the memory 1520 may store data used by the processor 1500 in performing operations.

The processor 1500 is responsible for managing the bus architecture and general processing, and the memory 1520 may store data used by the processor 1500 in performing operations.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned embodiment of the living body detection method, the model training method, or the image classification method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. With such an understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of model training, comprising:

obtaining a model training sample set, wherein the model training sample set comprises a plurality of fusion images, each fusion image is obtained by fusing a frame of red, green and blue (RGB) image and a frame of depth image corresponding to the RGB image, and the model training sample set comprises fusion images of at least two different fusion modes;

inputting the training sample set into a machine learning network model, and training to obtain a first model;

inputting the model training sample set into a machine learning network model, and training to obtain a second model, wherein the first model and the second model are different models;

inputting a fusion image obtained by a first fusion mode into the first model in the process of living body detection or image classification to obtain a first living body detection result or a first image classification result; inputting a fusion image obtained by a second fusion mode into the first model or the second model to obtain a second living body detection result or a second image classification result; calculating the first living body detection result and the second living body detection result to obtain a living body detection result; or, calculating the first image classification result and the second image classification result to obtain an image classification result; the second fusion mode is different from the first fusion mode.

2. The method of claim 1, wherein the first model comprises one of FeatherNet, mobileNet, shuffleNet, effectientNet, and SqueezeNet.

3. The method of claim 2, wherein the first model is a FeatherNet;

a convolutional neural network backbone CNN Stem of FeatherNet is a Deformable deep convolutional Deformable Depthwise constraint with the length of 3 multiplied by 3;

a Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of k x k, wherein k is a positive integer greater than 1;

the Deformable Depthwise constraint is obtained by combining the depth Convolution Depthwise constraint and the Deformable Convolution Depthwise constraint.

4. The method of claim 2, wherein the first model is a FeatherNet;

the CNN Stem of FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of 3 multiplied by 3; wherein, the 3 × 3 Deformable depth Convolution Deformable Depthwise Convolution is obtained by combining the 3 × 3 Deformable Deformable Depthwise Convolution with any one of the following Convolution modes: a Deformable Convolution Deformable constraint or a void convolved constraint;

or, the CNN Stem of the FeatherNet is the combination of the Deformable depth Convolution Deformable Depthlose contribution of 1 × 3 and the Deformable depth Convolution Deformable Depthlose contribution of 3 × 1;

the Streaming Module of the FeatherNet is k multiplied by k Deformable depth Convolution Deformable Depthwise conversion; the k × k Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the k × k Deformable depth Convolution Deformable constraint with any one of the following Convolution modes: a Deformable Convolution Deformable constraint or a void convolved constraint;

or the Streaming Module of the FeatherNet is the combination of the Deformable Depthlose contribution of the 1 xk Deformable depth Convolution and the Deformable Depthlose contribution of the kx 1 Deformable depth Convolution;

and k is a positive integer greater than 1.

5. The method of claim 1, wherein the fused image is obtained by any one of:

only the depth map is reserved to obtain a first single channel map; or,

mapping the depth map into a second color map; or,

6. A method of in vivo detection, comprising:

inputting the first fusion image into a first model to obtain a first living body detection result;

and calculating the first living body detection result and the second living body detection result to obtain a living body detection result.

7. The method according to claim 6, wherein the fusing the RGB map and the depth map in a first fusion manner to obtain a first fused image comprises any one of the following manners:

only the depth map is reserved to obtain a first single-channel map; or,

mapping the depth map into a second color map; or,

8. The method of claim 6, wherein the computing the first in-vivo test result and the second in-vivo test result to obtain a final in-vivo test result comprises:

calculating the first in-vivo detection result and the second in-vivo detection result, and taking the calculation result as the final in-vivo detection result;

the operation comprises any one of the following:

calculating an average of the first in vivo detection result and the second in vivo detection result.

9. The method of claim 6, wherein the first model comprises one of FeatherNet, mobileNet, shuffleNet, effectientNet, and SqueezeNet.

10. The method of claim 9, wherein the first model is a FeatherNet;

a convolutional neural network backbone CNN Stem of FeatherNet is a Deformable depth convolutional Deformable Depthwise constraint of 3 multiplied by 3;

a Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise conversion of k multiplied by k, wherein k is a positive integer larger than 1;

11. The method of claim 9, wherein the first model is a FeatherNet;

CNN Stem of the FeatherNet is a Deformable depth Convolution Deformable Depthwise Convolition of 3 multiplied by 3; wherein, the 3 × 3 Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the 3 × 3 Deformable depth Convolution Deformable constraint with any one of the following Convolution modes: deformable Convolution Deformable constraint or hole Convolution distorted constraint;

the Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of k multiplied by k; the k × k Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the k × k Deformable depth Convolution Deformable constraint with any one of the following Convolution modes: a Deformable Convolution Deformable constraint or a void convolved constraint;

or the Streaming Module of the FeatherNet is the combination of 1 × k Deformable Depthlose contribution and k × 1 Deformable Depthlose contribution;

and k is a positive integer greater than 1.

12. An image classification method, comprising:

inputting the first fusion image into a first model to obtain a first image classification result;

inputting the second fusion image into the first model or the second model to obtain a second image classification result; wherein the first model and the second model are different models;

and calculating the first image classification result and the second image classification result to obtain an image classification result.

13. The method of claim 12, wherein the first model comprises one of FeatherNet, mobileNet, shuffleNet, effectientNet, and SqueezeNet.

14. The method of claim 12, wherein the first model is FeatherNet;

the main Stem CNN Stem of the convolutional neural network of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of 3 multiplied by 3;

the Streaming Module of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of k multiplied by k, wherein k is a positive integer larger than 1;

15. The method of claim 12, wherein the first model is a FeatherNet;

the CNN Stem of the FeatherNet is a Deformable depth Convolution Deformable Depthwise constraint of 3 multiplied by 3; wherein, the 3 × 3 Deformable depth Convolution Deformable Depthwise constraint is obtained by combining the 3 × 3 Deformable depth Convolution with any one of the following Convolution modes: a Deformable Convolution Deformable constraint or a void convolved constraint;

or, CNN Stem of the FeatherNet is a combination of 1 × 3 Deformable depth Convolution Deformable Depthwise constraint and 3 × 1 Deformable depth Convolution Deformable Depthwise constraint;

or, the Streaming Module of the FeatherNet is obtained by combining a 1 × k Deformable depth Convolution Deformable Depthwise constraint with a k × 1 Deformable depth Convolution Deformable Depthwise constraint;

and k is a positive integer greater than 1.

16. The method according to claim 12, wherein the fusing the RGB map and the depth map in a first fusion manner to obtain a first fused image comprises any one of the following manners:

only the depth map is reserved to obtain a first single channel map; or,

mapping the depth map into a second color map; or,

17. An electronic device, comprising: a transceiver, a memory, a processor, and a program stored on the memory and executable on the processor; it is characterized in that the preparation method is characterized in that,

the processor for reading the program in the memory to implement the steps in the method of any one of claims 1 to 5; or implementing a step in a method as claimed in any one of claims 6 to 11; or implementing steps in a method according to any of claims 12 to 16.

18. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the steps in the method according to any one of claims 1 to 5; or implementing a step in a method as claimed in any one of claims 6 to 11; or implementing steps in a method according to any of claims 12 to 16.