CN113221842A

CN113221842A - Model training method, image recognition method, device, equipment and medium

Info

Publication number: CN113221842A
Application number: CN202110623053.9A
Authority: CN
Inventors: 刘有亮; 刘闯; 叶雨桐; 胡峻毅; 陈诗昱
Original assignee: Glasssix Technology Beijing Co ltd
Current assignee: Glasssix Technology Beijing Co ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-08-06
Anticipated expiration: 2041-06-04
Also published as: CN113221842B

Abstract

In the model training method, the image recognition device, the equipment and the medium, an auxiliary model is introduced to assist the image recognition model to extract the characteristics during the training of the image recognition model. Compared with the depth learning model without assistance, the depth learning model extracts the image features by itself, and the auxiliary model enables the image recognition model to be more inclined to extract the texture features during feature extraction, so that the recognition model obtained through training can utilize the texture information in the image to achieve the purpose of improving the recognition accuracy when judging whether the face in the image is a living body face.

Description

Model training method, image recognition method, device, equipment and medium

Technical Field

The present application relates to the field of image recognition, and in particular, to a model training method, an image recognition method, an apparatus, a device, and a medium.

Background

The human face living body detection technology plays an important role in a human face recognition system. As a key link in a face recognition system, face living body detection needs to effectively prevent attack of non-living faces, such as 2D face image printing attack, electronic screen attack, 3D face mask attack and the like, and the safety and reliability of the whole system are guaranteed. The existing in-vivo detection technology mainly comprises the traditional in-vivo detection and a deep learning-based in-vivo detection method.

The inventor researches and discovers that the existing living body detection method based on deep learning does not fully utilize the feature information in the non-living body face, so that the identification precision of the living body detection method based on deep learning is expected to be further improved.

Disclosure of Invention

In order to overcome at least one of the deficiencies in the prior art, in a first aspect, the present application provides a model training method applied to a training device configured with an auxiliary model and an image recognition model to be trained, the method comprising:

acquiring a human face sample image;

inputting the face sample image into the image recognition model for processing;

iteratively adjusting model parameters of the image recognition model according to training losses to obtain a recognition model, wherein the training losses include a first training loss and a second training loss, the first training loss is obtained through texture features provided by the auxiliary model and shallow texture features output by the image recognition model, and the second training loss is obtained through the image recognition model.

In a second aspect, the present application provides an image recognition method applied to an image recognition apparatus configured with a recognition model obtained by a model training method, the method including:

acquiring a face image to be recognized;

and obtaining a living body face recognition result of the face image to be recognized through the recognition model.

In a third aspect, the present application provides a model training apparatus, which is applied to a training device configured with an auxiliary model and an image recognition model to be trained, and includes:

the data acquisition module is used for acquiring a human face sample image;

the data processing module is used for inputting the human face sample image into the image recognition model for processing;

and the model training module is used for iteratively adjusting the model parameters of the image recognition model according to training losses to obtain a recognition model, wherein the training losses comprise a first training loss and a second training loss, the first training loss is obtained through the texture features provided by the auxiliary model and the shallow texture features output by the image recognition model, and the second training loss is obtained through the image recognition model.

In a fourth aspect, the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, implements the model training method or the image recognition method.

In a fifth aspect, the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the model training method or the image recognition method.

Compared with the prior art, the method has the following beneficial effects:

in the model training method, the image recognition device, the image recognition equipment and the medium provided by the embodiment of the application, an auxiliary model is introduced to assist the image recognition model in feature extraction during the training of the image recognition model. Compared with the depth learning model without assistance, the depth learning model extracts the image features by itself, and the auxiliary model enables the image recognition model to be more inclined to extract the texture features during feature extraction, so that the recognition model obtained through training can utilize the texture information in the image to achieve the purpose of improving the recognition accuracy when judging whether the face in the image is a living body face.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a schematic diagram illustrating steps of a model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a model provided in an embodiment of the present application;

FIG. 3 is a second schematic diagram of a model provided in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating steps of an image recognition method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Icon: 120-a memory; 130-a processor; 201-a data acquisition module; 202-a data processing module; 203-model training module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, it is noted that the terms "first", "second", "third", and the like are used merely for distinguishing between descriptions and are not intended to indicate or imply relative importance. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

As a key link in a face recognition system, face living body detection needs to effectively prevent attack of non-living faces. In the related art, a living body detection method based on deep learning is applied to various production scenes and is used for carrying out living body detection on a face image. However, the inventor researches and discovers that the existing living body detection method based on deep learning does not fully utilize the characteristic information in the non-living body face, so that the living body detection method based on deep learning has a plurality of problems in practical use.

For example, the existing in-vivo detection method based on deep learning is usually obtained by training based on a large number of samples. However, the cameras in actual use scenes include different specifications, so that images collected by the cameras with different specifications have different differences, and images outside a sample space inevitably exist, and therefore, the identification accuracy is affected by the differences in the specifications of the cameras to a certain extent.

Or, in order to improve the accuracy of the model, continuous frames are used as the input of the deep learning model, however, this method may require a large amount of computation, and may not be well adapted to the hardware performance of the terminal device (e.g., a camera), and also has a problem of poor recognition accuracy.

In view of this, the embodiment of the present application provides a model training method applied to a training device, and the training method enables an identification model obtained by training to specifically detect non-living body face features in a face image to be identified by introducing auxiliary monitoring information during training of the image identification model to be trained, so as to achieve the purpose of improving the accuracy of detecting a living body face.

The inventor researches and discovers that the copied face photo or video and the real face have difference in local texture. For example, in a face video shot by a mobile phone, due to the difference between the reflection of light by a mobile phone screen and the reflection of light by a face skin, moire fringes may exist in the image of the mobile phone screen in the camera.

The present embodiment is based on the finding that an auxiliary model is introduced during model training. Each step of the model training method provided in this embodiment is described in detail below with reference to the schematic step diagram of the model training method shown in fig. 1. As shown in fig. 1, the model training method includes:

in step S101A, a face sample image is acquired.

Step S102A, the face sample image is input into the image recognition model for processing.

Step S103A, iteratively adjusting the model parameters of the image recognition model according to the training loss to obtain the recognition model.

The training loss comprises a first training loss and a second training loss, the first training loss is obtained through texture features provided by the auxiliary model and shallow texture features output by the image recognition model, and the second training loss is obtained through the image recognition model.

Optionally, iteratively adjusting the model parameters of the image recognition model using the texture features provided by the auxiliary model to obtain the recognition model may include:

and step S103A-1, obtaining the shallow texture feature output by the image recognition model and the texture feature provided by the auxiliary model.

The shallow texture feature may be a feature output by a first network layer in the image recognition network. Those skilled in the art may select a network layer from the image recognition model as the first network layer according to the scene needs, and this embodiment is not limited in particular.

The model structure diagram shown in fig. 2 includes an image recognition model and an auxiliary model. For convenience of description, a model providing texture features in the auxiliary model will be referred to as a texture auxiliary model hereinafter.

As shown in fig. 2, compared with the face sample image in the original RGB color space, the image in the HVS color space is more beneficial to extracting texture information, so that the training apparatus converts the face sample image from the RGB color space to the HVS color space through the texture assistant model to obtain a color conversion image; then, the color conversion image is processed through a Local Binary Pattern (LBP) to obtain the texture features of the face sample image.

Referring to fig. 2 again, in the present embodiment, the second network layer of the image recognition model is used as the first network layer, and the output features thereof are used as the shallow texture features.

Step S103A-2, the difference between the texture feature and the shallow texture feature is taken as the first training loss.

Taking fig. 2 as an example again, in order to further reduce the introduction of interference information, before comparing the shallow texture features with the texture features, the image recognition model further performs feature extraction on the shallow texture features output by the first network layer by using a convolution kernel of 3 × 3.

In addition, in order to compare the shallow texture features with the texture features, the training device further scales the size of the texture features through the texture assistant model before the comparison, so that the scaled texture features and the shallow texture features have the same size.

And finally, the training equipment acquires the difference value between the texture feature and the shallow texture feature, and the difference value is used as a first training loss.

And step S103A-3, obtaining a second training loss of the image recognition model recognition sample image.

Taking fig. 2 as an example again, since the image recognition model does not reach the convergence condition during training, there is a second training loss when the image recognition model recognizes the sample image.

And step S103A-4, when the image recognition model meets the preset convergence condition, obtaining the recognition model.

And step S103A-5, when the image recognition model does not meet the preset convergence condition, adjusting the model parameters of the image recognition model according to the first training loss and the second training loss.

Optionally, the first training loss and the second training loss may be pooled by means of a weighted sum. For example, assuming that the weight of the first training loss is 0.5 and the weight of the second training loss is also 0.5, the training device may then perform a weighted summation to obtain a weighted loss according to the respective weights of the first training loss and the second training moment; and finally, adjusting the model parameters of the image recognition model by a reverse gradient propagation algorithm according to the weighting loss.

It should be noted that the above weight is only an example provided in this embodiment, and a person skilled in the art may perform adaptive adjustment according to needs, and this embodiment is not limited in particular.

And S103-6, returning to the step of obtaining the shallow texture feature output by the image recognition model and the texture feature provided by the auxiliary model until the image recognition model meets the preset convergence condition, and obtaining the recognition model.

Therefore, compared with the method that the image features are automatically extracted by a depth learning model without assistance, the method introduces an auxiliary model to assist the image recognition model to extract the features during the training period of the image recognition model. The auxiliary model enables the image recognition model to be more inclined to extract texture features during feature extraction, so that the recognition model obtained through training can utilize texture information in the image to achieve the purpose of improving recognition accuracy when judging whether the face in the image is a living body face.

Further, the inventors have also found that the high-frequency information in the living body face image is higher than that in the non-living body face image in the living body face image as compared with the non-living body face image.

Thus, as another possible implementation, the training loss includes a third training loss in addition to the first training loss and the second training loss. And obtaining the third training loss through the high-frequency characteristics provided by the auxiliary model and the deep high-frequency characteristics output by the image recognition model.

Therefore, before adjusting the model parameters of the image recognition model, the training device obtains the shallow texture features and the deep high-frequency features output by the image recognition model, and the texture features and the high-frequency features provided by the auxiliary model.

Illustratively, the training device may process the face sample image by means of Discrete Cosine Transform (DCT for Discrete Cosine Transform) to obtain high-frequency features.

Wherein, the deep shallow texture feature may be a feature output by a second network layer in the image recognition network. Those skilled in the art may select a network layer from the image recognition model as the second network layer according to the scene needs, and this embodiment is not limited in particular.

For example, a network layer preceding the fully connected layer may be selected as the second network layer.

Then, the training equipment takes the difference between the texture features and the shallow texture features as a first training loss; obtaining a second training loss of the image recognition model recognition sample image; the difference between the high frequency features and the deep high frequency features is taken as a third training loss.

And when the image recognition model meets the preset convergence condition, the training equipment obtains the trained recognition model.

And when the image recognition model does not meet the preset convergence condition, the training equipment adjusts the model parameters of the image recognition model according to the first training loss, the second training loss and the third training loss.

Exemplarily, the training equipment obtains a weighted training loss among the first training loss, the second training loss and the third training loss according to a preset weight; and adjusting the model parameters of the image recognition model to be trained according to the weighted training loss.

Wherein, the right training loss and the first training loss, the second training loss and the third training loss satisfy the following data relationship:

L_final＝αL_FL+βL_DCT+γL_LBP

wherein α + β + γ is 1, and α, β, γ >0, α > β, α > γ.

For example, the weight of the first training loss may be 0.3; the weight of the second training penalty may be 0.3; the weight of the third training penalty may be 0.4. Of course, one skilled in the art can adapt the adjustment as needed.

And returning to the step of obtaining the shallow texture feature output by the image recognition model and the texture feature provided by the auxiliary model, obtaining the recognition model when the image recognition model meets the preset convergence condition, and obtaining the recognition model when the image recognition model meets the preset convergence condition.

It should be understood that the above-mentioned preset convergence condition may be, but is not limited to, stopping the iteration when the model loss value no longer decreases; and stopping iteration when the iteration number reaches a set number, or stopping iteration when the model loss value is lower than a set threshold value.

For ease of understanding, the following description is made with reference to the model structure diagram shown in fig. 3. As shown in fig. 3, includes an image recognition model and an auxiliary model. For convenience of description, a model providing texture features in the auxiliary model is hereinafter referred to as a texture auxiliary model; the model in the auxiliary model that provides the high frequency features is referred to as a high frequency auxiliary model.

The texture-assisted model is the same as the texture-assisted model in fig. 2, and therefore, the description thereof is omitted. In order to further reduce the introduction of interference information, before the deep high-frequency features are compared with the high-frequency features, the image recognition model also utilizes a convolution kernel of 3 x 3 to perform further feature extraction on the deep high-frequency features output by the second network layer.

In addition, in order to compare the deep high-frequency features with the high-frequency features, before the comparison, the training device further scales the sizes of the high-frequency features through the high-frequency auxiliary model, so that the scaled high-frequency features and the deep high-frequency features have the same sizes.

Therefore, with the assistance of the texture auxiliary model and the high-frequency auxiliary model, the high-frequency features of the recognition model obtained by training are extracted on the basis of the texture features, so that the purpose of judging whether the face in the image is a living face or not by combining the texture features and the high frequency is achieved.

The embodiment further provides a model training apparatus, which is applied to a training device, wherein the training device is configured with an auxiliary model and an image recognition model to be trained, and the model training apparatus includes at least one functional module which can be stored in the memory 120 in a software form. As shown in fig. 4, functionally divided, the model training apparatus may include:

and the data acquisition module 201 is used for acquiring a face sample image.

In this embodiment, the data obtaining module 201 is configured to implement step S101A in fig. 1, and for a detailed description of the data obtaining module 201, refer to a detailed description of step S101A.

And the data processing module 202 inputs the human face sample image into the image recognition model for processing.

In this embodiment, the data processing module 202 is configured to implement step S102A in fig. 1, and for a detailed description of the data processing module 202, refer to a detailed description of step S102A.

And the model training module 203 is configured to iteratively adjust model parameters of the image recognition model according to training losses to obtain the recognition model, where the training losses include a first training loss and a second training loss, the first training loss is obtained through texture features provided by the auxiliary model and shallow texture features output by the image recognition model, and the second training loss is obtained through the image recognition model.

In this embodiment, the model training module 203 is configured to implement step S103A in fig. 1, and for a detailed description of the model training module 203, refer to a detailed description of step S103A.

It should be noted that the model training apparatus may further include other software functional modules, which are used to implement other steps or sub-steps of the above model training method; similarly, the data obtaining module 201, the data processing module 202, and the model training module 203 may also be used to implement other steps or sub-steps of the model training method, which is not specifically limited in this embodiment, and those skilled in the art may perform adaptive adjustment according to different module division standards.

The embodiment also provides an image recognition method, which is applied to an image recognition device, and the image recognition device is provided with a recognition model obtained by the model training method. That is, in this embodiment, the image recognition model to be trained is trained based on the auxiliary model, and after the image recognition model satisfies the preset convergence condition, the auxiliary model is removed, and the remaining models are referred to as the recognition models.

As shown in fig. 5, the image recognition method includes:

step S101B, a face image to be recognized is acquired.

Step S101B, obtaining the living human face recognition result of the human face image to be recognized through the recognition model.

An embodiment of the present application further provides an electronic device, as shown in fig. 6, the electronic device includes a memory and a processor, and the memory stores a computer program. When the electronic device is a training device, the computer program is executed by the processor to implement the model training method.

When the electronic device is a training device, the computer program is executed by a processor to implement the image recognition method described above.

The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the model training method or the image recognition method is implemented.

The processor 130 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In summary, in the model training method, the image recognition method, the device, the apparatus, and the medium provided in the embodiments of the present application, compared with an unassisted deep learning model, the method and the apparatus extract image features by itself, and during the training of the image recognition model, an assistant model is introduced to assist the image recognition model to extract features. The auxiliary model enables the image recognition model to be more inclined to extract texture features during feature extraction, so that the recognition model obtained through training can utilize texture information in the image to achieve the purpose of improving recognition accuracy when judging whether the face in the image is a living body face.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A model training method is applied to a training device, the training device is provided with an auxiliary model and an image recognition model to be trained, and the method comprises the following steps:

acquiring a human face sample image;

2. The model training method of claim 1, wherein iteratively adjusting the model parameters of the image recognition model to obtain the recognition model according to the training loss comprises:

obtaining shallow texture features output by the image recognition model and texture features provided by the auxiliary model;

taking a difference between the textural features and the shallow textural features as a first training loss;

obtaining a second training loss of the image recognition model for recognizing the sample image;

when the image recognition model meets a preset convergence condition, obtaining the recognition model;

when the image recognition model does not meet the preset convergence condition, adjusting model parameters of the image recognition model according to the first training loss and the second training loss;

and returning to the step of obtaining the shallow texture feature output by the image recognition model and the texture feature provided by the auxiliary model until the image recognition model meets the preset convergence condition, and obtaining the recognition model.

3. The model training method of claim 1, wherein the training penalty further comprises a third training penalty obtained by matching the high frequency features provided by the auxiliary model with the deep high frequency features output by the image recognition model.

4. The model training method of claim 3, wherein iteratively adjusting the model parameters of the image recognition model to obtain the recognition model according to the training loss comprises:

obtaining a shallow texture feature and a deep high-frequency feature output by the image recognition model, and the texture feature and the high-frequency feature provided by the auxiliary model;

taking the difference between the high frequency features and the deep high frequency features as a third training loss;

when the image recognition model does not meet the preset convergence condition, adjusting model parameters of the image recognition model according to the first training loss, the second training loss and the third training loss;

and returning to the step of executing the step of obtaining the shallow texture feature output by the image recognition model and the texture feature provided by the auxiliary model until the image recognition model meets a preset convergence condition, obtaining the recognition model until the image recognition model meets the preset convergence condition, and obtaining the recognition model.

5. The model training method of claim 4, wherein the adjusting the model parameters of the image recognition model according to the first training loss, the second training loss, and the third training loss comprises:

obtaining a weighted training loss among the first training loss, the second training loss and the third training loss according to a preset weight;

and adjusting the model parameters of the image recognition model to be trained according to the weighted training loss.

6. The model training method according to claim 4, wherein the obtaining high-frequency features provided by the auxiliary model comprises:

and processing the face sample image in a discrete cosine transform mode to obtain the high-frequency characteristics.

7. An image recognition method applied to an image recognition device configured with a recognition model obtained by the model training method according to any one of claims 1 to 6, the method comprising:

acquiring a face image to be recognized;

8. A model training apparatus applied to a training device provided with an auxiliary model and an image recognition model to be trained, the model training apparatus comprising:

the data acquisition module is used for acquiring a human face sample image;

9. An electronic device, characterized in that the electronic device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, implements the model training method of any one of claims 1-6 or the image recognition method of claim 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the model training method of any one of claims 1-6 or the image recognition method of claim 7.