CN118038522A

CN118038522A - High-quality binocular face recognition model training method, system, equipment and storage medium

Info

Publication number: CN118038522A
Application number: CN202410170666.5A
Authority: CN
Inventors: 段兴; 兰兴增; 陈晨; 林威宇; 汪博; 朱力; 吕方璐
Original assignee: Chongqing Guangjian Aoshen Technology Co ltd; Shenzhen Guangjian Technology Co Ltd
Current assignee: Chongqing Guangjian Aoshen Technology Co ltd; Shenzhen Guangjian Technology Co Ltd
Priority date: 2024-02-06
Filing date: 2024-02-06
Publication date: 2024-05-14

Abstract

The invention provides a high-quality binocular face recognition model training method, a system, equipment and a storage medium, wherein the method comprises the following steps: step M1: acquiring a training sample set; step M2: inputting the sub-partitions into different sub-networks respectively, inputting the data of a plurality of sub-networks into a sub-network fusion layer, and performing multi-round training on the model until convergence to obtain a first model; step M3: changing the partition mode of the sub-partition in the training sample set to obtain a second training sample set; step M4: repeating steps M2-M3 until the first model converges. The invention greatly improves the perception capability of the model on the face details, combines the face feature recognition with the living body authentication, and can better recognize the attack scene of the matching of the real face and the prosthesis.

Description

High-quality binocular face recognition model training method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of face recognition model training, in particular to a high-quality binocular face recognition model training method, a system, equipment and a storage medium.

Background

The sample plays a critical role in the deep learning model. First, the training process of the deep learning model is to learn and fit the sample data, and by inputting the sample data, the model will self-adjust the internal parameters to minimize the difference between the predicted value and the true value. Thus, the quality and number of samples directly affects the performance of the model. If training samples are too few or of poor quality, the model may suffer from under-fitting problems, resulting in poor prediction results.

Second, the sample also has an important impact on the generalization ability of the model. A complex model may overfit the training data, which means that the model performs well on training data, but not on test data that is not seen. To avoid this, a large number of samples are required to train the model so that the model can learn the basic structure and distribution rules of the data, thereby improving the generalization ability of the model.

Furthermore, the samples can also be used to optimize the model. By analyzing the prediction results of the model on the samples, portions of the model that perform poorly can be found and then optimized, e.g., by increasing the weight of the corresponding samples, or by retraining the portions of the samples to improve the overall performance of the model.

A sample of a deep learning model is typically composed of two parts: features and labels. Features are data describing the sample, such as pixel values of an image or word vectors of text, etc. The labels are then the actual outputs corresponding to the samples, e.g., in the image classification task, the class to which each image corresponds is the label.

These samples are typically divided into three parts, a training set, a validation set and a test set. The training set is used for training the model, the verification set is used for adjusting model parameters and selecting the optimal model structure, and the test set is used for evaluating the generalization capability of the model. The three parts of data should remain as independent and distributed as possible and have the same data distribution and characteristics.

In addition, the quality and quantity of sample data is also critical to the training of the deep learning model. If training samples are too few or of poor quality, the model may suffer from under-fitting problems, resulting in poor prediction results. Thus, obtaining a high quality, sufficient number of samples is a critical step in training a deep learning model.

However, the data augmentation in the prior art is various, but it is difficult to form high-quality samples, especially false samples with more real person characteristics. In actual tests, a person can recognize through living bodies and identity authentication by wearing attack glasses and the like, so that the existing authentication scheme is easy to crack.

In the prior art, the face recognition model generally regards face recognition and living body recognition as two independent parts, and authentication is carried out respectively, so that attacks of semi-true human and semi-false bodies are easy to break through respectively, and the face recognition is invalid.

Binocular or multi-view measurement is one mainstream measurement. However, there is a lack of efficient high quality face recognition model training techniques for binocular or multi-view images in the prior art.

The foregoing background is only for the purpose of providing an understanding of the inventive concepts and technical aspects of the present application and is not necessarily prior art to the present application and is not intended to be used as an aid in the evaluation of the novelty and creativity of the present application in the event that no clear evidence indicates that such is already disclosed at the date of filing of the present application.

Disclosure of Invention

Therefore, the training sample is divided into different partitions to train the model in more detail, and the sub-partition mode is changed to train the model repeatedly for a plurality of times, so that the perception capability of the model to face details is greatly improved, the face feature recognition and the living body authentication are combined, and the attack scene of the real face matched with the prosthesis can be better recognized.

In a first aspect, the present invention provides a training method for a high-quality binocular face recognition model, which is characterized by comprising:

step M1: acquiring a training sample set; wherein the training sample set comprises a positive sample pair and a negative sample pair; the positive sample pair is two face images obtained by shooting with a binocular camera, and the negative sample pair is two face images with similar angles to the positive sample; the face images comprise a plurality of sub-partitions and are respectively endowed with different labels;

Step M2: inputting the sub-partitions into different sub-networks respectively, inputting the data of a plurality of sub-networks into a sub-network fusion layer, and performing multi-round training on the model until convergence to obtain a first model;

Step M3: changing the partition mode of the sub-partition in the training sample set to obtain a second training sample set;

step M4: repeating steps M2-M3 until the first model converges.

Optionally, the training method of the high-quality binocular face recognition model is characterized in that when the sub-partition mode of the negative sample pair is changed in the step M3, different labels are reapplied according to the proportion of the real face.

Optionally, the high-quality binocular face recognition model training method is characterized in that the partitioning modes of different face images in the positive sample pair or the negative sample pair are the same.

Optionally, the method for training a high-quality binocular face recognition model is characterized in that the step of generating the negative sample pair includes:

Step S1: acquiring a plurality of real face image pairs and a plurality of prosthesis images; the real face image pair comprises two images of the same target object and different angles at the same moment; the prosthesis image is an image comprising at least one of a 2D paper surface, a 3D mask and a head model;

step S2: normalizing the real face image pair and the prosthesis image according to the face size so as to enable the face size to be the same;

step S3: determining a first region on the prosthesis image, respectively determining regions corresponding to the first region of different images in the real face image pair, and replacing the regions with contents corresponding to the prosthesis image to obtain a first generated image pair;

Step S4: determining a second area on the real face image pair, and replacing the second area on the real face image pair with contents corresponding to other real face images to obtain a second generated image pair;

Step S5: dividing the first generated image pair and the second generated image pair into a plurality of subareas according to a preset rule, and giving different labels according to the real face image proportion in the subareas.

Optionally, the high-quality binocular face recognition model training method is characterized by comprising the following steps when normalization is performed in step S2:

step S21: identifying a left eye center, a right eye center and a mouth corner center;

step S22: calculating the area of a triangle formed by the left eye center, the right eye center and the mouth corner center;

step S23: and adjusting the sizes of the real face image and the prosthesis image so that the area of the triangle is a first preset value.

Optionally, the high-quality binocular face recognition model training method is characterized in that step S3 includes:

Step S31: obtaining a first region of random area, random location on the prosthetic image;

Step S32: determining a first positional relationship between the first region and a left eye center, a right eye center or a mouth angle center on the prosthetic image;

Step S33: on the real face image pair, respectively determining corresponding areas on each real face image according to the left eye center, the right eye center, the mouth corner center and the first position relation;

step S34: and replacing the content of the corresponding region with the content of the first region to obtain a first generated image pair.

Optionally, the high-quality binocular face recognition model training method is characterized in that step S5 includes:

step S51: dividing the first generated image pair and the second generated image pair into a plurality of subareas according to the human face key points, so that each subarea has at most one human face key point;

step S52: if the face key points exist in the subareas and the proportion of the areas of the face key parts where the face key points are located to the subareas is smaller than a preset value, the subareas are further divided;

step S53: and different labels are given according to the real face image duty ratio in the subarea.

In a second aspect, the present invention provides a high-quality binocular face recognition model training system, configured to implement the high-quality binocular face recognition model training method described in any one of the foregoing, where the method is characterized by comprising:

the acquisition module is used for acquiring a training sample set; wherein the training sample set comprises a positive sample pair and a negative sample pair; the positive sample pair is two face images obtained by shooting with a binocular camera, and the negative sample pair is two face images with similar angles to the positive sample; the face images comprise a plurality of sub-partitions and are respectively endowed with different labels;

the training module is used for inputting the sub-partitions into different sub-networks respectively, inputting the data of a plurality of sub-networks into a sub-network fusion layer, and performing multi-round training on the model until convergence to obtain a first model;

The sub-partition module is used for changing the partition mode of the sub-partition in the training sample set to obtain a second training sample set;

And the circulation module is used for repeatedly executing the training module and the sub-partition module until the first model converges.

In a third aspect, the present invention provides a high-quality binocular face recognition model training apparatus, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

Wherein the processor is configured to perform the steps of the high quality binocular face recognition model training method of any one of the preceding claims via execution of the executable instructions.

In a fourth aspect, the present invention provides a computer readable storage medium storing a program, wherein the program when executed implements the steps of the high quality binocular face recognition model training method of any one of the foregoing.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the model can be trained in more detail by dividing the training sample into different partitions, so that the sensitivity of the model to the local characteristics of the face and the sensitivity of the relationship between the local characteristics are improved, and the attack scene of the real face matched with the prosthesis can be well identified.

According to the invention, training is repeated for a plurality of times by changing the sub-partition mode, so that the partition capacity of the model passing through different positions and sizes is greatly improved, the perception capacity of the model to details of different parts of the human face is improved, and the attack scene of the real human face matched with the prosthesis can be better identified by combining the advantages of human face feature identification and living body authentication.

According to the invention, the consistency of angles and the like of the positive sample pair and the negative sample pair is maintained, so that the deep learning model is more effective for detail learning of the sample, interference of areas except the human face on model learning is filtered, and learning efficiency and sample detection accuracy are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art. Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of steps of a training method for a high-quality binocular face recognition model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a sub-region according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a negative sample pair generation step according to an embodiment of the present invention;

FIG. 4 is a schematic view of a first region according to an embodiment of the present invention;

FIG. 5 is a schematic view of a second region according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating steps for normalization according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating steps for obtaining a first generated image pair according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating another exemplary step of obtaining a first generated image pair according to an embodiment of the present invention;

FIG. 9 is a flowchart illustrating steps for obtaining a second image pair in accordance with an embodiment of the present invention;

FIG. 10 is a flowchart illustrating another step of obtaining a second image pair according to an embodiment of the present invention;

FIG. 11 is a flowchart illustrating a step of assigning labels to different sub-regions in accordance with an embodiment of the present invention;

FIG. 12 is a schematic structural diagram of a training system for a high-quality binocular face recognition model according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a training device for a high-quality binocular face recognition model according to an embodiment of the present invention; and

Fig. 14 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention provides a method for training a depth module by a high-quality binocular face recognition model, which aims to solve the problems in the prior art.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

According to the invention, the training sample is divided into different partitions, so that the model can be trained in more detail, the sub-partition mode is changed, the training is repeated for a plurality of times, the perception capability of the model to face details is greatly improved, the face feature recognition and the living body authentication are combined, and the attack scene of matching the real face with the prosthesis can be better recognized.

Fig. 1 is a flowchart of steps of a training method for a high-quality binocular face recognition model according to an embodiment of the present invention. As shown in fig. 1, the method for training a high-quality binocular face recognition model in the embodiment of the invention includes the following steps:

step M1: a training sample set is obtained.

In this step, the training sample set includes a positive sample pair and a negative sample pair; the positive sample pair is two face images obtained by shooting with a binocular camera, and the negative sample pair is two face images with similar angles to the positive sample; the face images comprise a plurality of sub-partitions and are respectively endowed with different labels. The positive and negative pairs of samples are two-dimensional images, such as RGB images, infrared images, etc. The positive sample pair is also divided into a plurality of sub-partitions, except that the labels are positive. While the labels of the sub-regions in the negative pair are not identical. In some embodiments, the negative samples are true or false for the label of the sub-region. While in other embodiments, the negative samples label the sub-regions as positive sample pairs or duty cycles of negative sample pairs. The areas of the different sub-partitions may be the same or different. The positive sample pair and the negative sample pair are images with normalized size, so that the model can learn details better. For a sample, if one of its sub-partitions is false, the sample is a negative pair of samples. Fig. 2 illustrates a sub-region dividing method. Fig. 2 is merely exemplary, and those skilled in the art will appreciate that there are many more ways in which the sub-regions may be partitioned. The label of the sub-region includes true or false of the sample, and the number of the sub-region. The total area occupied by the plurality of subareas can be 100% of the area of the face area or can be less than 100%.

Step M2: and respectively inputting the sub-partitions into different sub-networks, inputting the data of a plurality of sub-networks into a sub-network fusion layer, and performing multiple rounds of training on the model until convergence to obtain a first model.

In this step, the model has a plurality of subnetworks. The number of sub-networks is not less than the number of sub-partitions. Each sub-partition inputs a sub-network. And inputting the multiple sub-networks into a sub-network fusion layer. For example, sample a has five sub-partitions a, b, c, d, e. Five sub-networks are used to receive the image information of the five sub-partitions a, b, c, d, e and the respective labels. And for the case that the sum of the areas of the plurality of subareas is smaller than the area of the face area, the total area of the images in the sub-network fusion layer is smaller than the total area of the face area in the face image. The image in the sub-network fusion layer needs to contain all keypoint areas of the face including left eye, right eye, nose, mouth, etc. The subarea comprises a face key point area. The method comprises the steps of performing repeated iterative training on the model by adopting the training samples until parameters of the model are converged to obtain a first model.

Step M3: and changing the partition mode of the sub-partition in the training sample set to obtain a second training sample set.

In this step, the manner of the sub-partition is changed so that the areas of the sub-partition are different. For positive sample pairs, only the range of the sub-region need be changed and the label need not be changed. For negative sample pairs, the sub-partition changes, and its label may change accordingly. For a child partition represented in a true or false binarization manner, child partition a is true, child partition b is false, and when child partition a and child partition b are combined into a new child partition, its label is false. For the identification of the duty ratio of the positive or negative sample pair, the duty ratio of the positive or negative sample pair of the new sub-partition needs to be calculated in combination with the area weighting of each sub-region. In changing the partition mode of the sub-partition, various modes may be employed. For example, a region with a false sub-partition and a partial region with a true sub-region are combined into a new sub-partition, and a region with a true sub-partition is split into different sub-partitions. As long as the changed sub-partition can still be identified, various partition modes of the sub-partition are possible.

Step M4: repeating steps M2-M3 until the first model converges.

In this step, the model is retrained by the second training sample set after the partition is changed. And repeatedly executing the process until the model parameter change is within a preset range after retraining, and judging that the first model converges. Note that the convergence determination condition in the first model in this step is different from that in step M2. In step M2, model convergence refers to multiple iterative training performed on the same set of training samples, with less variation in model parameters. In this step, the first model converges by training with different training sample sets, and the final parameters of the model have small changes. In general, the range in the step M2 of determining the convergent range size in this step, that is, the threshold value of the parameter variation in this step is larger than the threshold value of the parameter variation in the step M2.

In some embodiments, when the manner of sub-partitioning of the negative sample pair is changed in the step M3, different labels are reassigned according to the proportion occupied by the real face. The prosthesis can be identified more accurately by identifying the proportion of the real face to the sub-partition, and the deep learning model is more friendly. Meanwhile, the sub-partition repartitioning is more flexible, and the frequency of partition change can be achieved by improving the training sample set.

In some embodiments, the sub-partitions corresponding to different face key parts are combined, and the like, so as to change the sub-partitions of the negative sample pair. The sub-subareas corresponding to the key parts of different faces are operated, so that the overall recognition capability of the model to the key parts of the faces can be further improved, and the model is cross-trained with the models refined by the sub-subareas, so that the model has better recognition capability to the features of different levels.

In some embodiments, the positive sample pair or the negative sample pair are partitioned in the same way by different face images. In a group of positive sample pairs or negative sample pairs, the partition modes of two different face images are the same, so that the model can learn by utilizing the corresponding relation of the binocular images, the learning effect of the binocular images is improved, and more accurate results are obtained than those of the monocular images.

FIG. 3 is a flowchart illustrating a negative pair of samples generating step according to an embodiment of the present invention. As shown in fig. 3, the step of generating a negative sample pair in the embodiment of the present invention includes:

step S1: a plurality of pairs of real face images and a plurality of prosthetic images are acquired.

In this step, the pair of real face images includes two images of the same target object at different angles at the same time. The prosthesis image is an image including at least one of a 2D paper surface, a 3D mask, and a head mold. The real face image pair is a set of binocular images, with the same target object. These images may be obtained from a public face database or by shooting. The prosthesis image can be a 2D paper surface, a 3D mask, a head model and the like, and is used for simulating different attack modes. The plurality of pairs of real face images form a set of real faces. The plurality of prosthetic images form a set of prosthetic faces. The image obtained in this step is a two-dimensional image such as an RGB image or an infrared image. The real face image pair can be two RGB images, two infrared images, one RGB image and one infrared image.

Step S2: normalizing the real face image pair and the prosthesis image according to the face size so as to enable the face size to be the same.

In this step, in order to ensure that the generated dummy sample has the same size as the real sample, the subsequent processing is facilitated. Since the size, angle, etc. of different images may be different, the size of faces, etc. in the images may also be different. When processing the real face image pair, all images in the image pair are processed respectively. The step normalizes the face sizes in the real face image pair and the prosthetic image so that the face sizes remain at the same size. The face sizes are normalized, so that the face sizes on different images can be kept at the same size, the generated false sample is more reasonable, the attack scene of combining a true person and a false body is better modeled, and the face recognition model is easier to deceive. In the prior art, the generated false samples have the problems of different sizes, easy distinction, overlarge gap from actual prosthesis attack and the like. The step normalizes the real face image and the false body image, so that the generated false sample is more real and has higher quality.

Step S3: and determining a first area on the prosthesis image, respectively determining areas corresponding to the first area of different images in the real face image pair, and replacing the areas with the contents corresponding to the prosthesis image to obtain a first generated image pair.

In this step, the first region may be a region including eyes, nose, mouth, or the like, and replacing it with the content of the prosthetic image may simulate a different attack pattern. As shown in fig. 4, the dotted line portion is the first region. Fig. 4 is merely exemplary, and one skilled in the art will appreciate that the first region may be any other shape. Because the real face image and the false body image are normalized in the step S4, unified coordinates are established on the real face image and the false body image, and the corresponding relation between different images can be established, so that the pixel replacement is performed. In determining the first region, a region is randomly determined as the first region within the face of the prosthetic image. The size and extent of the first region are not limited. Preferably, the first region contains at least one face key point. For example, a coordinate system is established on the prosthesis image, wherein the center point of the connecting line of the left eye and the right eye is taken as a primary center, the connecting line of the left eye and the right eye is taken as an x axis, and the connecting line of the primary center and the mouth angle center is taken as a y axis, so that the track M of the first area on the prosthesis image can be determined; and establishing a coordinate system on the real face image with the same standard, determining replacement content by adopting the track M, and replacing the content of the first area of the real face image to obtain a first generated image pair. Since the two or more images in the first generated image pair differ in view angle, their visual effect is also different in the first generated image pair after being replaced by the content of the first region on the prosthetic image.

In the case of replacement, although the synthesized first generated image pair has a certain discontinuity, the secondary resizing of the replacement content is not required. This is both because the size in the prosthetic image is more in line with the situation in actual use and because the adjustment of the replacement content results in distortion of the replacement content, making the recognition meaningless. The first generated image obtained by the step has a good recognition effect on the attack of a real person and a prosthesis on the model obtained by training.

Step S4: and determining a second area on the real face image pair, and replacing the second area on the real face image pair with contents corresponding to other real face images to obtain a second generated image pair.

In this step, the second area may be an area including eyes, nose, mouth, or the like, and replacing it with the content of other real face images may simulate different attack modes. As shown in fig. 5, the dotted line portion is the second region. Fig. 5 is merely exemplary, and one skilled in the art will appreciate that the second region may be any other shape. The first region and the second region may be the same or different. Both the first region and the second region are randomly selected regions. Unlike step S3, other real face images are used in this step to replace the current real face image, so that each part on the second raw image is a real face, but is generally a false sample. The second generated image has a good recognition effect on the attack of a trained model by a true man by adopting the highly imitated local prosthesis. And when the real face image is replaced, the corresponding real face image is replaced. For example, for a binocular camera, where there are two cameras a and B, then the image of the a camera is replaced by the image of the other a camera and the image of the B camera is replaced by the image of the other B camera.

In this step, different labels are assigned to sub-areas. Unlike the prior art where the sample has only one label, this step assigns two levels of labels to the first and second generated image pairs. The first level labels are labels where the image is entirely of false and true samples. The second level labels are labels with sub-regions being true samples and false samples. There are two ways of dispensing labels. The first method is to respectively set two labels, wherein the first label identifies whether an image is a positive sample or a negative sample, and only one first label exists in one image; the second tag identifies whether the sub-region is a real face or a prosthetic face. The second is to set a set of two-dimensional labels, which respectively identify each sub-region, such as (x, y) labels, where x represents whether the image is a positive or negative sample and y represents whether the current sub-region is a real face or a prosthetic face.

Unlike the first generated image pair, the second generated image pair has no image of the prosthetic face, and it is necessary to determine which part of the face is the real face for division of the subareas. In order to enable the model to be accurately identified during training, the area ratio a of the second area to the real face is calculated in the second generated image pair. When a is smaller than a preset threshold, taking the original real face image as a real face, otherwise, taking other real face images as real faces.

In the prior art, the face recognition model for deep learning learns the features of the face by extracting the features of the face image. And because the attack of the matching of the true man and the false man has the characteristics of the true man and the characteristics of the simulated human face, the living body recognition of deep learning and the recognition of the human face recognition model are easy to break. In the embodiment, a large number of samples of real face and prosthesis attack are generated, so that when model training is performed, labels of all subareas can be utilized, more characteristics of the model can be learned for all subareas and the relation of the subareas, and effective recognition of the attacks of semi-true people and semi-prosthesis is improved.

In some embodiments, the method for generating a false sample pair based on a real face further includes:

step S6: the brightness, hue, etc. of the first region is adjusted in the first generated image pair to enhance or attenuate the contrast of the first region with other portions.

Step S7: and adjusting the brightness, the tone and the like of the second region in the second generated image pair to enhance or weaken the contrast of the second region with other parts.

The present embodiment adjusts the brightness, hue, and the like of the first region and the second region, and can change the relationship of the first region, the second region, and the other regions, thereby generating a plurality of first generated image pairs and second generated image pairs of different contrast relationships. When the model learns the series of images, the first area and the second area can be better identified, so that the identification effect of the attack on the semi-true human semi-prosthesis is improved.

FIG. 6 is a flowchart illustrating steps for normalization according to an embodiment of the present invention. As shown in fig. 6, the steps for normalization in the embodiment of the present invention include:

step S21: the left eye center, the right eye center, and the mouth corner center are identified.

In this step, computer vision techniques are required to identify the left eye center, right eye center, and mouth angle center in the real face image and the prosthetic image. This can be achieved by the following steps:

s211: face detection algorithms are used to detect face regions in both the real face image and the prosthetic image. There are a variety of face detection algorithms, such as Haar cascade classifiers or deep learning models.

S212: the left eye center, right eye center, and mouth corner center are identified using a keypoint detection algorithm of the eyes and mouth. The keypoint detection algorithm may employ any of a variety of keypoint detectors, such as 68 in a Dlib library. These key points typically include the inner and outer corners of the eye, the corner points of the upper and lower eyelids, the upper and lower lip corner points of the mouth, and the like.

S213: and saving the position information of the key points of the human face so as to be used in the subsequent calculation of the triangle area.

Step S22: and calculating the area of a triangle formed by the left eye center, the right eye center and the mouth corner center.

In this step, the areas of triangles formed by the left eye center, the right eye center, and the mouth angle center identified in the previous step are calculated based on the coordinate information of these centers. This can be achieved by the following steps:

Step S221: the distance between the center of the left eye and the center of the right eye, and the distance between the center of the mouth angle and the center of the left eye or the center of the right eye are calculated using a distance formula between the two points.

Step S222: the area of the triangle is calculated using the halen formula. The equation for sea is as follows: a=sqrt [ s (s-a) (s-b) (s-c) ]

Wherein A represents the area of the triangle, a, b and c represent the lengths of three sides of the triangle respectively, and s represents the half perimeter, namely (a+b+c)/2.

Step S223: and judging whether the sizes of the real face image and the prosthesis image are proper or not according to the calculated triangle area. If the triangle area is smaller than a first preset value, the sizes of the real face image and the prosthesis image are not proper, and scaling is needed; if the triangle area is larger than the first preset value, the sizes of the real face image and the prosthesis image are proper, and scaling is not needed.

In this step, the sizes of the real face image and the prosthetic image are adjusted according to the triangle area calculated in the previous step, so that the triangle area reaches a first preset value. This can be achieved by the following steps:

Step S231: the scale factor that needs to be scaled is calculated. This may be obtained by dividing the triangle area by a first preset value. For example, if the triangle area is 100 square units and the first preset value is 50 square units, the scaling factor required for scaling is 2.

Step S232: the true face image and the prosthetic image are scaled using the resize function in an image processing library (e.g., openCV). When invoking the restore function, we need to pass in the scaling factor as a parameter. For example, if it is desired to reduce the size of both the real face image and the prosthetic image by half, the scaling factor may be set to 0.5.

The image is normalized by the areas of the left eye center, the right eye center and the mouth corner center, so that faces with different face shapes, different sizes and different angles can be normalized into images with similar sizes, serious deviation does not occur in the subsequent direct replacement process, and the quality of a false sample is improved.

FIG. 7 is a flowchart illustrating steps for obtaining a first generated image pair according to an embodiment of the present invention. As shown in fig. 7, the step of obtaining a first generated image pair in an embodiment of the present invention includes:

step S31: a first region of random area, random location is obtained on the prosthetic image.

In this step, a random area is selected on the prosthetic image, the area and location of this area being random. In some embodiments, the first region includes at least one face key point. The key points of the human face comprise key points of left eye, right eye, nose, mouth and the like. The key points can be of any key point type of 68, 96 and the like. The randomly selected first region may enhance the richness of the sample such that the generated dummy sample is more broadly representative.

Step S32: and determining a first position relation between the first area and the center of the left eye, the center of the right eye or the center of the mouth angle on the prosthesis image.

In this step, it is necessary to determine the positional relationship of the first region with the center of the left eye, the center of the right eye, or the center of the mouth angle on the prosthetic image. This may be achieved by calculating the distance between the boundary of the first region and these keypoints, or by establishing a coordinate system with the keypoints, thereby determining the boundary of the first region. The position relationship can be determined by only adopting one of the left eye center, the right eye center or the mouth corner center.

Step S33: and on the real face image pair, respectively determining corresponding areas on each real face image according to the left eye center, the right eye center, the mouth corner center and the first position relation.

In this step, the corresponding region on the real face image is determined according to the positional relationship between the first region and the key point on the prosthetic image calculated in step S32. This means that we need to find a region on the pair of real face images that has the same positional relationship as the first region. This ensures that in a subsequent step we fuse the content of the prosthetic image to the correct position. According to the embodiment, the position of the first area can be more accurately positioned by adopting the position relation between the first area and the key points instead of the relation between the first area and the face image, and the characteristic that the face recognition model recognizes by the key points is more met, so that the model friendliness is improved. In the step, calculation is only needed according to one of the left eye center, the right eye center and the mouth corner center and the first position relation. For example, the first position relationship is calculated by taking the center of the left eye as a reference, and the step is only required to calculate according to the center of the left eye and the first position relationship. Because the positions of the face key points in different real face images in the real face image pair are not identical, the positions of the corresponding areas on different images obtained in the step are also different.

In this step, the content of the corresponding region on the previously determined prosthetic image is replaced by the content of the first region on the real face image pair. Thus, a new generated image is obtained, and the image is formed by the real face image and the prosthesis image.

According to the embodiment, the first area obtained randomly is utilized, and the first position relation between the first area and the key points is utilized for positioning, so that the corresponding area of the prosthetic image and the real face image pair is determined, the accurate face position replacement is realized, and the random and accurate semi-real semi-prosthetic photo is generated.

FIG. 8 is a flowchart illustrating another exemplary process for obtaining a first generated image pair according to an embodiment of the present invention. As shown in fig. 8, another step of obtaining a first generated image pair according to an embodiment of the present invention includes:

step S35: and acquiring the face key points of the prosthesis image, and randomly expanding the face key points to the periphery by at least one face key point to obtain a first area.

In this step, the face key points are extended to the surrounding, so that the obtained first region is more friendly to the face recognition model.

Step S36: and determining a second position relation between the first region and the face key point.

In this step, the second positional relationship is obtained using the face key points in step S35 as a reference.

Step S37: and acquiring the face key points of the real face image pairs, and respectively determining corresponding areas on each real face image according to the second position relation.

In the step, the corresponding area on the real face image pair is obtained by utilizing the face key points and the second position relation on the real face image pair, so that the part corresponding to the first area on the real face image pair is obtained.

Step S38: and replacing the content of the corresponding region with the content of the first region to obtain a first generated image pair.

According to the embodiment, the face key points are expanded outwards to obtain the first area, the second position relation between the first area and the face key points is further determined, the face key points on the real face image are identified, the corresponding area on the real face image pair is obtained through calculation by means of the second position relation, and the first area of the real face image pair is replaced to obtain the random and accurate semi-true human semi-prosthesis photo.

Fig. 9 is a flowchart illustrating steps for obtaining a second generated image pair according to an embodiment of the present invention. As shown in fig. 9, in an embodiment of the present invention, the step of obtaining the second generated image pair includes:

step S41: and obtaining a second area with random area and random position on the real face image pair.

In this step, the processing of the pair of real face images is to process two or more real face images thereof respectively. The process flow is the same as step S31, and will not be described here again.

Step S42: and determining a third position relation between the second region and the key region of the face on the pair of real face images.

In this step, the face key region refers to a minimum rectangular region surrounding the face key portion. The face key area is the smallest rectangular area containing all key points of the corresponding part. For example, if the left eye has 6 key points, the face key region is the smallest region surrounding the 6 key points. If there are at least two critical locations in the second region, then either critical location is taken. Because the face key region comprises four points, the second region is divided into four parts, the four parts correspond to the nearest points respectively, and the third position relation is determined.

Step S43: and acquiring face key areas of other real face image pairs, and determining corresponding areas on the other real face images according to the third position relation.

In the step, the face key areas of other real face image pairs are identified, and then the corresponding areas on the real face images are obtained according to the third position relation.

Step S44: and replacing the content of the corresponding region with the content of the second region to obtain a second generated image pair.

In this step, when the areas of the second area and the corresponding area are not completely consistent, the partial area of the real face is replaced with the corresponding area as the reference, and the positions of the key areas of the face are kept unchanged. The positions and the areas of the key face areas on the real face areas and other real face areas are different due to the fact that angles and the like of the real face images and other real face images are different. And when the real face image is replaced, the corresponding real face image is replaced. For example, for a binocular camera, where there are two cameras a and B, then the image of the a camera is replaced by the image of the other a camera and the image of the B camera is replaced by the image of the other B camera.

In the embodiment, the third position relation is determined by using the key face region, and the influence of the shape of the key face part is considered, so that the selection of the corresponding region is more stable and accurate especially when the angle of the real face image and other real face images, the difference of the face and the like cause nuances.

FIG. 10 is a flowchart illustrating another step of obtaining a second image pair according to an embodiment of the present invention. As shown in fig. 10, another step of obtaining a second generated image pair according to an embodiment of the present invention includes:

Step S45: and taking a face key region of the real face image pair, and randomly expanding the face key region to the periphery by at least one face key region to obtain a second region.

In this step, the face key region refers to a minimum rectangular region surrounding the face key portion. Compared with the previous embodiment, the second region in this embodiment includes at least one complete face key region.

Step S46: determining a fourth position relation between the second region and a key region of the face on the pair of real face images;

Step S47: acquiring face key areas of other real face image pairs, and determining corresponding areas on the other real face image pairs according to the fourth position relation;

step S48: and replacing the content of the corresponding region with the content of the second region to obtain a second generated image pair.

According to the embodiment, the face key region is expanded outwards, so that the second region is determined, the face key position and the peripheral region can be better included, the integrity of the face key position is kept, and the second generated image pair is more complete.

FIG. 11 is a flowchart of a step of assigning labels to different sub-regions in an embodiment of the present invention. As shown in fig. 11, the step of assigning labels to different sub-areas in the embodiment of the present invention includes:

Step S51: dividing the first generated image pair and the second generated image pair into a plurality of subareas according to the face key points, so that each subarea has at most one face key point.

In this step, the generated image is divided into a plurality of sub-regions so as to label the sub-regions more finely. The image may be divided into a plurality of sub-regions, each containing one or more face keypoints, using a face detection algorithm or predefined face keypoint locations. The plurality of sub-regions together comprise an image region.

Step S52: and if the face key points exist in the subareas and the proportion of the areas of the face key parts where the face key points are located to the subareas is smaller than a preset value, the subareas are further divided.

In the step, for the subareas containing the face key points, whether the proportion of the area of the face key parts where the face key points are located to the whole subareas is smaller than a preset value or not is checked. The key parts of the human face are left eye, right eye, nose and mouth. If the ratio is smaller than the preset value, this sub-area is indicated to be too large, and further partitioning is required to obtain a more accurate result. For example, in a subarea where the left eye is located, the left eye area is 34% and the preset left eye area is 50%, the subarea needs to be further subdivided so that the left eye area is not less than 50%. The step can ensure the area occupation ratio of key parts of the face and improve the precision. Because the diversity of the face image, such as long hair, shielding, wearing glasses, hats, etc., can affect the face area, the division in step S51 can only be performed roughly, but the step is further refined, and the division accuracy of the subareas is improved.

In this step, different labels are assigned to each sub-region according to the duty ratio of the real face image in each sub-region. This can be achieved by calculating the ratio of the number of real face image pixels to the total number of pixels in each sub-region. For example, the tags of all the sub-regions may be set to 0, and then the value of the tag is gradually increased according to the duty ratio of the real face image. In this way, sub-regions with higher real face image duty cycles will be assigned higher tag values, while sub-regions with lower real face image duty cycles will be assigned lower tag values.

According to the embodiment, the first generated image pair and the second generated image pair are divided into a plurality of subareas, and different labels are respectively assigned to each subarea, so that different scores are respectively assigned to different face key points by the generated images, the model can be more accurate in learning, and the relation learning between the face key parts and the key parts is more effective.

Fig. 12 is a schematic structural diagram of a training system for a high-quality binocular face recognition model according to an embodiment of the present invention.

As shown in fig. 12, a high-quality binocular face recognition model training system according to an embodiment of the present invention includes:

According to the embodiment, the training sample is divided into different partitions, so that the model can be trained in more detail, the sub-partition mode is changed, the training is repeated for a plurality of times, the perception capability of the model to face details is greatly improved, the face feature recognition and the living body authentication are combined, and the attack scene of the real face and the prosthesis can be recognized better.

The embodiment of the invention also provides high-quality binocular face recognition model training equipment, which comprises a processor. A memory having stored therein executable instructions of a processor. Wherein the processor is configured to perform the steps of a high quality binocular face recognition model training method via execution of executable instructions.

As described above, the training sample is divided into different partitions to train the model in more detail, and the sub-partition mode is changed to train the model repeatedly, so that the perception capability of the model to face details is greatly improved, the face feature recognition and the living body authentication are combined, and the attack scene of the real face and the prosthesis can be better recognized.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" platform.

Fig. 13 is a schematic structural diagram of a high-quality binocular face recognition model training apparatus according to an embodiment of the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 13. The electronic device 600 shown in fig. 13 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 13, the electronic device 600 is in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including memory unit 620 and processing unit 610), a display unit 640, etc.

Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention described in the above-mentioned part of one of the high quality binocular face recognition model training methods of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 6201 and/or cache memory unit 6202, and may further include Read Only Memory (ROM) 6203.

The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown in fig. 13, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.

The embodiment of the invention also provides a computer readable storage medium for storing a program, and the method is implemented when the program is executed. In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the above-mentioned section of the high quality binocular face recognition model training method, when the program product is run on the terminal device.

As shown above, the training sample is divided into different partitions to train the model in more detail, and the sub-partition mode is changed to train the model repeatedly, so that the perception capability of the model to face details is greatly improved, the face feature recognition and the living body authentication are combined, and the attack scene of the real face and the prosthesis can be recognized better.

Fig. 14 is a schematic structural view of a computer-readable storage medium in an embodiment of the present invention. Referring to fig. 14, a program product 800 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The training sample is divided into different partitions, so that the model can be trained in more detail, the sub-partition mode is changed, the training is repeated for a plurality of times, the perception capability of the model to face details is greatly improved, the face feature recognition and the living body authentication are combined, and the attack scene of the real face and the prosthesis can be recognized better.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the claims without affecting the spirit of the invention.

Claims

1. The high-quality binocular face recognition model training method is characterized by comprising the following steps of:

step M4: repeating steps M2-M3 until the first model converges.

2. The training method of a high-quality binocular face recognition model according to claim 1, wherein different labels are newly assigned according to the proportion of the real face when the sub-division mode of the negative sample pair is changed in the step M3.

3. A high quality binocular face recognition model training method according to claim 2, wherein the different face images in the positive sample pair or the negative sample pair are partitioned in the same way.

4. The method for training a high-quality binocular face recognition model according to claim 1, wherein the generating of the negative sample pair comprises:

5. The training method of high-quality binocular face recognition model according to claim 4, wherein the normalization in step S2 comprises the steps of:

6. The method for training a high-quality binocular face recognition model of claim 4, wherein the step S3 comprises:

7. The method for training a high-quality binocular face recognition model of claim 4, wherein the step S5 comprises:

8. A high quality binocular face recognition model training system for implementing the high quality binocular face recognition model training method of any one of claims 1 to 7, comprising:

9. A high quality binocular face recognition model training apparatus, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

Wherein the processor is configured to perform the steps of the high quality binocular face recognition model training method of any one of claims 1 to 7 via execution of the executable instructions.

10. A computer-readable storage medium storing a program, wherein the program when executed implements the steps of the high quality binocular face recognition model training method of any one of claims 1 to 7.