CN112733946A

CN112733946A - Training sample generation method and device, electronic equipment and storage medium

Info

Publication number: CN112733946A
Application number: CN202110050175.3A
Authority: CN
Inventors: 杨博文; 尹榛菲; 邵婧
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-04-30
Anticipated expiration: 2041-01-14
Also published as: CN112733946B

Abstract

The disclosure provides a training sample generation method, a training sample generation device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a source training data set and a target face image generated by a target scene; generating a synthetic face image corresponding to the target face image based on the target face image and a source face image corresponding to the target face image in the source training data set; and expanding the source training data set based on the synthesized face image to obtain an expanded training data set. The target face image can be expanded in the sample space by utilizing the synthesized face image obtained by fusion, and the expanded training data set can cover more training data in the target scene to a certain extent, so that the trained neural network has higher detection accuracy in the target scene (for example, a brand-new environment).

Description

Training sample generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a method and an apparatus for generating a training sample, an electronic device, and a storage medium.

Background

As an important research direction in the technical field of computer vision, face recognition is widely applied to various application scenarios such as mobile phone unlocking, entrance guard passing and the like. However, due to the characteristic that the face image is easy to obtain, the face recognition system is easily attacked by forging modes such as printing, video playback and the like, and potential safety hazards are generated, so that the living body detection capable of distinguishing the authenticity of the face is an indispensable key link in the face recognition system.

The current living body detection method can automatically identify whether the input face picture belongs to a real person or a dummy person by using a living body detection neural network. Compared with other face tasks (such as face detection), the living body detection is easily influenced by various attack means and attack materials, so that the trained living body detection model cannot be well adapted to a new attack environment.

Here, considering that training data generated in a new attack environment is generally less, if a new training data is directly added to a data set of existing training data, the detection accuracy of the neural network in the new attack environment cannot be effectively improved.

Disclosure of Invention

The embodiment of the disclosure at least provides a training sample generation method and device, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for generating training data, including:

acquiring a source training data set and a target face image generated by a target scene;

generating a synthetic face image corresponding to the target face image based on the target face image and a source face image corresponding to the target face image in the source training data set;

and expanding the source training data set based on the synthesized face image to obtain an expanded training data set.

By adopting the method for generating the training data, the target face image can be fused based on the source face image in the source training data set, so that the fused synthetic face image not only contains the image characteristics of the source face image, but also contains the image characteristics of the target face image. Under the condition that a target scene generates fewer target face images, the target face images can be expanded in a sample space by utilizing the synthesized face images obtained by fusion, and the expanded training data set can cover more training data in the target scene to a certain extent, so that the trained neural network has higher detection accuracy in the target scene (for example, a brand new environment).

In one possible embodiment, the generating a synthesized face image corresponding to the target face image based on the target face image and a source face image corresponding to the target face image in the source training data set includes:

extracting the source facial image paired with the target facial image from the source training data set;

inputting the target face image and a source face image matched with the target face image into a trained image generation neural network for feature fusion processing to obtain a synthesized face image, wherein the synthesized face image is fused with the content features of the source face image and the style features of the target face image.

In the embodiment of the disclosure, the fusion processing between the paired target face image and the source face image can be performed by using the image generation neural network, so that the operation is simple, and the time and the efficiency are saved.

In one possible embodiment, the image-generating neural network is trained as follows:

acquiring a source face image sample and a target face image sample which are matched;

respectively extracting the characteristics of the paired source face image sample and target face image sample to obtain the content characteristic information of the source face image sample and the style characteristic information of the target face image sample;

and performing at least one round of training on the image generation neural network to be trained based on the content characteristic information of the source face image sample and the style characteristic information of the target face image sample.

In order to enhance the significance of the features of the target face image in the target scene in subsequent living body detection and other applications, style feature information which is important to influence the living body detection can be extracted for the target face image, and content feature information can be extracted for a source face image sample, so that the trained image generation neural network can be better adapted to a subsequent detection network.

In a possible implementation, the performing at least one round of training on an image generation neural network to be trained based on the content feature information of the source facial image sample and the style feature information of the target facial image sample includes:

aiming at the current round of training, taking the content characteristic information of the source face image sample and the style characteristic information of the target face image sample as the input characteristics of the image generation neural network to be trained, and determining the fusion characteristic information output by the image generation neural network to be trained;

under the condition that the first similarity between the fusion characteristic information and the content characteristic information is smaller than a first threshold value and the second similarity between the fusion characteristic information and the style characteristic information is smaller than a second threshold value, adjusting network parameters of the image generation neural network, and performing next round of training;

until the training is cut off when a first similarity between the fusion feature information and the content feature information is greater than or equal to a first threshold and a second similarity between the fusion feature information and the style feature information is greater than or equal to a second threshold.

Here, the fused image is intended to improve the significance of the style feature on the premise of having the content feature, so that the constraint of the training condition can be performed by using the first similarity between the fused feature information and the content feature information and the second similarity between the fused feature information and the style feature information, and the trained network can meet the requirements of the scene.

In a possible implementation, the extracting, from the source training data set, a source face image paired with the target face image includes:

determining the living body label category to which the target face image belongs;

and extracting a source face image with the living body label type same as that of the living body label to which the target face image belongs from the source training data set, and taking the source face image as a source face image matched with the target face image.

In one possible embodiment, after the obtaining the extended training data set, the method further comprises:

and performing at least one round of training on the living body detection neural network to be trained based on the extended training data set to obtain the trained living body detection neural network.

Here, the target scene can be well considered by expanding the training data set, so that the trained living body detection neural network has better compatibility with the target scene and higher detection accuracy.

In a possible embodiment, the performing at least one round of training on the living body detection neural network to be trained based on the extended training data set to obtain a trained living body detection neural network includes:

respectively obtaining the characteristic information of each face image in the extended training data set by using a living body detection neural network to be trained;

determining a target loss function value corresponding to the living body detection neural network to be trained based on the obtained characteristic information;

and under the condition that the target loss function value does not meet the preset condition, carrying out next round of training on the living body detection neural network to be trained until the target loss function value meets the preset condition.

In one possible embodiment, the feature information of each face image in the extended training data set includes:

the first characteristic information of the source face image, the second characteristic information of the target face image and the third characteristic information of the synthesized face image in the extended training data set.

In a possible implementation manner, the determining, based on the obtained feature information, an objective loss function value corresponding to the living body detection neural network to be trained includes:

determining a first target loss function value for measuring the difference of training data in the same living body label category and a second target loss function value for measuring the feature distribution condition of face images from different sources based on the first feature information, the second feature information and the third feature information;

and determining a target loss function value corresponding to the living body detection neural network to be trained based on the first target loss function value and the second target loss function value.

The living body detection neural network is synchronously adjusted based on the first target loss function value for measuring the difference of training data in the same living body label category and the second target loss function value for measuring the feature distribution of face images from different sources, so that the trained network can enable samples in the same category to have more similar expression in feature space, and the accuracy of classification results can be improved.

In a possible implementation, the determining, based on the first feature information, the second feature information, and the third feature information, a first objective loss function value for measuring a difference of training data in a same live label category includes:

selecting two first face images of the same living body label type and two second face images of different living body label types from the training data set; the two first face images are different in source, and the two second face images are different in source;

determining a first image similarity between the two first face images based on the feature information of the two first face images; determining second image similarity between the two second face images based on the feature information of the two second face images;

and summing the first image similarity and the second image similarity to obtain the first target loss function value.

In this case, the same category can be drawn closer and different categories can be pushed away by calculating the image similarity between samples, so that the trained network can be better classified.

In a possible implementation manner, determining a second objective loss function value for measuring feature distribution of face images of different sources based on the first feature information, the second feature information, and the third feature information includes:

respectively determining first distribution characteristic information used for representing the characteristic distribution of each source human face image in the extended training data set, second distribution characteristic information used for representing the characteristic distribution of each target human face image in the extended training data set and third distribution characteristic information used for representing the characteristic distribution of each combined human face image in the extended training data set on the basis of the first characteristic information, the second characteristic information and the third characteristic information;

determining a first feature distribution similarity between each target face image and each combined face image based on the similarity between the second distribution feature information and the third distribution feature information; splicing the second distribution characteristic information and the third distribution characteristic information to obtain spliced distribution characteristic information;

determining second feature distribution similarity between each source human face image and each combined human face image based on the similarity between the first distribution feature information and the spliced distribution feature information;

and summing the first feature distribution similarity and the second feature distribution similarity to obtain the second target loss function value.

Here, the three kinds of distribution feature information may be processed based on the feature distribution level, so that the entire extended training data set is compared in the feature distribution, and thus living body classification may be performed better.

determining a target loss function value corresponding to the living body detection neural network to be trained by using the first characteristic information of each source face image in the extended training data set determined by the living body detection neural network to be trained and the first source characteristic information extracted from the source face image by using the trained source living body detection neural network;

the source living body detection neural network is obtained by training each source face image sample and the living body label type labeled on each source face image sample.

Here, in order to enable the trained living body detection neural network to not reduce the detection accuracy of the scene corresponding to the existing training data set on the premise of considering the high accuracy of the target scene, here, the trained source living body detection neural network may be used to extract the first source feature information from the source face image, and the network may be subjected to parameter adjustment based on the source feature information and the similarity between the first feature information, so as to achieve the above object.

In a second aspect, an embodiment of the present disclosure further provides an apparatus for generating training data, including:

the acquisition module is used for acquiring a source training data set and a target face image generated by a target scene;

a generating module, configured to generate a synthesized face image corresponding to the target face image based on the target face image and a source face image corresponding to the target face image in the source training data set;

and the expansion module is used for expanding the source training data set based on the synthesized face image to obtain an expanded training data set.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method for generating training data according to the first aspect and any of its various embodiments.

In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the training data generation method according to the first aspect and any one of the various embodiments.

For the description of the effects of the training data generation apparatus, the electronic device, and the computer-readable storage medium, reference is made to the description of the training data generation method, which is not repeated herein.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of a training data generation method provided by an embodiment of the present disclosure;

fig. 2 is a schematic application diagram illustrating a method for generating training data according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a training data generation apparatus provided in an embodiment of the present disclosure;

fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Research shows that the current living body detection method can automatically identify whether the input face picture belongs to a real person or a dummy by using a living body detection neural network. Compared with other face tasks (such as face detection), the living body detection is easily influenced by various attack means and attack materials, so that the trained living body detection model cannot be well adapted to a new attack environment.

Based on the above research, the present disclosure provides a method and an apparatus for generating a training sample, an electronic device, and a storage medium, which extend a training data set through image feature fusion, so that a neural network trained in this way can better adapt to various scenes.

To facilitate understanding of the present embodiment, first, a method for generating training data disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the method for generating training data provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the method for generating the training data may be implemented by a processor calling computer readable instructions stored in a memory.

Referring to fig. 1, which is a flowchart of a method for generating training data according to an embodiment of the present disclosure, the method includes steps S101 to S103, where:

s101: acquiring a source training data set and a target face image generated by a target scene;

s102: generating a synthetic face image corresponding to the target face image based on the target face image and a source face image corresponding to the target face image in the source training data set;

s103: and expanding the source training data set based on the synthesized face image to obtain an expanded training data set.

Here, in order to facilitate understanding of the generation method of the training data provided by the embodiment of the present disclosure, an application scenario of the generation method may be first described in detail. The method for generating the training data is mainly applied to the preparation process of the training data set before the training of the living body detection neural network. Considering that the related living body detection is easily affected by various attack means and attack materials, the trained living body detection model cannot adapt to a new attack environment well, mainly because training data generated in the new attack environment is generally less, and if the new training data is directly added to the existing training data set, the detection accuracy of the trained neural network in the new attack environment cannot be effectively improved.

In order to solve the above problem, the embodiments of the present disclosure provide a scheme for implementing the training data set expansion through image synthesis, so that a neural network trained based on the expanded training data set can better adapt to the detection requirement in a new attack environment.

The source training data set may be collected in advance, for example, may be a collection of face pictures acquired in a widely used face live attack mode, or may be a collection of pictures acquired in other live attack modes. The target scene may correspond to the new attack environment, and the number of generated target face images is usually small in the new attack environment.

It can be known that, in the embodiment of the present disclosure, a target scene corresponding to a target face image is different from a scene corresponding to a source training data set, where the different scenes mainly refer to different acquisition environments in which the face images are located, for example, in a case where the acquisition environment corresponding to the source training data set includes a paper attack mode, a mobile phone screen attack mode, and the like, a face image acquired in a new attack environment, which is a computer screen attack mode, may be used as the target face image in the target scene.

In this case, the method for generating training data according to the embodiment of the present disclosure may generate a synthesized face image based on the target face image and the source face image corresponding to the target face image in the source training data set, where the synthesized face image fuses the image features of the target face image and the image features of the source face image, so that after the source training data set is expanded by the synthesized face image, the obtained expanded training data set will include more face images that conform to the target scene, and a neural network trained according to the expanded training data set may better adapt to the requirements in the target scene.

The target face images in the embodiment of the present disclosure may be multiple, and in the process of image synthesis, the image synthesis may be performed on all target face images, for example, each source face image in the source training data set may be traversed, and each traversed source face image is synthesized with each target face image, so that multiple synthesized face images may be provided for each target face image. In addition, the embodiment of the present disclosure may perform a synthesizing operation on a part of the source face images in the source training data set for each target face image.

It should be noted that, in the course of performing the neural network training for detecting relevant living bodies based on the training data set, the training data set herein will generally include facial images with real person labels and also facial images with false person labels. In order to avoid possible interference of different types of labels to the neural network, here, the living body label category to which the target face image belongs is determined, for example, the living body label category corresponds to a real person label, at this time, a source face image with the same label category as the real person label may be selected from the source training data set, and then the selected source face image and the target face image are synthesized.

The training data generation method provided by the embodiment of the disclosure can utilize the trained image generation neural network to realize the relevant operation of image synthesis.

Here, first, a source face image paired with a target face image may be extracted from a source training data set, and then the target face image and the source face image paired with the target face image may be input to a trained image generation neural network for feature fusion processing, so as to obtain a synthesized face image in which the content features of the source face image and the style features of the target face image are fused.

The source face image paired with the target face image in the embodiment of the present disclosure may refer to a face image having a living body label category that is the same as a living body label category to which the target face image belongs. Here, two face images with the same living body label type (i.e., the paired target face image and source face image) may be directly input to the image generation neural network, and a synthesized face image in which the content features of the source face image and the style features of the target face image are fused may be performed.

In the embodiment of the present disclosure, the content features of the source face image and the style features of the target face image are selected for fusion, mainly considering that in the subsequent living body detection application, what is more concerned is the influence of the style features such as the living body attack type (e.g., mobile phone screen attack, paper attack) on the living body detection result, and some features of the face image itself, such as the size of five sense organs, the eyebrow distance, and other content features, may be weakened, so that, in order to better adapt to the living body detection in a new target scene, corresponding style features may be extracted for the target face image, and corresponding content features may be extracted for the source face image.

The image generation neural network can be trained by the matching relationship between two input face images and one output synthetic face image, and the image generation neural network can be trained according to the following steps:

acquiring a source face image sample and a target face image sample which are paired;

and thirdly, performing at least one round of training on the image generation neural network to be trained based on the content characteristic information of the source face image sample and the style characteristic information of the target face image sample.

Here, the source face image sample and the target face image sample may be taken as a paired set of face image samples. In order to achieve the technical purpose of migrating the style characteristics of the target face image sample to the source face image, the embodiment of the disclosure can determine the content characteristic information of the source face image sample and the style characteristic information of the target face image sample under the condition of respectively extracting the characteristics of the paired source face image sample and target face image sample, and can perform one or more rounds of network training based on the content characteristic information of the source face image sample and the style characteristic information of the target face image sample.

The content feature information may be used to characterize a face in a source face image sample, and is a feature of an upper layer, and in a specific application, the content feature information may be related information of the face, for example, information used to characterize the face, such as a face contour, a binocular distance, and the like; the style characteristic information may be an underlying characteristic close to the image texture, for example, information such as a material presented by the target face image sample.

Here, in order to ensure that the image generation neural network can generate a synthesized face image sample fusing the content feature information of the active face image sample and the style feature information of the target face image sample, the training may be performed in each training cycle according to the following steps:

aiming at the current round of training, taking content characteristic information of a source face image sample and style characteristic information of a target face image sample as input characteristics of an image generation neural network to be trained, and determining fusion characteristic information output by the image generation neural network to be trained;

secondly, under the condition that the first similarity between the fusion characteristic information and the content characteristic information is smaller than a first threshold value and the second similarity between the fusion characteristic information and the style characteristic information is smaller than a second threshold value, adjusting network parameters of an image generation neural network, and carrying out next round of training;

and step three, training is cut off until the first similarity between the fusion characteristic information and the content characteristic information is greater than or equal to a first threshold value and the second similarity between the fusion characteristic information and the style characteristic information is greater than or equal to a second threshold value.

Here, in the process of each round of neural network training, the similarity between the fusion feature information output by the image generation neural network to be trained and the content feature information and style feature information input into the network may be determined, and when the first similarity between the fusion feature information and the content feature information is not large enough, it is described to a certain extent that the content feature included in the synthesized face image sample corresponding to the fusion feature information is not sufficient, and at this time, the occupation ratio of the relevant content feature in the fusion feature may be improved by adjusting the network parameters of the image generation neural network. Similarly, when the second similarity between the fusion feature information and the style feature information is not large enough, the style feature included in the synthesized face image sample corresponding to the fusion feature information is not enough, and the occupation ratio of the relevant style feature in the fusion feature can be improved by adjusting the network parameters of the image generation neural network. Thus, through a plurality of times of iterative training, a trained image generation neural network can be obtained.

It should be noted that the above-mentioned first threshold and the second threshold can be selected in combination with different application scenarios. Taking the selection of the second threshold as an example, in practical applications, the set second threshold should not be too large or too small, the too large second threshold will cause the whole synthesized face image sample to ignore the influence of the content features on the living body detection and other applications, and the too small second threshold will cause the whole synthesized face image sample not to be well applied to the living body detection and other applications according to the style features, so that the style ratio of 0.6 can be selected to determine the corresponding second threshold.

It is known that the image generation neural network in the embodiments of the present disclosure implements a style migration operation. In a specific application, the style migration operation may be implemented by using a style migration network of whitening and coloring transforms (WCT 2), where the WCT2 is different from the synthesized face image sample reconstructed for the features of different levels, the lower the used feature level (the shallower the level extracted from the corresponding feature), the more the content details of the corresponding synthesized face image sample can be retained but the poorer the stylization effect is, and conversely, the higher the used feature level (the deeper the level extracted from the corresponding feature), the better the stylization effect of the corresponding synthesized face image sample is, and the stylization degree may also be controlled by a stylization degree parameter in the embodiment of the present disclosure, for example, set to 0.6 to control the stylization degree.

The embodiment of the present disclosure can perform training of the living body detection neural network by using the extended training data, that is, perform at least one round of training on the living body detection neural network to be trained based on the extended training data set, so as to obtain the trained living body detection neural network.

The extended training data set may include not only a source face image included in the source training data set, but also a target face image generated in a target scene, and a synthesized face image obtained by synthesizing image features of the source face image and the target face image.

In addition, the living body detection neural network in the embodiment of the disclosure can mainly realize that two classification identification operations of a real person or a prosthesis are performed on any input human face image. Meanwhile, in practical application, the multi-classification recognition operation of determining the corresponding attack mode of the face image identified as the prosthesis can be performed, and the embodiment of the disclosure does not specifically limit the operation. For the convenience of illustration, the following description mostly takes the binary identification as an example.

In the embodiment of the present disclosure, the living body detection neural network may be trained according to the following steps:

step one, utilizing a living body detection neural network to be trained to respectively obtain the characteristic information of each face image in an extended training data set;

secondly, determining a target loss function value corresponding to the living body detection neural network to be trained based on the obtained characteristic information;

and step three, under the condition that the target loss function value does not meet the preset condition, performing next round of training on the living body detection neural network to be trained until the target loss function value meets the preset condition.

Here, for each source face image in the extended training data set, the first feature information of the source face image may be determined using the living body detection neural network to be trained, and similarly, for each target face image in the extended training data set, the second feature information of the target face image may be determined using the living body detection neural network to be trained, and similarly, for each synthetic face image in the extended training data set, the third feature information of the synthetic face image may be determined using the living body detection neural network to be trained.

The first feature information, the second feature information and the third feature information may be feature information related to living body detection and identification, and the feature information may be changed along with adjustment of network parameters for a living body detection neural network, so that the trained feature information can be better used for living body detection.

Here, the face image with a real person label and the face image with a false person label may be included in the extended training data set, in the plurality of source face images (corresponding to the source face image subsets), in the extended training data set, in the plurality of target face images (corresponding to the target face image subsets), or in the extended training data set, in the plurality of synthesized face images (corresponding to the synthesized face image subsets).

When the first feature information, the second feature information, and the third feature information are extracted, feature distributions of the three subsets, i.e., the source face image subset, the target face image subset, and the synthesized face image subset, are relatively independent. In order to achieve the purpose of living body classification for the three subsets, the embodiment of the disclosure can establish corresponding target loss functions at a face image level and a feature distribution level, so as to achieve a good effect of living body classification for the three subsets. Next, the following two aspects can be explained.

In a first aspect: for the face image layer, a first objective loss function value for measuring the difference of training data in the same living body label category can be determined based on the first feature information, the second feature information and the third feature information. This target loss function value may specifically be determined as follows:

selecting two first face images of the same living body label type from a training data set, and selecting two second face images of different living body label types; the two first face images are different in source, and the two second face images are different in source;

determining a first image similarity between the two first face images based on the characteristic information of the two first face images; determining second image similarity between the two second face images based on the feature information of the two second face images;

and step three, summing the first image similarity and the second image similarity to obtain a first target loss function value.

The two first face images and the two second face images herein each include two of a source face image, a target face image, and a synthesized face image. In the embodiment of the present disclosure, two corresponding first face images and two corresponding second face images may be selected based on the living body tag category, and in practical applications, one of the selected first face images and one of the selected second face images may be the same image or different images, which is not limited herein.

For two first face images belonging to the same living body label category selected from the training dataset, for example, two first face images with real person labels, a first image similarity between the two first face images can be determined; similarly, for two first face images with a prosthesis label, corresponding first image similarities may also be determined. For two second facial images selected from the training dataset belonging to different live body label categories, for example, comprising one second facial image with a live body label and one second facial image with a false body label, a second image similarity between the two second facial images may be determined.

In order to realize living body classification, the first objective loss function determined here needs to improve the first image similarity as much as possible and reduce the second image similarity. Here, the above description about the first image similarity and the second image similarity may be specifically described with reference to an exemplary diagram shown in fig. 2.

As shown in fig. 2, when the first feature information, the second feature information, and the third feature information are extracted, the feature distributions (respectively, corresponding to the labels as distribution 1, distribution 2, and distribution 3) of the three subsets, i.e., the source face image subset, the target face image subset, and the synthesized face image subset, are relatively independent.

For distribution 1, corresponding 11 and 12 in the distribution 1 correspond to a source face image having a real person label and a false body label, respectively; for distribution 2, corresponding 21 and 22 in the distribution 2 correspond to the target face image with the real person label and the prosthesis label, respectively; for distribution 3, corresponding 31 and 32 in the distribution 3 correspond to a synthetic face image having a real person label and a false person label, respectively.

The process of performing the above-described first image similarity calculation, that is, the process of approximating the image similarity between the face images belonging to the same live body label in the distribution 1, the distribution 2, and the distribution 3, may be, for example, determining the first image similarity for the face image denoted by 11 in the distribution 1 and for the face image denoted by 21 in the distribution 2, where it is intended to enable the trained live body detection neural network to approximate the distance between the source face image and the target face image belonging to the live body label.

The process of performing the above-described second image similarity calculation, i.e., the process of extrapolating the image similarities between the face images belonging to different live body labels in the distribution 1, the distribution 2, and the distribution 3, may be, for example, determining the second image similarity for the face image denoted by 11 in the distribution 1 and for the face image denoted by 22 in the distribution 2, where it is intended to enable the trained live body detection neural network to extrapolate the distance between the source face image belonging to the live body label and the target face image belonging to the prosthetic label.

It should be noted that, in the process of generating a synthetic face image by using an image generation neural network, in order to improve the image generation effect, a certain difference is generally required between the generated synthetic face image and the input source face image, and therefore, in a specific application, the above-mentioned related limitation on the image similarity in the distribution 1 and the distribution 3 may not be performed.

Based on the above principle, the embodiments of the present disclosure may determine a first objective loss function value for measuring the difference of training data within the same live label category.

In a second aspect: for the aspect of feature distribution, a second objective loss function value for measuring feature distribution of face images from different sources may be determined based on the first feature information, the second feature information, and the third feature information. This target loss function value may specifically be determined as follows:

respectively determining first distribution characteristic information used for representing the characteristic distribution of each source human face image in an extended training data set, second distribution characteristic information used for representing the characteristic distribution of each target human face image in the extended training data set and third distribution characteristic information used for representing the characteristic distribution of each combined human face image in the extended training data set on the basis of the first characteristic information, the second characteristic information and the third characteristic information;

determining first feature distribution similarity between each target face image and each combined face image based on the similarity between the second distribution feature information and the third distribution feature information; splicing the second distribution characteristic information and the third distribution characteristic information to obtain spliced distribution characteristic information;

determining second feature distribution similarity between each source human face image and each combined human face image based on the first distribution feature information and the similarity between the spliced distribution feature information;

and step four, summing the first characteristic distribution similarity and the second characteristic distribution similarity to obtain a third target loss function value.

Here, in the case where the first feature information for the source face image, the second feature information for the target face image, and the third feature information for the synthesized face image are determined, the corresponding first distribution feature information may be determined for the source face image subset, the corresponding second distribution feature information may be determined for the target face image subset, and the corresponding third distribution feature information may be determined for the synthesized face image subset, respectively.

Here, in order to implement the above-mentioned zooming-in of the three face image subsets in the feature distribution layer, here, the similarity between the second distribution feature information and the third distribution feature information may be determined first, where the larger the similarity of the first feature distribution is, the closer the feature distribution corresponding to the target face image subset and the synthesized face image subset is, to a certain extent, as shown in fig. 2, the distribution 2 is after the first zooming-in of the distribution, compared with the original distribution, as shown in fig. 2^’(corresponding to distribution 2 after drawing-in) and distribution 3^’(corresponding to the distribution 3 after the pull-up) are closer together.

In the case of stitching the second distribution characteristic information and the third distribution characteristic information, the similarity between the first distribution characteristic information and the stitched distribution characteristic information may be determined, where the second characteristic distribution similarity indicates to some extent that the closer the feature distributions corresponding to the source face image subset and the stitched two face image subsets are, as shown in fig. 2, the distribution 1 is after the second time of zooming-in on the distribution, compared with the original distribution^’(corresponding to distribution 1 after drawing-in) and distribution 2^’And distribution 3^’Are all closer together.

It should be noted that, in the process of specifically performing the zoom-in operation, one distribution may be kept still, and the other distribution may be closed based on the distribution similarity, or a specific closing position may be selected, and both distributions are closed to this position.

Based on the principle, the embodiment of the present disclosure may determine a second objective loss function value for measuring feature distribution of face images from different sources.

After the characteristic distribution zooming operation is executed, a hyperplane is easily found to effectively segment the face image belonging to the real person label and the face image belonging to the prosthesis label in the extended training data set, so that the identification accuracy of the in-vivo detection neural network is improved.

In the embodiment of the present disclosure, in the case where the first target loss function value and the second target loss function value are determined, the target loss function value corresponding to the living body detection neural network to be trained may be determined. Once it is determined that the target loss function value of one round of training does not meet the preset condition, the next round of training can be performed according to the method until the target loss function value meets the preset condition, and the training is stopped.

It should be noted that the preset condition may be determined for the first target loss function value, the second target loss function value, and the whole target loss function value, or may be determined for any combination of the above three function values, and the embodiment of the present disclosure is not limited to this specific condition.

According to the training data generation method provided by the embodiment of the disclosure, the human face image characteristics in the target scene can be well excavated by using the network training method, so that the requirements on living body detection application in the target scene can be better adapted. In order to avoid performance interference of the trained living body detection neural network on the related living body detection application in the existing scene, the living body detection neural network can be interfered by a source field anti-forgetting constraint strategy.

Here, the first feature information of each source face image in the extended training data set determined by the living body detection neural network to be trained and the first source feature information extracted from the source face image by the trained source living body detection neural network may be used to determine the target loss function value corresponding to the living body detection neural network to be trained.

In a specific application, the target loss function value may be determined by a difference between the first source characteristic information and the first source characteristic information. The smaller the difference value is, the larger the deviation between the current living body detection neural network to be trained and the trained source living body detection neural network is, the larger the deviation is, the performance of the trained living body detection neural network in the source field is not reduced.

In the embodiment of the present disclosure, when the target loss function value is not small enough, the offset change may be penalized by the parameter adjustment of the living body detection neural network.

It should be noted that, the process of training the living body detection neural network according to the embodiment of the present disclosure may be a result of a combined action of a target loss function set for a source field anti-forgetting constraint policy, the first target loss function for measuring the difference of training data in the same living body label category, and the second target loss function for measuring the feature distribution of face images from different sources, and by constraint of the loss functions, on the premise of improving the recognition performance of the trained living body detection neural network in a new target scene, it is ensured that the recognition performance in the existing scene is not degraded.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, a training data generation device corresponding to the training data generation method is also provided in the embodiments of the present disclosure, and because the principle of solving the problem of the device in the embodiments of the present disclosure is similar to the method for generating the training data in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 3, a schematic diagram of an apparatus for generating training data according to an embodiment of the present disclosure is shown, where the apparatus includes: an acquisition module 301, a generation module 302 and an expansion module 303; wherein the content of the first and second substances,

an obtaining module 301, configured to obtain a source training data set and a target face image generated in a target scene;

a generating module 302, configured to generate a synthesized face image corresponding to the target face image based on the target face image and a source face image corresponding to the target face image in the source training data set;

and the expansion module 303 is configured to expand the source training data set based on the synthesized face image to obtain an expanded training data set.

The method and the device for synthesizing the human face images can fuse the target human face images based on the source human face images in the source training data set, so that the synthesized human face images obtained by fusion not only contain the image characteristics of the source human face images, but also contain the image characteristics of the target human face images. Under the condition that a target scene generates fewer target face images, the target face images can be expanded in a sample space by utilizing the synthesized face images obtained by fusion, and the expanded training data set can cover more training data in the target scene to a certain extent, so that the trained neural network has higher detection accuracy in the target scene (for example, a brand new environment).

In one possible implementation, the generating module 302 is configured to generate a synthesized face image corresponding to the target face image based on the target face image and a source face image corresponding to the target face image in the source training data set according to the following steps:

extracting a source face image matched with the target face image from the source training data set;

inputting the target face image and a source face image matched with the target face image into a trained image generation neural network for feature fusion processing to obtain a synthesized face image, and fusing the content features of the source face image and the style features of the target face image with the synthesized face image.

In one possible implementation, the generating module 302 is configured to train an image generation neural network according to the following steps:

respectively extracting the characteristics of the matched source face image sample and the matched target face image sample to obtain the content characteristic information of the source face image sample and the style characteristic information of the target face image sample;

In a possible implementation manner, the generating module 302 is configured to perform at least one round of training on an image generation neural network to be trained based on content feature information of a source facial image sample and style feature information of a target facial image sample:

aiming at the current round of training, taking the content characteristic information of a source face image sample and the style characteristic information of a target face image sample as the input characteristics of an image generation neural network to be trained, and determining the fusion characteristic information output by the image generation neural network to be trained;

under the condition that the first similarity between the fusion characteristic information and the content characteristic information is smaller than a first threshold value and the second similarity between the fusion characteristic information and the style characteristic information is smaller than a second threshold value, adjusting network parameters of an image generation neural network, and carrying out next round of training;

and until the training is cut off under the condition that the first similarity between the fusion characteristic information and the content characteristic information is greater than or equal to a first threshold value and the second similarity between the fusion characteristic information and the style characteristic information is greater than or equal to a second threshold value.

In one possible implementation, the generating module 302 is configured to extract a source facial image paired with a target facial image from a source training data set according to the following steps:

In a possible embodiment, the above apparatus further comprises:

and the training module 304 is configured to perform at least one round of training on the living body detection neural network to be trained based on the extended training data set after the extended training data set is obtained, so as to obtain a trained living body detection neural network.

In one possible implementation, the training module 304 is configured to perform at least one round of training on the living body detecting neural network to be trained based on the extended training data set according to the following steps to obtain a trained living body detecting neural network:

determining a target loss function value corresponding to a living body detection neural network to be trained based on the obtained characteristic information;

and under the condition that the target loss function value does not meet the preset condition, performing next round of training on the living body detection neural network to be trained until the target loss function value meets the preset condition.

In one possible embodiment, the method for expanding the feature information of each face image in the training data set comprises the following steps:

and expanding the first characteristic information of the source face image, the second characteristic information of the target face image and the third characteristic information of the synthesized face image in the training data set.

In one possible implementation, the training module 304 is configured to determine the objective loss function value corresponding to the living body detection neural network to be trained based on the feature information of each face image according to the following steps:

In one possible implementation, the training module 304 is configured to determine a first objective loss function value for measuring the difference of training data in the same live label category based on the first feature information, the second feature information, and the third feature information according to the following steps:

selecting two first face images of the same living body label type and two second face images of different living body label types from a training data set; the two first face images are different in source, and the two second face images are different in source;

and summing the first image similarity and the second image similarity to obtain a first target loss function value.

In a possible implementation manner, the training module 304 is configured to determine a second objective loss function value for measuring feature distribution of face images from different sources based on the first feature information, the second feature information, and the third feature information according to the following steps:

determining the first feature distribution similarity between each target face image and each synthesized face image based on the similarity between the second distribution feature information and the third distribution feature information; splicing the second distribution characteristic information and the third distribution characteristic information to obtain spliced distribution characteristic information;

determining second feature distribution similarity between each source face image and each combined face image based on the similarity between the first distribution feature information and the spliced distribution feature information;

and summing the first characteristic distribution similarity and the second characteristic distribution similarity to obtain a third target loss function value.

In a possible implementation manner, the training module 304 is configured to determine, based on the obtained feature information, an objective loss function value corresponding to the living body detection neural network to be trained according to the following steps:

determining a target loss function value corresponding to the living body detection neural network to be trained by utilizing the first characteristic information of each source face image in the extended training data set determined by the living body detection neural network to be trained and the first source characteristic information extracted from the source face image by utilizing the trained source living body detection neural network;

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

An embodiment of the present disclosure further provides an electronic device, as shown in fig. 4, which is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and the electronic device includes: a processor 401, a memory 402, and a bus 403. The memory 402 stores machine-readable instructions executable by the processor 401 (for example, execution instructions corresponding to the obtaining module 301, the generating module 302, and the extending module 303 in the apparatus in fig. 3, and the like), when the electronic device is operated, the processor 401 and the memory 402 communicate through the bus 403, and when the machine-readable instructions are executed by the processor 401, the following processes are performed:

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the training data generation method in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the training data generation method in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for generating training data, comprising:

2. The method of claim 1, wherein generating a synthetic face image corresponding to the target face image based on the target face image and a source face image in the source training dataset corresponding to the target face image comprises:

3. The method of claim 2, wherein the image-generating neural network is trained by:

4. The method of claim 3, wherein the performing at least one round of training on the image generation neural network to be trained based on the content feature information of the source facial image sample and the style feature information of the target facial image sample comprises:

5. The method according to any one of claims 2-4, wherein said extracting the source facial image paired with the target facial image from the source training data set comprises:

6. The method of any of claims 1-5, wherein after said obtaining the augmented training data set, the method further comprises:

7. The method of claim 6, wherein the performing at least one round of training on the living body detecting neural network to be trained based on the extended training data set to obtain a trained living body detecting neural network comprises:

8. The method of claim 7, wherein the augmenting the feature information of each face image in the training data set comprises:

9. The method of claim 8, wherein the determining the target loss function value corresponding to the biopsy neural network to be trained based on the obtained feature information comprises:

10. The method of claim 9, wherein determining a first objective loss function value for measuring the variability of training data within a same live label class based on the first feature information, the second feature information, and the third feature information comprises:

11. The method according to claim 9 or 10, wherein determining a second objective loss function value for measuring feature distribution of face images from different sources based on the first feature information, the second feature information and the third feature information comprises:

12. The method of claim 8, wherein the determining the target loss function value corresponding to the biopsy neural network to be trained based on the obtained feature information comprises:

13. An apparatus for generating training data, comprising:

14. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of generating training data according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for generating training data according to any one of claims 1 to 12.