CN112241716A

CN112241716A - Training sample generation method and device

Info

Publication number: CN112241716A
Application number: CN202011146608.7A
Authority: CN
Inventors: 田飞
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-01-19
Anticipated expiration: 2040-10-23
Also published as: CN112241716B

Abstract

The application discloses a training sample generation method and device, and relates to the technical field of image processing and deep learning. The specific implementation mode comprises the following steps: acquiring a visible light face image and a near-infrared face image; determining face key points of the visible light face image, and extracting HOG characteristics of the face key points; searching a pixel point with the maximum HOG characteristic similarity between the near-infrared face image and the face key point as a target key point; and aligning the visible light face image and the near-infrared face image based on the face key points and the target key points, and further generating input and target output in a training sample of the cross-modal data generation model. The method and the device can make up for the deviation of the positions of the key points between the visible light face image and the near-infrared face image. Moreover, the accuracy of the training sample can be improved by using the visible light face image and the near infrared face image which compensate the over deviation as the input and the target output of the training sample.

Description

Training sample generation method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to the technical field of image processing and deep learning, and particularly relates to a training sample generation method and device.

Background

With the development of machine learning techniques, more and more models can implement various processing operations such as image processing, natural language processing, and the like.

In the related art, in order to train to obtain an accurate model, a large number of training samples are often required to train the model. If the data in the training samples of the model are inaccurate, the deviation of the prediction result of the trained model can be directly caused.

Disclosure of Invention

Provided are a training sample generation method, a training sample generation device, an electronic device and a storage medium.

According to a first aspect, there is provided a method for generating training samples, comprising: acquiring a visible light face image and a near infrared face image, wherein the visible light face image and the near infrared face image are shot aiming at the same face, and the difference value of the shooting angles is smaller than the specified angle difference value; determining face key points of the visible light face image, and extracting HOG (histogram of oriented gradients) features of the face key points; determining pixel points corresponding to coordinates of key points of the face in the near-infrared face image, and searching pixel points with the highest HOG characteristic similarity between the pixel points and the key points of the face in an area taking the pixel points as a center to serve as target key points; and aligning the visible light face image and the near-infrared face image based on the face key points and the target key points, and generating input and target output in a training sample of the cross-mode data generation model based on the aligned visible light face image and the aligned near-infrared face image.

According to a second aspect, there is provided an apparatus for generating training samples, comprising: the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire a visible light face image and a near-infrared face image, the visible light face image and the near-infrared face image are shot aiming at the same face, and the difference value of shooting angles is smaller than a specified angle difference value; the determining unit is configured to determine face key points of the visible light face image and extract direction gradient Histogram (HOG) features of the face key points; the searching unit is configured to determine a pixel point corresponding to the coordinates of the face key point in the near-infrared face image, and search a pixel point with the highest HOG feature similarity between the pixel point and the face key point in an area taking the pixel point as the center to serve as a target key point; and the sample generation unit is configured to align the visible light face image and the near-infrared face image based on the face key point and the target key point, and generate input and target output in a training sample of the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image.

According to a third aspect, there is provided an electronic device comprising: one or more processors; a storage device for storing one or more programs that, when executed by one or more processors, cause the one or more processors to implement a method as in any embodiment of a method for generating training samples.

According to a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method as any one of the embodiments of the method of generating training samples.

According to the scheme of the application, the deviation of the position of the key point between the visible light face image and the near-infrared face image caused by the fact that the shooting parameters of the visible light camera and the near-infrared camera are inconsistent can be made up. The accuracy of the training sample can be improved by using the visible light face image and the near infrared face image which compensate the deviation as the input and the target output of the training sample. Furthermore, the training sample is adopted to train the cross-modal data generation model, so that the model can be predicted more accurately.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram to which some embodiments of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method of generating training samples according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a training sample generation method according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method of generating training samples according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of an apparatus for generating training samples according to the present application;

fig. 6 is a block diagram of an electronic device for implementing a training sample generation method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the training sample generation method or training sample generation apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as video applications, live applications, instant messaging tools, mailbox clients, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103.

Here, the

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

terminal devices

101, 102, 103. The background server can analyze and process the received data such as the visible light face image and the near-infrared face image, and feed back a processing result (for example, a training sample of a cross-modal data generation model) to the terminal device.

It should be noted that the method for generating training samples provided in the embodiment of the present application may be executed by the server 105 or the

terminal devices

101, 102, and 103, and accordingly, the means for generating training samples may be disposed in the server 105 or the

terminal devices

101, 102, and 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of generating training samples according to the present application is shown. The generation method of the training sample comprises the following steps:

step 201, a visible light face image and a near infrared face image are obtained, wherein the visible light face image and the near infrared face image are shot for the same person, and the difference value of the shooting angles is smaller than the specified angle difference value.

In this embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) on which the generation method of the training sample operates may acquire a visible light (Red Green Blue, RGB) face image and a Near Infrared (NIR) face image.

In practice, the two images are taken for the faces of the same person, i.e. for the same face. And when the two images are shot, the difference of the shooting angles adopted for the human face is small, namely the difference is smaller than the specified angle difference, such as smaller than 5 degrees or 10 degrees. For example, the visible light face image is shot at a front angle for the third image, and the shooting angle of the near-infrared face image for the third image deviates from the front angle by 5 degrees.

Step 202, determining face key points of the visible light face image, and extracting HOG (histogram of oriented gradients) features of the face key points.

In this embodiment, the execution subject may determine face key points of the visible light face image, and extract HOG features of the determined face key points.

In practice, the execution subject may determine the face key points in various ways. For example, the execution subject may directly obtain each face key point of the visible light face image from a local or other electronic device, or the execution subject may also perform key point detection on the visible light face image, so as to use the detection result as the determined face key point.

Step 203, determining a pixel point corresponding to the coordinates of the face key point in the near-infrared face image, and searching a pixel point with the highest HOG feature similarity between the pixel point and the face key point in an area taking the pixel point as the center to serve as a target key point.

In this embodiment, the execution subject may determine, for each determined face key point, a pixel point corresponding to the coordinates of the face key point in the near-infrared face image, and determine an area centered on the pixel point. For example, the region may be a sized region. The pixel point corresponding to the coordinate of the face key point may refer to a pixel point having the same coordinate as the face key point, or a pixel point having a coordinate adjacent to the coordinate of the face key point, for example, the coordinate of the pixel point may be an adjacent coordinate of the face key point.

Then, the execution subject may search for a pixel point in the region, where the similarity between the HOG feature of the pixel point and the HOG feature of the key point of the face is the largest, that is, the similarity between the HOG feature of the pixel point and the HOG feature of the key point of the face is greater than (or equal to) the similarity between the HOG features of other pixel points in the region and the HOG feature of the key point of the face.

In practice, the execution subject may determine the similarity between the HOG features of the pixel points and the HOG features of the key points of the face by determining cosine values between the HOG features of the pixel points and the HOG features of the key points of the face.

And 204, aligning the visible light face image and the near-infrared face image based on the face key points and the target key points, and generating input and target output in a training sample of the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image.

In this embodiment, the execution subject may align the visible light face image and the near-infrared face image based on the face key point and the target key point. Then, the execution subject may generate a training sample for training the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image. Specifically, the execution subject may use the aligned visible light face image and the aligned near-infrared face image as an input and a target output in the training sample. The target output here refers to data for determining a loss value in cooperation with a prediction result corresponding to the input. Specifically, because the shooting parameters of the visible light camera and the near-infrared camera are not consistent, there may be some deviations in the key points between the visible light face image and the near-infrared face image, that is, coordinates are not consistent.

The input and output of the cross-modal data generation model may be different modes, for example, the input may be a visible mode and the output may be a near-infrared mode, or the input may be a near-infrared mode and the output may be a visible mode accordingly. In practice, the cross-modal data generation model may be various deep neural networks, such as a residual neural network.

Specifically, the execution subject may generate an input and a target output in a training sample of the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image in various ways. For example, the execution subject may use the aligned visible light face image as an input in a training sample, and use the aligned near-infrared face image as a target in the training sample to output. Or the execution subject may also output the aligned visible light face image as a target in a training sample, and use the aligned near-infrared face image as an input in the training sample. In addition, the execution main body may perform preset processing on the aligned visible light face image and/or the aligned near-infrared face image, for example, input a preset processing model, and respectively output the visible light face image and the near-infrared face image, which are subjected to the preset processing, as an input and a target in the cross-modal data generation model training sample.

In practice, the execution subject may align the visible light face image and the near-infrared face image based on the face key point and the target key point in various ways. For example, the execution main body may align the face key point of the visible light face image with the face key point template, and align the target key point of the near-infrared face image with the face key point template, thereby implementing alignment between the visible light face image and the near-infrared face image.

The method provided by the embodiment of the application can make up for the deviation of the key point position between the visible light face image and the near-infrared face image caused by the inconsistency of the shooting parameters of the visible light camera and the near-infrared camera. The accuracy of the training sample can be improved by using the visible light face image and the near infrared face image which compensate the deviation as the input and the target output of the training sample. Furthermore, the training sample is adopted to train the cross-modal data generation model, so that the model can be predicted more accurately.

In some optional implementation manners of this embodiment, the searching for a pixel point with the highest HOG feature similarity to the face key point in the region centered on the pixel point may include: dividing the region into a plurality of sub-regions, wherein the number of the sub-regions is a preset numerical value; in each subregion, determining a subregion with the highest similarity of HOG characteristics between the face and the key points of the face as a target subregion; and taking the central pixel point of the target subregion as the pixel point with the maximum HOG characteristic similarity between the searched pixel point and the face key point.

In these optional implementation manners, the execution subject may segment the region into a plurality of sub-regions, and determine, in each sub-region, a sub-region whose HOG feature has the greatest similarity to the HOG feature of the face key point. That is, the similarity between the determined HOG feature of the sub-region and the HOG feature of the face key point is greater than (or equal to) the similarity between the HOG features of other sub-regions and the HOG features of the face key point. Then, the execution subject may use the center pixel point of the target sub-region as the found pixel point.

For example, the execution body may divide the region into 64 sub-regions of 8 × 8, that is, the division may be equal.

The implementation modes can avoid traversing and calculating the HOG characteristics of each pixel point in the region through segmentation, thereby reducing the calculation amount and improving the execution speed of the steps.

In some optional implementation manners of this embodiment, the aligning the visible light face image and the near-infrared face image based on the face key point and the target key point in step 204 may include: triangulating the visible light face image and the near-infrared face image to obtain a visible light grid map taking face key points as vertexes and a near-infrared grid map taking target key points as vertexes; and carrying out face alignment on the visible light grid image and the near-infrared grid image to obtain an aligned visible light face image and an aligned near-infrared face image.

In these optional implementations, the execution subject may triangulate the visible light face image and the near-infrared face image respectively, so as to obtain a visible light grid map and a near-infrared grid map. In practice, the execution body may triangulate in various ways, such as delaunay triangulation.

The realization modes can obtain the grid map through triangulation, and improve the accuracy of face alignment by carrying out face alignment on the grid map.

In some optional implementations of this embodiment, the training step of the cross-modal data generation model may include: and taking the aligned visible light face image as input, taking the aligned near-infrared face image as target output, training a cross-modal data generation model to obtain a trained cross-modal data generation model, wherein the trained cross-modal data generation model is used for converting the visible light face image into the near-infrared face image, and the cross-modal data generation model is used for generating a countermeasure network.

In these alternative implementations, the execution agent or other electronic device may train the cross-modal data generation model. Taking the execution subject as an example, the execution subject may use the aligned visible light face image as an input in a training sample, and input the input into the cross-modal data generation model, that is, forward propagation is performed by using the input, so as to obtain a prediction result output by the cross-modal data generation model. And outputting the near-infrared face image as a target in a training sample, determining a loss value through a preset loss function by using the target output and a prediction result, and performing back propagation by using the loss value to realize the training of the cross-modal data generation model to obtain the trained cross-modal data generation model.

In practice, the cross-modal data generation model may be a variety of deep neural networks, such as convolutional neural networks, generative countermeasure networks, and so forth.

The implementation modes can train the cross-modal data generation model by using an accurate training sample, so that the cross-modal data generation model with high prediction accuracy is generated.

In some optional implementations of this embodiment, the method may further include: carrying out graying processing on the visible light face image and the near infrared face image respectively to obtain a grayed visible light image and a grayed near infrared image; and the extracting of the histogram of oriented gradient HOG feature of the face key point in step 202 may include: and extracting HOG characteristics of key points of the human face in the grayed visible light image.

In these alternative implementations, the executing subject may first grayout the image before extracting the HOG features. The execution main body can perform graying processing on the visible light face image and the near-infrared face image to respectively obtain a grayed visible light image and a grayed near-infrared image. In this way, the execution subject can extract the HOG feature of each face key point for the grayed visible light image.

The realization modes can accurately extract the HOG characteristics of the key points of the human face by graying the image.

Optionally, the determining, in the step 203, a pixel point corresponding to the coordinate of the face key point in the near-infrared face image, and searching a pixel point with the highest HOG feature similarity between the pixel point and the face key point in an area with the pixel point as a center may include: and searching pixel points with the same coordinates as the key points of the human face in the grayed near-infrared image, and searching the pixel points with the highest HOG characteristic similarity between the pixel points and the key points of the human face in an area which takes the pixel points as the center in the grayed near-infrared image.

In these optional implementation manners, the execution subject may search for a pixel point having the same coordinate as the face key point in the grayed near-infrared image, and determine, by using the pixel point, a pixel point having the largest HOG feature similarity with the face key point in the grayed near-infrared image. Thus, the execution subject can take the pixel point as a target key point.

In practice, the target key point searched in the grayed near-infrared image can directly correspond to the pixel point with the same coordinate as the target key point in the non-grayed near-infrared human face image. That is, the pixel points with the same coordinates as the target key points searched in the grayed near-infrared image and in the non-grayed near-infrared face image can be used as the target key points searched in the near-infrared face image.

The realization modes can accurately determine the target key points in the near-infrared face image corresponding to the face key points in the visible light face image through the gray-scale image.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the training sample generation method according to the present embodiment. In the application scenario of fig. 3, the execution subject 301 obtains a visible light face image 302 and a near-infrared face image 303, where the visible light face image 302 and the near-infrared face image 303 are shot for the same face and the difference between the shooting angles is smaller than a specified angle difference. The execution subject 301 determines a face key point 304 of the visible light face image 302, and extracts a histogram of oriented gradients HOG feature of the face key point 304. The execution main body 301 determines a pixel point corresponding to the coordinate of the face key point 304 in the near-infrared face image, and searches for a pixel point with the highest HOG feature similarity to the face key point in an area with the pixel point as the center, as a target key point 305. The execution subject 301 aligns the visible light face image and the near-infrared face image based on the face key point 304 and the target key point 305, and generates an input and a target output in a training sample 306 of the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method of generating training samples is shown. The process 400 includes the following steps:

step 401, acquiring a visible light face image and a near-infrared face image, where the visible light face image and the near-infrared face image are shot for the same face and a difference value of shooting angles is smaller than a specified angle difference value.

Step 402, determining face key points of the visible light face image, and extracting HOG (histogram of oriented gradients) features of the face key points.

Step 403, determining pixel points corresponding to the coordinates of the key points of the human face in the near-infrared human face image, and obtaining the area of a detection frame of the human face in the visible light human face image.

In this embodiment, the execution subject may search for a pixel point corresponding to a coordinate of a key point of the face in the near-infrared face image, and obtain an area of a detection frame of the face in the visible light face image.

The execution main body may directly obtain the area from a local or other electronic device, or the execution main body may also obtain the detection frame of the human face in various ways before obtaining the area of the detection frame, for example, the execution main body may directly perform human face detection on the visible light human face image to obtain the detection frame for indicating the position of the human face in the visible light human face image. In addition, the execution subject can also directly acquire a detection result of the human face detection of the visible light human face image from local or other electronic equipment.

And step 404, determining an area of the rectangular frame with the area ratio of the rectangular frame to the detection frame being a preset ratio by taking the corresponding pixel point as a center, wherein the preset ratio is smaller than 1.

In this embodiment, the execution subject may determine an area by using the corresponding pixel point as an area center. The area is an area where a rectangular frame is located, and the area of the rectangular frame and the area of the detection frame may have a preset ratio, for example, the area of the rectangular frame is 4 times of the area of the detection frame. The execution body may determine the rectangular frame in various ways, for example, the execution body may determine the rectangular frame with a preset aspect ratio, for example, the rectangular frame may be a square. Further, the rectangular frame may be a predetermined shape, such as a shape corresponding to the detection frame.

And 405, searching a pixel point with the highest HOG characteristic similarity between the pixel point and the face key point in the region to serve as a target key point.

In this embodiment, the execution subject may search for a pixel point with the highest HOG feature similarity between the HOG feature and the face key point.

And 406, aligning the visible light face image and the near-infrared face image based on the face key points and the target key points, and generating input and target output in a training sample of the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image.

In this embodiment, the execution subject may align the visible light face image and the near-infrared face image based on the face key point and the target key point. Then, the execution subject may generate a training sample for training the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image. Specifically, the execution subject may use the aligned visible light face image and the aligned near-infrared face image as an input and a target output in the training sample.

According to the method and the device, the area of the region for generating the target key points can be determined properly through the area of the detection frame, and the problem that the target key points are not accurately generated due to the fact that the region is too large or too small is avoided.

In some optional implementation manners of this embodiment, the searching for the pixel point with the highest HOG feature similarity to the face key point in the region centered on the pixel point may include: dividing the region into a plurality of sub-regions, wherein the number of the sub-regions is a preset numerical value; in each subregion, determining a subregion with the highest similarity of HOG characteristics between the face and the key points of the face as a target subregion; and taking the central pixel point of the target subregion as the pixel point with the maximum HOG characteristic similarity between the searched pixel point and the face key point.

In some optional implementation manners of this embodiment, the aligning the visible light face image and the near-infrared face image based on the face key point and the target key point may include: triangulating the visible light face image and the near-infrared face image to obtain a visible light grid map taking face key points as vertexes and a near-infrared grid map taking target key points as vertexes; and carrying out face alignment on the visible light grid image and the near-infrared grid image to obtain an aligned visible light face image and an aligned near-infrared face image.

In some optional implementations of this embodiment, the training step for the cross-modal data generation model may include: and taking the aligned visible light face image as input, taking the aligned near-infrared face image as target output, training a cross-modal data generation model to obtain a trained cross-modal data generation model, wherein the trained cross-modal data generation model is used for converting the visible light face image into the near-infrared face image, and the cross-modal data generation model is used for generating a countermeasure network.

In some optional implementations of this embodiment, the method may further include: carrying out graying processing on the visible light face image and the near infrared face image respectively to obtain a grayed visible light image and a grayed near infrared image; the extracting of the histogram of oriented gradient HOG features of the key points of the face may include: and extracting HOG characteristics of key points of the human face in the grayed visible light image.

Optionally, the determining, in the near-infrared face image, a pixel point corresponding to the coordinate of the face key point, and searching, in an area centered on the pixel point, a pixel point with the highest HOG feature similarity to the face key point may include: and searching pixel points with the same coordinates as the key points of the human face in the grayed near-infrared image, and searching the pixel points with the highest HOG characteristic similarity between the pixel points and the key points of the human face in an area which takes the pixel points as the center in the grayed near-infrared image.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present application provides an embodiment of an apparatus for generating a training sample, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and besides the features described below, the embodiment of the apparatus may further include the same or corresponding features or effects as the embodiment of the method shown in fig. 2. The device can be applied to various electronic equipment.

As shown in fig. 5, the training sample generation device 500 of the present embodiment includes: an acquisition unit 501, a determination unit 502, a search unit 503, and a sample generation unit 504. The acquiring unit 501 is configured to acquire a visible light face image and a near-infrared face image, where the visible light face image and the near-infrared face image are shot for the same face and a difference of shooting angles is smaller than a specified angle difference; a determining unit 502 configured to determine face key points of the visible light face image, and extract histogram of oriented gradient HOG (HOG) features of the face key points; a searching unit 503 configured to determine a pixel point corresponding to the coordinates of the face key point in the near-infrared face image, and search, in an area centered on the pixel point, a pixel point having the largest HOG feature similarity with the face key point as a target key point; a sample generation unit 504 configured to align the visible light face image and the near-infrared face image based on the face key point and the target key point, and generate an input and a target output in a training sample of the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image.

In this embodiment, specific processes of the obtaining unit 501, the determining unit 502, the searching unit 503, and the sample generating unit 504 of the training sample generating apparatus 500 and technical effects thereof may refer to related descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the apparatus further includes: the area acquisition unit is configured to acquire the area of a detection frame of the human face in the visible light human face image before searching a pixel point with the highest HOG feature similarity between the pixel point and the human face key point in an area taking the pixel point as the center; and the area determining unit is configured to determine an area of a rectangular frame with a preset area ratio to the detection frame by taking the corresponding pixel point as a center, wherein the preset ratio is smaller than 1.

In some optional implementation manners of this embodiment, the searching unit is further configured to search, in an area centered on the pixel point, a pixel point with the highest HOG feature similarity to the face key point according to the following manner: dividing the region into a plurality of sub-regions, wherein the number of the sub-regions is a preset numerical value; in each subregion, determining a subregion with the highest similarity of HOG characteristics between the face and the key points of the face as a target subregion; and taking the central pixel point of the target subregion as the pixel point with the maximum HOG characteristic similarity between the searched pixel point and the face key point.

In some optional implementations of this embodiment, the sample generating unit is further configured to perform aligning the visible light face image and the near-infrared face image based on the face key point and the target key point as follows: triangulating the visible light face image and the near-infrared face image to obtain a visible light grid map taking face key points as vertexes and a near-infrared grid map taking target key points as vertexes; and carrying out face alignment on the visible light grid image and the near-infrared grid image to obtain an aligned visible light face image and an aligned near-infrared face image.

In some optional implementations of this embodiment, the training step of generating the model from the cross-modal data includes: and taking the aligned visible light face image as input, taking the aligned near-infrared face image as target output, training a cross-modal data generation model to obtain a trained cross-modal data generation model, wherein the trained cross-modal data generation model is used for converting the visible light face image into the near-infrared face image, and the cross-modal data generation model is used for generating a countermeasure network.

In some optional implementations of this embodiment, the apparatus further includes: the graying unit is configured to perform graying processing on the visible light face image and the near infrared face image respectively to obtain a grayed visible light image and a grayed near infrared image; and the determining unit is further configured to extract the HOG features of the histogram of oriented gradients of the key points of the human face according to the following modes: and extracting HOG characteristics of key points of the human face in the grayed visible light image.

In some optional implementation manners of this embodiment, the searching unit is further configured to determine a pixel point corresponding to the coordinate of the face key point in the near-infrared face image, and search, in an area centered on the pixel point, a pixel point with the largest HOG feature similarity between the pixel point and the face key point: and searching pixel points with the same coordinates as the key points of the human face in the grayed near-infrared image, and searching the pixel points with the highest HOG characteristic similarity between the pixel points and the key points of the human face in an area which takes the pixel points as the center in the grayed near-infrared image.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the training sample generation method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of generating training samples provided herein.

The memory 602 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training sample generation method in the embodiment of the present application (for example, the acquisition unit 501, the determination unit 502, the search unit 503, and the sample generation unit 504 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, that is, implements the method for generating training samples in the above method embodiments.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the electronic device for generation of training samples, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected over a network to the training sample generating electronics. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for generating a training sample may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device generating the training sample, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a determination unit, a lookup unit, and a sample generation unit. The names of the units do not in some cases constitute a limitation to the units themselves, and for example, the acquisition unit may also be described as a "unit that acquires a visible light face image and a near-infrared face image".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a visible light face image and a near infrared face image, wherein the visible light face image and the near infrared face image are shot aiming at the same face, and the difference value of the shooting angles is smaller than the specified angle difference value; determining face key points of the visible light face image, and extracting HOG (histogram of oriented gradients) features of the face key points; determining pixel points corresponding to coordinates of key points of the face in the near-infrared face image, and searching pixel points with the highest HOG characteristic similarity between the pixel points and the key points of the face in an area taking the pixel points as a center to serve as target key points; and aligning the visible light face image and the near-infrared face image based on the face key points and the target key points, and generating input and target output in a training sample of the cross-mode data generation model based on the aligned visible light face image and the aligned near-infrared face image.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method of generating training samples, the method comprising:

acquiring a visible light face image and a near infrared face image, wherein the visible light face image and the near infrared face image are shot aiming at the same face, and the difference value of the shooting angles is smaller than the specified angle difference value;

determining face key points of the visible light face image, and extracting HOG (histogram of oriented gradients) features of the face key points;

determining pixel points corresponding to the coordinates of the face key points in the near-infrared face image, and searching pixel points with the highest HOG feature similarity between the pixel points and the face key points in an area taking the pixel points as the center to serve as target key points;

and aligning the visible light face image and the near-infrared face image based on the face key points and the target key points, and generating input and target output in a training sample of a cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image.

2. The method of claim 1, wherein before the searching for the pixel point with the highest HOG feature similarity to the face key point in the region centered on the pixel point, the method further comprises:

acquiring the area of a detection frame of the face in the visible light face image;

and determining the area of the rectangular frame with the area proportion of the rectangular frame to the detection frame as a preset proportion by taking the corresponding pixel point as a center, wherein the preset proportion is less than 1.

3. The method according to claim 1 or 2, wherein the searching for the pixel point with the highest HOG feature similarity to the face key point in the region centered on the pixel point comprises:

dividing the region into a plurality of sub-regions, wherein the number of the sub-regions is a preset numerical value;

in each sub-region, determining a sub-region with the highest similarity of HOG characteristics with the face key points as a target sub-region;

and taking the central pixel point of the target subregion as the pixel point with the maximum HOG characteristic similarity between the searched pixel point and the face key point.

4. The method of claim 1 or 2, wherein the aligning the visible light face image and the near-infrared face image based on the face keypoints and the target keypoints comprises:

triangulating the visible light face image and the near-infrared face image to obtain a visible light grid map taking the face key points as vertexes and a near-infrared grid map taking the target key points as vertexes;

and carrying out face alignment on the visible light grid pattern and the near-infrared grid pattern to obtain an aligned visible light face image and an aligned near-infrared face image.

5. The method of claim 1 or 2, wherein the training of the cross-modal data generation model comprises:

and taking the aligned visible light face image as input, taking the aligned near-infrared face image as target output, training the cross-modal data generation model, and obtaining the trained cross-modal data generation model, wherein the trained cross-modal data generation model is used for converting the visible light face image into the near-infrared face image, and the cross-modal data generation model is used for generating an anti-network.

6. The method according to claim 1 or 2, wherein the method further comprises:

performing graying processing on the visible light face image and the near-infrared face image respectively to obtain a grayed visible light image and a grayed near-infrared image; and

the extracting of the HOG feature of the histogram of oriented gradient of the key points of the face comprises the following steps:

and for the grayed visible light image, extracting HOG characteristics of the key points of the human face in the grayed visible light image.

7. The method of claim 6, wherein the determining a pixel point corresponding to the coordinates of the face key point in the near-infrared face image, and searching a pixel point with the highest HOG feature similarity with the face key point in an area with the pixel point as a center comprises:

and searching pixel points with the same coordinates as the key points of the human face in the grayed near-infrared image, and searching pixel points with the highest HOG characteristic similarity between the pixel points and the key points of the human face in an area which takes the pixel points as the center in the grayed near-infrared image.

8. An apparatus for generating training samples, the apparatus comprising:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire a visible light face image and a near-infrared face image, the visible light face image and the near-infrared face image are shot aiming at the same face, and the difference value of shooting angles is smaller than a specified angle difference value;

the determining unit is configured to determine face key points of the visible light face image, and extract histogram of direction gradients (HOG) features of the face key points;

the searching unit is configured to determine a pixel point corresponding to the coordinates of the face key point in the near-infrared face image, and search a pixel point with the highest HOG feature similarity between the pixel point and the face key point in an area taking the pixel point as a center to serve as a target key point;

and the sample generation unit is configured to align the visible light face image and the near-infrared face image based on the face key point and the target key point, and generate input and target output in a training sample of the cross-modal data generation model based on the aligned visible light face image and the aligned near-infrared face image.

9. The apparatus of claim 8, wherein the apparatus further comprises:

the area acquisition unit is configured to acquire the area of a detection frame of the human face in the visible light human face image before searching the pixel point with the highest HOG feature similarity between the pixel point and the human face key point in the region with the pixel point as the center;

and the area determining unit is configured to determine an area where a rectangular frame with a preset area ratio to the detection frame is located by taking the corresponding pixel point as a center, wherein the preset ratio is smaller than 1.

10. The apparatus according to claim 8 or 9, wherein the searching unit is further configured to perform the searching for the pixel point with the highest HOG feature similarity to the face key point in the region centered on the pixel point as follows:

11. The apparatus according to claim 8 or 9, wherein the sample generation unit is further configured to perform the aligning the visible light face image and the near-infrared face image based on the face keypoints and the target keypoints as follows:

12. The apparatus of claim 8 or 9, wherein the training of the cross-modal data generation model comprises:

13. The apparatus of claim 8 or 9, wherein the apparatus further comprises:

the graying unit is configured to perform graying processing on the visible light face image and the near infrared face image respectively to obtain a grayed visible light image and a grayed near infrared image; and

the determining unit is further configured to perform the extracting of the histogram of oriented gradients HOG feature of the face keypoint as follows:

14. The apparatus of claim 13, wherein the searching unit is further configured to determine a pixel point corresponding to the coordinates of the face key point in the near-infrared face image, and search for a pixel point with the highest HOG feature similarity to the face key point in a region centered on the pixel point:

15. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-7.