CN112149634A

CN112149634A - Training method, device and equipment of image generator and storage medium

Info

Publication number: CN112149634A
Application number: CN202011143074.2A
Authority: CN
Inventors: 田飞
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2020-12-29

Abstract

The embodiment of the application discloses a training method, a training device, training equipment and a storage medium of an image generator, and relates to the technical fields of artificial intelligence such as deep learning, big data processing, data enhancement and computer vision. The method comprises the following steps: respectively inputting the code of the first modality image and the code of the second modality image of the same object into an initial model of an image generator to obtain a first image and a second image; respectively inputting the first image and the second image into a target network model to obtain a characteristic value of the first image and a characteristic value of the second image; the loss function is determined based on the characteristic value of the first image, the characteristic value of the second image and the type of the target network model, the parameters of the image generator are adjusted, and the trained image generator is obtained, so that the image generator capable of generating the specific mode image in the first mode image and/or the second mode image is provided, the overfitting problem caused by too few training samples of the specific mode image recognition model is solved, and the recognition accuracy of the model is improved.

Description

Training method, device and equipment of image generator and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to the technical fields of artificial intelligence, such as deep learning, big data processing, data enhancement, and computer vision, and in particular, to a training method, apparatus, device, and storage medium for an image generator.

Background

The currently commonly used camera is an RGB (red, green, blue three-channel color) camera, it is easy to acquire face data in an RGB modality, and training data of a face recognition model in the RGB modality is sufficient. However, in addition to the RGB modality, the face data of other modalities, such as Depth (Depth) and NIR (near infrared), etc., are very few due to the cost or technical difficulty.

Disclosure of Invention

The embodiment of the application provides a training method, a training device, training equipment and a storage medium of an image generator.

In a first aspect, an embodiment of the present application provides a training method for an image generator, including: inputting a first modal image and a second modal image of the same object to a pre-trained encoder to obtain a code of the first modal image and a code of the second modal image; respectively inputting the code of the first modality image and the code of the second modality image into an initial model of an image generator to obtain a first image and a second image; inputting the first image and the second image into a pre-trained target network model respectively to obtain a characteristic value of the first image and a characteristic value of the second image; determining a loss function of an initial model of the image generator corresponding to the target network model based on the characteristic value of the first image, the characteristic value of the second image and the type of the target network model, and adjusting parameters of the initial model of the image generator according to the loss function of the initial model of the image generator to obtain the trained image generator.

In a second aspect, an embodiment of the present application provides an training apparatus for an image generator, including: the encoding module is configured to input a first modality image and a second modality image of the same object to a pre-trained encoder to obtain an encoding of the first modality image and an encoding of the second modality image; an image generation module configured to input the code of the first modality image and the code of the second modality image into an initial model of an image generator respectively, resulting in a first image and a second image; the image characteristic value generation module is configured to input the first image and the second image into a pre-trained target network model respectively to obtain a characteristic value of the first image and a characteristic value of the second image; and the image generator parameter adjusting module is configured to determine a loss function of an initial model of the image generator corresponding to the target network model based on the feature value of the first image, the feature value of the second image and the type of the target network model, and adjust parameters of the initial model of the image generator according to the loss function of the initial model of the image generator to obtain the trained image generator.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application propose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect.

According to the training method, the training device, the training equipment and the storage medium of the image generator, firstly, a first modality image and a second modality image of the same object are input to an encoder which is trained in advance, and a code of the first modality image and a code of the second modality image are obtained; then, respectively inputting the code of the first modality image and the code of the second modality image into an initial model of an image generator to obtain a first image and a second image; then, the first image and the second image are respectively input to a pre-trained target network model to obtain a characteristic value of the first image and a characteristic value of the second image; and finally, determining a loss function of an initial model of the image generator corresponding to the target network model based on the characteristic value of the first image, the characteristic value of the second image and the type of the target network model, and adjusting parameters of the initial model of the image generator according to the loss function of the initial model of the image generator to obtain the trained image generator, so that the image generator capable of generating the specific modality image in the first modality image and/or the second modality image is provided, the problem of model overfitting caused by too few training samples of the specific modality image recognition model is solved, and the recognition accuracy of the specific modality image recognition model is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a schematic flow chart diagram of one embodiment of a training method for an image generator according to the present application;

FIG. 3 is a schematic diagram of an embodiment of an exercise device of the image generator of the present application;

fig. 4 is a block diagram of an electronic device for implementing a training method of an image generator according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 of an embodiment of a training apparatus to which the training method of the image generator or the image generator of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

Terminal device 101 may interact with server 103 through network 102. The first modality image and the second modality image belonging to the training sample may be provided in the terminal device 101, including but not limited to a database, a user terminal, and the like.

The server 103 may provide various services, for example, the server 103 may perform processing such as analysis on data such as the first-modality image and the second-modality image acquired from the terminal apparatus 101, and generate a processing result (for example, obtain a trained image generator).

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the training method of the image generator provided in the embodiment of the present application is generally executed by the server 103, and accordingly, the training device of the image generator is generally disposed in the server 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a training method of an image generator according to the present application is shown. The method comprises the following steps:

step 201, inputting a first modality image and a second modality image of the same object to a pre-trained encoder to obtain a first modality image code and a second modality image code.

In this embodiment, an executive body (for example, the server 103 shown in fig. 1) of the training method of the image generator may input the first modality image and the second modality image of the same object to the pre-trained encoder, so as to obtain the encoding of the first modality image and the encoding of the second modality image.

The training process of the encoder is as follows: acquiring a first modal image sample and a second modal image sample of the same object; and taking the first modality image sample and the second modality image sample of the same object as the input of an initial model of the encoder, taking the coding of the first modality image sample and the coding of the second modality image sample as the output of the initial model of the encoder, and training the initial model of the encoder to obtain the trained encoder.

The first modality image may be a sufficient amount of modality image data, and the second modality image may be a smaller amount of modality image data.

For example, the first modality image may be an RGB image, a CT image, an MRI image, a PER image, and the second modality image may be an NIR image, a Depth image. The RGB color scheme is a color standard in the industry, and various colors are obtained by changing three color channels of red (R), green (G) and blue (B) and superimposing the three color channels on each other, wherein RGB represents the colors of the three color channels of red, green and blue.

The RGB image can be captured by a common camera. However, since a NIR (near infrared light) image or a Depth image needs a professional device (such as a Kinect or RealSense multi-mode camera) to be acquired, the NIR image or the Depth image is often difficult to acquire and the data amount is small. Due to the fact that the data volume of the first modality image is sufficient, the object recognition model trained by the first modality image has high recognition accuracy. However, the data volume of the second modality image is small, and the object recognition model trained only by the second modality image is prone to overfitting, so that the recognition accuracy is affected.

In the present embodiment, the first modality image and the second modality image may be modality image data with a sufficient amount of data, such as RGB images or the like. The first modality image and the second modality image may also be a smaller amount of data modality image data, such as an NIR image. In addition, the first modality image may be modality image data with a small amount of data, and the second modality image may be modality image data with a sufficient amount of data.

In this embodiment, the initial model of the encoder may be an untrained encoder or an untrained encoder, and each layer of the initial model of the encoder may be provided with initial parameters, and the parameters may be continuously adjusted during the training of the initial model of the encoder. The initial model of the encoder may be various types of untrained or untrained artificial neural networks or a model obtained by combining various types of untrained or untrained artificial neural networks, for example, the initial model of the encoder may be an untrained convolutional neural network, an untrained cyclic neural network, or a model obtained by combining an untrained convolutional neural network, an untrained cyclic neural network, and an untrained fully-connected layer.

Step 202, inputting the code of the first modality image and the code of the second modality image into the initial model of the image generator respectively to obtain the first image and the second image.

In this embodiment, the execution subject may input the encoding of the first-modality image and the encoding of the second-modality image to the initial model of the image generator, respectively, to obtain the first image and the second image. Therein, a network structure symmetrical to the trained encoder in step 201 above may be used as part of the initial model of the image generator. For example, if the encoder hidden layer size is [ h1, h2, h3], where h1 is the input and h2 and h3 are the hidden layer outputs, the hidden layer size of the initial model of the image generator may be [ h3, h2, h1], where h3 and h2 are the inputs and h1 is the hidden layer output. That is, the encoder and the image generator may employ an encoder-decoder structure. Specifically, the image generator decodes the encoding of the first modality image and the encoding of the second modality image, and the decoding may adopt N full connection layers plus a Sigmoid activation function to obtain the first image and the second image, respectively.

In this embodiment, the initial model of the image generator may be an untrained image generator or an untrained image generator, and each layer of the initial model of the image generator may be provided with initial parameters, which may be continuously adjusted during the training of the initial model of the image generator.

The initial model of the image generator may be various types of untrained or untrained artificial neural networks or a model obtained by combining various types of untrained or untrained artificial neural networks, for example, the initial model of the image generator may be an untrained convolutional neural network, an untrained cyclic neural network, or a model obtained by combining an untrained convolutional neural network, an untrained cyclic neural network, and an untrained full-connectivity layer.

Wherein, the first image and the second image output according to the initial model of the image generator with the initial parameters may be the first modality image or the second modality image. After the training of the image generator is completed, the code of the first mode image is input into the image generator, and the first mode image is output; the encoding of the second modality image is input to the image generator and the output is a second model image.

Step 203, inputting the first image and the second image to the pre-trained target network model respectively to obtain a characteristic value of the first image and a characteristic value of the second image.

In this embodiment, the executing entity may input the first image and the second image to the pre-trained target network model respectively to obtain a feature value of the first image and a feature value of the second image. The target network model may be any artificial neural network, such as a convolutional neural network, a cyclic neural network, and so on.

In this embodiment, the loss function of the initial model of the image generator may be determined according to the target network model. That is, the loss functions of the initial models of the image generators for different types of target network models are different.

For example, in order to make the first image and the second image generated by the image generator belong to the same face, a face recognition model may be selected as a target network model, and the first image and the second image are input to a pre-trained face recognition model to obtain a feature value of the first image and a feature value of the second image.

For another example, in order to make the first image generated by the image generator more similar to the first modality image or the second image more similar to the second modality image, the image modality determination model may be selected as the target network model, and the first image and the second image may be input to the pre-trained image modality determination model to obtain the feature value of the first image and the feature value of the second image.

For example, to control the difference between the input of the encoder and the output of the image generator, a depth convolution network model may be selected as the target network model, and the first image and the second image are input to the pre-trained depth convolution network model, resulting in the feature value of the first image and the feature value of the second image.

And 204, determining a loss function of the initial model of the image generator corresponding to the target network model based on the characteristic value of the first image, the characteristic value of the second image and the type of the target network model, and adjusting parameters of the initial model of the image generator according to the loss function of the initial model of the image generator to obtain the trained image generator.

In this embodiment, the executing entity may adjust parameters of the initial model of the image generator according to a loss function of the initial model of the image generator, so as to obtain a trained image generator.

And the loss functions of the initial models of the image generators corresponding to different types of target network models are different. Specifically, if the face recognition model is selected as the target network model, the loss function of the initial model of the image generator may be determined based on the similarity between the feature value of the first image and the feature value of the second image. For example, the cosine distance between the feature value of the first image and the feature value of the second image may be used as the loss function, where the loss function loss is arcos (dot (a, B)/| a | | | B |), where a is the feature value of the first image and B is the feature value of the second image. If the image mode discrimination model is selected as the target network model, the loss function of the initial model of the image generator is the loss function based on the image mode feature discrimination countermeasure. For example, the loss function D _ loss ═ E (Max (0,1-D (first image))) + E (Max (0,1+ D (second image))), where D denotes discriminant model inference, E denotes the dimension averaging of the batch samples, and Max denotes the maximization. The training process of the image mode discrimination model is as follows: if the input of the image discrimination model is the first mode image, the truth value 1 can be used as the expected output of the image discrimination model, and the image discrimination model is trained; if the input of the image discrimination model is the second mode image, the false value-1 can be used as the expected output of the image discrimination model, the image discrimination model is trained, and finally the trained image discrimination model is obtained. If the deep convolutional network model is selected as the target network model, the eigenvalue of the first image and the eigenvalue of the second image may be used as a loss function smooth-L1 loss.

In this embodiment, after the loss function of the initial model of the image generator is determined, the initial model of the image generator may be trained, and parameters of the initial model of the image generator are adjusted according to the loss function until the loss function converges, so as to obtain the trained image generator.

According to the training method of the image generator provided by the embodiment of the application, the image generator capable of generating the specific modality image is obtained by constraining the image generator, so that the problem of model overfitting caused by too few training samples of the specific modality image recognition model is solved, and the recognition accuracy of the specific modality image recognition model is improved.

In some optional implementations of this embodiment, the training process of the encoder in step 201 includes: inputting a first modality image sample and a second modality image sample of the same object into an initial model of an encoder to obtain a code of the first modality image sample and a code of the second modality image sample; determining a loss function of an initial model of the encoder based on the cosine distance between the code of the first mode image sample and the code of the second mode image sample, and adjusting the parameters of the initial model of the encoder according to the loss function of the initial model of the encoder to obtain the trained encoder.

In order to ensure that different modality images of the image to be extended belong to the same object (for example, the same face), a loss function of an initial model of the encoder may be constrained, so that cosine distances of codes of different modality images of the image to be extended are close, that is, the cosine distance loss function is used to train the initial model of the encoder. The image to be expanded is a first modality image and/or a second modality image. For example, the cosine distance loss function loss is arcos (dot (a, B)/| a | | | B |), where a is the code of the first-modality image, B is the code of the second-modality image, dot (a, B) represents the multiplication of a and B points, and | | a | | represents the modulus of the a vector, i.e., each element of the a vector is squared, then summed, and the sum is summed to obtain the square root. By making a loss function on the cosine distance between the code of the first modality image and the code of the second modality image, the input first modality image and the input second modality image can be characterized as the same object.

In some optional implementations of this embodiment, the training process of the encoder in step 201 includes: inputting a first modality image sample and a second modality image sample of the same object into an initial model of an encoder to obtain a code of the first modality image sample and a code of the second modality image sample; determining a loss function of an initial model of the encoder based on the mean and variance of the encoding of the first modality image sample or the encoding of the second modality image sample, and adjusting parameters of the initial model of the encoder according to the loss function of the initial model of the encoder to obtain the trained encoder.

In order to ensure that the coding of a certain modality of the image to be augmented conforms to a certain type of distribution. The loss function of the initial model of the encoder may be constrained such that the encoding of the first modality image or the encoding of the second modality image conforms to a desired distribution.

In particular, the loss function of the initial model of the encoder may be determined based on a mean of the encoding of the image to be expanded and a variance of the encoding of the image to be expanded, wherein the image to be expanded is the first modality image and/or the second modality image. Illustratively, the loss function of the initial model of the encoder is f (μ, σ)²) Where μ denotes a mean value of the encoding of the image to be expanded and σ denotes a variance of the encoding of the image to be expanded. For example, to ensure that the coding of the image to be extended fits into a gaussian distribution with a mean of 0 and a variance of 1, the loss function can be constrained to be:

where d denotes the number of images to be expanded, μ denotes the mean value of the coding of the images to be expanded, σ²Representing the square of the variance of the coding of the image to be extended. Alternatively, the loss function and the loss function f (μ, σ) may be based on both cosine distance and cosine distance²) And training the initial model of the encoder to obtain the trained encoder. Therefore, the encoder can ensure that images of different modes of the image to be expanded represent the same object, can ensure that the encoding of the image to be expanded conforms to the distribution of a specific type, and is beneficial to performing data enhancement on the image to be expanded subsequently according to the distribution of the specific type.

In some optional implementations of the present embodiment, the target network model in step 204 includes at least one of a face recognition network model, an image modality discrimination model, and a deep convolution network model.

Specifically, if the face recognition network model and the image mode discrimination model are selected as the target network model, the initial model of the image generator may be trained based on both the cosine distance loss function of the feature value of the first image and the feature value of the second image and the loss function of the image mode feature discrimination countermeasure, so as to obtain the trained image generator. For example, when an initial model of the image generator is trained, a loss function loss ═ arcos (dot (a, B)/| | a | | | | | | B |) and a loss function D _ loss ═ E (max (0,1-D (first image))) + E (max (0,1+ D (second image))) are optimized at the same time.

Optionally, if the face recognition network model, the image mode discrimination model, and the deep convolution network model are simultaneously selected as the target network model, the initial model of the image generator may be trained based on the loss functions corresponding to the three models, so as to obtain a trained image generator.

For example, when the initial model of the image generator is trained, the loss function loss ═ arcos (dot (a, B)/| a | | | | B |), the loss function D _ loss ═ E (max (0,1-D (first image))) + E (max (0,1+ D (second image)))), and the smooth-L1loss are optimized at the same time.

The image generator trained by the three loss functions can generate a first image and a second image which belong to the same object, the generated first image is more like a first mode image, the generated second image is more like a second mode image, and the difference between the first mode image which is used as the input of the encoder and the first image generated by the image generator is smaller.

In some optional implementation manners of this embodiment, a gaussian noise code of a code of an image to be expanded may be obtained, and the gaussian noise code is input to the trained image generator to obtain a simulated image corresponding to the image to be expanded. Specifically, a first modality image or a second modality image of any object is input to a pre-trained encoder, so as to obtain encoding of the first modality image or encoding of the second modality image. Then, a Gaussian noise code of the first modality image or the code of the second modality image is obtained, and the Gaussian noise code is input to the trained image generator, so that a plurality of first modality simulation images or second modality simulation images of the object are obtained. Wherein, the Gaussian noise code is a Gaussian code with a mean value of 0 and a variance of 1.

In the embodiment, a gaussian noise code similar to the code of the first-mode image or the code of the second-mode image is input to the image generator, so that the first-mode image or the second-mode image belonging to the same object can be obtained, and a method for enhancing data of the specific-mode image is provided.

With further reference to fig. 3, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for training of an image generator, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 3, the training apparatus 300 of the image generator of the present embodiment may include: an encoding module 301, an image generation module 302, an image feature value generation module 303, and an image generator parameter adjustment module 304. The encoding module 301 is configured to input a first modality image and a second modality image of the same object to a pre-trained encoder, so as to obtain an encoding of the first modality image and an encoding of the second modality image; an image generation module 302 configured to input the encoding of the first modality image and the encoding of the second modality image to an initial model of an image generator, respectively, resulting in a first image and a second image; an image feature value generation module 303, configured to input the first image and the second image to a pre-trained target network model respectively, so as to obtain a feature value of the first image and a feature value of the second image; and the image generator parameter adjusting module 304 is configured to determine a loss function of an initial model of the image generator corresponding to the target network model based on the feature values of the first image, the feature values of the second image and the type of the target network model, and adjust parameters of the initial model of the image generator according to the loss function of the initial model of the image generator, so as to obtain a trained image generator.

In the present embodiment, in the training apparatus 300 of the image generator: the specific processing and the technical effects thereof of the encoding module 301, the image generating module 302, the image feature value generating module 303, and the image generator parameter adjusting module 304 can refer to the related descriptions of step 201 and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the encoding module further includes a training module, and the training module includes: an initial encoding module configured to input a first modality image sample and a second modality image sample of the same object to an initial model of an encoder, resulting in an encoding of the first modality image sample and an encoding of the second modality image sample; and the encoder parameter adjusting module is configured to determine a loss function of an initial model of an encoder based on the cosine distance between the encoding of the first modality image sample and the encoding of the second modality image sample, and adjust parameters of the initial model of the encoder according to the loss function of the initial model of the encoder to obtain the trained encoder.

In some optional implementations of this embodiment, the target network model includes at least one of a face recognition network model, an image modality discrimination model, and a deep convolution network model.

In some optional implementations of this embodiment, when the target network model is a face recognition network model, the loss function of the initial model of the image generator is determined by a similarity between the feature values of the first image and the feature values of the second image.

In some optional implementations of this embodiment, when the target network model is an image modality discrimination model, the loss function of the initial model of the image generator is a loss function based on an image modality feature discrimination countermeasure.

In some optional implementations of this embodiment, when the target network model is a deep convolutional network model, the loss function of the initial model of the image generator is smooth-L1 loss.

In some optional implementations of the embodiment, the loss function of the initial model of the encoder is determined based on a mean of an encoding of an image to be expanded and a variance of an encoding of the image to be expanded, wherein the image to be expanded is the first modality image and/or the second modality image.

In some optional implementations of this embodiment, the apparatus further includes: and the simulated image generation module is configured to acquire the Gaussian noise code of the image to be expanded, input the Gaussian noise code to the trained image generator, and obtain a simulated image corresponding to the image to be expanded.

In some optional implementations of this embodiment, the first modality image is an RGB image, and the second modality image includes at least one of an NIR image and a Depth image.

Fig. 4 is a block diagram of an electronic device for training a training method of an image generator according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.

Memory 402 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the training method of the image generator provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the training method of the image generator provided by the present application.

The memory 402, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training method of the image generator in the embodiment of the present application (for example, the encoding module 301, the image generation module 302, the image feature value generation module 303, and the image generator parameter adjustment module 304 shown in fig. 3). The processor 401 executes various functional applications of the server and data processing, i.e., implements the training method of the image generator in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 402.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the training method of the image generator, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include a memory remotely located from the processor 401, and these remote memories may be connected to the electronic device of the training method of the image generator through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the training method of the image generator may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the training method of the image generator, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the application, a first modal image and a second modal image of the same object are input to a pre-trained encoder to obtain a code of the first modal image and a code of the second modal image; then, respectively inputting the code of the first modality image and the code of the second modality image into an initial model of an image generator to obtain a first image and a second image; then, the first image and the second image are respectively input to a pre-trained target network model to obtain a characteristic value of the first image and a characteristic value of the second image; and finally, determining a loss function of an initial model of the image generator corresponding to the target network model based on the characteristic value of the first image, the characteristic value of the second image and the type of the target network model, and adjusting parameters of the initial model of the image generator according to the loss function of the initial model of the image generator to obtain the trained image generator, so that the image generator capable of generating the specific modality image in the first modality image and/or the second modality image is provided, the problem of model overfitting caused by too few training samples of the specific modality image recognition model is solved, and the recognition accuracy of the specific modality image recognition model is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of training an image generator, comprising:

inputting a first modal image and a second modal image of the same object to a pre-trained encoder to obtain a code of the first modal image and a code of the second modal image;

respectively inputting the code of the first modality image and the code of the second modality image into an initial model of an image generator to obtain a first image and a second image;

inputting the first image and the second image into a pre-trained target network model respectively to obtain a characteristic value of the first image and a characteristic value of the second image;

determining a loss function of an initial model of the image generator corresponding to the target network model based on the characteristic value of the first image, the characteristic value of the second image and the type of the target network model, and adjusting parameters of the initial model of the image generator according to the loss function of the initial model of the image generator to obtain the trained image generator.

2. The method of claim 1, the training process of the encoder comprising:

inputting a first modality image sample and a second modality image sample of the same object into an initial model of an encoder to obtain a code of the first modality image sample and a code of the second modality image sample;

determining a loss function of an initial model of the encoder based on the cosine distance between the code of the first modality image sample and the code of the second modality image sample, and adjusting parameters of the initial model of the encoder according to the loss function of the initial model of the encoder to obtain the trained encoder.

3. The method of claim 1, the target network model comprising at least one of a face recognition network model, an image modality discrimination model, a deep convolution network model.

4. The method of claim 1, wherein when the target network model is a face recognition network model, the loss function of the initial model of the image generator is determined by similarity of the feature values of the first image and the feature values of the second image.

5. The method of claim 4, wherein the loss function of the initial model of the image generator is smooth-L1loss when the target network model is a deep convolutional network model.

6. The method of claim 4, wherein when the target network model is an image modality discrimination model, the loss function of the initial model of the image generator is a loss function of a discrimination countermeasure based on image modality features.

7. The method of claim 2, the loss function of the initial model of the encoder is determined based on a mean of an encoding of an image to be augmented and a variance of an encoding of the image to be augmented, wherein the image to be augmented is the first modality image and/or the second modality image.

8. The method of claim 7, further comprising:

and acquiring the Gaussian noise code of the image to be expanded, and inputting the Gaussian noise code to the trained image generator to obtain a simulated image corresponding to the image to be expanded.

9. The method of any of claims 1-8, the first modality image being an RGB image, the second modality image comprising at least one of an NIR image, a Depth image.

10. An apparatus for training an image generator, the apparatus comprising:

the encoding module is configured to input a first modality image and a second modality image of the same object to a pre-trained encoder to obtain an encoding of the first modality image and an encoding of the second modality image;

an image generation module configured to input the code of the first modality image and the code of the second modality image into an initial model of an image generator respectively, resulting in a first image and a second image;

the image characteristic value generation module is configured to input the first image and the second image into a pre-trained target network model respectively to obtain a characteristic value of the first image and a characteristic value of the second image;

and the image generator parameter adjusting module is configured to determine a loss function of an initial model of the image generator corresponding to the target network model based on the feature value of the first image, the feature value of the second image and the type of the target network model, and adjust parameters of the initial model of the image generator according to the loss function of the initial model of the image generator to obtain the trained image generator.

11. The apparatus of claim 10, wherein the encoding module further comprises a training module comprising:

an initial encoding module configured to input a first modality image sample and a second modality image sample of the same object to an initial model of an encoder, resulting in an encoding of the first modality image sample and an encoding of the second modality image sample;

and the encoder parameter adjusting module is configured to determine a loss function of an initial model of an encoder based on the cosine distance between the encoding of the first modality image sample and the encoding of the second modality image sample, and adjust parameters of the initial model of the encoder according to the loss function of the initial model of the encoder to obtain the trained encoder.

12. The apparatus of claim 10, wherein the target network model comprises at least one of a face recognition network model, an image modality discrimination model, a deep convolution network model.

13. The apparatus of claim 10, wherein when the target network model is a face recognition network model, the loss function of the initial model of the image generator is determined by a similarity between the feature values of the first image and the feature values of the second image.

14. The apparatus of claim 10, when the target network model is an image modality discrimination model, the loss function of the initial model of the image generator is a loss function of a discrimination countermeasure based on image modality features.

15. The apparatus of claim 10, wherein when the target network model is a deep convolutional network model, a loss function of the initial model of the image generator is smooth-L1 loss.

16. The apparatus of claim 11, the loss function of the initial model of the encoder is determined based on a mean of an encoding of an image to be augmented and a variance of an encoding of the image to be augmented, wherein the image to be augmented is the first modality image and/or the second modality image.

17. The apparatus of claim 16, further comprising:

and the simulated image generation module is configured to acquire the Gaussian noise code of the image to be expanded, input the Gaussian noise code to the trained image generator, and obtain a simulated image corresponding to the image to be expanded.

18. The apparatus of any of claims 10-17, the first modality image being an RGB image, the second modality image comprising at least one of an NIR image, a Depth image.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.