WO2022113398A1

WO2022113398A1 - Image outputting apparatus and image outputting method

Info

Publication number: WO2022113398A1
Application number: PCT/JP2021/016224
Authority: WO
Inventors: 敏行 ▲鶴▼見
Original assignee: 株式会社ネフロック
Priority date: 2020-11-27
Filing date: 2021-04-21
Publication date: 2022-06-02
Also published as: JP2022085452A; JP6856965B1

Abstract

This image output device is provided with: a generation unit which, from a first real image that contains a target image in a first form, generates a first virtual image, which contains the target image in a second form; an identification unit which identifies whether or not the first virtual image is real on the basis of a second real image that contains the target image in the second form; a learning unit which generates a learning model, the training data being the first virtual images identified to be real by the identification unit and the first real images; and an output unit which, from a third real image (or a fourth real image) that contains the target image in the first form (or the second form), uses the learning model to output a third virtual image (or a fourth virtual image) that contains the target image in the second form (or the first form).

Description

Image output device and image output method

This disclosure relates to an image output device and an image output method.

Conventionally, a learning model is generated by learning the correlation between the image of a person with glasses and the image of a person without glasses, and the generated learning model is used to generate an image of a person with glasses ( For example, there is known a technique for outputting an image of a person without glasses (for example, a virtual image including the target image of the second aspect) from a real image including the target image of the first aspect).

Further, in the above-mentioned technique, there is also known a problem that an image of a person wearing glasses and an image of a person without glasses need to be captured at the same angle at the stage of generating a learning model. In order to solve such a problem, a technique of detecting spectacles from an image of a person wearing spectacles and developing a mask covering the detected spectacles has also been proposed (for example, Patent Document 1).

Special Table 2019-527410

The first feature is an image output device, the first virtual image including the target image of the second aspect in which at least a part of the first aspect is modified from the first real image including the target image of the first aspect. A generation unit that generates an image, an identification unit that identifies whether or not the first virtual image is genuine based on a second real image including the target image of the second aspect, and an identification unit that is genuine by the identification unit. A learning unit that generates a learning model by learning the correlation between the first real image and the first virtual image using the first virtual image and the first real image identified as existing as teacher data. Using the learning model, a third virtual image including the target image of the second aspect is output from the third real image including the target image of the first aspect, or the learning model is used to output the first virtual image. The generation unit and the identification unit include an output unit that outputs a fourth virtual image including the target image of the first aspect from a fourth real image including the target image of the two embodiments, and the generation unit and the identification unit identify the identification unit. The gist is that you will be trained based on the results.

The second feature is that, in the first feature, the teacher data is input to the learning unit after the cleansing process is applied.

The third feature is, in the first feature or the second feature, the target image of the first aspect is an image of a person wearing an ornament, and the target image of the second aspect is an ornament. The gist is that it is an image of a person who is not wearing it.

The fourth feature is that in the first feature or the second feature, the target image of the first aspect is an image of a person who does not wear an ornament, and the target image of the second aspect is an ornament. The gist is that it is an image of a person who wears.

The fifth feature is an image output method, which is a first virtual image including a target image of the second aspect in which at least a part of the first aspect is modified from the first real image including the target image of the first aspect. The step of generating an image by the generation unit, the step of identifying by the identification unit whether or not the first virtual image is genuine based on the second real image including the target image of the second aspect, and the identification. A learning model is generated by learning the correlation between the first real image and the first virtual image using the first virtual image and the first real image identified as genuine by the unit as teacher data. Using the step and the learning model, a third virtual image including the target image of the second aspect is output from the third real image including the target image of the first aspect, or the learning model is used. A step of outputting a fourth virtual image including the target image of the first aspect from a fourth real image including the target image of the second aspect, and the generation unit and the identification unit are the identification unit. The gist is that it is trained based on the identification result of.

FIG. 1 is a diagram showing an image output device 100 according to an embodiment. FIG. 2 is a diagram showing an application example according to an embodiment. FIG. 3 is a diagram showing an application example according to the embodiment. FIG. 4 is a diagram showing an application example according to the embodiment. FIG. 5 is a diagram showing an image output method according to an embodiment.

Hereinafter, embodiments will be described with reference to the drawings. In the description of the drawings below, the same or similar parts are designated by the same or similar reference numerals. However, the drawings are schematic.

[Embodiment]
(Image output device)
Hereinafter, the image output device according to the embodiment will be described. FIG. 1 is a diagram showing an image output device 100 according to an embodiment.

As shown in FIG. 1, the image output device 100 includes a teacher data generation unit 110, a learning unit 120, and an output unit 130.

The teacher data generation unit 110 outputs the teacher data to be input to the learning unit 120. The teacher data includes a real image including the target image of the first aspect (hereinafter, the first real image) and a virtual image including the target image of the second aspect in which at least a part of the first aspect is modified (hereinafter, the first aspect). Virtual image), including. In other words, the teacher data includes a pair of a first real image and a first virtual image.

In the embodiment, the term "real image" is used to mean an image obtained by capturing an actual image. The term "virtual image" is used to mean an image that is not a "real image". Therefore, the "virtual image" may be read as a "generated image" or a "fake image".

Here, the target image may be an image of a person. The target image of the first aspect may be an image of a person wearing an ornament, and the target image of the second aspect may be an image of a person wearing no ornament. On the contrary, the target image of the first aspect may be an image of a person who does not wear the ornament, and the target image of the second aspect may be an image of a person who wears the ornament. The ornament may include one or more ornaments selected from eyeglasses, accessories, clothing, garments, wigs and the like.

Under such a premise, the teacher data generation unit 110 has a generation unit 111 and an identification unit 112. The teacher data generation unit 110 generates a first virtual image using GAN (Generative Adversarial Network). GAN may include DC (Deep Convolutional) GAN. The generation unit 111 may be read as a Generator in GAN, and the identification unit 112 may be read as a Discriminator in GAN. The generation unit 111 and the identification unit 112 are trained based on the identification result of the identification unit 112. In addition, "training" may be read as "learning".

The generation unit 111 generates the first virtual image from the first real image. Specifically, the generation unit 111 inputs the first real image to the predetermined generation model, and outputs the first virtual image from the predetermined generation model. As the predetermined generative model, a known generative model used by the Generator in GAN can be used. However, it may be different from the known generative model in that the first real image is used as the latent variable input to the predetermined generative model.

The identification unit 112 identifies whether or not the first virtual image generated by the generation unit 111 is genuine based on the second real image including the target image of the second aspect. Specifically, the identification unit 112 inputs the second real image and the first virtual image into the predetermined identification model, and compares the outputs of the predetermined identification models to determine whether or not the first virtual image is genuine. Identify. As the predetermined discriminative model, a known discriminative model used in the Discriminator in GAN can be used. The person relating to the second real image may be different from the person relating to the first virtual image.

Here, the generation unit 111 is trained based on whether the identification unit 112 determines that the first virtual image is genuine (that is, the identification result) in a state where the predetermined identification model is fixed. The training of the generative unit 111 is synonymous with the training of the predetermined generative model. The discrimination result may be considered as error backpropagation in GAN.

The identification unit 112 is trained based on whether or not the identification result of the identification unit 112 is correct (that is, the identification result) in a state where the predetermined generation model is fixed. It is known to the identification unit 112 that the first virtual image is not genuine in the training of the identification unit 112. The training of the discriminative unit 112 is synonymous with the training of the predetermined discriminative model. The discrimination result may be considered as error backpropagation in GAN.

The training of the generation unit 111 and the training of the identification unit 112 may be executed alternately. Although not particularly limited, the number of times the generation unit 111 and the identification unit 112 are trained (the number of epochs) may be such a number of times that overfitting of the predetermined generation model and the predetermined identification model does not occur. The number of times the generation unit 111 and the identification unit 112 are trained may be specified by the operator confirming the accuracy of the first virtual image.

The teacher data described above includes a first virtual image identified as genuine by the trained identification unit 112 and a first real image corresponding to the first virtual image (a first virtual image input to the trained generation unit 111). 1 real image) and. In other words, the first virtual image included in the teacher data is the first virtual image identified as genuine by the trained identification unit 112 from the first virtual images generated by the trained generation unit 111. Is.

The teacher data may be input to the learning unit 120 after the cleansing process is applied. The cleansing process may be a process of excluding inappropriate teacher data by the operator. If the cleansing process can be automated, the teacher data generation unit 110 may automatically execute the cleansing process.

The learning unit 120 generates a learning model based on the teacher data input from the teacher data generation unit 110. That is, the learning unit 120 learns the correlation between the first real image and the first virtual image using the first virtual image and the first real image identified as genuine by the identification unit 112 as teacher data. Generate a learning model. The learning model after learning the correlation between the first real image and the first virtual image may be referred to as a trained model. Although not particularly limited, the learning model may be a model suitable for image processing. For example, the learning model may include a CNN (Convolution Neural Network).

The output unit 130 outputs a third virtual image including the target image of the second aspect from the third real image including the target image of the first aspect by using the learning model (learned model). Alternatively, the output unit 130 outputs a fourth virtual image including the target image of the first aspect from the fourth real image including the target image of the second aspect by using the learning model (learned model).

Here, the person related to the third virtual image is the same as the person related to the third real image. Similarly, the person relating to the fourth virtual image is the same as the person relating to the fourth real image. On the other hand, the person related to the third real image and the third virtual image may be different from the person related to the first real image and the first virtual image, or may be different from the person related to the second real image. Similarly, the person relating to the fourth real image and the fourth virtual image may be different from the person relating to the first real image and the first virtual image, and may be different from the person relating to the second real image.

(Application example)
Hereinafter, an application example of the above-mentioned image output device 100 will be described. 2 to 4 are diagrams showing an application example of the image output device 100 according to the embodiment.

In the application example, for the sake of clarification of the description, a case where the target image of the first aspect is an image of a person wearing glasses and the target image of the second aspect is an image of a person without glasses will be mainly described. ..

First, the teacher data generation phase will be described with reference to FIG.

As shown in FIG. 2, the generation unit 111 generates a first virtual image including an image of a person without glasses based on a first real image including an image of a person with glasses. The identification unit 112 identifies whether or not the first virtual image generated by the generation unit 111 is genuine based on the second real image including the image of a person who does not wear glasses.

Here, the generation unit 111 constitutes a Generator in GAN, and the identification unit 112 constitutes a Discriminator in GAN. The generation unit 111 and the identification unit 112 are trained based on the identification result of the identification unit 112 (error back propagation).

In the following, the explanation will be continued assuming that the training of the generation unit 111 and the identification unit 112 has been completed. Under such a premise, the teacher data generation unit 110 outputs the first virtual image and the first real image identified as genuine by the identification unit 112 as teacher data. Specifically, the teacher data is input to the first virtual image identified as genuine by the trained identification unit 112 and the first real image corresponding to the first virtual image (trained generation unit 111). The first real image) and.

Here, the first real image and the first virtual image are images related to the same person. That is, the teacher data generation unit 110 outputs a pair of images relating to the same person as teacher data. Of course, the teacher data generation unit 110 may output a pair of images relating to two or more people as teacher data.

Second, the learning phase will be described with reference to FIG.

As shown in FIG. 3, learning of a neural network (learning model) that generates a first virtual image including an image of a person without glasses is executed from a first real image including an image of a person with glasses. To. In other words, the learning unit 120 learns the correlation between the first real image and the first virtual image based on the pair (teacher data) of the first real image and the first virtual image. Although not particularly limited, the learning of the neural network may include a process of adjusting the weighting coefficient between the neurons constituting the neural network.

In the following, the explanation will be continued assuming that the learning of the neural network is completed. In other words, the learning unit 120 completes the learning of the neural network (learning model).

As described above, the teacher data may be input to the learning unit 120 after the cleansing process is applied. The cleansing process may be a process of excluding inappropriate teacher data by the operator. If the cleansing process can be automated, the teacher data generation unit 110 may automatically execute the cleansing process.

Third, the inference phase will be explained with reference to FIG.

As shown in FIG. 4, the trained neural network (trained model) generates a third virtual image including an image of a person wearing glasses from a third real image including an image of a person wearing glasses. In other words, the output unit 130 outputs the third virtual image from the third real image by using the trained neural network (trained model).

(Image output method)
Hereinafter, the image output method according to the embodiment will be described. FIG. 5 is a diagram showing an image output method according to an embodiment.

As shown in FIG. 1, in step S11, the image output device 100 generates a first virtual image including the target image of the second aspect from the first real image including the target image of the first aspect.

In step S12, the image output device 100 identifies whether or not the first virtual image generated in step S11 is genuine based on the second real image including the target image of the second aspect.

In step S13, the image output device 100 determines whether or not the training of the generation unit 111 (predetermined generation model) and the identification unit 112 (predetermined identification model) has been completed. The image output device 100 may determine that the training is completed when the number of trainings (number of epochs) reaches a predetermined threshold value. Completion of training may be determined by the operator.

In step S14, the cleansing process is executed. The cleansing process is a process of excluding inappropriate teacher data. The teacher data includes a first virtual image identified as genuine by the trained identification unit 112 and a first real image corresponding to the first virtual image (the first real image input to the trained generation unit 111). ) And, including.

In step S15, the image output device 100 generates a learning model by learning the correlation between the first real image and the first virtual image based on the teacher data (first real image and first virtual image). ..

In step S16, the image output device 100 outputs a third virtual image including the target image of the second aspect from the third real image including the target image of the first aspect by using the learning model (learned model). .. Alternatively, the image output device 100 uses a learning model (learned model) to output a fourth virtual image including the target image of the first aspect from the fourth real image including the target image of the second aspect.

FIG. 5 illustrates a case where the cleansing process is executed, but the cleansing process may be omitted.

(Action and effect)
In the embodiment, the image output device 100 generates a learning model by learning the correlation between the first real image and the first virtual image using the first real image and the first virtual image as teacher data. The image output device 100 outputs a third virtual image (or a fourth virtual image) from the third real image (or the fourth real image) using the learning model. Under such a premise, the teacher data includes a first virtual image identified as genuine by the trained identification unit 112 and a first real image corresponding to the first virtual image (trained generation unit 111). The first real image input to) and. In other words, teacher data is generated using GAN.

According to such a configuration, since the learning model is generated using the teacher data generated by using GAN, the “real” including the target image of the second aspect in which at least a part of the first aspect is modified. The learning model can be appropriately generated without using images. Since the "real" image including the target image of the second aspect is not used, the target image of the first aspect and the target image of the second aspect do not need to be captured at the same angle. Further, it is possible to suppress the influence of noise (for example, shadow) and the like that may be contained in the "real" image including the target image of the second aspect. As described above, since the learning model can be appropriately generated, the image output device 100 is the target of the second aspect from the real image (the third real image or the fourth real image) including the target image of the first aspect. A virtual image including an image (third virtual image or fourth virtual image) can be appropriately output.

In the embodiment, the image output device 100 uses the first real image as a latent variable input to the generation unit 111 (predetermined generation model). Therefore, the generation accuracy of the first virtual image is improved. As a result, the number of trainings (number of epochs) can be suppressed.

[Change example 1]
Hereinafter, modification 1 of the embodiment will be described. In the following, the differences from the embodiments will be mainly described.

In the first modification, variations of the target image of the first aspect and the target image of the second aspect will be described.

As described above, the target image of the first aspect may be an image of a person wearing an ornament, and the target image of the second aspect may be an image of a person wearing no ornament. On the contrary, the target image of the first aspect may be an image of a person who does not wear the ornament, and the target image of the second aspect may be an image of a person who wears the ornament. The ornament may include one or more ornaments selected from eyeglasses, accessories, clothing, garments, wigs and the like.

However, the target image of the first aspect and the target image of the second aspect are not limited to this. The target image of the first aspect and the target image of the second aspect may include the variations shown below.

For example, the target image of the first aspect may be an image of a person before make-up, and the target image of the second aspect may be an image of a person after make-up. On the contrary, the target image of the first aspect may be an image of a person after make-up, and the target image of the second aspect may be an image of a person before make-up.

For example, the target image of the first aspect may be an image of a person before shaping, and the target image of the second aspect may be an image of a person after shaping. On the contrary, the target image of the first aspect may be an image of a person after shaping, and the target image of the second aspect may be an image of a person before shaping. Plastic surgery may include cosmetic surgery.

For example, the target image of the first aspect is an image of a person having a hairstyle before the change (for example, long hair), and the target image of the second aspect is an image of a person having a hairstyle after the change (for example, short hair). There may be. On the contrary, the target image of the first aspect is an image of a person having a hairstyle after the change (for example, short hair), and the target image of the second aspect is an image of a person having a hairstyle before the change (for example, long hair). It may be.

For example, the target image of the first aspect may be an image of a person before dieting, and the target image of the second aspect may be an image of a person after dieting. On the contrary, the target image of the first aspect may be an image of a person after dieting, and the target image of the second aspect may be an image of a person after dieting.

For example, the target image of the first aspect may be an image of a current person, and the target image of the second aspect may be an image of a future person. On the contrary, the target image of the first aspect may be an image of a future person, and the target image of the second aspect may be an image of a current person. Alternatively, the target image of the first aspect may be an image of the current person, and the target image of the second aspect may be an image of a past person. On the contrary, the target image of the first aspect may be an image of a past person, and the target image of the second aspect may be an image of a current person. Alternatively, the target image of the first aspect may be an image of a past person, and the target image of the second aspect may be an image of a future person. On the contrary, the target image of the first aspect may be an image of a future person, and the target image of the second aspect may be an image of a past person.

As described above, the target image may be an image of a person. However, the target image is not limited to this. The target image may be an image of an article such as a vehicle. In such a case, the target image of the first aspect may be an image of an article having a scratch, and the target image of the second aspect may be an image of an article after repairing the scratch. On the contrary, the target image of the first aspect may be an image of an article after repairing a scratch, and the target image of the second aspect may be an image of an article having a scratch.

Alternatively, the target image may be a landscape image or the like. In such a case, the target image of the first aspect may be a landscape image of the first season, and the target image of the second aspect may be a landscape image of the second season different from the first season.

Alternatively, the target image may include an image of an actual person and an image of an artificial person. In such a case, the target image of the first aspect may be an image of an actual person, and the target image of the second aspect may be an image of an artificial person. On the contrary, the target image of the first aspect may be an image of an artificial person, and the target image of the second aspect may be an image of an actual person. The image of the artificial person may be an illustration of the person, CG (Computer Graphics), an image of an avatar, or the like.

Alternatively, the target image may be an image of the article to be inspected. In such a case, the target image of the first aspect may be an image of a normal article, and the target image of the second aspect may be an image of an abnormal article. On the contrary, the target image of the first aspect may be an image of an abnormal article, and the target image of the second aspect may be an image of a normal article.

Alternatively, the type of the target image may not be used. In such a case, the target image of the first aspect may be an image of the first resolution, and the target image of the second aspect may be an image of a second resolution different from the first resolution.

[Other embodiments]
Although the invention has been described by embodiments described above, the statements and drawings that form part of this disclosure should not be understood to limit the invention. This disclosure will reveal to those skilled in the art various alternative embodiments, examples and operational techniques.

Although not specifically mentioned in the above disclosure, the image output device 100 may have a first real image input interface. The image output device 100 may have a second real image input interface. The input interface may include an image pickup device or may include a communication module that communicates with a network such as the Internet.

Although not specifically mentioned in the above disclosure, the first virtual image does not have to include the background image included in the first real image. The background image is a part other than the target image. Therefore, the third virtual image (or the fourth virtual image) does not have to include the background image.

In the above-mentioned disclosure, a case where each function of the image output device 100 is provided in one device is illustrated. However, the above disclosure is not limited to this. Each function of the image output device 100 may be provided in two or more devices, and a part of the functions of the image output device 100 may be provided by a cloud service. In such a case, the image output device may be read as an image output system.

Although not specifically mentioned in the above-mentioned disclosure, a program may be provided that causes a computer to execute each process performed by the image output device 100. The program may also be recorded on a computer-readable medium. Computer-readable media can be used to install programs on a computer. Here, the computer-readable medium on which the program is recorded may be a non-transient recording medium. The non-transient recording medium is not particularly limited, but may be, for example, a recording medium such as a CD-ROM or a DVD-ROM.

Alternatively, a chip composed of a memory for storing a program for executing each process performed by the image output device 100 and a processor for executing the program stored in the memory may be provided.

Claims

A generation unit that generates a first virtual image including the target image of the second aspect in which at least a part of the first aspect is modified from the first real image including the target image of the first aspect.
An identification unit that identifies whether or not the first virtual image is genuine based on the second real image including the target image of the second aspect.
A learning model is obtained by learning the correlation between the first real image and the first virtual image using the first virtual image and the first real image identified as genuine by the identification unit as teacher data. The learning part to generate and
Using the learning model, a third virtual image including the target image of the second aspect is output from the third real image including the target image of the first aspect, or the learning model is used to output the first virtual image. It is provided with an output unit that outputs a fourth virtual image including the target image of the first aspect from the fourth real image including the target image of the two aspects.
The generation unit and the identification unit are image output devices that are trained based on the identification result of the identification unit.
The image output device according to claim 1, wherein the teacher data is input to the learning unit after the cleansing process is applied.
The target image of the first aspect is an image of a person wearing an ornament.
The image output device according to claim 1, wherein the target image of the second aspect is an image of a person who does not wear ornaments.
The target image of the first aspect is an image of a person who does not wear ornaments.
The image output device according to claim 1, wherein the target image of the second aspect is an image of a person wearing an ornament.
A step of generating a first virtual image including the target image of the second aspect in which at least a part of the first aspect is modified from the first real image including the target image of the first aspect by the generation unit.
A step of identifying whether or not the first virtual image is genuine by the identification unit based on the second real image including the target image of the second aspect.
A learning model is obtained by learning the correlation between the first real image and the first virtual image using the first virtual image and the first real image identified as genuine by the identification unit as teacher data. Steps to generate and
Using the learning model, a third virtual image including the target image of the second aspect is output from the third real image including the target image of the first aspect, or the learning model is used to output the first virtual image. A step of outputting a fourth virtual image including the target image of the first aspect from the fourth real image including the target image of the two aspects is provided.
An image output method in which the generation unit and the identification unit are trained based on the identification result of the identification unit.