WO2022085970A1

WO2022085970A1 - Method for generating image on basis of user data text, electronic device therefor, and method for generating image on basis of text

Info

Publication number: WO2022085970A1
Application number: PCT/KR2021/013271
Authority: WO
Inventors: 박철민
Original assignee: 주식회사 에이아이파크
Priority date: 2020-10-23
Filing date: 2021-09-28
Publication date: 2022-04-28
Also published as: KR20220053863A

Abstract

A server according to an embodiment of the present invention may comprise: a communication unit configured to communicate with a user device; a processor; and a memory, wherein the memory includes a database for generating an image on the basis of text, and the processor is configured to generate an image generation model on the basis of the database, receive first text from the user device through the communication unit, generate a first image corresponding to the first text on the basis of the image generation model, and transmit the first image to the user device through the communication unit. Various other embodiments are possible.

Description

A method for generating an image based on user data text, an electronic device therefor, and a method for generating an image based on text

A method of generating an image based on user data text, an electronic device thereof, and a method of generating an image based on text.

As information transmission using untact communication and multimedia is explosively increasing, the tendency to transmit information that was previously transmitted as text through voice or video is getting stronger. For example, when text is input, a text to speech (TTS) technology for outputting a voice corresponding to the text has been widely used.

The existing text to speech (TTS) technology can output a voice corresponding to the text, but since there is no visual information to recognize the information sender, it is difficult to deliver information vividly with only voice, and the person who accepts the information can't lead the concentration of

An object of the present invention to solve the above problems is to provide a method and an electronic device for generating an image based on user data text, and a method for generating an image based on text.

A server according to an embodiment of the present invention for achieving the above object includes a communication unit configured to communicate with a user device, a processor, and a memory, wherein the memory includes a database for generating an image based on text wherein the processor generates an image generation model based on the database, receives a first text from the user device through the communication unit, and based on the image generation model, generates a first text corresponding to the first text. It may be configured to generate one image and transmit the first image to the user device through the communication unit.

Here, the database includes an audio database including a plurality of pairs of text and audio corresponding to the text, and a video image database including a plurality of pairs of video images corresponding to the audio and audio, and the processor is configured to: In order to generate a generative model, a video generating model for generating a voice based on text is generated based on the voice database, and a video image is generated based on the generated voice based on the video image database. and generate an image generation model, wherein the processor generates a first image corresponding to the first text, based on the speech generation model and the first text, to generate the first image corresponding to the first text. A voice is generated, a first video image corresponding to the first voice is generated based on the video image generation model and the first voice, and the first voice and the first video image are synthesized to generate the first sound. may be configured to generate an image.

Here, the database may include a plurality of person-specific databases corresponding to a plurality of persons, and the processor may be configured to generate a plurality of person-specific image generation models based on the plurality of person-specific databases.

Here, the processor receives selection information about a first person among the plurality of people from the user device through the communication unit, and receives a first person corresponding to the first person from among the plurality of person-specific image generation models. It may be configured to generate the first image based on an image generation model for each person.

According to an embodiment of the present invention for achieving the above object, a method performed in a server includes an operation of storing a database for generating an image based on text, an operation of generating an image generation model based on the database, receiving a first text from the user device, generating a first image corresponding to the first text based on the image generation model, and transmitting the first image to the user device can do.

Here, the database includes an audio database including a plurality of pairs of text and audio corresponding to the text, and a video image database including a plurality of pairs of video images corresponding to audio and audio, wherein the image generation model is The generating may include generating a voice generation model for generating a voice based on text based on the voice database, and a video image generating model for generating a video image based on a voice based on the video image database. and generating the first image corresponding to the first text, wherein the generating of the first image corresponding to the first text includes generating a first voice corresponding to the first text based on the voice generation model and the first text. generating a first video image corresponding to the first audio based on the video image generation model and the first audio, and synthesizing the first audio and the first video image to obtain the first It may include an operation of generating an image.

Here, the database includes a plurality of person-specific databases corresponding to a plurality of persons, and the operation of generating an image generation model based on the database includes: a plurality of person-specific image generation models based on the plurality of person-specific databases It may include an operation to create

Here, the method further includes receiving selection information about a first person among the plurality of people from the user device, and the operation of generating the first image corresponding to the first text includes: and generating the first image based on an image generating model for each first person corresponding to the first person among the image generating models.

According to an embodiment of the present invention for achieving the above object, a non-transitory storage medium stores a command, and when the command is executed by an electronic device, the electronic device receives a first text, and the first Transmitting text to a server including a database for generating an image based on the text and an image generating model based on the database, receiving a first image corresponding to the first text from the server, and the first image can be output.

Here, when the command is executed by the electronic device, it causes the electronic device to display a plurality of people, receive a selection of a first person among the plurality of people, and transmit the selection of the first person to the server. to be transmitted, and the first image may be generated in the server based on an image generation model for each person corresponding to the first person.

According to the present invention, it is possible to provide a method and an electronic device for generating an image based on user data text, and a method for generating an image based on text. Accordingly, by providing the visual information that can recognize the information bearer, the information can be delivered more effectively by drawing the attention of the recipient of the information.

1 is a block diagram of a user device and a server according to an embodiment of the present invention.

2 is a flowchart illustrating operations performed by a server according to an embodiment of the present invention.

3 is a flowchart illustrating operations performed by a user device according to an embodiment of the present invention.

Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numerals have been used for like elements.

Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. The term “and/or” includes a combination of a plurality of related listed items or any of a plurality of related listed items.

When an element is referred to as being “connected” or “connected” to another element, it is understood that it may be directly connected or connected to the other element, but other elements may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the accompanying drawings. In describing the present invention, in order to facilitate the overall understanding, the same reference numerals are used for the same components in the drawings, and duplicate descriptions of the same components are omitted.

Hereinafter, in this specification, "user data" refers to data transmitted to a server through a user device, and such user data includes text, speech or voice, and images by the user. Alternatively, it may include an image or video, a gesture, and the like. However, for better understanding of the present invention and convenience of explanation, for example, text is described as an example of the user data, but the present invention is not limited thereto.

1 is a block diagram of a user device and a server according to an embodiment of the present invention. The user device 101 may include a communication unit 110 , a processor 120 , a memory 130 , an input interface 140 , and an output interface 150 . The communication unit 110 may communicate with other electronic devices other than the user device 101 , including the server 106 . The type of communication method or communication protocol performed by the communication unit 110 with other electronic devices is not limited.

According to various embodiments, the communication unit 110 of the user device 101 transmits the first text 191 input by the user to the server 106 , and sends the first text 191 from the server 106 to the first text 191 . A corresponding first image 192 may be received.

The processor 120 controls other components of the user device 101 , such as the communication unit 110 , the memory 130 , the input interface 140 , and the output interface 150 , or It can receive data from other components. Hereinafter, in this specification, the processor 120 performs a certain operation through other components of the user device 101 , such as the communication unit 110 , the memory 130 , the input interface 140 , and the output interface 150 . Doing this may mean controlling other components of the user device 101 to perform the corresponding operation. Also, the processor 120 may perform an operation on data received from other components of the user device 101 .

According to various embodiments, the processor 120 of the user device 101 transmits the first text 191 input by the user to the server 106 through the communication unit 110 , and sends the first text 191 input from the server 106 to the server 106 . A first image 192 corresponding to the first text 191 may be received.

The memory 130 may store a result of an operation performed by the processor 120 . According to various embodiments, the memory 130 may store computer-executable instructions to perform operations performed by the user device 101 according to an embodiment of the present invention.

The input interface 140 may receive an input from a user of the user device 101 . For example, the input interface 140 may include at least one of a touch pad, a digitizer, a stylus pen, a microphone, a camera, a mouse, and a keyboard.

According to various embodiments, the processor 120 of the user device 101 may confirm input of user data such as the first text 191 from the user through the input interface 140 .

The output interface 150 may provide an output to a user of the user device 101 . For example, the output interface 150 may include at least one of a TV, a digital signage, a display device such as a monitor or a touch screen display, and an audio output interface such as a speaker.

According to various embodiments, the processor 120 of the user device 101 may output the first image 192 through the output interface 150 .

The server 106 may include a memory 160 , a processor 170 , and a communication unit 180 . The memory 160 may include a database 161 , a voice generation model 162 , and an image generation model 1632 .

According to various embodiments, the database 161 is an audio database including a plurality of pairs of text and audio corresponding to the text, and a video image including a plurality of pairs of audio and video images corresponding to the audio. It may include a database. In this case, the video-to-speech generation model 162 includes a speech generation model for generating a voice based on text, which is generated based on an audio database, and the image generation model 163 includes: and the audio generation model and the video image database. and a video image generation model for generating a video image based on a voice generated based on the .

According to various embodiments, the database 161 may include an image database including a plurality of combinations of text and images including audio and video images corresponding to the text. In this case, the image generation model 162163 may include an image generation model that generates an image including an audio and a video image based on a voice based on a text, which is generated based on an image database.

The processor 170 may control other components of the server 106 , such as the communication unit 180 , or receive data from other component(s) of the user device 101 . Then, in this specification, the processor 170, such as the communication unit 180, to perform any operation through other components of the server 106, controls the other components of the server 106 to perform the operation. can mean doing In addition, the processor 170 may perform an operation on data received from other components of the server 106 .

According to various embodiments, the processor 170 may generate the voice generation model 162 and the image generation model 162163 based on the database 161 . Also, according to various embodiments, the processor 170 is configured to perform a first text and a first image corresponding to the first text based on the image generation model 162 received from the user device 101 through the communication unit 180 . can create

The communication unit 180 may communicate with other electronic devices other than the server 106 including the user device 101 . The type of communication method or communication protocol performed by the communication unit 110 with other electronic devices is not limited.

According to various embodiments, the communication unit 180 receives the first text 191 from the user device 101 , and displays a first image 192 corresponding to the first text 191 to the user device 101 . can send

In operation 205 , the processor 170 of the server 106 may generate the voice generation model 162 based on the database 161 .

In operation 210 , the processor 170 of the server 106 may generate the image generation model 162163 based on the database 161 .

In operation 205 and operation 210 , the voice and

image generation models

162 and 163 are generated through deep learning, respectively.

According to various embodiments, the database 161 may include an audio database including a plurality of pairs of text and audio corresponding to the text, and a video image database including a plurality of pairs of audio and video images corresponding to the audio. can In this case, the processor 170 generates a voice generation model that generates a voice based on text based on the voice database, and generates a video image generation model that generates a video image based on the voice based on the video image database. can create

According to various embodiments, the database 161 may include an image database including a plurality of combinations of text and images including audio and video images corresponding to the text. In this case, the processor 170 may generate an image generation model that generates an image including a voice based on text and an audio and video image based on the voice based on the image database.

According to various embodiments, data for voice may include Mel Frequency Cepstral Coefficients (MFCC) characteristics of voice. According to various embodiments, the data for the video image may include a face image of a person to be displayed on the video. According to various embodiments, the data for the video image may include data regarding the coordinates of feature points of the lips displayed on the image.

According to various embodiments, the database 161 may include a plurality of person-specific databases corresponding to the plurality of persons. The content of data that can be included in the database for each person is the same as described with respect to the various embodiments described above. In this case, the processor 170 may generate a plurality of person-specific image generation models based on the plurality of person-specific databases. According to various embodiments, the processor 170 may also generate a plurality of person-specific voice generation models based on the plurality of person-specific databases.

According to various embodiments, the database 161 may include a plurality of context-specific databases corresponding to a plurality of contexts. For example, the plurality of situations may include at least one of an utterance situation between intimacy, a utterance situation in a public situation, a situation uttered at an angle, an utterance situation in an urgent situation, and a situation uttered in an investigation. In addition to the situations exemplified above, various situations may be set. The content of data that can be included in the database for each situation is the same as described with respect to the various embodiments described above. In this case, the processor 170 may generate a plurality of contextual image generation models based on the plurality of contextual databases. Meanwhile, the processor 170 may also generate a plurality of contextual voice generation models based on the plurality of contextual databases.

According to various embodiments, the process of generating the voice and

image generation models

162 and 163 based on the database 161 may be performed through deep learning.

In operation 220 , the processor 170 may receive the first text 191 from the user device 101 through the communication unit 180 .

According to various embodiments, when the database 161 includes a plurality of person-specific databases corresponding to a plurality of persons, in operation 220 , the processor 170 receives the plurality of data from the user device 101 through the communication unit 180 . Information on the first person selected by the user from among the people of . According to various embodiments, when the database 161 includes a plurality of context-specific databases corresponding to a plurality of contexts, in operation 220 , the processor 170 receives the plurality of data from the user device 101 through the communication unit 180 . Information on the first situation selected by the user may be further received among the situations of .

In operation 230 , the processor 170 may generate a first image corresponding to the first text based on the image generation model 162163 . According to various embodiments, when the database 161 includes an audio database and a video image database, and the image generation model 162163 includes an audio generation model and a video image generation model, or the audio generation model 162 generates a voice When the model includes a model and the image generation model 163 includes a video image generation model, the processor 170 generates a first voice corresponding to the first text based on the voice generation model and the first text, The first video may be generated by generating a first video image corresponding to the first audio based on the video image generation model and the first audio, and synthesizing the first audio and the first video image.

According to various embodiments, the database 161 includes an image database, and the image generation model 162163 includes an image generation model that generates an image including an audio and a video image based on a voice based on text and a video image. In this case, the processor 170 may generate a first image including a first voice corresponding to the first text and a first video image, based on the image generation model and the first text.

According to various embodiments of the present disclosure, when information on the first person selected by the user from among the plurality of persons is further received from the user device 101 , the processor 170 determines the person corresponding to the first person selected by the user from among the plurality of persons A first image corresponding to the first text may be generated based on the star voice or/and image generation model and the first text.

According to various embodiments of the present disclosure, when information on a first situation selected by the user from among a plurality of situations is further received from the user device 101 , the processor 170 is configured to determine a situation corresponding to the first situation selected by the user from among the plurality of situations A first image corresponding to the first text may be generated based on the star voice or/and image generation model and the first text.

In operation 240 , the processor 170 may transmit the first image 192 to the user device 101 through the communication unit 180 .

In operation 310 , the processor 120 of the user device 101 may confirm the input of the first text through the input interface 140 .

In operation 320 , the processor 120 of the user device 101 may transmit the first text 191 to the server 106 through the communication unit 110 .

In operation 330 , the processor 120 of the user device 101 may receive the first image 192 corresponding to the first text 191 from the server 106 through the communication unit 110 .

In operation 340 , the processor 120 of the user device 101 may output the first image 192 through the output interface 150 .

The operation according to the embodiment of the present invention can be implemented as a computer-readable program or code on a computer-readable recording medium. The computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. In addition, the computer-readable recording medium may be distributed in a network-connected computer system to store and execute computer-readable programs or codes in a distributed manner.

In addition, the computer-readable recording medium may include a hardware device specially configured to store and execute program instructions, such as ROM, RAM, and flash memory. The program instructions may include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

Although some aspects of the invention have been described in the context of an apparatus, it may also represent a description according to a corresponding method, wherein a block or apparatus corresponds to a method step or feature of a method step. Similarly, aspects described in the context of a method may also represent a corresponding block or item or a corresponding device feature. Some or all of the method steps may be performed by (or using) a hardware device such as, for example, a microprocessor, programmable computer or electronic circuit. In some embodiments, one or more of the most important method steps may be performed by such an apparatus.

In embodiments, a programmable logic device (eg, a field programmable gate array) may be used to perform some or all of the functions of the methods described herein. In embodiments, the field programmable gate array may operate in conjunction with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by some hardware device.

Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that it can be done.

Claims

in the server,

a communication unit configured to communicate with the user device;

processor; and

including memory;

The memory includes a database for generating an image based on the text,

The processor is

generating an image generation model based on the database;

receiving the first text from the user device through the communication unit;

generating a first image corresponding to the first text based on the image generation model;

and transmit the first image to the user device through the communication unit.
The method according to claim 1,

The database includes an audio database including a plurality of pairs of text and audio corresponding to the text, and a video image database including a plurality of pairs of audio and video images corresponding to the audio,

The processor to generate the image generation model,

generating a voice generation model for generating voice based on text, based on the voice database;

configured to generate a video image generation model for generating a video image based on the generated voice based on the video image database,

The processor generates the first image corresponding to the first text,

generating a first voice corresponding to the first text based on the voice generation model and the first text;

generating a first video image corresponding to the first audio based on the video image generation model and the first audio;

and generate the first video by synthesizing the first audio and the first video image.
The method according to claim 1,

The database includes a plurality of person-specific databases corresponding to the plurality of persons,

The processor is

and generate a plurality of person-specific image generation models based on the plurality of person-specific databases.
4. The method according to claim 3,

The processor is

receiving selection information about a first person among the plurality of people from the user device through the communication unit;

and generate the first image based on a first person-specific image generation model corresponding to the first person among the plurality of person-specific image generation models.
In the method performed on the server,

an operation of storing a database for generating an image based on the text;

generating an image generation model based on the database;

receiving a first text from the user device;

generating a first image corresponding to the first text based on the image generation model; and

and transmitting the first image to the user device.
6. The method of claim 5,

The database includes an audio database including a plurality of pairs of text and audio corresponding to the text, and a video image database including a plurality of pairs of audio and video images corresponding to the audio,

The operation of generating the image generation model includes:

generating, based on the speech database, a speech generation model for generating speech based on text; and

Based on the video image database, comprising the operation of generating a video image generation model for generating a video image based on a voice,

The operation of generating the first image corresponding to the first text includes:

generating a first voice corresponding to the first text based on the voice generation model and the first text;

generating a first video image corresponding to the first audio based on the video image generation model and the first audio; and

and generating the first video by synthesizing the first audio and the first video image.
6. The method of claim 5,

The database includes a plurality of person-specific databases corresponding to the plurality of persons,

The generating of the image generation model based on the database includes generating a plurality of image generation models for each person based on the plurality of person-specific databases.
8. The method of claim 7,

The method further includes receiving selection information about a first person among the plurality of people from the user device,

The operation of generating the first image corresponding to the first text includes:

and generating the first image based on a first person-specific image generation model corresponding to the first person among the plurality of person-specific image generation models.
A non-transitory storage medium for storing instructions, wherein the instructions, when executed by an electronic device, cause the electronic device to:

Receive the first text input,

transmitting the first text to a server including a database for generating an image based on the text and an image generation model based on the database,

receiving a first image corresponding to the first text from the server;

A non-transitory storage medium configured to output the first image.
The method of claim 9, wherein the instruction, when executed by the electronic device, causes the electronic device to:

display multiple people,

receiving a selection of a first person among the plurality of persons;

to transmit the selection of the first person to the server;

The first image is generated in the server based on an image generation model for each person corresponding to the first person, a non-transitory storage medium.