WO2023058812A1

WO2023058812A1 - Method, apparatus, and computer program for switching image

Info

Publication number: WO2023058812A1
Application number: PCT/KR2021/016162
Authority: WO
Inventors: 조면철
Original assignee: 주식회사 마인즈랩
Priority date: 2021-10-06
Filing date: 2021-11-08
Publication date: 2023-04-13
Also published as: KR102514580B1

Abstract

A method for switching an image from a first image to a second image, according to one embodiment of the present invention, may comprise the steps of: calculating a degree of similarity between at least one frame constituting the first image and at least one frame constituting each of a plurality of second images; determining a first connection frame, which is a frame used at a time of switching from the first image to the second image, from among the at least one frame of the first image, by referring to the degree of similarity; and determining a second image to be used for switching from among the plurality of second images, and a second connection frame, which is a frame used at the time of switching, from among at least one frame of the second image to be used for switching, by referring to the degree of similarity.

Description

Image conversion method, device and computer program

The present invention relates to an image conversion method, apparatus, and computer program for providing natural image conversion in providing an image in which voice and lip shape are synchronized.

With the development of information and communication technology, artificial intelligence technology is being introduced into many applications. Conventionally, in order to generate an image in which a specific person talks about a specific topic, the only method is to acquire an image in which the corresponding person actually talks about the corresponding topic using a camera or the like.

In addition, in some prior art, a synthesized image based on an image or video of a specific person has been created using video synthesis technology, but such an image still has problems in that the transition of the image is not smooth or the shape of the mouth of the person is unnatural there was

The present invention is to solve the above problems, and to create a more natural image.

In particular, the present invention is intended to allow a natural transition between an avatar's standby video and an avatar's utterance video.

In addition, the present invention intends to generate an avatar image with a natural mouth shape without taking a picture by a real person.

In addition, the present invention is intended to minimize the use of server resources and network resources used in image generation despite the use of an artificial neural network.

A method for converting an image from a first image to a second image according to an embodiment of the present invention measures a similarity between at least one frame constituting a first image and at least one frame constituting a plurality of second images, respectively. calculating; determining a first connected frame, which is a frame used at a time of transition from the first image to the second image, among at least one frame of the first image, by referring to the degree of similarity; and determining a second video to be used for switching among the plurality of second images and a second connected frame that is a frame used at the switching time point among at least one frame of the second video to be used for switching by referring to the degree of similarity. can include

The image conversion method, prior to the step of calculating the similarity, selects at least one frame constituting each of the plurality of second images among at least one frame constituting the first image and at least one candidate frame to be compared for similarity. The method may further include determining a similarity, and the calculating of the degree of similarity may include calculating a degree of similarity between the at least one candidate frame and at least one frame constituting each of the plurality of second images.

The determining of the at least one candidate frame may include determining, as the at least one candidate frame, at least one frame of the first image corresponding to a predetermined time period from a first point in time when a need to switch an image occurs; can include

The determining of the at least one candidate frame may include determining a first frame of the first image and a last frame of the first image as the at least one candidate frame.

The image switching method further includes generating a transition image by connecting a third image including at least some frames of the first image and a fourth image including at least some frames of the second image to be used for the transition. can do. At this time, the third image is an image composed of at least some frames of the first image, and has the first connected frame as the last frame, and the fourth image is composed of at least some frames of the second image to be used for the transition. It may be an image having the second connected frame as a first frame.

The first image is a captured image of the avatar's waiting state, and the second image is a captured image of the avatar's utterance. It may further include generating a lip image provided with the fourth image.

The generating of the lip image may include acquiring a target voice to be used as the voice of the avatar; generating a lip image corresponding to the voice for each frame of the fourth image using the learned first artificial neural network; and generating lip sync data including identification information of the fourth image, the lip image, and location information of the lip image in a frame of the fourth image.

The first artificial neural network may be a neural network trained to output a second lip image, which is an image in which the first lip image is transformed according to the voice, according to input of the voice and the first lip image.

The generating of the transition image may include generating configuration information of a transition image including identification information of frames included in each of the third and fourth images and transmitting the generated transition image configuration information to an external device.

An image switching device for switching an image from a first image to a second image according to an embodiment of the present invention includes a gap between at least one frame constituting a first image and at least one frame constituting a plurality of second images, respectively. A similarity is calculated, and a first connection frame, which is a frame used at a transition time when the first image is switched to the second image, among at least one frame of the first image is determined by referring to the similarity, and the similarity is determined. With reference to the second image to be used for switching among the plurality of second images, and at least one frame of the second image to be used for switching, a second connected frame that is a frame used at the switching time point may be determined.

The image conversion device determines at least one candidate frame to be compared with at least one frame constituting each of the plurality of second images among at least one frame constituting the first image, and the at least one candidate frame A similarity between the image and at least one frame constituting each of the plurality of second images may be calculated.

The image switching device may determine, as the at least one candidate frame, at least one frame of the first image corresponding to a predetermined time period from a first point in time when the need for image switching occurs.

The image conversion device may determine a first frame of the first image and a last frame of the first image as the at least one candidate frame.

The image switching device generates a transition image by connecting a third image including at least some frames of the first image and a fourth image including at least some frames of the second image to be used for the transition, and is an image composed of at least some frames of the first image and has the first connected frame as the last frame, and the fourth image is an image composed of at least some frames of the second image to be used for the transition. It may be an image having 2 connected frames as the first frame.

The first image is a captured image of the avatar's waiting state, the second image is a captured image of the avatar's utterance, and the image conversion device is a lip image provided together with the fourth image from the input voice. can create

The image switching device acquires a target voice to be used as the voice of the avatar, uses the learned first artificial neural network, and generates a lip image corresponding to the voice for each frame of the fourth image, Lip sync data including identification information, the lip image, and location information of the lip image in the frame of the fourth image may be generated.

The image conversion device may generate configuration information of a conversion image including identification information of frames included in each of the third image and the fourth image and transmit it to an external device.

According to the present invention, a more natural person image can be created.

In particular, according to the present invention, switching between an avatar's standby video and an avatar's utterance video can be made naturally.

In addition, according to the present invention, an avatar image having a natural mouth shape can be generated without taking a picture of a real person.

In addition, according to the present invention, despite the use of an artificial neural network, it is possible to minimize the use of server resources and network resources used in image generation.

1 is a diagram schematically showing the configuration of an image generating system according to an embodiment of the present invention.

Figure 2 is a diagram schematically showing the configuration of the server 100 according to an embodiment of the present invention.

3 is a diagram schematically illustrating the configuration of a service server 300 according to an embodiment of the present invention.

4 and 5 are diagrams for explaining an exemplary structure of an artificial neural network learned by the server 100 according to an embodiment of the present invention.

6 is a diagram illustrating a plurality of first images IG1 and a plurality of second images IG2 stored in the memory 130 of the server 100 according to an embodiment of the present invention.

7 and 8 are diagrams for explaining a process in which the server 100 determines at least one

candidate frame

411, 412, or 413 according to an embodiment of the present invention.

9 is a diagram illustrating a similarity calculation result 431 when the similarity between the frame 414 constituting the first image 410 and the frame 421 of the second image is high.

10 is a diagram illustrating a similarity calculation result 432 when the similarity between the frame 415 constituting the first image 410 and the frame 422 of the second image is small.

11 is a diagram showing the configuration of a transition image 440 generated by the server 100 according to an embodiment of the present invention.

FIG. 12 is a diagram for explaining a method in which the server 100 learns the first artificial neural network 520 using a plurality of learning data 510 according to an embodiment of the present invention.

13 is a diagram for explaining input data and output data of the first artificial neural network 520 .

14 is a diagram for explaining a method of generating an output image 740 composed of output frames by the user terminal 200 according to an embodiment of the present invention.

15, 16 and 17 are flowcharts for explaining an image conversion method performed by the server 100 according to an embodiment of the present invention.

Since the present invention can apply various transformations and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. Effects and features of the present invention, and methods for achieving them will become clear with reference to the embodiments described later in detail together with the drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various forms.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, and when describing with reference to the drawings, the same or corresponding components are assigned the same reference numerals, and overlapping descriptions thereof will be omitted. .

In the following embodiments, terms such as first and second are used for the purpose of distinguishing one component from another component without limiting meaning. In the following examples, expressions in the singular number include plural expressions unless the context clearly dictates otherwise. In the following embodiments, terms such as include or have mean that features or components described in the specification exist, and do not preclude the possibility that one or more other features or components may be added. In the drawings, the size of components may be exaggerated or reduced for convenience of description. For example, since the size and shape of each component shown in the drawings are arbitrarily shown for convenience of description, the present invention is not necessarily limited to those shown.

An image generating system according to an embodiment of the present invention includes a lip image (generated by a server) and a second image including a face (stored in a memory of an image receiving device, sometimes referred to as 'second image to be used for conversion' in this specification). ) may be overlapped and displayed in the image receiving device (eg, user terminal).

At this time, the server of the image generating system may generate sequential lip images from the voice to be used as the voice of the target object, and the image receiving device overlaps and displays the sequential lip images and the second image, thereby matching the mouth shape and the voice. can be displayed.

As described above, in generating a lip sync image, the present invention allows some operations to be performed by an image receiving device so that server resources can be used more efficiently and related resources can also be used more efficiently.

Meanwhile, the image generating system according to an embodiment of the present invention converts a first image (for example, an image in which a waiting state of a target object is photographed) to a second image (eg, an image in which a state of speaking of a target object is photographed). In generating the image, a more natural image may be generated by selecting a second image having a high degree of similarity through a similarity comparison between frames constituting the two images.

In the present invention, 'artificial neural networks' such as the first artificial neural network and the second artificial neural network are neural networks learned with learning data according to use, and artificial neural networks learned by machine learning or deep learning techniques can mean The structure of such a neural network will be described later with reference to FIGS. 4 and 5 .

In the present invention, the first image may be, for example, one of an image in which a waiting state of a target object (hereinafter referred to as 'avatar') is photographed and an image in which an avatar utterance is photographed, and the second video corresponds to the other one. can However, this is an example and the spirit of the present invention is not limited thereto, and any image including an image of an operating target object (or avatar) may correspond to the first image and the second image of the present invention. A detailed description of the first image and the second image will be described later with reference to FIG. 6 .

As shown in FIG. 1 , an image generating system according to an embodiment of the present invention may include a server 100 , a user terminal 200 , a service server 300 and a communication network 400 .

The server 100 according to an embodiment of the present invention may perform operations necessary for generating and/or transmitting a transition image, which is an image that is converted from a first image to a second image. For example, the server 100 may provide identification information of one or more images included in the transition image and identification information of each frame of the one or more images to the user terminal 200 and/or the service server 300 .

Meanwhile, the server 100 according to an embodiment of the present invention generates a lip image from the target voice using the learned first artificial neural network, and transmits the generated lip image to the user terminal 200 and/or the service server 300. can be provided to

At this time, the server 100 may generate a lip image corresponding to the voice for each frame of the transition image, and may generate lip sync data including identification information of the frame, the generated lip image, and location information of the lip image within the frame. . In addition, the server 100 may provide the generated lip sync data to the user terminal 200 and/or the service server 300 . In the present invention, such a server 100 may be sometimes named and described as a 'video conversion device'.

Figure 2 is a diagram schematically showing the configuration of the server 100 according to an embodiment of the present invention. Referring to FIG. 2 , a server 100 according to an embodiment of the present invention may include a communication unit 110 , a first processor 120 , a memory 130 and a second processor 140 . Also, although not shown in the drawings, the server 100 according to an embodiment of the present invention may further include an input/output unit, a program storage unit, and the like.

The communication unit 110 includes hardware and software necessary for the server 100 to transmit/receive signals such as control signals or data signals with other network devices such as the user terminal 200 and/or the service server 300 through wired or wireless connections. It may be a device that includes

The first processor 120 may be a device that controls a series of processes of generating output data from input data using learned artificial neural networks. For example, the first processor 120 may be a device that controls a process of generating a lip image corresponding to the acquired voice using the learned first artificial neural network.

In this case, the processor may mean, for example, a data processing device embedded in hardware having a physically structured circuit to perform functions expressed by codes or instructions included in a program. As an example of such a data processing device built into the hardware, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) Circuit) and FPGA (Field Programmable Gate Array), but the scope of the present invention is not limited thereto.

The memory 130 performs a function of temporarily or permanently storing data processed by the server 100 . The memory may include magnetic storage media or flash storage media, but the scope of the present invention is not limited thereto. For example, the memory 130 may temporarily and/or permanently store data constituting the learned artificial neural network (eg, coefficients). Of course, the memory 130 may store learning data for learning an artificial neural network or data received from the service server 300 . However, this is illustrative and the spirit of the present invention is not limited thereto.

The second processor 140 may refer to a device that performs calculations under the control of the first processor 120 described above. In this case, the second processor 140 may be a device having higher computational power than the above-described first processor 120 . For example, the second processor 140 may be configured as a graphics processing unit (GPU). However, this is an example and the spirit of the present invention is not limited thereto. In one embodiment of the present invention, the number of second processors 140 may be plural or singular.

In one embodiment of the present invention, the service server 300 receives lip sync data including a lip image generated by the server 100 and data related to a transition image (eg, identification information of a frame included in the transition image), It may be a device that generates an output frame using this and provides it to another device (eg, the user terminal 200).

In another embodiment of the present invention, the service server 300 receives the artificial neural network learned by the server 100, and according to a request from another device (eg, the user terminal 200), lip sync data and a transition image and It may be a device that provides related data (eg, identification information of a frame included in a transition image).

3 is a diagram schematically illustrating the configuration of a service server 300 according to an embodiment of the present invention. Referring to FIG. 3 , a service server 300 according to an embodiment of the present invention may include a communication unit 310 , a third processor 320 , a memory 330 and a fourth processor 340 . Also, although not shown in the drawing, the service server 300 according to an embodiment of the present invention may further include an input/output unit, a program storage unit, and the like.

In one embodiment of the present invention, the third processor 320 receives lip sync data including a lip image generated from the server 100 and data related to a transition image (eg, identification information of a frame included in the transition image) It may be a device that controls the process of generating an output frame and providing it to another device (for example, the user terminal 200).

Meanwhile, in another embodiment of the present invention, the third processor 320 uses the learned artificial neural network (received from the server 100) to obtain lip sync data and lip sync data according to a request from another device (eg, the user terminal 200). It may be a device that provides data related to a transition image (eg, identification information of a frame included in the transition image).

The memory 330 performs a function of temporarily or permanently storing data processed by the service server 300 . The memory may include magnetic storage media or flash storage media, but the scope of the present invention is not limited thereto. For example, the memory 330 may temporarily and/or permanently store data constituting the learned artificial neural network (eg, coefficients). Of course, the memory 330 may store learning data for learning an artificial neural network or data received from the service server 300 . However, this is illustrative and the spirit of the present invention is not limited thereto.

The fourth processor 340 may mean a device that performs calculations under the control of the aforementioned third processor 320 . In this case, the fourth processor 340 may be a device having higher computational power than the aforementioned third processor 320 . For example, the fourth processor 340 may be configured as a graphics processing unit (GPU). However, this is an example and the spirit of the present invention is not limited thereto. In one embodiment of the present invention, the fourth processor 340 may be plural or singular.

The user terminal 200 according to an embodiment of the present invention may refer to various types of devices that mediate the user and the server 100 so that the user can use various services provided by the server 100 . In other words, the user terminal 200 according to an embodiment of the present invention may refer to various devices that transmit and receive data to and from the server 100 .

The user terminal 200 according to an embodiment of the present invention receives lip sync data and transition image-related data (eg, frame identification information included in the transition image) provided by the server 100, and uses them to generate an output frame. As shown in FIG. 1 , such a user terminal 200 may mean

portable terminals

201 , 202 , and 203 or may mean a computer 204 .

The user terminal 200 according to an embodiment of the present invention may include a display means for displaying content and an input means for acquiring a user's input for such content in order to perform the above-described functions. At this time, the input means and the display means may be configured in various ways. For example, the input means may include, but is not limited to, a keyboard, a mouse, a trackball, a microphone, a button, a touch panel, and the like.

In the present invention, such a user terminal 200 may be sometimes named and described as a 'video display device'.

The communication network 400 according to an embodiment of the present invention may refer to a communication network that mediates data transmission and reception between components of an image generating system. For example, the communication network 400 may be a wired network such as LANs (Local Area Networks), WANs (Wide Area Networks), MANs (Metropolitan Area Networks), ISDNs (Integrated Service Digital Networks), wireless LANs, CDMA, Bluetooth, satellite communication, etc. However, the scope of the present invention is not limited thereto.

4 and 5 are diagrams for explaining an exemplary structure of an artificial neural network learned by the server 100 according to an embodiment of the present invention. Hereinafter, for convenience of explanation, the first artificial neural network and the second artificial neural network will be collectively referred to as 'artificial neural networks'.

An artificial neural network according to an embodiment of the present invention may be an artificial neural network based on a Convolutional Neural Network (CNN) model as shown in FIG. 4 . In this case, the CNN model may be a hierarchical model used to finally extract features of input data by alternately performing a plurality of operation layers (Convolutional Layer, Pooling Layer).

The server 100 according to an embodiment of the present invention may build or train an artificial neural network model by processing learning data according to a supervised learning technique. A detailed description of how the server 100 trains the artificial neural network will be described later.

The server 100 according to an embodiment of the present invention uses a plurality of learning data to input any one input data to the artificial neural network so that the generated output value approaches the value marked in the corresponding learning data. / Alternatively, the artificial neural network may be trained by repeatedly performing a process of updating the weight of each node.

At this time, the server 100 according to an embodiment of the present invention may update weights (or coefficients) of each layer and/or each node according to a back propagation algorithm.

The server 100 according to an embodiment of the present invention generates a convolution layer for extracting feature values of input data and a pooling layer constituting a feature map by combining the extracted feature values. can do.

In addition, the server 100 according to an embodiment of the present invention combines the generated feature maps to generate a fully connected layer preparing to determine a probability that input data corresponds to each of a plurality of items. can

The server 100 according to an embodiment of the present invention may calculate an output layer including output corresponding to input data.

In the example shown in FIG. 4, it is shown that input data is divided into 5X7 blocks, 5X3 unit blocks are used to create a convolution layer, and 1X4 or 1X2 unit blocks are used to create a pooling layer. However, this is an example and the spirit of the present invention is not limited thereto. Accordingly, the type of input data and/or the size of each block may be configured in various ways.

On the other hand, such an artificial neural network is a function defining the type of model of the artificial neural network, the coefficient of at least one node constituting the artificial neural network, the weight of the node, and the relationship between a plurality of layers constituting the artificial neural network in the memory 130 described above. can be stored in the form of coefficients of Of course, the structure of the artificial neural network may also be stored in the memory 130 in the form of source codes and/or programs.

An artificial neural network according to an embodiment of the present invention may be an artificial neural network based on a Recurrent Neural Network (RNN) model as shown in FIG. 5 .

Referring to FIG. 5 , an artificial neural network according to such a recurrent neural network (RNN) model includes an input layer L1 including at least one input node N1 and a hidden layer L2 including a plurality of hidden nodes N2. ) and an output layer L3 including at least one output node N3.

As shown, the hidden layer L2 may include one or more fully connected layers. When the hidden layer L2 includes a plurality of layers, the artificial neural network may include a function (not shown) defining a relationship between each hidden layer.

At least one output node N3 of the output layer L3 may include an output value generated by the artificial neural network from an input value of the input layer L1 under the control of the server 100 .

Meanwhile, a value included in each node of each layer may be a vector. In addition, each node may include a weight corresponding to the importance of the corresponding node.

Meanwhile, the artificial neural network includes a first function F1 defining the relationship between the input layer L1 and the hidden layer L2 and a second function F2 defining the relationship between the hidden layer L2 and the output layer L3. can include

The first function F1 may define a connection relationship between the input node N1 included in the input layer L1 and the hidden node N2 included in the hidden layer L2. Similarly, the second function F2 may define a connection relationship between the hidden node N2 included in the hidden layer L2 and the output node N3 included in the output layer L3.

Functions between the first function F1, the second function F2, and the hidden layer may include a recurrent neural network model that outputs a result based on an input of a previous node.

In the course of learning the artificial neural network by the server 100, the first function F1 and the second function F2 may be learned based on a plurality of learning data. Of course, in the process of learning the artificial neural network, functions between the plurality of hidden layers may also be learned in addition to the first function F1 and the second function F2 described above.

An artificial neural network according to an embodiment of the present invention may be learned in a supervised learning method based on labeled learning data.

The server 100 according to an embodiment of the present invention uses a plurality of learning data and inputs any one of the input data to the artificial neural network so that the generated output value approaches the value marked in the corresponding training data. The artificial neural network can be trained by repeatedly performing a process of updating F1, F2, functions between hidden layers, etc.

At this time, the server 100 according to an embodiment of the present invention may update the aforementioned functions (F1, F2, functions between hidden layers, etc.) according to a back propagation algorithm. However, this is an example and the spirit of the present invention is not limited thereto.

The type and/or structure of the artificial neural network described in FIGS. 4 and 5 is exemplary, and the spirit of the present invention is not limited thereto. Accordingly, artificial neural networks of various types of models may correspond to 'artificial neural networks' described throughout the specification.

Hereinafter, a transition image generation process performed by the server 100, a process of providing a lip sync video, and a process of displaying a lip sync video performed by the user terminal 200 will be mainly described.

전환 영상 과정transition video course

In the present invention, a 'transition image' is a background image displayed together with a lip image generated by the server 100, and may be an image containing a process or appearance of an avatar transitioning. For example, the transition image may be an image in which an image of an avatar's waiting state is photographed and an image of an avatar's speaking state is photographed. Of course, the transition video may be a video in which the first video in which the avatar's speech is filmed and the second video in which the avatar's speech is filmed are connected. It may be a linked video.

On the other hand, the lip image generated by the server 100 may be displayed overlapping an image portion in which the speech of the avatar is photographed among the transition images.

In general, it is not required that the content of the avatar's speech and the avatar's speech gesture (for example, body movements) are necessarily related, so it is unnatural to display any mouth shape overlapping on the image in which the avatar makes the speech gesture. this is not big Therefore, the server 100 according to an embodiment of the present invention generates a transition image in which the avatar's speaking state is switched from a photographed image of the avatar's standby state at a time when the avatar's speech is required, and the avatar's speech It is possible to provide a natural and intended speech image by overlapping and displaying the lip image for the part where the figure is photographed.

6 is a diagram illustrating a plurality of first images IG1 and a plurality of second images IG2 stored in the memory 130 of the server 100 according to an embodiment of the present invention. The server 100 according to an embodiment of the present invention selects any one of the plurality of first images IG1 and the plurality of second images IG2 according to a given situation to display the user terminal 200 and/or the service server. (300). For example, the server 100 may select and provide one of the plurality of first images IG1 to the user terminal 200 in a situation where there is no user manipulation of the user terminal 200 . At this time, the user terminal 200 may provide a more natural idle screen by displaying the first image provided by the server 100 .

In the present invention, a natural image such as a 'natural idle screen' is an image in which an avatar moves naturally over time. For example, when the avatar is a person, it may be an image in which a person blinks his eyes or moves his body minutely according to breathing. Accordingly, an image composed of a single frame or a single image may not correspond to a natural image described in the present invention.

On the other hand, the server 100, when a situation in which the avatar needs to be uttered occurs according to the user's manipulation of the user terminal 200, the server 100 displays the first one currently displayed on the user terminal 200 among the plurality of second images IG2. A second image capable of being naturally connected to the image may be selected (that is, a transition image may be generated) and provided to the user terminal 200 . At this time, as described above, the server 100 may generate a lip image to be displayed on the second image and provide it to the user terminal 200, which will be described in detail later.

Hereinafter, for convenience of explanation, the first image 410 is currently displayed on the user terminal 200, and the second image 420 to be used for conversion is selected from among the plurality of second images IG2 according to a process described later. It is explained on the premise that

The server 100 according to an embodiment of the present invention includes at least one frame that is a target for comparing similarity with at least one frame constituting each of a plurality of second images IG2 among at least one frame constituting the first image 410 . A candidate frame of can be determined.

candidate frame

411, 412, or 413 according to an embodiment of the present invention.

The server 100 according to an embodiment of the present invention is configured for a predetermined time period (eg, from the first time point t1 when the need to switch the image displayed on the user terminal 200 occurs). At least one frame 411 of the first image 410 corresponding to the interval up to the second time point t2 that is after T seconds may be determined as at least one candidate frame. For example, when a user's manipulation of the user terminal 200 is performed at a first time point and a situation requiring the avatar to be uttered is issued accordingly, the server 100 responds within a second time point t2 from the first time point. It is possible to determine at least one frame 411 as at least one candidate frame. In this case, the server 100 may provide a video in which the avatar speaks at the latest at the second time point t2.

Meanwhile, the server 100 according to another embodiment of the present invention may determine the first frame 412 of the first image 410 and the last frame 413 of the first image 410 as at least one candidate frame. For example, if each of the plurality of first images IG1 is composed of a relatively short time length and the avatar's pose is static at the beginning and end of the image, at least one of the first frame 412 and the last frame 413 It may be more preferable to determine as a candidate frame of .

Meanwhile, the candidate frame determination schemes shown in FIGS. 7 and 8 are exemplary, and the spirit of the present invention is not limited thereto.

The server 100 according to an embodiment of the present invention may calculate a similarity between at least one frame constituting the first image 410 and at least one frame constituting each of the plurality of second images IG2. . For example, the server 100 may calculate a similarity by comparing at least one

candidate frame

411, 412, or 413 determined according to the above-described process with at least one frame constituting each of the plurality of second images IG2. .

As shown in FIG. 9 , the server 100 according to an embodiment of the present invention may calculate a difference value between pixels of the two

frames

414 and 421 as a similarity calculation result 431 . In the calculation result 431, a portion having a difference in pixel values between the two

frames

414 and 421 may be displayed to correspond to the degree of difference.

As described above, the server 100 according to an embodiment of the present invention may calculate a difference between pixels of the two

frames

415 and 422 as the similarity calculation result 432 . In the calculation result 432, a portion having a difference in pixel values between the two

frames

415 and 422 may be displayed to correspond to the degree of difference.

Comparing the calculation result 431 of FIG. 9 with the calculation result 432 of FIG. 10 , it can be seen that more pixels appear to have differences in the calculation result 432 of FIG. 10 .

The server 100 according to an embodiment of the present invention converts the first image 410 to the second image 420 among at least one frame of the first image 410 by referring to the degree of similarity calculated according to the above-described process. It is possible to determine a first concatenated frame, which is a frame used at a time of switching. Similarly, the server 100 according to an embodiment of the present invention determines the second image 420 to be used for conversion and the second image 420 to be used for conversion among the plurality of second images IG2 by referring to the degree of similarity. A second concatenated frame, which is a frame used at a transition time, among at least one frame may be determined.

At this time, the server 100 may determine the first connection frame and the second connection frame according to various criteria.

For example, the server 100 selects a frame combination having the highest similarity, that is, a combination (or set) of a frame of the first image 410 and a frame of the second image 420 having the highest similarity. The frame of 410 may be determined as the first connected frame, and the frame of the second image 420 may be determined as the second connected frame. According to this method, the most natural video conversion is possible.

In addition, the server 100 selects a combination including a frame of the first image 410 at the earliest (or earliest) time point among frame combinations whose similarity exceeds the threshold similarity, and the first image 410 at this time It is also possible to determine the frame of the first linking frame and the frame of the second image 420 as the second linking frame. According to this method, a fast response speed can be provided.

However, both of the above-described two methods are illustrative, and the spirit of the present invention is not limited thereto, and a method of determining a frame combination based on similarity may correspond to the present invention.

The server 100 according to an embodiment of the present invention converts a third image including at least some frames of the first image 410 and a fourth image including at least some frames of the second image 420 to be used for conversion. You can create a linked transition video.

As described above, the transition image 440 includes a third image 416 including at least some frames of the first image 410 and a fourth image including at least some frames of the second image 420 to be used for transition ( 423) may be connected.

In this case, the third image 416 is an image composed of at least some frames of the first image 410, and may be an image having the first connected frame 417 determined according to the above process as the last frame. Also, the fourth image 423 is an image composed of at least some frames of the second image 420 to be used for conversion, and may be an image having the second connection frame 424 as a first frame.

The server 100 according to an embodiment of the present invention may transmit configuration information of the transition image generated according to the above process to the user terminal 200 and/or the service server 300 . For example, the server 100 generates configuration information of transition images so that identification information of frames included in each of the third image 416 and the fourth image 423 is included, and the user terminal 200 and/or the service server ( 300). The user terminal 200 and/or the service server 300 may use the received configuration information of the transition image to create a background image of the lip image generated according to a process described later.

립싱크 영상 제공 및 표시 과정Lip sync video provision and display process

The server 100 according to an embodiment of the present invention may generate a lip image provided together with the fourth image 423 from the input voice. To this end, the server 100 according to an embodiment of the present invention may train the first artificial neural network using each learning data.

The first artificial neural network 520 according to an embodiment of the present invention is a neural network that learns (or learns) a correlation between a first lip image, a voice, and a second lip image included in each of a plurality of training data 510. can mean

Accordingly, the first artificial neural network 520 according to an embodiment of the present invention generates the first lip image 542 according to the input of the voice 531 and the first lip image 542 as shown in FIG. 13 . This may refer to a neural network trained (or learned) to output the second lip image 543 , which is an image transformed according to 531 . In this case, the first lip image 542 is a sample image and may be an image including a shape of the lips that is a basis for generating a lip image according to the voice.

Each of the plurality of learning data 510 according to an embodiment of the present invention may include a first lip image, a voice, and a second lip image.

For example, the first training data 511 may include a first lip image 511B, a voice 511A, and a second lip image 511C. Similarly, the second training data 512 and the third training data 513 may each include a first lip image, a voice, and a second lip image.

Meanwhile, in one embodiment of the present invention, the number of second lip images included in each of the plurality of learning data 510 may be plural or singular. For example, in an example in which the server 100 divides voice according to a predetermined rule and generates a lip image from the divided voice section, the second lip image may be singular. At this time, the voice included in each of the plurality of learning data 510 may also be a partial section divided from the entire voice.

Meanwhile, in an example in which the server 100 generates a series of lip images from the entire voice, a plurality of second lip images may be provided as shown in FIG. 12 . However, this is illustrative and the spirit of the present invention is not limited thereto.

In an embodiment of the present invention, the plurality of learning data 510 may be acquired through a lip reading process. For example, the server 100 may collect a mouth shape image and audio data related to the mouth shape image through a lip reading process. In this case, the plurality of learning data 510 may further include text corresponding to voice and mouth shape.

In an optional embodiment of the present invention, the server 100 may use the model learned through the lip reading process as the first artificial neural network 520 . However, this is illustrative and the spirit of the present invention is not limited thereto.

Hereinafter, it will be described on the premise that the first artificial neural network 520 has been learned according to the process described in FIGS. 12 and 13 .

The server 100 according to an embodiment of the present invention may obtain a target voice to be used as the voice of the avatar.

In the present invention, 'target voice' is used as a sound signal of a transition image (or fourth image 423), and means a voice corresponding to the avatar's lip shape in the transition image (or fourth image 423). can do.

In one embodiment of the present invention, the server 100 may generate a target voice from text using the learned second artificial neural network. In this case, the second artificial neural network may refer to a neural network learned (or learned) to output a target voice corresponding to the reading of the text according to the input of the text.

Meanwhile, 'text' may be created by the server 100 according to a predetermined rule or method. For example, in an example in which the server 100 provides a response according to a request received from the user terminal 200, the server 100 responds to the request received from the user terminal 200 using a third artificial neural network (not shown). You can generate text that responds to

Meanwhile, in an example in which the server 100 provides a response (or image) according to a preset scenario, the server 100 may read text from memory. However, this is illustrative and the spirit of the present invention is not limited thereto.

The server 100 according to an embodiment of the present invention may transmit the target voice to the user terminal 200 . At this time, the user terminal 200 according to an embodiment of the present invention may store the target voice. Meanwhile, the target voice to the user terminal 200 may be used to generate and/or output an output video (or output frame), which will be described in detail later.

The server 100 according to an embodiment of the present invention may generate a lip image corresponding to the target voice for each frame of the fourth image 423 using the learned first artificial neural network.

In the present invention, an expression such as 'for each frame of the fourth image' may mean generating a lip image for each individual frame of the fourth image 423 . For example, the server 100 according to an embodiment of the present invention may generate a lip image corresponding to the first frame of the fourth image 423 using the learned first artificial neural network. At this time, as described with reference to FIG. 13, the first artificial neural network receives the input of the voice 531 and the first lip image 542, and the second lip image 543 is an image that is transformed from the first lip image 542 according to the voice. ) may mean a neural network learned (or learned) to output.

In an embodiment of the present invention, the server 100 inputs the first lip image obtained from the first frame of the fourth image 423 and the target voice to the first artificial neural network, and outputs the first frame as an output result. It is possible to generate a lip image for .

The server 100 according to an embodiment of the present invention may generate first lip sync data. At this time, the first lip sync data includes the identification information of the first frame of the fourth image 423 used for the lip image, the generated lip image, and the location of the lip image in the first frame of the fourth image 423 used for the lip image. information may be included. To generate such first lip sync data, the server 100 according to an embodiment of the present invention may check the position of the lips in the first frame and generate position information of the lip image based on the checked position. .

The server 100 according to an embodiment of the present invention may transmit the generated first lip sync data to the user terminal 200 . At this time, as described above, the first lip sync data is the identification information of the fourth image 423 used for the lip image, the generated lip image, and the frame (ie, the first frame) of the fourth image 423 used for the lip image. Location information of the lip image may be included.

Upon receiving such first lip sync data, the user terminal 200 may read a frame corresponding to the frame identification information from memory by referring to identification information of the first frame included in the first lip sync data. At this time, the user terminal 200 may retrieve and read a frame corresponding to the frame identification information from the plurality of first images IG1 and/or the plurality of second images IG2 stored in advance. For example, the user terminal 200 may search for an image including the fourth image 423 among the plurality of second images IG2 and read a specific frame from the corresponding image.

In addition, the user terminal 200 overlaps the lip image included in the first lip sync data on the lead frame based on the location information of the lip image included in the first lip sync data to generate an output frame and display it. there is.

The above-described process is a step for explaining the process of the server 100 and the user terminal 200 for the first frame, which is one frame.

The server 100 according to an embodiment of the present invention may generate the lip sync data frame by frame for a plurality of frames of the fourth image 423 . At this time, the user terminal 200 may receive lip sync data generated in units of frames and generate output frames for each lip sync data.

For example, the server 100 and the user terminal 200 may process the second frame in the same way as the first frame described above. In this case, the second frame may be a frame following the first frame in the fourth image 423 .

The user terminal 200 according to an embodiment of the present invention displays the output frames generated according to the above-described process and simultaneously reproduces the target voice, thereby providing the user with an output result as if the object speaks with the corresponding voice. .

That is, the user terminal 200 provides the avatar's image as an image in which the mouth shape is changed from the pre-stored avatar image to the mouth shape received from the server 100, and the avatar's voice as the target voice received from the server 100 to provide natural A lip sync image may be provided.

The user terminal 200 according to an embodiment of the present invention overlaps the lip image 544 generated by the server 100 on the specific frame 590 of the fourth image 423 to generate an output frame 711. can create At this time, the user terminal 200 may determine the overlap position of the lip image 544 on the specific frame 590 by using the location information 591 of the lip image received from the server 100 . Of course, the user terminal 200 may generate output frames in the same way for the remaining frames of the fourth image 423 .

15, 16 and 17 are flowcharts for explaining an image conversion method performed by the server 100 according to an embodiment of the present invention. Hereinafter, a description will be made with reference to FIGS. 1 to 14, but descriptions of overlapping contents will be omitted.

The server 100 according to an embodiment of the present invention includes at least one frame that is a target for comparing similarity with at least one frame constituting each of a plurality of second images IG2 among at least one frame constituting the first image 410 . A candidate frame of may be determined. (S1210)

candidate frame

411, 412, or 413 according to an embodiment of the present invention.

The server 100 according to an embodiment of the present invention may calculate a similarity between at least one frame constituting the first image 410 and at least one frame constituting each of the plurality of second images IG2. . (S1220) For example, the server 100 compares at least one

candidate frame

411, 412, or 413 determined according to the above process with at least one frame constituting each of the plurality of second images IG2 to calculate a similarity can do.

frames

414 and 421 may be displayed to correspond to the degree of difference.

frames

415 and 422 may be displayed to correspond to the degree of difference.

The server 100 according to an embodiment of the present invention converts the first image 410 to the second image 420 among at least one frame of the first image 410 by referring to the degree of similarity calculated according to the above-described process. A first concatenated frame, which is a frame used at a time of switching, may be determined. (S1230) Similarly, the server 100 according to an embodiment of the present invention refers to the degree of similarity to determine the second image 420 to be used for conversion among the plurality of second images IG2 and the second image to be used for conversion (S1230). 420), a second concatenated frame, which is a frame used at a transition time, may be determined from among the at least one frame. (S1240)

The server 100 according to an embodiment of the present invention converts a third image including at least some frames of the first image 410 and a fourth image including at least some frames of the second image 420 to be used for conversion. You can create a linked transition video. (S1250)

The server 100 according to an embodiment of the present invention may generate a lip image provided together with the fourth image 423 from the input voice. (S1260) To this end, the server 100 according to an embodiment of the present invention may train the first artificial neural network using each learning data.

The server 100 according to an embodiment of the present invention may determine a fourth image 423 through steps S1220 to S1250 (S1261). At this time, the fourth image 423 is a target of lip sync among transition images. It may mean an image corresponding to the part.

The server 100 according to an embodiment of the present invention may obtain a target voice to be used as the voice of the avatar (S1262).

The server 100 according to an embodiment of the present invention may transmit the target voice to the user terminal 200 . (S1263) At this time, the user terminal 200 according to an embodiment of the present invention may store the target voice. (S1264) Meanwhile, the target voice of the user terminal 200 may be used to generate and/or output an output image (or output frame) thereafter, which will be described in detail later.

The server 100 according to an embodiment of the present invention may generate a lip image corresponding to the target voice for each frame of the fourth image 423 using the learned first artificial neural network. (S1265)

The server 100 according to an embodiment of the present invention may generate first lip sync data. (S1266) At this time, the first lip sync data includes the identification information of the first frame of the fourth image 423 used for the lip image, the generated lip image, and the lips in the first frame of the fourth image 423 used for the lip image. Location information of the image may be included. To generate such first lip sync data, the server 100 according to an embodiment of the present invention may check the position of the lips in the first frame and generate position information of the lip image based on the checked position. .

The server 100 according to an embodiment of the present invention may transmit the generated first lip sync data to the user terminal 200 . (S1267) At this time, as described above, the first lip sync data includes the identification information of the fourth image 423 used for the lip image, the generated lip image, and the frame of the fourth image 423 used for the lip image (ie, the first lip sync data). frame) may include location information of the lip image.

Upon receiving such first lip sync data, the user terminal 200 may read a frame corresponding to the frame identification information from memory by referring to identification information of the first frame included in the first lip sync data. (S1268) At this time, the user terminal 200 may search for and read a frame corresponding to the frame identification information from the plurality of first images IG1 and/or the plurality of second images IG2 stored in advance. For example, the user terminal 200 may search for an image including the fourth image 423 among the plurality of second images IG2 and read a specific frame from the corresponding image.

In addition, the user terminal 200 generates an output frame by overlapping the lip image included in the first lip sync data on the lead frame based on the location information of the lip image included in the first lip sync data (S1269), can be displayed (S1270)

The server 100 according to an embodiment of the present invention may generate lip sync data frame by frame for a plurality of frames of the fourth image 423 . At this time, the user terminal 200 may receive lip sync data generated in units of frames and generate output frames for each lip sync data.

For example, the server 100 and the user terminal 200, as shown in steps S1277 to S1282 (FR2) of FIG. 17, transmit the second frame in the same manner as the above-described first frame (steps S1271 to S1276 (FR1)). method) can be dealt with. In this case, the second frame may be a frame following the first frame in the fourth image 423 .

Embodiments according to the present invention described above may be implemented in the form of a computer program that can be executed on a computer through various components, and such a computer program may be recorded on a computer-readable medium. In this case, the medium may store a program executable by a computer. Examples of the medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROM and DVD, magneto-optical media such as floptical disks, and ROM, RAM, flash memory, etc. configured to store program instructions.

Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and usable to those skilled in the art of computer software. An example of a computer program may include not only machine language code generated by a compiler but also high-level language code that can be executed by a computer using an interpreter or the like.

Specific implementations described in the present invention are examples and do not limit the scope of the present invention in any way. For brevity of the specification, description of conventional electronic components, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connection of lines or connecting members between the components shown in the drawings are examples of functional connections and / or physical or circuit connections, which can be replaced in actual devices or additional various functional connections, physical connection, or circuit connections. In addition, if there is no specific reference such as "essential" or "important", it may not necessarily be a component necessary for the application of the present invention.

Therefore, the spirit of the present invention should not be limited to the above-described embodiments and should not be determined, and all scopes equivalent to or equivalently changed from the claims as well as the claims described below are within the scope of the spirit of the present invention. will be said to belong to

Claims

A method for switching an image from a first image to a second image,

calculating a similarity between at least one frame constituting a first image and at least one frame constituting a plurality of second images;

determining a first connected frame, which is a frame used at a time of transition from the first image to the second image, among at least one frame of the first image, by referring to the degree of similarity; and

Determining a second image to be used for switching among the plurality of second images with reference to the degree of similarity and a second connected frame that is a frame used at the switching time point among at least one frame of the second image to be used for switching; Including, video conversion method.
in claim 1

The video conversion method

Prior to the step of calculating the degree of similarity,

Determining at least one candidate frame to be compared with at least one frame constituting each of the plurality of second images among at least one frame constituting the first image;

The step of calculating the similarity is

The image switching method of calculating a similarity between the at least one candidate frame and at least one frame constituting each of the plurality of second images.
in claim 2

Determining the at least one candidate frame

and determining, as the at least one candidate frame, at least one frame of the first image corresponding to a predetermined time period from a first point in time when a need for image switching occurs.
in claim 2

Determining the at least one candidate frame

and determining a first frame of the first image and a last frame of the first image as the at least one candidate frame.
in claim 1

The video conversion method

Generating a transition image by connecting a third image including at least some frames of the first image and a fourth image including at least some frames of the second image to be used for the transition,

The third image is an image composed of at least some frames of the first image and has the first connected frame as the last frame,

The fourth image is an image composed of at least some frames of the second image to be used for the transition, and is an image having the second connection frame as a first frame.
in claim 5

The first image is an image in which the waiting state of the avatar is photographed,

The second image is an image in which the speech of the avatar is photographed,

In the image conversion method, after generating a conversion image,

Generating a lip image provided with the fourth image from the input voice; further comprising, the image switching method.
in claim 6

The step of generating the lip image is

obtaining a target voice to be used as the voice of the avatar;

generating a lip image corresponding to the voice for each frame of the fourth image using the learned first artificial neural network; and

and generating lip sync data including identification information of the fourth image, the lip image, and location information of the lip image in a frame of the fourth image.
in claim 7

The first artificial neural network

A method of providing a lip sync image, wherein the neural network is trained to output a second lip image in which the first lip image is transformed according to the voice according to input of the voice and the first lip image.
in claim 5

The step of generating the transition image is

And generating configuration information of a transition image including identification information of a frame included in each of the third image and the fourth image and transmitting the configuration information to an external device.
using a computer

A computer program stored in a recording medium to execute the methods of claims 1 to 9.