WO2022255980A1

WO2022255980A1 - Virtual agent synthesis method with audio to video conversion

Info

Publication number: WO2022255980A1
Application number: PCT/TR2022/050507
Authority: WO
Inventors: Duygu CAKIR YENIDOGAN
Original assignee: Bahcesehir Universitesi
Priority date: 2021-06-02
Filing date: 2022-05-31
Publication date: 2022-12-08

Abstract

The present invention relates to an artificial/virtual agent synthesis method that performs the extraction of features from the voice, the production of fake faces of the voice, the video production required for the training set, and the production of real-time artificial representation. Particularly, the present invention relates to a virtual agent synthesis method that enables the extraction of features from the voice of the speaker and the synthesis of a fake face image based on GAN suitable for these features, recording a video and synthesizing it in accordance with the artificial face to feed into the next training step, and synthesizing new video with the speaker's voice in real time.

Description

VIRTUAL AGENT SYNTHESIS METHOD WITH AUDIO TO VIDEO

CONVERSION

Technical Field of the Invention

The present invention relates to an artificial/virtual representative synthesis method that performs the extraction of features from the voice, the production of fake faces of the voice, the video production required for the training set, and the production of real time artificial representation.

Particularly, the present invention relates to a virtual agent synthesis method that enables the extraction of features from the voice of the speaker and the synthesis of a fake face image based on the Generative Adversarial Network (GAN) suitable for these features, recording a video and synthesizing it in accordance with the artificial face to feed into the next training step, and synthesizing new video with the speaker's voice in real time.

State of the Art

People have long been interested in creating their own avatar (the person's graphical representation in the online environment), showing themselves in an exaggerated version, or disguising themselves in virtual environments. However, there are certain times when the other side of the talking person wants to be dealing with a real-looking person, namely a real video. It becomes more important than ever to include and retain the user in the system with the increasing use of online services, especially online education. Especially in the current period, it is very important to employ few people in the field and to return the work to the online platform.

Studies have shown that using an avatar can help the user interact with the system and not leave too early [1-3] In addition to attracting users to the system, an avatar also has the effect of changing a user's behavior. If the user chooses an attractive avatar, he/she would be more friendly, or chooses a rebellious-looking avatar, he/she exhibits more rebellious behavior than usual [4]

There have been groundbreaking changes in the world of computer vision, particularly with the study of Generative Adversarial Networks (GAN) [5] since the machine learning techniques become easier to implement and datasets are widely available. The endless creative fields in which GANs are used has emerged such as; GAN-based fun applications such as Snapchat [6] or Reflect [7], transforming photos into styles of famous artists [11], changing the appearance of animals between species [12], recreating colors from a black and white photograph [13] However, it has also become easy to carry out forgery attacks that create fake news [8,9] or political/financial statements [10]

By means of the GAN method, a face can be synthesized or modified as follows by using effects; 1. Changing a facial attribute such as “wearing glasses, changing hair color or getting older [14]”, 2. Changing facial expression such as “smiling” [15], 3. Swapping faces between two people [16], and 4. Synthesizing the entire face, that is, creating a face that does not exist at all when the command "create a young bald man with a mustache, brown eyes" is entered [17]

In the state of the art, in the abstract of the invention that is the subject of the application numbered “TR2019/15715” there is information such as; the present invention relates to a virtual assistant system and a method thereof with an end-to-end dialog system that is designed to meet customer questions, opinions, and requests more quickly and effectively within the scope of customer service, that has strong memory features designed with artificial intelligence infrastructure that can detect the content of speech, make sense of it, and produce answers according to scope and context."

In the state of the art, in the abstract of the invention that is the subject of the application numbered “TR2011/04387” there is information such as; "This invention relates to a virtual assistant interacting with people, which can be used for promotional or informational purposes. The aim of this invention is to realize a virtual assistant in a realistic size and shape as if getting information from a live person during promotional activities. The invention also aims to ensure that the assistant is displayed only on a screen in the size of the assistant, even though the assistant is constantly moving in order to maintain the realistic features of the virtual assistant. In addition, the virtual assistant also aims to ensure that the images taken from a database are displayed separately such that they can answer the questions asked."

In the state of the art, in the abstract of the invention that is the subject of the application numbered “TR2011/13014” there is information such as; " The invention relates to projection devices that reflect virtual presentations in human appearance or different images, and virtual assistant structures that perform presentations, and it is characterized by comprising promotional stand that creates the virtual assistant, at least one transparent surface on the promotional stand, on which the image from the projector is transferred, at least one touch screen that enables the desired intervention and promotion forms to be made during the promotion, at least one web camera and/or printer connected to the touch screen, at least one communication part that provides data and data exchange, and/or at least one sound system where the necessary sounds for promotion are obtained on at least one control unit control unit where the whole system is controlled.

In the state of the art, the invention that is the subject of the application numbered “KR2096598B1” mentions a computer program stored in a computer readable storage medium, wherein said computer program, when executed on one or more processors of a computing device, performs operations for generating a face animation. Here, the computer program comprises the operations of: receiving voice data; calculating the voice data using a first network function including two or more convolutional layers to output a feature vector related to a face pose for generating a face animation; and generating the face animation by calculating the feature vector related to the face pose using a second network function including two or more deconvolutional layers. In this invention, a 3D face animation is generated by calculating the sound data.

In the state of the art, the invention that is the subject of the application numbered “CN111145282” mentions a method that proposes a technique that makes emotion analysis from the voice and transfers it to the avatar image. Said invention manages a virtual avatar instead of a real face texture. A human face that does not exist but looks real is not synthesized and managed. In the state of the art, the invention that is the subject of the application numbered “US20120130717A1” mentions techniques for providing real-time animation for a personalized cartoon avatar. In the invention, a process trains one or more animated models to provide a set of probabilistic motions of one or more upper body parts based on speech and motion data. The process links one or more predetermined phrases of emotional states to the one or more animated models. After the models are created, the process receives real-time speech input. The process then identifies an emotional state to be expressed based on one or more predetermined expressions that match real-time speech input. The process then creates an animated sequence of motions of one or more upper body parts by applying one or more animated models in response to real-time speech input. In this invention, a face animation (avatar) is created by calculating the voice data.

In the state of the art, audio conversion applications that have been made so far have been either realized as cartoon graphics animation speech, or audio-to-video conversion has been performed over a real person photo or video. Until now, an integrated and end-to-end system has not been developed, in which the voice of the speaker (call center employee, distance education teacher or person receiving training, etc.) is analyzed, a suitable photo is produced, and the photo is converted into a real time speaking artificial person video.

Consequently, the disadvantages disclosed above and the inadequacy of available solutions in this regard necessitated making an improvement in the relevant technical field.

Objects of the Invention

The main object of the present invention is to use four sequential methods created by the Generative Adversarial Network (GAN) in order to convert a human voice into a real face video. In these methods, features are extracted from the voice of the speaker; a fake face image based on GAN is synthesized according to these features; a video is recorded and synthesized in accordance with the artificial/virtual face in order to feed it into the next training step; a new video is synthesized with the speaker's voice in real time. The most important object of the present invention is to create a non-existing 2D real- looking face with characteristics suitable for the voice by extracting the voice data. Here, the features of the sound are extracted by using neural networks, a real-looking face video of a non-existing face is produced to match this sound by using a generative adversarial network, and the system is trained with this video. Thus, a system using a real-looking virtual/digital assistant can be online regardless of where the service is provided, whether it's the contact center office or the agent's home. Since the speaker only needs a high-quality microphone and a stable internet connection, a company using this type of online customer service can reduce the number of seats used but not employees, that is, the number of people in the same environment at the same time can be reduced. It can also be used to replace images of teachers or students in online education.

Another important object of the present invention is to enable people connected to virtual agent synthesis to encounter a fake image that is close to the real personality of the speaker. Thus, a virtual agent suitable for the habits and personality of the person connected to the call center can be synthesized by using psychology-based studies on usage habits, and the approach of the person receiving the service may become more positive.

Another object of the present invention is to ignore emotional states, and to only work on the synthesis of the lip area. Thus, in order to exclude any emotion, the video created is recorded as if the speaker is giving a political statement. In other words, no emotions and exaggerated expressions / facial expressions are recorded. Otherwise, the synthesized face can express a happy emotion while the other party is angry.

Another object of the present invention is to produce a version that can be easily used by the end user and that gives the same quality output since it necessitates a high computational power to create a real-time result.

Yet another object of the present invention is to create a real-looking character with the audio-to-video conversion model by means of the method.

Description of the Figures FIGURE -1 is the drawing that illustrates the diagram view of the main flow chart of the method according to the present invention.

FIGURE -2 is the drawing that illustrates the diagram view of the extraction of the audio features from the audio file of the method according to the present invention. FIGURE -3 is the diagram of the method according to the present invention, showing the process of making the virtual agent speak in real time as the voice of the live speaker enters the computer from the microphone.

Reference Numerals 10. VoxCeleb database

110. Transferring the user's voice to the computer with the microphone.

120. Extracting the characteristic features of the voice from the created cluster by the computer.

130. Synthesizing an image output containing the properties of the voice by means of the Generative Adversarial Network, by computer.

140. Recording a 10-15 hour video by the computer and synthesizing it in accordance with the artificial face by the computer.

150. Generating a video with the help of artificial intelligence from the artificial image obtained by synthesizing the voice, by the computer. 151. Extracting the landmark points of the faces in all the frames in the video after recording the video by computer.

152. After obtaining the landmarks from each frame, transferring these points to the source image by the computer.

160. Preparing the training video set for real-time synchronized synthesis by the computer. 161. Obtaining the landmarks of the mouth shape suitable for the voice coming to the computer by the computer by transferring a new simultaneous voice of the speaker to the computer with the microphone.

162. Blending the obtained points with the face in the next frame in the video of the training set by the artificial intelligence assisted computer.

163. Synchronizing the facial and lip movements synthesized on the realistic face created from the sound with the live sound taken from the microphone, and displaying it to the 3rd parties in real time by the computer.

170. displaying a virtual/fake face image similar to one's self-image on a device screen as synchronized and simultaneously with speech.

Description of the Invention

The method of the present invention enables extracting the features from the speaker's voice, synthesizing a fake face image based on the Generative Adversarial Network (GAN) suitable for these features, recording a video and synthesizing it in accordance with the artificial face to feed into the next training step, and synthesizing new video with the speaker's voice in real time.

The method of the present invention synthesizes a virtual agent with the sound received from a microphone with an artificial intelligence supported computer. In virtual agent synthesis, four different methods are used sequentially in order to convert an audio into a video generated by the Generative Adversarial Network (GAN). The method of the present invention basically comprises the process steps of extracting the features from the speaker's voice; synthesizing a fake face image based on the Generative Adversarial Network (GAN) suitable for these features; recording 10-15 hours of video and synthesizing it accordingly to the artificial face to feed into the next training step; and synthesizing new video with the speaker's voice in real time.

The first three steps are necessary and time consuming, however, they will only be run once in the system and when finished the training phase will be complete. After that, the real-time step 4 initiates. The general flow of the method of the present invention is shown in Figure 1 . Initially, in extracting features from the voice of the speaker, which is one of the basic steps of the method that is the subject of the present invention, it starts with the computer converting the speaker's voice into a face synthesized by the GAN. In order to present a realistic / believable image, the input audio should be associated with the output image. For example, a young woman's voice should not be matched with an older man. The speaker's voice, received from the microphone (110), enters the computer along with an audio-visual dataset, and a set of speaker attribute tables of the output are kept by the computer. The outline of this process is shown in Figure 2. A voice dataset is a set containing labels such as age, gender, and ethnicity. The recommended dataset to be used in training is the VoxCeleb database (10).

When a new voice input comes from the microphone to the system trained with the VoxCeleb database (10), the computer extracts the age, gender, and race characteristics of the voice in multiple ways with a deep learning-based architecture. Thus, the voice data is calculated by the computer, and a non-existing 2D real-looking face is created with characteristics suitable for the voice.

In synthesizing and producing a GAN-based fake face image in accordance with the extracted features, which is one of the basic steps of the present invention, a realistic face should be created that must match the sound with a high level of believability to grab the audience's attention after receiving a sound profile (sound properties). This section can be personalized according to the speaker's desires or the habits of the audience, however, still needs to comply with certain characteristics such as age, gender, and ethnicity. The CelebA dataset is used as the computer dataset for the training of this section. The dataset comprises real and fake faces and features of those faces, and when a new feature list comes, the computer uses this data set to synthesize (generate) a new face.

The computer transmits the features of the voice and all images obtained by extracting the features from the voice of the speaker to the data set (CelebA). The computer synthesizes an image output containing the properties of the sound by means of the Generative Adversarial Network.

In producing the necessary video for the training set, which is one of the basic steps of the present invention, 10-15 hours of video is recorded by the computer and synthesized in accordance with the artificial face in order to feed into the training step. In a video production research, it was observed that the longer they use video in training, the more realistic lip movements are produced. Therefore, only one fake image will not be sufficient to synchronize lip landmarks, and at least 10-15 hours (or more) of video recording will be required for the training set. The fake face of the speaker is reproduced with a 10-15 hour video of the speaker himself/herself in order to obtain a 10-15 hour video. For this step of the method, a video is produced from a single image with the help of artificial intelligence.

In video production, the source is the face synthesized in the previous step. After recording the video of the target speaker, the landmark points of the faces in all the frames in the video (every frame in the video) is extracted. The Viola-Jones object detection method is utilized to find faces in a frame and mark the landmark points. After the landmarks are obtained from each frame, these points are transferred to the source image. Thus, the output produced by the computer is 10-15 hours of speech video of the source image.

In obtaining a real-time virtual agent, which is one of the basic steps of the present invention, after creating a dataset consisting of at least ten hours of video of the speaker with artificial intelligence assisted computer, the time-consuming part for the speaker and the computer is completed and the training video set is ready for the real time part. In the last part of this algorithm, where the speaker's video and a new voice enters into the system, the oral landmarks of the voice are extracted and then mapped to every frame in the video, a new model using GAN is used for the production and movement of oral tissues. In this model, the landmarks of the mouth shape suitable for the sound entering the system are obtained, then these points are blended with the face in the next frame in the video. In this process, not the whole video, only the mouth part of the video is shaped according to the incoming sound. The virtual agent video produced by the computer, which is the output of the speaker, will be such that the user can follow it from all smart devices such as phone, tablet, or computer.

In the method of the present invention, the voice of the user is received by the computer with a microphone, the characteristic features suitable for this voice are extracted by a computer with artificial intelligence, and a fake face suitable for these features is also created by the computer. The created face is a face similar to the real human face. Then, a video is taken for the training set to the computer containing artificial intelligence, and facial and lip movements are synthesized from the training set. Finally, the user's voice is received with the microphone in real time, the facial and lip movements synthesized on the realistic face created from the sound are synchronized with the live sound and displayed to the 3rd parties in real time. Thus, the virtual/fake face image similar to the person's own image can be displayed on any screen synchronized and simultaneously with the speech.

The detailed process steps of the virtual agent synthesis method with a realistic face, which is the subject of the invention, are as follows;

- Transferring the user's voice to the computer with the microphone (110),

- Creating a cluster with artificial intelligence from the received sound by the computer,

- Extracting the characteristic features of the voice from the created cluster by the computer (120)

- Transmitting the features of the voice obtained by extracting the features from the voice of the speaker to the dataset by the computer,

- Synthesizing an image output containing the properties of the voice by means of the Generative Adversarial Network, by computer (130),

- Creating a non-existing 2D real-looking face of a face synthesized by the GAN with the speaker's voice by the computer,

- Recording a 10-15 hour video by the computer and synthesizing it in accordance with the artificial face by the computer (140),

- Generating a video with the help of artificial intelligence from the artificial image obtained by synthesizing the voice, by the computer (150),

- Reproducing the fake face produced to represent the speaker with a 10 to 15 hour video of the speaker himself/herself by the computer,

- Extracting the landmark points of the faces in all the frames in the video after recording the video by computer (151 ),

- After obtaining the landmarks from each frame, transferring these points to the source image by the computer (152),

- Preparing the training video set for real-time synchronized synthesis by the computer (160), - Obtaining the landmarks of the mouth shape suitable for the voice coming to the computer by the computer by transferring a new simultaneous voice of the speaker to the computer with the microphone (161 ),

- Blending the obtained points with the face in the next frame in the video of the training set by the artificial intelligence assisted computer (162),

- Synthesizing the computer-generated virtual agent video, which is the speaker output, with the computer in accordance with the mouth and lip shape simultaneously with the speaker's voice,

- Synchronizing the facial and lip movements synthesized on the realistic face created from the sound with the live sound taken from the microphone, and displaying it to the 3rd parties in real time by the computer (163),

- Displaying a virtual/fake face image similar to one's self-image on a device screen as synchronized and simultaneously with speech (170).

The virtual agent synthesis method, which is the subject of the present invention, does not take into account the emotional states, but only works on the synthesis of the lip area. In order to exclude any emotion, computer-generated video is recorded as if the speaker is giving a political statement. In other words, no emotions and exaggerated expressions/mimics are recorded. Otherwise, the synthesized face can express a happy emotion while the other party is angry. The method may not be limited to the lip area, but all the emotional expressions that can be transferred may be transferred to the face.

Persons connected to the virtual representative produced by the virtual agent synthesis method, which is the subject of the invention, encounter a fake image that is close to the real personality of the speaker. A virtual agent suitable for the habits and personality of the person connected to the call center can be synthesized by using psychology-based studies on usage habits, and the approach of the person receiving the service may become more positive. REFERENCES

[1] McMahan, A. (2003). Immersion, engagement and presence. The video game theory reader, 67, 86.

[2] Lee, H., & Doh, Y. Y. (2012, August). A study on the relationship between educational achievement and emotional engagement in a gameful interface for video lecture systems. In 2012 International Symposium on Ubiquitous Virtual Reality (pp. 34-37). IEEE.

[3] Mahyar, N., Kim, S. H., & Kwon, B. C. (2015, October). Towards a taxonomy for evaluating user engagement in information visualization. In Workshop on Personal Visualization: Exploring Everyday Life (Vol. 3, p. 2).

[4] Yee, N., & Bailenson, J. (2007). The Proteus effect: The effect of transformed self representation on behavior. Human communication research, 33(3), 271-290.

[5] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).

[6] “Snapchat,” son goruntulenme: 2020-06-25. [Qevrimigi] https://www.snapchat.com/

[7] “Reflect,” son goruntulenme: 2020-06-25. [Qevrimigi] https://reflect.tech/

[8] Botha, J., & Pieterse, H. (2020, March). Fake News and Deepfakes: A Dangerous Threat for 21st Century Information Security. In ICCWS 2020 15th International Conference on Cyber Warfare and Security (p. 57). Academic Conferences and publishing limited.

[9] Tariq, S., Lee, S., Kim, H., Shin, Y., & Woo, S. S. (2019, April). GAN is a friend or foe? a framework to detect various fake face images. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (pp. 1296-1303).

[10] Zhang, X., & Ghorbani, A. A. (2020). An overview of online fake news: Characterization, detection, and discussion. Information Processing & Management, 57(2), 102025.

[11] Liu, H., Michelini, P. N., & Zhu, D. (2018, August). Artsy-GAN: A style transfer system with improved quality, diversity and performance. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 79-84). IEEE.

[12] Choi, Y., Uh, Y., Yoo, J., & Ha, J. W. (2020). Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8188-8197).

[13] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223-2232).

[14] Yang, H., Huang, D., Wang, Y., & Jain, A. K. (2019). Learning continuous face age progression: A pyramid of gans. IEEE transactions on pattern analysis and machine intelligence.

[15] Wang, X., Wang, Y., & Li, W. (2019). U-Net Conditional GANs for Photo- Realistic and Identity-Preserving Facial Expression Synthesis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(3s), 1-23.

[16] Li, L, Bao, J., Yang, H., Chen, D., & Wen, F. (2019). Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457.

[17] Tolosana, R., Vera-Rodriguez, R., Fierrez, J., Morales, A., & Ortega-Garcia, J. (2020). Deepfakes and beyond: A survey of face manipulation and fake detection. arXiv preprint arXiv:2001.00179.

Claims

1. A method of generating a virtual representative from one's voice, characterized in that, it comprises the process steps of;

- Transferring the user's voice to the computer with the microphone (110),

- Extracting the characteristic features of the voice from the created cluster by the computer (120),

- Extracting the landmark points of the faces in all the frames in the video after recording the video by computer (151),

- Preparing the training video set for real-time synchronized synthesis by the computer (160),

- Obtaining the landmarks of the mouth shape suitable for the voice coming to the computer by the computer by transferring a new simultaneous voice of the speaker to the computer with the microphone (161),

- Blending the obtained points with the face in the next frame in the video of the training set by the artificial intelligence assisted computer (162), - Synthesizing the computer-generated virtual agent video, which is the speaker output, with the computer in accordance with the mouth and lip shape simultaneously with the speaker's voice,

2. Method of generating an agent according to Claim 1 , characterized in that, in the process step of creating a cluster with artificial intelligence from the received sound by the computer, it comprises the process steps of;

• Receiving the speaker's voice from the microphone, and entering it to the computer along with an audio-visual dataset, and

• keeping a set of properties table of the speaker of the output by the computer.

3. Method of generating an agent according to Claim 1 , characterized in that, in the process step of extracting the characteristic features of the voice from the created cluster by the computer (120), said voice dataset comprises age, gender, and ethnicity tags.

4. Method of generating an agent according to Claim 1 , characterized in that, in the process step of extracting the characteristic features of the voice from the created cluster by the computer (120), said voice dataset is the VoxCeleb database (10).

5. Method of generating an agent according to Claim 1 , characterized in that, in the process step of extracting the characteristic features of the voice from the created cluster by the computer (120); it comprises the process step of; when a new voice input comes from the microphone to the system trained with the VoxCeleb database (10), multiple extraction of the age, gender and race characteristics with deep learning-based architecture by the computer.

6. Method of generating an agent according to Claim 1 , characterized in that, in the process step of synthesizing an image output containing the properties of the voice by means of the Generative Adversarial Network, by computer (130); for the training of said synthesis, the computer uses the CelebA dataset as the dataset.

7. Method of generating an agent according to Claim 1 , characterized in that, in the process step of synthesizing an image output containing the properties of the voice by means of the Generative Adversarial Network, by computer (130); the dataset comprises real and fake faces and features of those faces.

8. Method of generating an agent according to Claim 1, characterized in that, in the process step of synthesizing an image output containing the properties of the voice by means of the Generative Adversarial Network, by computer (130); it comprises the process step of using the dataset to synthesize a new face by the computer when a new feature list comes.

9. Method of generating an agent according to Claim 1, characterized in that, in the process step of blending the obtained points with the face in the next frame in the video of the training set by the artificial intelligence assisted computer (162), the computer only shapes the mouth part according to the incoming sound.