WO2024117616A1

WO2024117616A1 - System and method for providing metaverse service using digital human capable of real-time synchronization and interaction using camera and motion capture recognition

Info

Publication number: WO2024117616A1
Application number: PCT/KR2023/018290
Authority: WO
Inventors: 고기훈; 조풍연
Original assignee: 메타빌드주식회사
Priority date: 2022-11-30
Filing date: 2023-11-14
Publication date: 2024-06-06

Abstract

In the present invention, the facial expression and motion of the user may be recognized in real time by using a camera provided in a user terminal, and the change in facial expression and motion of the user may be reflected in the face and body of a digital human in real time, so that a natural facial expression, muscle movement, or the like of the digital human may be expressed more realistically. In addition, because customization of a user's desired form may be reflected in real time, there are advantages in that real-time interaction with the user is possible and the digital human with maximized reality may be used for a wider variety of metaverse services, such as music performances and live broadcasting services. In addition, a rendered image of the digital human is generated in a cloud server so that a hyperrealistic digital human may be generated in real time even if the user terminal is not a high-end computer, and thus, more users may more easily generate digital humans and apply the generated digital humans to various platforms.

Description

A system and method for providing a metaverse service using digital humans capable of real-time synchronization and interaction using cameras and motion capture recognition

The present invention relates to a system and method for providing a metaverse service using digital humans capable of real-time synchronization and interaction using cameras and motion capture recognition. More specifically, real-time rendering using cameras and motion capture recognition. By reflecting changes in the user's facial expression or motion in the environment, digital humans can be expressed in a hyper-realistic manner, and customization in the form desired by the user can be reflected in real time.

In general, Metaverse is a compound word of Meta, meaning processing and abstraction, and Universe, meaning the real world, and refers to a three-dimensional virtual world. Metaverse is a three-dimensional space platform that allows users to engage in social, economic, educational, cultural, scientific and technological activities similar to actual reality through avatars.

As interest in and use of the metaverse increases, avatars in the virtual world are evolving from IDs made up of text only to two-dimensional or three-dimensional cyber characters. Moreover, recently, interest in 3D digital humans that utilize not only character images but also movements and voices is increasing.

The purpose of the present invention is to provide a system and method for producing more realistic digital humans and providing a metaverse service using digital humans that can be used in various fields requiring real-time interaction.

In the metaverse service method using digital humans capable of real-time synchronization and interaction according to the present invention, when an application provided by the cloud server is executed on a user terminal, a camera provided in the user terminal detects the expression and motion of the user's face. Recognizing and transmitting tracking information according to facial expression changes and motion changes to the cloud server; The cloud server checks the tracking information and creates a plurality of wrinkle maps, a plurality of normal maps, and a plurality of displacement maps that are each differently pre-generated according to the plurality of facial expressions. extracting at least one of the following, changing it according to the tracking information according to the user's facial expression, and rendering it on the face of a digital human in real time; The cloud server connects the tracking information according to the user's motion change with the digital human's bone and renders the digital human's body in real time to generate animation, thereby generating a rendered image of the digital human. and; and transmitting, by the cloud server, a rendered image of the digital human to the user terminal in real time.

It further includes transmitting voice data in which the voice recognizer provided in the user terminal recognizes the user's voice to the cloud server in real time.

The step of generating the rendered image further includes a process in which the cloud server learns the voice data through deep learning and synchronizes the user's voice with the digital human's voice in real time.

The cloud server displays a user interface to enable selection of at least one of skin texture, skin tone, hairstyle, eye color, background image, clothing, accessory, motion, and voice on the user terminal.

The step of generating the rendered image further includes, if there is information input through the user interface, rendering the image to the digital human in real time according to the input information.

The cloud server streams the rendered image of the digital human in real time to preset terminals through web real-time communication (WebRTC).

The cloud server extracts all of the wrinkle map, the plurality of normal maps, and the plurality of displacement maps according to the tracking information.

The camera includes a TrueDepth camera.

The tracking information is depth information.

The metaverse service method using digital humans capable of real-time synchronization and interaction according to another aspect of the present invention is that when an application provided by the cloud server is executed on a user terminal, a camera provided in the user terminal detects the user's face. Recognizing facial expressions and motions and transmitting tracking information according to facial expression and motion changes to the cloud server; The cloud server checks the tracking information and creates a plurality of wrinkle maps, a plurality of normal maps, and a plurality of displacement maps that are each differently pre-generated according to the plurality of facial expressions. extracting at least one of the following, changing it according to the tracking information according to the user's facial expression, and rendering it on the face of a digital human in real time; The cloud server connects the tracking information according to the user's motion change with the digital human's bone and renders the digital human's body in real time to generate animation, thereby generating a rendered image of the digital human. and; A step of the cloud server transmitting a rendered image of the digital human to the user terminal in real time, and when a voice recognizer provided in the user terminal transmits voice data recognizing the user's voice to the cloud server in real time. , The cloud server learns the voice data through deep learning, and further includes a process of synchronizing the user's voice with the digital human's voice in real time.

The present invention includes a system that provides a metaverse service using digital humans capable of real-time synchronization and interaction as described above.

The present invention recognizes the user's facial expressions and motions in real time using a camera provided in the user terminal, and can reflect the user's facial expression and motion changes in real time on the face and body of the digital human, thereby providing the digital human's natural appearance. It is possible to express facial expressions and muscle movements more realistically.

In addition, different wrinkle maps, normal maps, and displacement maps according to various facial expressions are created in advance on the cloud server, and the appropriate map can be extracted and applied according to the tracking information recognized by the camera, so that changes in the user's facial expression and motion are possible. can be reflected in real time.

In addition, it is possible to reflect the user's desired customization in real time, which has the advantage of enabling real-time interaction with the user and utilizing digital humans with maximized realism in a wider variety of metaverse services such as music performances and live broadcasting services.

In addition, by creating a rendering image of a digital human on a cloud server, hyper-realistic digital humans can be created in real time even if the user terminal is not a high-end computer, allowing more users to more easily create digital humans on various platforms. It can be applied to .

Figure 1 is a diagram schematically showing a system that provides a metaverse service using digital humans according to an embodiment of the present invention.

Figure 2 is a flowchart schematically showing a metaverse service method using digital humans according to an embodiment of the present invention.

Figure 3 shows an example of a screen for creating a basic character appearance of a digital human in the metaverse service method using a digital human according to an embodiment of the present invention.

Figure 4 shows the structure of a skin shader node in the metaverse service method using digital humans according to an embodiment of the present invention.

Figure 5 shows an example of recognizing a user's facial expression and rendering it on the face of a digital human in real time in the metaverse service method using a digital human according to an embodiment of the present invention.

Figure 6 shows an example of recognizing a user's motion and rendering it in real time on the body of a digital human in the metaverse service method using a digital human according to an embodiment of the present invention.

Figure 7 shows an example of an SVS model for high-performance AI vocal voice synthesis technology in the metaverse service method using digital humans according to an embodiment of the present invention.

Figure 8 shows an example of reflecting the clothing selected by the user on the digital human in real time in the metaverse service method using a digital human according to an embodiment of the present invention.

Figure 9 shows a configuration diagram of video streaming network connection middleware according to an embodiment of the present invention.

Figure 10 shows an example of real-time interaction with digital humans in the metaverse service method using digital humans according to an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described with reference to the attached drawings.

The present invention relates to a system and method for providing a metaverse service using digital humans capable of real-time synchronization and interaction using cameras and motion capture recognition.

Referring to Figures 1 to 10, the present invention performs node structure/interface development to enable hyper-realistic expression of digital humans and customization of the user's desired form in real time in an Unreal-based real-time rendering environment, and pixel streaming This is about a system and method that allows free interaction/utilization of hyper-realistic digital humans through cloud rendering based on WebRTC.

Referring to FIG. 1, a system that provides a metaverse service using digital humans according to an embodiment of the present invention includes a cloud server and a user terminal.

The cloud server is a service server for providing and performing metaverse services using digital humans, and provides applications.

The user terminal is a personal terminal possessed by the user and includes, for example, a smart phone capable of wired or wireless communication, a tablet PC, a computer, etc.

The application provided by the cloud server is installed on the user terminal.

The user terminal is equipped with a camera and a voice recognizer.

The camera recognizes the user's facial expression and body motion and generates tracking information according to the change in facial expression and motion. In this embodiment, the camera recognizes not only the user's facial expression but also the motion and generates motion data. The camera may be a TrueDepth camera, an RGB camera, or the like. However, it is not limited to this, and of course, it is also possible to additionally use motion capture equipment to recognize the user's motion separately from the camera.

The voice recognizer is explained as an example of a microphone that recognizes the user's voice.

Middleware and an interface are provided between the cloud server and the user terminal.

The cloud server and the user terminal communicate based on web real-time communication (WebRTC).

With reference to FIG. 2, a metaverse service method using digital humans according to an embodiment of the present invention will be described as follows.

First, run the application provided by the cloud server on the user terminal (S1).

When the application is executed, the camera provided in the user terminal recognizes the user's facial expressions and body motions, generates tracking information based on facial expression changes and motion changes, and transmits it to the cloud server. (S2)

The tracking information includes change values detected by the camera at a number of preset tracking points according to facial expression changes and motion changes when the camera recognizes the user's face.

In the cloud server, different wrinkle maps, normal maps, and displacement maps are generated and stored in advance according to a plurality of facial expressions. That is, a plurality of wrinkle maps are generated in advance, different from each other according to a plurality of facial expressions. A plurality of normal maps are also generated in advance depending on the plurality of facial expressions. A plurality of displacement maps are also generated in advance depending on the plurality of facial expressions.

Here, the wrinkle map is created to express wrinkles that appear or disappear according to changes in facial expression, and includes a number of dynamic nodes. For example, this is a map to express the wrinkles that appear on the forehead when you raise your eyes.

The normal map is created to express changes in the height of the face that occur when facial expressions change, and includes a number of dynamic nodes.

The displacement map is created to express deformation that occurs when facial expression changes, and includes a number of dynamic nodes.

The cloud server extracts the plurality of wrinkle maps, the plurality of normal maps, and the plurality of displacement maps, changes them according to the tracking information received from the user terminal, and renders them on the face of the digital human in real time. ( S3)

For example, the cloud server extracts the most similar wrinkle map among the plurality of wrinkle maps according to the tracking information and synchronizes the tracking information to each point of the wrinkle maps.

Additionally, the cloud server can express the digital human's skin texture, etc. using a pre-built skin shader. (S4)

The cloud server renders a skin texture in conjunction with the dynamic nodes of the map extracted through the skin shader.

Referring to FIG. 3, the basic appearance of the digital human is pre-stored in the cloud server, and the user's face is rendered on the basic appearance in real time, thereby creating a rendered image of the digital human in which changes in the user's facial expression are reflected in real time. can be created.

Referring to FIG. 4, the skin shader is a shader that reflects scattering under the skin surface and enables ultra-realistic expressions such as regular reflection, multiple scattering, and single scattering for each skin layer.

Figure 5 shows an example of linking a camera and a digital human in real time in the metaverse service method using a digital human according to an embodiment of the present invention.

Therefore, by applying the skin shader and the plurality of maps based on the tracking information according to the change in the user's facial expression recognized by the camera, the digital human's natural facial expression or muscle movement can be expressed in a real-time rendering environment. .

Referring to FIG. 6, the cloud server connects the user's motions with the digital human's bone based on the tracking information according to the motion change of the user's body recognized by the camera to create the digital human's body. An animation is created by rendering in real time. (S5)

Accordingly, a rendered image of the digital human that reflects the user's facial expression and motion changes in real time can be generated.

In this embodiment, the user's motion is explained as an example using a camera provided in the user terminal, but it is not limited to this and can of course be captured in real time using motion capture equipment provided separately from the user terminal. And, of course, it is also possible to use both the camera and the motion capture equipment. When using the motion capture equipment, the motion data recognized by the motion capture equipment is transmitted to the cloud server, and the cloud server collects, blends, transforms, and corrects the motion data to match the motion of the digital human in real time. It can be linked.

In addition, the cloud server can of course display a number of motion samples created in advance through the application and allow the user to select and input them.

In addition, it is also possible to recognize not only the user's facial expressions and motions but also the voice, and synchronize real-time with the digital human's voice.

Meanwhile, the user's voice is explained as an example in which a voice recognizer provided in the user terminal recognizes it. However, the present invention is not limited to this, and the voice recognizer may of course be provided separately from the user terminal.

Voice data recognized by the voice recognizer is transmitted to the cloud server.

When the cloud server receives the voice data, it learns the voice data through deep learning and synthesizes the user's voice into the digital human's voice in real time.

Additionally, the cloud server can synthesize the user's voice or a singing voice of various tones, generate a vocal voice according to lyrics, notes, and duration, and reflect it as the digital human's vocal voice.

Figure 7 shows an example of a SVS (Singing Voice Synthesis) model for high-performance AI vocal voice synthesis technology in the metaverse service method using digital humans according to an embodiment of the present invention.

Additionally, the cloud server provides a user interface that allows the user to input or change at least one of skin texture, skin tone, hairstyle, eye color, background image, clothing, accessories, motion, and voice to the user terminal.

Customizing information entered through the user interface is transmitted to the cloud server.

The cloud server can reflect the customized information to the digital human in real time.

Therefore, real-time user interaction is possible.

Figure 8 shows an example of simulating pattern-based costume production and animation changes in the metaverse service method using digital humans according to an embodiment of the present invention.

As described above, the cloud server can reflect the user's facial expression, motion, voice, etc. to the digital human in real time, expressing the digital human more realistically and generating a rendered image.

The cloud server transmits the rendered image of the digital human to the user's terminal in real time. (S6)

The cloud server supports displaying HTML5-based real-time video web pages through the Web Real-Time Communication (WebRTC) protocol.

Accordingly, the cloud server can transmit the rendered image to the user terminal through the web real-time communication protocol.

Additionally, the cloud server can stream the rendered image in real time to a plurality of preset client terminals through the web real-time communication.

As a rendered image of the digital human is generated in the cloud server as described above, it is possible to interact with a hyper-realistic digital human in real time even if the user terminal is not a high-end computer.

Additionally, it may become easier for individuals to create digital humans and use them on various platforms.

The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Therefore, the true scope of technical protection of the present invention should be determined by the technical spirit of the attached patent claims.

According to the present invention, it is possible to provide a metaverse service using a digital human that reflects the user's facial expression and motion changes in real time.

Claims

When the application provided by the cloud server is executed on the user terminal, the camera provided in the user terminal recognizes the expression and motion of the user's face, and transmits tracking information according to the change in expression and motion to the cloud server. and;

The cloud server checks the tracking information and creates a plurality of wrinkle maps, a plurality of normal maps, and a plurality of displacement maps that are each differently pre-generated according to the plurality of facial expressions. extracting at least one of the following, changing it according to the tracking information according to the user's facial expression, and rendering it on the face of a digital human in real time;

The cloud server connects the tracking information according to the user's motion change with the digital human's bone and renders the digital human's body in real time to generate animation, thereby generating a rendered image of the digital human. and;

Comprising the step of the cloud server transmitting the rendered image of the digital human to the user terminal in real time,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
In claim 1,

Further comprising transmitting voice data in which the voice recognizer provided in the user terminal recognizes the user's voice to the cloud server in real time,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
In claim 2,

The step of generating the rendered image is,

Further comprising the process of the cloud server learning the voice data through deep learning and synchronizing the user's voice with the digital human's voice in real time,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
In claim 1,

The cloud server is,

Displaying a user interface to enable selection of at least one of skin texture, skin tone, hairstyle, eye color, background image, costume, accessory, motion, and voice on the user terminal,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
In claim 4,

The step of generating the rendered image is,

If there is information input through the user interface, further comprising rendering in real time to the digital human according to the input information,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
In claim 1,

The cloud server is,

Streaming the rendered video of the digital human in real time to preset terminals through web real-time communication (WebRTC),

Metaverse service method using digital humans capable of real-time synchronization and interaction.
In claim 1,

The cloud server is,

Extracting all of the wrinkle map, the plurality of normal maps, and the plurality of displacement maps according to the tracking information,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
In claim 1,

The camera includes a TrueDepth camera,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
In claim 8,

The tracking information is depth information,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
When the application provided by the cloud server is executed on the user terminal, the camera provided in the user terminal recognizes the expression and motion of the user's face, and transmits tracking information according to the change in expression and motion to the cloud server. and;

The cloud server checks the tracking information and creates a plurality of wrinkle maps, a plurality of normal maps, and a plurality of displacement maps that are each differently pre-generated according to the plurality of facial expressions. extracting at least one of the following, changing it according to the tracking information according to the user's facial expression, and rendering it on the face of a digital human in real time;

The cloud server connects the tracking information according to the user's motion change with the digital human's bone and renders the digital human's body in real time to generate animation, thereby generating a rendered image of the digital human. and;

The cloud server transmits the rendered image of the digital human to the user terminal in real time,

When the voice recognizer provided in the user terminal transmits voice data that recognizes the user's voice to the cloud server in real time,

Further comprising the process of the cloud server learning the voice data through deep learning and synchronizing the user's voice with the digital human's voice in real time,

Metaverse service method using digital humans capable of real-time synchronization and interaction.
A system that provides metaverse services using digital humans of claim 1.
A system that provides metaverse services using digital humans of claim 10.