CN111724457A

CN111724457A - Realistic virtual human multi-modal interaction implementation method based on UE4

Info

Publication number: CN111724457A
Application number: CN202010168192.2A
Authority: CN
Inventors: 郭松睿; 贺志武; 高春鸣
Original assignee: Changsha Qianbo Information Technology Co ltd
Current assignee: Changsha Qianbo Information Technology Co ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-09-29

Abstract

The invention discloses a method for realizing multi-modal interaction of a realistic virtual human based on UE4, wherein the method comprises the following steps: resource making, resource assembling and function making. The system comprises a resource making module: the method is used for making a role model, making a facial expression blend shape, binding bone covering, making an action, making a mapping and adjusting materials; a resource assembly module: the system is used for scene construction, light design and UI interface construction; a function making module: the voice recognition system is used for recognizing voice input of a user, intelligently answering according to the input, playing voice, lip animation, expression animation and limb action, and reflecting interactive multimode. The module specifically comprises a voice recognition module, an intelligent question-answering module, a Chinese natural language processing module, a voice synthesis module, a lip animation module, an expression animation module and a limb action module. The system has the affinity similar to that of a real person, and can be accepted by a user more easily; the method is more in line with the interactive habits of human beings, and can ensure that the application has a wider popularization range; the application can be made truly "intelligent" and the response of the application is made more logical to humans.

Description

Realistic virtual human multi-modal interaction implementation method based on UE4

Technical Field

The invention relates to the field of computer software, in particular to a method and a system for realizing multi-modal interaction of a realistic virtual human based on UE 4.

Background

Currently, the technical research of digital virtual people is very hot, and as a next generation intelligent human-computer interaction mode, the digital virtual people have very wide commercial landing application scenes, for example, virtual idols, virtual actors, virtual anchor broadcasters and the like can land in industries such as games, movies, financial services, education, medical treatment and the like.

However, most digital avatars are represented by cartoon characters, and have simple expressions or lip expressions, so that the digital avatars cannot be applied to most fields. The existing game engine, especially the next generation game engine represented by UE4, has rendering effect that human eyes are hard to distinguish true from false in realistic rendering representation of characters and scenes, but lacks speech input, and has no professional field or general question-answering system, so that it cannot give answer response conforming to human behavior habits.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method and a system for realizing multi-modal interaction of a realistic digital virtual human based on UE4, so that the realistic digital virtual human has high-accuracy voice recognition capability, has voice answers in professional fields, has natural and vivid expressions such as lip expression animations, limb actions and the like, is a real embodiment of a next-generation intelligent interaction mode, can be used for carrying out deep application customization on various digital virtual humans in a plurality of industrial fields according to the realization method of the system, and has a great commercial application prospect.

In order to achieve the purpose, the invention adopts the scheme that:

a realistic digital virtual human multi-modal interaction implementation method based on UE4 comprises the following steps:

s1, resource making, including:

making a model: making a character model according to a next generation standard, emphasizing details of a mouth, a face and eyes, and carrying out normal map;

preparation of facial expression BlendShape: in Maya, 46 expression units BlendShape are produced by facial motion coding system (FACS);

binding the skeleton covering: in Maya, model binding is carried out according to human bones, and vertex weights are brushed;

and (3) action making: in Maya, making a basic mouth shape action sequence and a non-language behavior action sequence;

making a mapping: making 4K high-definition diffuse reflection mapping, highlight mapping and roughness mapping in Substance;

material adjustment: in the UE4, adjusting skin material parameters by using a sub-surface contour coloring model, adjusting hair material parameters by using a hair coloring model, adjusting eye material parameters by using an eye coloring model, and setting the produced chartlet into each material parameter to finally achieve a vivid rendering effect;

s2, resource assembly, including:

setting up a scene: in the UE4, creating a scene Level, importing a role model and adjusting the position of a camera;

light design: in the UE4, a lighting arrangement of a studio is imitated, a surface light source is built by using a plurality of point light sources, and parameters such as color positions of lighting are set;

building a UI interface: in the UE4, a UI interface such as an answer text, a microphone state indication and the like is constructed by using the UMG;

s3, function making, including:

manufacturing a voice recognition module: starting an independent thread to monitor microphone audio data, continuously sending the audio data and related interface parameters to a speech recognition background service program by using a Websocket protocol, and analyzing data returned by the service to obtain a recognition text;

and (3) manufacturing an intelligent question-answering module: sending the recognized text to a background question-answering system program through an Http Post request, and analyzing an answer text returned by the service;

the Chinese natural language processing module is manufactured: sending the analyzed answer text to a Chinese natural language processing service program through an Http Post request, and analyzing to obtain Chinese word segmentation and emotion results;

and (3) manufacturing a voice synthesis module: sending the analyzed answer text to a background speech synthesis service program through an Httppost request, analyzing to obtain returned speech audio data and Chinese phoneme time sequence data, and directly calling a system playing audio interface to play sound by the audio data;

making an oral lip animation module: calling an interface provided by a lip controller in an animation engine by the analyzed Chinese phoneme time sequence data and Chinese word segmentation, and playing lip animation by performing interpolation operation on a corresponding lip action sequence;

and (3) making an expression animation module: calling an emotion result calculated by the Chinese natural language processing module to an interface provided by a face controller in an animation engine, and performing interpolation operation on blend shape of a corresponding expression unit to play expression animation;

manufacturing a limb action module: and (3) the Chinese word segmentation obtained by analysis calls an interface provided by an NVBG module in the animation engine, head action Bml, eyeball action Bml and hand action Bml are obtained through calculation, and interpolation operation is carried out on the corresponding action sequence by a corresponding head controller, eyeball controller and hand controller triggered by a Bml analysis module to play the head action, the eyeball action and the hand action.

A realistic digital virtual human multi-mode interaction system based on UE4 comprises:

a resource making module: the method is used for making a role model, making a facial expression blend shape, binding bone covering, making an action, making a mapping and adjusting materials;

a resource assembly module: the system is used for scene construction, light design and UI interface construction;

a function making module: the voice recognition system is used for recognizing voice input of a user, intelligently answering according to the input, playing voice, lip animation, expression animation and limb action, and reflecting interactive multimode. The module specifically comprises a voice recognition module, an intelligent question-answering module, a Chinese natural language processing module, a voice synthesis module, a lip animation module, an expression animation module and a limb action module.

As a next generation intelligent interaction mode, the invention has the following beneficial effects:

1. the system has realistic virtual human images and has responses of lips, expressions, limbs and the like which accord with human behavior habits, so that the application accessed into the system has affinity similar to that of a real human and can be accepted by a user more easily;

2. the system has voice recognition interaction capacity, can replace the existing PC keyboard and mouse input mode and the touch interaction mode of touch equipment such as mobile phones and the like, better accords with the interaction habit of human beings, and can enable the application to have a wider popularization range;

3. the system is provided with a question-answering system in the professional field, so that the application really has 'intelligence', and the response of the application is more in accordance with human logic.

Drawings

The drawings that accompany the present invention can be briefly described as follows, and are merely used for explaining the concept of the present invention.

Fig. 1 is a flowchart of a method for implementing multi-modal interaction of a realistic digital virtual human based on UE 4.

FIG. 2 is a flow chart of the functional fabrication of the present invention.

Detailed Description

Hereinafter, the present invention will be further described with reference to the accompanying drawings.

Before making the detailed description, it is necessary to explain some terminology:

UE4 is a shorthand for UNREAL ENGINE 4, chinese: the fantasy engine 4, UE4, is the top game engine most widely licensed in the world today.

BlendShape is a vertex morphing animation, which is commonly used for emoticons.

Maya is a world-leading software application for 3D digital animation and visual effects made by the company onteck.

Substance is a piece of powerful 3D texture mapping software.

UMG is a shorthand for non Motion Graphics, Chinese: a phantom motion graphical interface designer, UMG is a UI interface making module in a UE4 editor;

NVBG is a shorthand for NonVerbal-Behavior-Generator, Chinese: a non-verbal behavior generator.

Bml is a shorthand for the Behavior Markup Language, chinese: a behavioral markup language.

Matinee is a track animation editor provided by UE4, and may have camera animation production.

As shown in FIG. 1, the method for implementing the multi-modal interaction of the realistic digital virtual human based on the UE4 comprises the following steps:

resource production

1. Making a model: and (4) making a character model according to a next generation standard, and carrying out normal map so as to make the outline of the model clearer and emphasize the details of the mouth, face and eyes. In particular, in order to make the face, mouth, corners of eyes, etc. of a character model more natural when making expressions or speaking, the model needs to be made in four sides, and wiring needs to conform to human anatomy.

2. Preparation of facial expression BlendShape: in Maya, 46 expression units BlendShape are produced by a facial motion coding system (FACS), and various expressions of a person can be abundantly expressed by combining these basic expression units according to a facial muscle variation law.

3. Binding the skeleton covering: in Maya, model binding is carried out according to human skeletons, vertex weights are brushed, and the skin vertex is driven to change through the change of the geometric space of the skeletons, so that the actions of the limbs of a human body are expressed vividly.

4. And (3) action making: in Maya, a basic mouth shape motion sequence and a non-language behavior motion sequence are created, the mouth shape change of the person speaking is represented by interpolation calculation of some basic mouth shape motion sequences, and the non-language behavior motion sequence of the person indicates the common habitual motion of the specific person. The action sequence is composed of a series of key frames, and each key frame needs to adjust the position value and the rotation value of the related bone to be proper values.

5. Making a mapping: in Substance, 4K high-definition diffuse reflection maps, highlight maps and roughness maps are created, and the details of a character are expressed finely by the high-definition maps, so that the character model becomes very vivid and the light and shadow effect is more natural.

6. Material adjustment: in the UE4, the subsurface contour coloring model is used to adjust the skin material parameters, the hair coloring model is used to adjust the hair material parameters, the eye coloring model is used to adjust the eye material parameters, and the created map is set to each material parameter, so as to achieve the realistic rendering effect.

Second, resource assembly

1. Setting up a scene: in UE4, creating a scene Level, importing a role model, and keeping the vertex normal of the original model of the model during importing; placing the model at a proper position, and setting the material of the model one by one; a viewing camera is created and adjusted to the appropriate position as the default camera position. Specifically, a camera animation is produced using the Matinee editor of the UE4, and is played to move the camera to a position where the face can be observed appropriately when the facial expression is to be expressed, and to move the camera to a position where the body motion can be observed appropriately when the body motion is expressed.

2. Light design: in the UE4, a surface light source is built up with a plurality of point light sources and parameters such as color positions of lights are set, following the light arrangement of a studio. In particular, in order to secure a stable frame rate of about 30 frames, a plurality of point light sources are organized into a surface light source using a UE4 blueprint code, and the character is irradiated with light source directions and positions adjusted.

UI interface construction: in the UE4, a UI interface such as answer text, microphone status indication, etc. is built using UMG.

Three, function preparation

As shown in fig. 2, the function creation includes the creation of the following modules:

1. manufacturing a voice recognition module: and starting an independent thread to monitor the audio data of the microphone, continuously sending the audio data and related interface parameters to a speech recognition background service program by using a Websocket protocol, and analyzing data returned by the service to obtain a recognition text.

2. And (3) manufacturing an intelligent question-answering module: and sending the recognized text to a background question-answering system program through an Http Post request, and analyzing the answer text returned by the service. In order not to block the main program, the request send function is encapsulated in a separate worker thread, which listens for the service's return.

3. The Chinese natural language processing module is manufactured: and sending the analyzed answer text to a Chinese natural language processing service program through an Http Post request, and analyzing to obtain Chinese word segmentation and emotion results. In particular, each sentence corresponds to an emotional state in order to reasonably express the emotion of the person, so that compared with a scheme of corresponding emotion according to word segmentation, the emotion mutation is not caused, meanwhile, the duration of emotion is longer, and the emotion is in accordance with the emotional expression rule of human beings.

4. And (3) manufacturing a voice synthesis module: and sending the analyzed answer text to a background speech synthesis service program through an Httppost request, analyzing to obtain returned speech audio data and Chinese phoneme time sequence data, and directly calling a system playing audio interface to play sound by the audio data. In order to ensure the timeliness of voice return, the answer text needs to be reasonably segmented, so that too long text is not suitable to be sent, and the Chinese phoneme time sequence data is generated while the voice is synthesized.

5. Making an oral lip animation module: and calling an interface provided by a lip controller in an animation engine by the analyzed Chinese phoneme time sequence data and Chinese word segmentation, and playing the lip animation by carrying out interpolation operation on the corresponding lip action sequence. In particular, to ensure natural consistency of the mouth shape animation, the animation engine employs a cache pool technique to store mouth shape action sequence data to be used in advance in a memory.

6. And (3) making an expression animation module: and calling an emotion result calculated by the Chinese natural language processing module to an interface provided by a face controller in an animation engine, and performing interpolation operation on the blend shape of the corresponding expression unit to play the expression animation. In particular, in order to express the persistence of emotion in a speech, the animation editor of the UE4 is used to edit a weight curve of an expression unit over time, and the blending shape is interpolated in real time through the weight curve.

7. Manufacturing a limb action module: and (3) the Chinese word segmentation obtained by analysis calls an interface provided by an NVBG module in the animation engine, head action Bml, eyeball action Bml and hand action Bml are obtained through calculation, and interpolation operation is carried out on the corresponding action sequence by a corresponding head controller, eyeball controller and hand controller triggered by a Bml analysis module to play the head action, the eyeball action and the hand action. In particular, in order to keep all actions synchronized, the interpolation operations of all controllers are sequentially executed in one frame, and the results are rendered at the rendering time.

The resource making module is used for a resource assembling module and is also used for a function making module, and the function making module depends on the resource making module and the resource assembling module.

The digital virtual human with realistic sensation has high-accuracy voice recognition capability, voice answer in the professional field, natural and vivid expressions such as lip expression animation and limb actions, and is a real embodiment of the next-generation intelligent interaction mode.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive faculty based on the technical solutions of the present invention.

Claims

1. A realistic digital virtual human multi-modal interaction realization method based on UE4 is characterized by comprising the following steps:

s1, resource making, including:

s2, resource assembly, including:

s3, function making, including:

and (3) manufacturing a voice synthesis module: sending the analyzed answer text to a background speech synthesis service program through an Http Post request, analyzing to obtain returned speech audio data and Chinese phoneme time sequence data, and directly calling a system playing audio interface to play sound by the audio data;

2. The method for realizing multi-modal interaction of the realistic digital virtual human based on the UE4, according to the step S1, wherein the model is required to be made according to four sides and the wiring is required to conform to human anatomy.

3. The method for implementing multimodal interaction of realistic digital virtual humans based on UE4, according to claim 1, wherein in the scene construction of step S2, a Matinee editor of UE4 is used to produce camera animation, when facial expression is to be expressed, the camera animation is played to move the camera to a position where the face can be observed properly, when limb action is expressed, the camera animation is played to move the camera to a position where the limb action can be observed properly.

4. The method for implementing multimodal interaction of a realistic digital virtual human based on UE4, according to claim 1, wherein in the lighting design of step S2, a plurality of point light sources are organized into a surface light source by using UE4 blue-map coding, and the direction and position of the light source are adjusted to illuminate the character, while ensuring a stable frame rate of about 30 frames.

5. The method for implementing multi-modal interaction of realistic digital virtual humans based on UE4 of claim 1, wherein in the step S3 of the Chinese natural language processing module, each sentence corresponds to an emotional state, which conforms to the law of human emotional expression.

6. The method for implementing multi-modal interaction of realistic digital virtual human based on UE4, according to claim 1, wherein in the lip animation module production described in step S3, the animation engine employs a cache pool technique to store the sequence data of mouth shape actions in memory in advance.

7. The method for implementing multimodal interaction of a realistic digital avatar based on UE4 of claim 1, wherein in the step S3 of creating the expression animation module, a weight curve of the expression unit changing with time needs to be edited by an animation editor of UE4, and the blending shape is interpolated in real time by the weight curve.

8. The method for realizing multi-modal interaction of the realistic digital virtual human based on the UE4, according to the step S3, wherein in the process of manufacturing the body motion module, interpolation operations of all controllers are sequentially executed in one frame, and the executed results are uniformly rendered during rendering.

9. A system implemented by the method of claims 1-8, comprising: