CN116016986A

CN116016986A - Virtual person interactive video rendering method and device

Info

Publication number: CN116016986A
Application number: CN202310025137.1A
Authority: CN
Inventors: 张雪源; 顾文元
Original assignee: Yuanmeng Human Intelligence International Co ltd; Shanghai Yuanmeng Intelligent Technology Co ltd
Current assignee: Yuanmeng Human Intelligence International Co ltd; Shanghai Yuanmeng Intelligent Technology Co ltd
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-04-25

Abstract

The application provides a rendering method and device of virtual person interactive video, wherein the method comprises the steps of obtaining voice to be broadcasted; selecting limb motion data matched with voice to be broadcasted in a limb motion video library as target limb motion data; the target limb movement data comprise limb movement videos which are rendered in advance based on limb movements of the virtual person, and lip position information and lip posture information of the limb movement videos; rendering lip videos according to lip gesture information and voice to be broadcasted; and fusing the lip video and the limb action video based on the lip position information to obtain a virtual person interactive video for outputting the voice to be broadcasted. According to the scheme, the lip video and the pre-rendered limb action video can be synthesized into the completed virtual human interactive video only by rendering the lip video in real time, so that the calculated amount required by rendering the virtual human interactive video in real time is remarkably reduced.

Description

Virtual person interactive video rendering method and device

Technical Field

The invention relates to the technical field of virtual human interaction, in particular to a virtual human interaction video rendering method and device.

Background

The virtual human interactive video is a video with the following characteristics: the three-dimensional character model is displayed in the video picture, and limbs (including limbs and trunk) and lips of the character model are changed along with voice output by the video when the video is played.

During man-machine interaction, the device can play virtual man-machine interaction video to simulate real human speaking, and interaction experience is improved. Therefore, the virtual person interactive video is increasingly applied to scenes such as intelligent shopping guide, intelligent navigation, intelligent foreground, mobile phone assistant and the like.

In the above scenario, the voice output by the virtual person interactive video usually carries a lot of real-time information that varies with time, such as time, weather, stock, service status, personal information, etc., so the virtual person interactive video must be rendered in real time, that is, the virtual person interactive video needs to be generated and output in a short time after the user input is obtained.

However, the amount of computation required to render video containing three-dimensional character models in real time is significant, and particularly as modeling techniques develop, the accuracy of three-dimensional character models increases, resulting in an increase in the amount of computation required to render corresponding video. The problem greatly increases the application cost of the virtual person interactive video, and limits the application range and application scene of the technology.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a virtual person interactive video rendering method and device so as to reduce the calculated amount required by real-time virtual person interactive video rendering.

The first aspect of the present application provides a method for rendering a virtual person interactive video, including:

obtaining voice to be broadcasted; selecting limb motion data matched with the voice to be broadcasted in a limb motion video library as target limb motion data; the target limb action data comprise limb action videos which are rendered in advance based on limb actions of a virtual person, and lip position information and lip posture information of the limb action videos;

rendering lip videos according to the lip gesture information and the voice to be broadcasted;

and based on the lip position information, fusing the lip video and the limb action video to obtain a virtual person interactive video for outputting the voice to be broadcasted.

Optionally, the rendering the lip video according to the lip gesture information and the voice to be broadcasted includes:

synchronizing the time axis of the limb action video and the time axis of the voice to be broadcasted;

for each audio frame in the voice to be broadcasted, lip gesture data of an action video frame corresponding to the audio frame is obtained from the lip gesture information, and a lip video frame corresponding to the audio frame is rendered according to the audio frame and the lip gesture data; wherein, the action video frame refers to a video frame of the limb action video; the lip video frames refer to video frames that make up the lip video.

Optionally, based on the lip position information, fusing the lip video and the limb action video to obtain a virtual person interaction video for outputting the voice to be broadcasted, including:

for each action video frame, acquiring lip position data of the action video frame from the lip position information, and superposing the lip video frame corresponding to the action video frame at a position indicated by the lip position data in the action video frame to obtain an interactive video frame corresponding to the action video frame; and the virtual human interactive video is formed by a plurality of continuous interactive video frames.

Optionally, the selecting, as the target limb motion data, limb motion data matched with the voice to be broadcasted in the limb motion video library includes:

determining a target limb action matched with the voice content of the voice to be broadcasted;

and selecting limb motion data corresponding to the target limb motion from a limb motion video library as target limb motion data.

Optionally, the limb action data includes an expression tag, and the expression tag represents a virtual expression when the limb action video is rendered;

the selecting the limb movement data corresponding to the target limb movement in the limb movement video library as target limb movement data comprises the following steps:

selecting limb motion data which corresponds to the target limb motion and has the same expression label as the voice to be broadcasted from the limb motion video library as target limb motion data; the expression label of the voice to be broadcasted is determined according to the voice content of the voice to be broadcasted.

A second aspect of the present application provides a virtual person interactive video rendering apparatus, including:

the acquisition unit is used for acquiring voice to be broadcasted; selecting limb motion data matched with the voice to be broadcasted in a limb motion video library as target limb motion data; the target limb action data comprise limb action videos which are rendered in advance based on limb actions of a virtual person, and lip position information and lip posture information of the limb action videos;

the rendering unit is used for rendering lip videos according to the lip gesture information and the voice to be broadcasted;

and the fusion unit is used for fusing the lip video and the limb action video based on the lip position information to obtain a virtual person interactive video for outputting the voice to be broadcasted.

Optionally, the rendering unit is specifically configured to, when rendering the lip video according to the lip gesture information and the voice to be broadcasted:

for each audio frame in the voice to be broadcasted, acquiring lip gesture data of an action video frame corresponding to the audio frame from the lip gesture information, and synthesizing a lip video frame corresponding to the audio frame according to the audio frame and the lip gesture data; wherein, the action video frame refers to a video frame of the limb action video; the lip video frames refer to video frames that make up the lip video.

Optionally, the fusion unit fuses the lip video and the limb action video based on the lip position information, so as to obtain the virtual person interactive video for outputting the voice to be broadcasted, which is specifically used for:

Optionally, when the obtaining unit selects the limb motion data matched with the voice to be broadcasted in the limb motion video library as the target limb motion data, the obtaining unit is specifically configured to:

the acquiring unit selects the limb motion data corresponding to the target limb motion from the limb motion video library as target limb motion data, and is specifically configured to:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for rendering a virtual person interactive video according to an embodiment of the present application;

fig. 2 is a schematic diagram of lip gesture data according to an embodiment of the present application;

FIG. 3 is a schematic diagram of lip position data according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a virtual person interactive video rendering device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

An embodiment of the present application provides a method for rendering an interactive video of a virtual person, please refer to fig. 1, which is a flowchart of the method, and the method may include the following steps.

S101, obtaining voice to be broadcasted; and selecting the limb motion data matched with the voice to be broadcasted in the limb motion video library as target limb motion data.

The target limb motion data comprise limb motion videos which are rendered in advance based on limb motions of the virtual person, and lip position information and lip posture information of the limb motion videos.

The method for obtaining the voice to be broadcasted comprises the following steps:

first, a text to be synthesized is determined according to an input user instruction. The form of the user instruction is not limited, and may be a voice form instruction or a text form instruction.

The method includes the steps that after a user speaks a question, terminal equipment collects voice when the user speaks, the voice is determined to be a voice instruction of the user, after the voice instruction is obtained, the question which is presented by the user is identified through a voice identification technology, then a preset answer text is found according to the question, or a corresponding answer text is generated according to the question by means of an artificial intelligence algorithm, and the answer text is determined to be a text to be synthesized corresponding to the user instruction.

After the text to be synthesized is obtained, the voice to be broadcasted for broadcasting the text to be synthesized can be generated by utilizing a voice synthesis technology. Speech synthesis is a technique for generating artificial speech based on specific text by mechanical, electronic means, which can convert text information generated by a computer device or externally input into audible, fluent spoken chinese language output. The method provided in this embodiment may utilize any one of existing speech synthesis techniques, or a speech synthesizer to synthesize the speech to be broadcasted.

The limb movement video library is described below.

The limb movement video library comprises a plurality of limb movement data, each limb movement data corresponding to a specific limb movement. A piece of limb movement data may include the following: a section of three-dimensional character model (hereinafter referred to as virtual person) executes a limb motion video of a specific limb motion, and is used for describing lip position information of a position of a lip in the section of limb motion video and lip gesture information of a lip gesture in the section of limb motion video.

In an alternative embodiment, the virtual person in all limb movement videos may remain in a default expression and lips are silent and inactive. The default expression may be smiling expression or other expressions, without limitation.

Taking a default expression as a smile as an example, when a limb action video library is constructed, the whole material mapping, two-dimensional or three-dimensional scene mapping, related prop mapping and other materials of a virtual person can be utilized to render the process of executing a limb action of the virtual person presenting the smile expression and having a silent lip, so that a section of limb action video corresponding to the limb action is obtained, and meanwhile, the lip position data and the lip gesture data of each video frame of the limb action video are recorded in real time in the rendering process. After rendering is completed, the set of lip position data of all video frames in the limb movement video is lip position information corresponding to the limb movement video, and the set of lip gesture data of all video frames in the limb movement video is lip gesture information corresponding to the limb movement video, so that a limb movement data corresponding to a limb movement can be obtained. Repeating the above process for each limb action executable by the preset virtual person to obtain the limb action data corresponding to each limb action, thereby obtaining a limb action video library.

In another alternative embodiment, the virtual person in the different limb movement videos may keep the lips silent and inactive and have different expressions. For example, one part of the limb movement videos shows smiling expressions of the virtual persons, and the other part of the limb movement videos shows serious expressions of the virtual persons.

Under the situation, when the limb action video library is constructed, the expression labels corresponding to different limb action videos need to be appointed in advance, when the limb action videos are rendered, virtual people are rendered according to the expressions indicated by the expression labels, and after the rendering is finished, the expression labels need to be added into the corresponding limb action data. That is, when the expressions of the virtual persons in the different limb-action videos are different, the limb-action data may include limb-action videos, lip position information, lip posture information, and expression tags representing the expressions of the virtual persons in the videos.

For example, when a limb action video corresponding to a certain limb action is rendered, designating the expression of a virtual person in the video as serious, controlling the virtual person to present serious expression in the rendering process, and determining the corresponding lip position information and lip posture information of the limb action video and the serious label as a piece of limb action data corresponding to the limb action after the limb action video is rendered.

Referring to fig. 2, the lip pose of a virtual person is defined as the zenith and azimuth angles of its normal direction and global coordinates. Specifically, the lip region is defined as the portion of the orbicularis outward facing inward, with its normal being perpendicular to the plane formed by the orbicularis outward facing edge, facing in the direction of the lips. The directions are defined by using a spherical coordinate system, and the angles between the normal direction and the z axis are zenith angles (shown as an angle A in fig. 2), and the angles between the projection on the x-y plane and the x axis are azimuth angles (shown as an angle B in fig. 2). Wherein, angle A controls the elevation and depression of the lip, when A=pi/2, the virtual person is in a head-up state, when A < pi/2, the lip is in a head-up state, and when A > pi/2, the virtual person is in a low head state; angle B controls the orientation of the lips, facing forward when b=0, to the left of the virtual person when 0<B < pi, and to the right of the virtual person itself when pi < B <2 pi.

Thus, the lip gesture data of one video frame in the limb movement video can be the angle of the zenith angle A and the angle of the azimuth angle B in the video frame. Illustratively, the lip pose data for a video frame may be: angle A is pi/4 and angle B is pi/3.

Referring to fig. 3, the virtual human lip position is defined as the global three-dimensional coordinates of the center point of the orbicularis outer edge of the virtual human population. Correspondingly, the lip position data of one video frame in the limb action video can be the coordinate data of the central point of the external edge of the orbicularis stomatitis muscle in the video frame, namely the (x, y, z) coordinate of the point.

Optionally, the process of selecting, in step S101, the limb motion data in the limb motion video library, which is matched with the voice to be broadcasted, as the target limb motion data may include:

firstly, determining a target limb action matched with voice content of voice to be broadcasted;

and then selecting the limb movement data corresponding to the target limb movement in the limb movement video library as target limb movement data.

The matching relationship between the voice content and the limb action can be preset, and for example, when the voice content indicates that a certain part of the equipment is clicked by a user, the matched limb action can point to the direction of the part by hands, and when the voice content calls the user, the matched limb action can be nodding and calling the hands.

It should be noted that, a section of voice to be broadcasted may relate to various voice contents, so that one target limb action or a plurality of target limb actions may be matched. Correspondingly, the target limb action data corresponding to the voice to be broadcasted can be one piece of data corresponding to one action or multiple pieces of data corresponding to multiple actions.

After the target limb movement is determined, the target limb movement data can be found in the limb movement video library. In the video of the target limb movement data, the movement executed by the virtual person is consistent with the target limb movement.

As previously described, in some embodiments, the virtual person in the different limb-action videos may have different expressions, where the limb-action data includes expression tags that characterize the virtual expression when the limb-action video is rendered.

In this case, when selecting the target limb movement data, it is necessary to consider the limb movement and the expression label at the same time, in other words, the limb movement data corresponding to the target limb movement in the limb movement video library is selected as the target limb movement data, which may include:

selecting limb motion data which corresponds to a target limb motion and has the same expression label as the voice to be broadcasted from a limb motion video library as target limb motion data; the expression label of the voice to be broadcasted is determined according to the voice content of the voice to be broadcasted.

S102, rendering lip videos according to lip gesture information and voice to be broadcasted.

The limb movement videos in steps S102 and S103 refer to the limb movement videos included in the target limb movement data selected in step S101. Similarly, the lip position information and the lip posture information in steps S102 and S103 refer to the lip position information and the lip posture information included in the target limb motion data selected in S101.

Optionally, the performing of step S102 may include:

a1, synchronizing a time axis of a limb action video and a time axis of voice to be broadcasted;

a2, for each audio frame in the voice to be broadcasted, acquiring lip gesture data of an action video frame corresponding to the audio frame from lip gesture information, and rendering according to the audio frame and the lip gesture data to obtain a lip video frame corresponding to the audio frame; wherein, the action video frame refers to a video frame of the limb action video; lip video frames refer to video frames that make up lip video.

The specific implementation process of the step A1 is as follows:

if only one target limb movement data is determined, only one limb movement video needs to be synchronized in A1, in this case, the starting time of the limb movement video is set as the starting time of the voice to be broadcast, and the ending time of the limb movement video is set as the ending time of the voice to be broadcast;

after the setting is completed, if the duration of the limb action video is consistent with the duration of the voice to be broadcasted, the time axis is successfully synchronized, and the step A1 is ended; if the duration of the limb action video is longer than the duration of the voice to be broadcasted, the duration of the limb action video can be shortened by accelerating the video playing speed and deleting a plurality of video frames in the video, so that the duration of the limb action video and the duration of the video are consistent; if the duration of the limb action video is smaller than the duration of the voice to be broadcasted, the duration of the limb action video can be prolonged by slowing down the video playing speed, copying a plurality of video frames in the video, inserting the copied video frames into the original video and the like, so that the duration of the limb action video and the duration of the limb action video are consistent, after the duration of the limb action video and the duration of the video are consistent, the time axis synchronization is successful, and the step A1 is ended.

If it is determined in S101 that the plurality of target limb motion data are required to synchronize the plurality of limb motion videos in A1, in this case, first determining a play sequence of the plurality of limb motion videos, for example, determining that the motion 1 is performed first and the motion 2 is performed later according to the voice content, then playing the limb motion video corresponding to the motion 1 first, playing the limb motion video corresponding to the motion 2 later, and then sequentially splicing the plurality of limb motion videos according to the play sequence to obtain a spliced video.

After the spliced video is obtained, the time axis of the spliced video and the time axis of the voice to be broadcasted can be synchronized according to the synchronization mode of the single video, and specific components are shown in the foregoing and will not be repeated.

In step A2, for each audio frame in the voice to be broadcasted, lip gesture data of the audio frame and the action video frame corresponding to the audio frame may be input into a mouth shape real-time synthesizer to perform mouth shape rendering, so as to obtain a lip video frame corresponding to the audio frame.

The lip shape real-time synthesizer is used for calculating the lip shape when a specific voice is sent out under the specific lip gesture according to the voice signal and the lip gesture data.

And S103, based on the lip position information, fusing the lip video and the limb action video to obtain a virtual person interaction video for outputting the voice to be broadcasted.

Optionally, the performing of step S103 may include:

for each action video frame, acquiring lip position data of the action video frame from the lip position information, and superposing the lip video frame corresponding to the action video frame at a position indicated by the lip position data in the action video frame to obtain an interactive video frame corresponding to the action video frame; wherein, a plurality of continuous interactive video frames form virtual human interactive video.

In S102, the time axis of the limb action video is already synchronized with the time axis of the voice to be broadcasted, so each audio frame of the voice to be broadcasted corresponds to one action video frame in the limb action video, and the audio frame of the voice to be broadcasted corresponds to the action video frame of the limb action video one by one.

Each lip video frame in the lip video is synthesized by the voice to be broadcasted, and the audio frame of the voice to be broadcasted and the lip video frame of the lip video are in one-to-one correspondence. Thus, a one-to-one correspondence may be determined between each lip video frame of the lip video and each action video frame of the limb action video.

Based on the correspondence, in step S103, each lip video frame and the corresponding action video frame may be overlapped and fused in real time when the lip video is rendered, based on the lip position data of the action video frame, to obtain an interactive video frame. The video composed of a plurality of continuous interactive video frames is the interactive video used for outputting the voice to be broadcasted.

The interactive video can be played on a terminal device supporting the virtual person interactive video.

The method provided by the embodiment can be executed by the server at the cloud. The server generates the interactive video in real time by executing the method, and pushes the interactive video to the terminal equipment for display in a video stream mode.

The method provided in this embodiment may also be executed by the terminal device. The terminal equipment can download the limb action video library from the server to the local in advance, then execute the method based on the downloaded limb action video library, and generate and play the interactive video in real time at the local.

The application provides a rendering method of virtual person interactive video, which comprises the steps of obtaining voice to be broadcasted; selecting limb motion data matched with voice to be broadcasted in a limb motion video library as target limb motion data; the target limb movement data comprise limb movement videos which are rendered in advance based on limb movements of the virtual person, and lip position information and lip posture information of the limb movement videos; rendering lip videos according to lip gesture information and voice to be broadcasted; and fusing the lip video and the limb action video based on the lip position information to obtain a virtual person interactive video for outputting the voice to be broadcasted. According to the scheme, the lip video and the pre-rendered limb action video can be synthesized into the completed virtual human interactive video only by rendering the lip video in real time, so that the calculated amount required by rendering the virtual human interactive video in real time is remarkably reduced.

According to the method for rendering the virtual human interactive video provided in the embodiment of the present application, the embodiment of the present application provides a device for rendering the virtual human interactive video, please refer to fig. 4, which is a schematic structural diagram of the device, and the device may include the following units.

An obtaining unit 401, configured to obtain a voice to be broadcasted; selecting limb motion data matched with voice to be broadcasted in a limb motion video library as target limb motion data; the target limb motion data comprise limb motion videos which are rendered in advance based on limb motions of the virtual person, lip position information and lip posture information of the limb motion videos;

a rendering unit 402, configured to render lip video according to lip gesture information and voice to be broadcasted;

and the fusion unit 403 is configured to fuse the lip video and the limb action video based on the lip position information, and obtain a virtual person interaction video for outputting the voice to be broadcasted.

Optionally, the rendering unit 402 is specifically configured to, when rendering the lip video according to the lip gesture information and the voice to be broadcasted:

synchronizing a time axis of the limb action video and a time axis of the voice to be broadcasted;

for each audio frame in the voice to be broadcasted, lip gesture data of an action video frame corresponding to the audio frame is obtained from lip gesture information, and the lip video frame corresponding to the audio frame is synthesized according to the audio frame and the lip gesture data; wherein, the action video frame refers to a video frame of the limb action video; lip video frames refer to video frames that make up lip video.

Optionally, the fusion unit 403 fuses the lip video and the limb action video based on the lip position information, so as to obtain the virtual human interaction video for outputting the voice to be broadcasted, which is specifically used for:

Optionally, when the obtaining unit 401 selects, as the target limb motion data, the limb motion data matching the voice to be broadcasted in the limb motion video library, the method is specifically used for:

and selecting limb motion data corresponding to the target limb motion from the limb motion video library as target limb motion data.

Optionally, the limb action data includes an expression tag, and the expression tag characterizes a virtual expression when rendering the limb action video;

when the obtaining unit 401 selects the limb movement data corresponding to the target limb movement in the limb movement video library as the target limb movement data, the method is specifically used for:

The specific working principle of the virtual person interactive video rendering device provided in this embodiment may refer to relevant steps in the virtual person interactive video rendering method provided in this embodiment, and will not be described here again.

The application provides a rendering device of virtual person interactive video, which comprises an acquisition unit 401 for acquiring voice to be broadcasted; selecting limb motion data matched with voice to be broadcasted in a limb motion video library as target limb motion data; the target limb movement data comprise limb movement videos which are rendered in advance based on limb movements of the virtual person, and lip position information and lip posture information of the limb movement videos; the rendering unit 402 renders lip videos according to lip gesture information and voice to be broadcasted; the fusion unit 403 fuses the lip video and the limb action video based on the lip position information, and obtains a virtual person interaction video for outputting the voice to be broadcasted. According to the scheme, the lip video and the pre-rendered limb action video can be synthesized into the completed virtual human interactive video only by rendering the lip video in real time, so that the calculated amount required by rendering the virtual human interactive video in real time is remarkably reduced.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.

Those skilled in the art can make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The method for rendering the virtual human interactive video is characterized by comprising the following steps of:

2. The method of claim 1, wherein the rendering lip video from the lip pose information and the voice to be announced comprises:

3. The method according to claim 2, wherein the fusing the lip video and the limb action video based on the lip position information to obtain a virtual human interactive video for outputting the voice to be broadcasted comprises:

4. The method according to claim 1, wherein selecting the limb motion data in the limb motion video library that matches the voice to be broadcasted as the target limb motion data comprises:

5. The method of claim 4, wherein the limb-motion data includes an expression tag that characterizes a virtual expression when rendering a limb-motion video;

6. A virtual human interactive video rendering apparatus, comprising:

7. The apparatus of claim 6, wherein the rendering unit is configured to, when rendering the lip video according to the lip gesture information and the voice to be broadcasted, specifically:

8. The apparatus of claim 7, wherein the fusion unit is configured to, based on the lip position information, fuse the lip video and the limb action video to obtain a virtual human interactive video for outputting the voice to be broadcasted, when the virtual human interactive video is specifically configured to:

9. The apparatus of claim 6, wherein the obtaining unit is configured to, when selecting, as the target limb motion data, limb motion data in the limb motion video library that matches the voice to be broadcasted, specifically:

10. The apparatus of claim 9, wherein the limb-motion data includes an expression tag that characterizes a virtual expression when rendering a limb-motion video;