CN117058286B

CN117058286B - Method and device for generating video by using word driving digital person

Info

Publication number: CN117058286B
Application number: CN202311322110.5A
Authority: CN
Inventors: 胡兴凯; 郑航; 费元华; 郭建君
Original assignee: Beijing Weiling Times Technology Co Ltd
Current assignee: Beijing Weiling Times Technology Co Ltd
Priority date: 2023-10-13
Filing date: 2023-10-13
Publication date: 2024-01-23
Anticipated expiration: 2043-10-13
Also published as: CN117058286A

Abstract

The application provides a method and a device for generating video by an alphanumeric person, comprising the following steps: washing and standardizing the input text content, and extracting key information and identifying entities from the text content; labeling emotion polarities so as to determine emotion tendencies; based on the key information and the entity, generating a virtual scene related to the text content, and selecting different background colors, illumination and environment settings according to emotion tendencies; creating a digital man anchor, and placing the generated digital man in a virtual scene; generating facial expressions and actions of the digital person according to the emotion tendencies; generating voice according to the text content; synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth; and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed, so as to generate a final video. The method and the device directly drive the digital person to generate the video based on the text content, and improve the video generation efficiency.

Description

Method and device for generating video by using word driving digital person

Technical Field

The present application relates to the field of video creation, and more particularly, to a method and apparatus for text-driven digital person generation of video.

Background

The automatic video generation can accelerate business processes such as news broadcasting, teaching materials and the like, improves the production efficiency, and enables content creation and transmission to be quicker. The method for generating the video by the digital person can meet the requirements of users for personalized and targeted content transmission in different fields, and provides more attractive and customized information presentation.

There is currently no method to directly drive digital people to generate video based on text content.

Disclosure of Invention

The present application is directed to overcoming the problems in the art and providing a method and apparatus for an alphanumeric person to generate video.

The application provides a method for generating video by an alphanumeric person, comprising the following steps:

cleaning and standardizing the input text content;

extracting key information in the text content;

identifying an entity in the text content;

marking emotion polarities of the text contents to determine emotion tendencies of the text contents;

generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;

creating a digital man anchor, and placing the generated digital man in the virtual scene;

generating facial expressions and actions of the digital person according to the emotion tendencies;

generating voice according to the text content;

synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth shape;

and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video.

Optionally, the synchronizing the motion, facial expression, voice, and digital mouth shape of the digital person includes:

based on the text content, the motion, facial expression, voice, and digital human mouth shape of the digital human are synchronized using a time axis and an animation curve.

Optionally, the emotional tendency includes: positive, negative, or neutral.

Optionally, the method further comprises:

displaying a happy facial expression and action when the text content is frontal;

when the text content is negative, a sad or angry facial expression and action is displayed.

Optionally, the transition effect includes: fade in and fade out, rotation and translation.

The application provides an apparatus for a text-driven digital person to generate video, comprising:

the preprocessing module is used for cleaning and standardizing the input text content;

the extraction module is used for extracting key information in the text content;

the identification module is used for identifying the entity in the text content;

the labeling module is used for labeling the emotion polarity of the text content so as to determine the emotion tendency of the text content;

the scene module is used for generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;

the creation module is used for creating a digital person anchor and placing the generated digital person in the virtual scene;

the expression module is used for generating facial expressions and actions of the digital person according to the emotion tendencies;

the voice module is used for generating voice according to the text content;

the synchronization module is used for synchronizing the actions, facial expressions, voices and digital mouths of the digital people;

and the video module is used for generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video.

Optionally, the synchronization module synchronizes the actions, facial expressions, voices, and digital mouth shapes of the digital person, including:

Optionally, the emotional tendency includes: positive, negative, or neutral.

Optionally, the method further comprises:

The application has the advantages and beneficial effects that:

the application provides a method for generating video by an alphanumeric person, comprising the following steps: cleaning and standardizing the input text content; extracting key information in the text content; identifying an entity in the text content; marking emotion polarities of the text contents to determine emotion tendencies of the text contents; generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies; creating a digital man anchor, and placing the generated digital man in the virtual scene; generating facial expressions and actions of the digital person according to the emotion tendencies; generating voice according to the text content; synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth shape; and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video. The method and the device directly drive the digital person to generate the video based on the text content, and improve the video generation efficiency.

Drawings

Fig. 1 is a schematic diagram of an alphanumeric human generated video in the present application.

Fig. 2 is a schematic diagram in the present application.

Fig. 3 is a schematic diagram of an apparatus for driving digital persons to generate video in the present application.

Detailed Description

The present application is further described in conjunction with the drawings and detailed embodiments so that those skilled in the art may better understand the present application and practice it.

The following are examples of specific implementation provided for the purpose of illustrating the technical solutions to be protected in this application in detail, but this application may also be implemented in other ways than described herein, and one skilled in the art may implement this application by using different technical means under the guidance of the conception of this application, so this application is not limited by the following specific embodiments.

Referring to fig. 1, the steps for generating video by using an alphanumeric person provided in the present application include:

s101 cleans and normalizes the inputted text content.

First, it involves processing the entered text content, primarily for purposes of washing and normalizing the text. Specifically, the method comprises the following steps:

punctuation marks, spaces, special characters, etc. in the text are removed to facilitate more accurate word segmentation and part-of-speech tagging.

The text is segmented into individual words or vocabularies to facilitate keyword extraction and part-of-speech tagging. Chinese word segmentation may use jieba, THULAC, etc. tools.

Words that frequently appear in the text but do not help in extracting key information, such as "yes", "in", etc., are removed.

Unified processing is performed on the text, such as unified case, digital format, and the like.

S102 extracts key information in the text content.

This step is mainly for extracting key information from the text. The process can be performed in the following manner:

keywords are extracted using the Bag of Words model (Bag of Words) or TF-IDF (terminal frequency-Inverse Document Frequency) method.

S103 identifies an entity in the text content.

This step is mainly for identifying the person name, place name, organization and other entities in the text. Named Entity Recognition (NER) techniques may be used, such as entity recognition using deep learning models such as CRF (conditional random field) models, BERT, etc.

S104, marking emotion polarities of the text contents to determine emotion tendencies of the text contents.

For part of speech tagging, tools such as THULAC may be used.

The purpose of emotion analysis is to determine the emotion tendencies of the text, which can be based on dictionary methods or using deep learning models (e.g., LSTM, BERT, etc.).

S105, generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies.

Based on the key information in the text and the entity recognition result, virtual scenes related to the text content are generated by using 3D modeling and rendering technology. Specifically, the method comprises the following steps:

and creating corresponding 3D models, such as buildings, figures, props and the like, according to the entity identification result.

And a material and illumination technology is used for adding a vivid visual effect to the model.

According to the emotion tendencies of the text, different background colors, illumination and environment settings are selected, and the sense of reality and substitution of the scene are enhanced.

S106 creates a digital person anchor, and places the generated digital person in the virtual scene.

A realistic 3D digital anchor may be automatically created from the text description using digital person generation techniques. Specifically, the method comprises the following steps:

from the text description, a corresponding 3D model and animation are created.

And setting proper facial expressions and limb actions for the digital person according to the emotion analysis result.

Voice content is set for a digital person using voice synthesis techniques. Speech synthesis may use Text-to-Speech (TTS) technology to convert Text to Speech.

S107, generating the facial expression and action of the digital person according to the emotion tendencies.

And automatically generating facial expressions and actions of the digital person according to emotion analysis of the text content and emotion characteristics of the digital person. Specifically, the method comprises the following steps:

and selecting corresponding facial expressions and limb actions according to the emotion tendencies of the text.

Different facial expressions and action templates can be preset for different types of emotions.

Further, the emotional tendency includes: positive, negative, or neutral. Displaying a happy facial expression and action when the text content is frontal; when the text content is negative, a sad or angry facial expression and action is displayed.

S108, generating voice according to the text content.

Based on text content, natural and smooth speech is generated using TTS technology, synchronized with the mouth shape of the digital person. Specifically, the method comprises the following steps:

a high quality Speech synthesis engine is selected, such as Google Text-to-Speech, etc.

S109 synchronizes the digital person 'S motion, facial expression, voice, and digital person' S mouth shape.

And placing the generated digital person in a virtual scene, and synchronizing actions, facial expressions, voices and text contents of the digital person according to the text contents. Specifically, the method comprises the following steps:

the model and animation of the digital person are synchronized with the model and animation in the scene.

The time axis and animation curve are used to synchronize the digital person's performance with the text content.

According to the text content and the mouth shape of the digital person, a proper voice rhythm and intonation are selected.

In achieving synchronization of digital persons with virtual scenes, the following mathematical formulas may be used to optimize:

the synchronization may also be optimized as shown in fig. 2.

Synchronizing the animation of the digital person with the text content using a timeline synchronization formula:

t = sqrt(dist / g) + 0.5 * sin(2* π * freq * t)

where t represents time, dist represents distance between the digital person and the target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration. The formula may synchronize the movement of the digital person with the time axis.

The actions of the digital person are more natural and lifelike, and an animation curve optimization formula is used:

x(t) = x0 + v0 * t + 0.5 * a *t^2

where x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents time. The formula can be used for representing the motion trail of the object under the action of external force, so as to generate a more real animation curve.

Ensuring consistency of digital people on different media, and using a cross-media synchronous optimization formula:

d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)

wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media. The formula can be used for calculating the distance between digital persons on different media so as to realize cross-media synchronization.

S110, generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed, so as to generate a final video.

Smooth transitional effects such as fade-in and fade-out, rotation, panning, etc. are added upon scene cuts and shot changes. Specifically, the method comprises the following steps:

suitable transitional effects are selected, such as rotation, translation, scaling, etc.

The creation and editing of transitional effects is performed using video editing software or a dedicated transitional effect tool.

Further, emotion in voice is analyzed through AI, and a proper virtual scene and transition effect are recommended according to emotion characteristics of the digital person. Specifically, the method comprises the following steps:

emotion analysis is performed on the speech using the emotion analysis model.

And selecting corresponding virtual scenes and transition effects according to emotion analysis results.

In addition, the recommendation algorithm can be continuously optimized according to the user feedback and the data, and the recommendation accuracy and the user satisfaction are improved.

Referring to fig. 3, the apparatus for generating video by using an alphanumeric person provided in the present application includes: preprocessing module 301, extraction module 302, recognition module 303, annotation module 304, scene module 305, creation module 306, presentation module 307, speech module 308, synchronization module 309, video module 310.

The preprocessing module 301 is used for cleaning and normalizing the input text content.

And the extracting module 302 is used for extracting key information in the text content.

And the identifying module 303 is used for identifying the entity in the text content.

And the labeling module 304 is configured to label the emotion polarity of the text content, so as to determine emotion tendencies of the text content.

For part of speech tagging, tools such as THULAC may be used.

A scene module 305, configured to generate a virtual scene related to the text content based on the key information and the entity in the text content, and select different background colors, illumination, and environment settings according to the emotion tendencies.

A creation module 306, configured to create a digital person anchor, and place the generated digital person in the virtual scene.

from the text description, a corresponding 3D model and animation are created.

And the expression module 307 is used for generating facial expressions and actions of the digital person according to the emotion tendencies.

A voice module 308, configured to generate voice according to the text content.

A synchronization module 309 for synchronizing the digital person's actions, facial expressions, voices and digital person's mouth shapes.

t = sqrt(dist / g) + 0.5 * sin(2* π * freq * t)

x(t) = x0 + v0 * t + 0.5 * a *t^2

d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)

The video module 310 is configured to generate a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the motion, and add a transition effect when the scene is switched and the lens is changed, so as to generate a final video.

emotion analysis is performed on the speech using the emotion analysis model.

Claims

1. A method of an alphanumeric person generating video, comprising:

cleaning and standardizing the input text content;

extracting key information in the text content;

identifying an entity in the text content;

generating voice according to the text content;

synchronizing the actions, facial expressions, voices and digital mouth shapes of the digital person according to the text content, and specifically comprising the following steps: synchronizing motion, facial expression, speech, and digital human mouth shape of the digital human using a time axis and an animation curve according to the text content; according to the text content, the text content is synchronized with the actions of the digital person, and the following time axis synchronization formula is adopted:

t = sqrt(dist / g) + 0.5 * sin(2 * π * freq* t)

wherein t represents time, dist represents distance between the digital person and a target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration;

the following animation curve optimization formula is adopted:

x(t) = x0 + v0 * t + 0.5 * a *t^2

wherein x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents the time;

generating a video directly according to the information of the virtual scene, the digital human mouth synchronization, the facial expression and the action based on the text content, and adding a transition effect when the scene is switched and the lens is changed to generate a final video;

wherein, to ensure consistency of digital people on different media, the following cross-media synchronization optimization formula is used:

d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)

wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media.

2. The method of claim 1, wherein the emotional tendency comprises: positive, negative, or neutral.

3. The method of claim 2, further comprising:

4. The method of claim 1, wherein the transitional effect comprises: fade in and fade out, rotation and translation.

5. An apparatus for generating video by an alphanumeric person, comprising:

the voice module is used for generating voice according to the text content;

the synchronization module is used for synchronizing the actions, the facial expressions, the voices and the digital mouths of the digital people according to the text content and specifically comprises the following steps: synchronizing motion, facial expression, speech, and digital human mouth shape of the digital human using a time axis and an animation curve according to the text content; according to the text content, the text content is synchronized with the actions of the digital person, and the following time axis synchronization formula is adopted:

t = sqrt(dist / g) + 0.5 * sin(2 * π * freq* t)

the following animation curve optimization formula is adopted:

x(t) = x0 + v0 * t + 0.5 * a *t^2

the video module is directly used for generating a video according to the information of the virtual scene, the digital human mouth synchronization, the facial expression and the action based on the text content, and adding a transition effect when the scene is switched and the lens is changed to generate a final video;

d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)

6. The apparatus for generating video for an alphanumeric person as in claim 5, wherein the emotional tendency comprises: positive, negative, or neutral.

7. The apparatus for driving a digital person to generate video of claim 6, further comprising:

8. The apparatus for driving a digital person to generate video of claim 5, wherein the transitional effect comprises: fade in and fade out, rotation and translation.