CN117058286B - Method and device for generating video by using word driving digital person - Google Patents
Method and device for generating video by using word driving digital person Download PDFInfo
- Publication number
- CN117058286B CN117058286B CN202311322110.5A CN202311322110A CN117058286B CN 117058286 B CN117058286 B CN 117058286B CN 202311322110 A CN202311322110 A CN 202311322110A CN 117058286 B CN117058286 B CN 117058286B
- Authority
- CN
- China
- Prior art keywords
- text content
- digital
- person
- generating
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000008451 emotion Effects 0.000 claims abstract description 62
- 230000008921 facial expression Effects 0.000 claims abstract description 55
- 230000009471 action Effects 0.000 claims abstract description 25
- 230000000694 effects Effects 0.000 claims abstract description 25
- 230000001815 facial effect Effects 0.000 claims abstract description 25
- 230000033001 locomotion Effects 0.000 claims abstract description 15
- 230000007704 transition Effects 0.000 claims abstract description 15
- 238000005286 illumination Methods 0.000 claims abstract description 13
- 239000003086 colorant Substances 0.000 claims abstract description 11
- 238000002372 labelling Methods 0.000 claims abstract description 6
- 230000001360 synchronised effect Effects 0.000 claims description 10
- 230000001133 acceleration Effects 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 8
- 238000004140 cleaning Methods 0.000 claims description 7
- 230000002996 emotional effect Effects 0.000 claims description 6
- 230000007935 neutral effect Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 230000014509 gene expression Effects 0.000 claims description 3
- 238000005406 washing Methods 0.000 abstract description 3
- 238000004458 analytical method Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 8
- 230000015572 biosynthetic process Effects 0.000 description 6
- 238000003786 synthesis reaction Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000013136 deep learning model Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 239000000463 material Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000004091 panning Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Processing Or Creating Images (AREA)
Abstract
The application provides a method and a device for generating video by an alphanumeric person, comprising the following steps: washing and standardizing the input text content, and extracting key information and identifying entities from the text content; labeling emotion polarities so as to determine emotion tendencies; based on the key information and the entity, generating a virtual scene related to the text content, and selecting different background colors, illumination and environment settings according to emotion tendencies; creating a digital man anchor, and placing the generated digital man in a virtual scene; generating facial expressions and actions of the digital person according to the emotion tendencies; generating voice according to the text content; synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth; and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed, so as to generate a final video. The method and the device directly drive the digital person to generate the video based on the text content, and improve the video generation efficiency.
Description
Technical Field
The present application relates to the field of video creation, and more particularly, to a method and apparatus for text-driven digital person generation of video.
Background
The automatic video generation can accelerate business processes such as news broadcasting, teaching materials and the like, improves the production efficiency, and enables content creation and transmission to be quicker. The method for generating the video by the digital person can meet the requirements of users for personalized and targeted content transmission in different fields, and provides more attractive and customized information presentation.
There is currently no method to directly drive digital people to generate video based on text content.
Disclosure of Invention
The present application is directed to overcoming the problems in the art and providing a method and apparatus for an alphanumeric person to generate video.
The application provides a method for generating video by an alphanumeric person, comprising the following steps:
cleaning and standardizing the input text content;
extracting key information in the text content;
identifying an entity in the text content;
marking emotion polarities of the text contents to determine emotion tendencies of the text contents;
generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;
creating a digital man anchor, and placing the generated digital man in the virtual scene;
generating facial expressions and actions of the digital person according to the emotion tendencies;
generating voice according to the text content;
synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth shape;
and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video.
Optionally, the synchronizing the motion, facial expression, voice, and digital mouth shape of the digital person includes:
based on the text content, the motion, facial expression, voice, and digital human mouth shape of the digital human are synchronized using a time axis and an animation curve.
Optionally, the emotional tendency includes: positive, negative, or neutral.
Optionally, the method further comprises:
displaying a happy facial expression and action when the text content is frontal;
when the text content is negative, a sad or angry facial expression and action is displayed.
Optionally, the transition effect includes: fade in and fade out, rotation and translation.
The application provides an apparatus for a text-driven digital person to generate video, comprising:
the preprocessing module is used for cleaning and standardizing the input text content;
the extraction module is used for extracting key information in the text content;
the identification module is used for identifying the entity in the text content;
the labeling module is used for labeling the emotion polarity of the text content so as to determine the emotion tendency of the text content;
the scene module is used for generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;
the creation module is used for creating a digital person anchor and placing the generated digital person in the virtual scene;
the expression module is used for generating facial expressions and actions of the digital person according to the emotion tendencies;
the voice module is used for generating voice according to the text content;
the synchronization module is used for synchronizing the actions, facial expressions, voices and digital mouths of the digital people;
and the video module is used for generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video.
Optionally, the synchronization module synchronizes the actions, facial expressions, voices, and digital mouth shapes of the digital person, including:
based on the text content, the motion, facial expression, voice, and digital human mouth shape of the digital human are synchronized using a time axis and an animation curve.
Optionally, the emotional tendency includes: positive, negative, or neutral.
Optionally, the method further comprises:
displaying a happy facial expression and action when the text content is frontal;
when the text content is negative, a sad or angry facial expression and action is displayed.
Optionally, the transition effect includes: fade in and fade out, rotation and translation.
The application has the advantages and beneficial effects that:
the application provides a method for generating video by an alphanumeric person, comprising the following steps: cleaning and standardizing the input text content; extracting key information in the text content; identifying an entity in the text content; marking emotion polarities of the text contents to determine emotion tendencies of the text contents; generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies; creating a digital man anchor, and placing the generated digital man in the virtual scene; generating facial expressions and actions of the digital person according to the emotion tendencies; generating voice according to the text content; synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth shape; and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video. The method and the device directly drive the digital person to generate the video based on the text content, and improve the video generation efficiency.
Drawings
Fig. 1 is a schematic diagram of an alphanumeric human generated video in the present application.
Fig. 2 is a schematic diagram in the present application.
Fig. 3 is a schematic diagram of an apparatus for driving digital persons to generate video in the present application.
Detailed Description
The present application is further described in conjunction with the drawings and detailed embodiments so that those skilled in the art may better understand the present application and practice it.
The following are examples of specific implementation provided for the purpose of illustrating the technical solutions to be protected in this application in detail, but this application may also be implemented in other ways than described herein, and one skilled in the art may implement this application by using different technical means under the guidance of the conception of this application, so this application is not limited by the following specific embodiments.
The application provides a method for generating video by an alphanumeric person, comprising the following steps: cleaning and standardizing the input text content; extracting key information in the text content; identifying an entity in the text content; marking emotion polarities of the text contents to determine emotion tendencies of the text contents; generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies; creating a digital man anchor, and placing the generated digital man in the virtual scene; generating facial expressions and actions of the digital person according to the emotion tendencies; generating voice according to the text content; synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth shape; and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video. The method and the device directly drive the digital person to generate the video based on the text content, and improve the video generation efficiency.
Fig. 1 is a schematic diagram of an alphanumeric human generated video in the present application.
Referring to fig. 1, the steps for generating video by using an alphanumeric person provided in the present application include:
s101 cleans and normalizes the inputted text content.
First, it involves processing the entered text content, primarily for purposes of washing and normalizing the text. Specifically, the method comprises the following steps:
punctuation marks, spaces, special characters, etc. in the text are removed to facilitate more accurate word segmentation and part-of-speech tagging.
The text is segmented into individual words or vocabularies to facilitate keyword extraction and part-of-speech tagging. Chinese word segmentation may use jieba, THULAC, etc. tools.
Words that frequently appear in the text but do not help in extracting key information, such as "yes", "in", etc., are removed.
Unified processing is performed on the text, such as unified case, digital format, and the like.
S102 extracts key information in the text content.
This step is mainly for extracting key information from the text. The process can be performed in the following manner:
keywords are extracted using the Bag of Words model (Bag of Words) or TF-IDF (terminal frequency-Inverse Document Frequency) method.
S103 identifies an entity in the text content.
This step is mainly for identifying the person name, place name, organization and other entities in the text. Named Entity Recognition (NER) techniques may be used, such as entity recognition using deep learning models such as CRF (conditional random field) models, BERT, etc.
S104, marking emotion polarities of the text contents to determine emotion tendencies of the text contents.
For part of speech tagging, tools such as THULAC may be used.
The purpose of emotion analysis is to determine the emotion tendencies of the text, which can be based on dictionary methods or using deep learning models (e.g., LSTM, BERT, etc.).
S105, generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies.
Based on the key information in the text and the entity recognition result, virtual scenes related to the text content are generated by using 3D modeling and rendering technology. Specifically, the method comprises the following steps:
and creating corresponding 3D models, such as buildings, figures, props and the like, according to the entity identification result.
And a material and illumination technology is used for adding a vivid visual effect to the model.
According to the emotion tendencies of the text, different background colors, illumination and environment settings are selected, and the sense of reality and substitution of the scene are enhanced.
S106 creates a digital person anchor, and places the generated digital person in the virtual scene.
A realistic 3D digital anchor may be automatically created from the text description using digital person generation techniques. Specifically, the method comprises the following steps:
from the text description, a corresponding 3D model and animation are created.
And setting proper facial expressions and limb actions for the digital person according to the emotion analysis result.
Voice content is set for a digital person using voice synthesis techniques. Speech synthesis may use Text-to-Speech (TTS) technology to convert Text to Speech.
S107, generating the facial expression and action of the digital person according to the emotion tendencies.
And automatically generating facial expressions and actions of the digital person according to emotion analysis of the text content and emotion characteristics of the digital person. Specifically, the method comprises the following steps:
and selecting corresponding facial expressions and limb actions according to the emotion tendencies of the text.
Different facial expressions and action templates can be preset for different types of emotions.
Further, the emotional tendency includes: positive, negative, or neutral. Displaying a happy facial expression and action when the text content is frontal; when the text content is negative, a sad or angry facial expression and action is displayed.
S108, generating voice according to the text content.
Based on text content, natural and smooth speech is generated using TTS technology, synchronized with the mouth shape of the digital person. Specifically, the method comprises the following steps:
a high quality Speech synthesis engine is selected, such as Google Text-to-Speech, etc.
S109 synchronizes the digital person 'S motion, facial expression, voice, and digital person' S mouth shape.
And placing the generated digital person in a virtual scene, and synchronizing actions, facial expressions, voices and text contents of the digital person according to the text contents. Specifically, the method comprises the following steps:
the model and animation of the digital person are synchronized with the model and animation in the scene.
The time axis and animation curve are used to synchronize the digital person's performance with the text content.
According to the text content and the mouth shape of the digital person, a proper voice rhythm and intonation are selected.
In achieving synchronization of digital persons with virtual scenes, the following mathematical formulas may be used to optimize:
the synchronization may also be optimized as shown in fig. 2.
Synchronizing the animation of the digital person with the text content using a timeline synchronization formula:
t = sqrt(dist / g) + 0.5 * sin(2* π * freq * t)
where t represents time, dist represents distance between the digital person and the target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration. The formula may synchronize the movement of the digital person with the time axis.
The actions of the digital person are more natural and lifelike, and an animation curve optimization formula is used:
x(t) = x0 + v0 * t + 0.5 * a *t^2
where x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents time. The formula can be used for representing the motion trail of the object under the action of external force, so as to generate a more real animation curve.
Ensuring consistency of digital people on different media, and using a cross-media synchronous optimization formula:
d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)
wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media. The formula can be used for calculating the distance between digital persons on different media so as to realize cross-media synchronization.
S110, generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed, so as to generate a final video.
Smooth transitional effects such as fade-in and fade-out, rotation, panning, etc. are added upon scene cuts and shot changes. Specifically, the method comprises the following steps:
suitable transitional effects are selected, such as rotation, translation, scaling, etc.
The creation and editing of transitional effects is performed using video editing software or a dedicated transitional effect tool.
Further, emotion in voice is analyzed through AI, and a proper virtual scene and transition effect are recommended according to emotion characteristics of the digital person. Specifically, the method comprises the following steps:
emotion analysis is performed on the speech using the emotion analysis model.
And selecting corresponding virtual scenes and transition effects according to emotion analysis results.
In addition, the recommendation algorithm can be continuously optimized according to the user feedback and the data, and the recommendation accuracy and the user satisfaction are improved.
Fig. 3 is a schematic diagram of an apparatus for driving digital persons to generate video in the present application.
Referring to fig. 3, the apparatus for generating video by using an alphanumeric person provided in the present application includes: preprocessing module 301, extraction module 302, recognition module 303, annotation module 304, scene module 305, creation module 306, presentation module 307, speech module 308, synchronization module 309, video module 310.
The preprocessing module 301 is used for cleaning and normalizing the input text content.
First, it involves processing the entered text content, primarily for purposes of washing and normalizing the text. Specifically, the method comprises the following steps:
punctuation marks, spaces, special characters, etc. in the text are removed to facilitate more accurate word segmentation and part-of-speech tagging.
The text is segmented into individual words or vocabularies to facilitate keyword extraction and part-of-speech tagging. Chinese word segmentation may use jieba, THULAC, etc. tools.
Words that frequently appear in the text but do not help in extracting key information, such as "yes", "in", etc., are removed.
Unified processing is performed on the text, such as unified case, digital format, and the like.
And the extracting module 302 is used for extracting key information in the text content.
This step is mainly for extracting key information from the text. The process can be performed in the following manner:
keywords are extracted using the Bag of Words model (Bag of Words) or TF-IDF (terminal frequency-Inverse Document Frequency) method.
And the identifying module 303 is used for identifying the entity in the text content.
This step is mainly for identifying the person name, place name, organization and other entities in the text. Named Entity Recognition (NER) techniques may be used, such as entity recognition using deep learning models such as CRF (conditional random field) models, BERT, etc.
And the labeling module 304 is configured to label the emotion polarity of the text content, so as to determine emotion tendencies of the text content.
For part of speech tagging, tools such as THULAC may be used.
The purpose of emotion analysis is to determine the emotion tendencies of the text, which can be based on dictionary methods or using deep learning models (e.g., LSTM, BERT, etc.).
A scene module 305, configured to generate a virtual scene related to the text content based on the key information and the entity in the text content, and select different background colors, illumination, and environment settings according to the emotion tendencies.
Based on the key information in the text and the entity recognition result, virtual scenes related to the text content are generated by using 3D modeling and rendering technology. Specifically, the method comprises the following steps:
and creating corresponding 3D models, such as buildings, figures, props and the like, according to the entity identification result.
And a material and illumination technology is used for adding a vivid visual effect to the model.
According to the emotion tendencies of the text, different background colors, illumination and environment settings are selected, and the sense of reality and substitution of the scene are enhanced.
A creation module 306, configured to create a digital person anchor, and place the generated digital person in the virtual scene.
A realistic 3D digital anchor may be automatically created from the text description using digital person generation techniques. Specifically, the method comprises the following steps:
from the text description, a corresponding 3D model and animation are created.
And setting proper facial expressions and limb actions for the digital person according to the emotion analysis result.
Voice content is set for a digital person using voice synthesis techniques. Speech synthesis may use Text-to-Speech (TTS) technology to convert Text to Speech.
And the expression module 307 is used for generating facial expressions and actions of the digital person according to the emotion tendencies.
And automatically generating facial expressions and actions of the digital person according to emotion analysis of the text content and emotion characteristics of the digital person. Specifically, the method comprises the following steps:
and selecting corresponding facial expressions and limb actions according to the emotion tendencies of the text.
Different facial expressions and action templates can be preset for different types of emotions.
Further, the emotional tendency includes: positive, negative, or neutral. Displaying a happy facial expression and action when the text content is frontal; when the text content is negative, a sad or angry facial expression and action is displayed.
A voice module 308, configured to generate voice according to the text content.
Based on text content, natural and smooth speech is generated using TTS technology, synchronized with the mouth shape of the digital person. Specifically, the method comprises the following steps:
a high quality Speech synthesis engine is selected, such as Google Text-to-Speech, etc.
A synchronization module 309 for synchronizing the digital person's actions, facial expressions, voices and digital person's mouth shapes.
And placing the generated digital person in a virtual scene, and synchronizing actions, facial expressions, voices and text contents of the digital person according to the text contents. Specifically, the method comprises the following steps:
the model and animation of the digital person are synchronized with the model and animation in the scene.
The time axis and animation curve are used to synchronize the digital person's performance with the text content.
According to the text content and the mouth shape of the digital person, a proper voice rhythm and intonation are selected.
In achieving synchronization of digital persons with virtual scenes, the following mathematical formulas may be used to optimize:
synchronizing the animation of the digital person with the text content using a timeline synchronization formula:
t = sqrt(dist / g) + 0.5 * sin(2* π * freq * t)
where t represents time, dist represents distance between the digital person and the target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration. The formula may synchronize the movement of the digital person with the time axis.
The actions of the digital person are more natural and lifelike, and an animation curve optimization formula is used:
x(t) = x0 + v0 * t + 0.5 * a *t^2
where x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents time. The formula can be used for representing the motion trail of the object under the action of external force, so as to generate a more real animation curve.
Ensuring consistency of digital people on different media, and using a cross-media synchronous optimization formula:
d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)
wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media. The formula can be used for calculating the distance between digital persons on different media so as to realize cross-media synchronization.
The video module 310 is configured to generate a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the motion, and add a transition effect when the scene is switched and the lens is changed, so as to generate a final video.
Smooth transitional effects such as fade-in and fade-out, rotation, panning, etc. are added upon scene cuts and shot changes. Specifically, the method comprises the following steps:
suitable transitional effects are selected, such as rotation, translation, scaling, etc.
The creation and editing of transitional effects is performed using video editing software or a dedicated transitional effect tool.
Further, emotion in voice is analyzed through AI, and a proper virtual scene and transition effect are recommended according to emotion characteristics of the digital person. Specifically, the method comprises the following steps:
emotion analysis is performed on the speech using the emotion analysis model.
And selecting corresponding virtual scenes and transition effects according to emotion analysis results.
In addition, the recommendation algorithm can be continuously optimized according to the user feedback and the data, and the recommendation accuracy and the user satisfaction are improved.
Claims (8)
1. A method of an alphanumeric person generating video, comprising:
cleaning and standardizing the input text content;
extracting key information in the text content;
identifying an entity in the text content;
marking emotion polarities of the text contents to determine emotion tendencies of the text contents;
generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;
creating a digital man anchor, and placing the generated digital man in the virtual scene;
generating facial expressions and actions of the digital person according to the emotion tendencies;
generating voice according to the text content;
synchronizing the actions, facial expressions, voices and digital mouth shapes of the digital person according to the text content, and specifically comprising the following steps: synchronizing motion, facial expression, speech, and digital human mouth shape of the digital human using a time axis and an animation curve according to the text content; according to the text content, the text content is synchronized with the actions of the digital person, and the following time axis synchronization formula is adopted:
t = sqrt(dist / g) + 0.5 * sin(2 * π * freq* t)
wherein t represents time, dist represents distance between the digital person and a target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration;
the following animation curve optimization formula is adopted:
x(t) = x0 + v0 * t + 0.5 * a *t^2
wherein x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents the time;
generating a video directly according to the information of the virtual scene, the digital human mouth synchronization, the facial expression and the action based on the text content, and adding a transition effect when the scene is switched and the lens is changed to generate a final video;
wherein, to ensure consistency of digital people on different media, the following cross-media synchronization optimization formula is used:
d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)
wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media.
2. The method of claim 1, wherein the emotional tendency comprises: positive, negative, or neutral.
3. The method of claim 2, further comprising:
displaying a happy facial expression and action when the text content is frontal;
when the text content is negative, a sad or angry facial expression and action is displayed.
4. The method of claim 1, wherein the transitional effect comprises: fade in and fade out, rotation and translation.
5. An apparatus for generating video by an alphanumeric person, comprising:
the preprocessing module is used for cleaning and standardizing the input text content;
the extraction module is used for extracting key information in the text content;
the identification module is used for identifying the entity in the text content;
the labeling module is used for labeling the emotion polarity of the text content so as to determine the emotion tendency of the text content;
the scene module is used for generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;
the creation module is used for creating a digital person anchor and placing the generated digital person in the virtual scene;
the expression module is used for generating facial expressions and actions of the digital person according to the emotion tendencies;
the voice module is used for generating voice according to the text content;
the synchronization module is used for synchronizing the actions, the facial expressions, the voices and the digital mouths of the digital people according to the text content and specifically comprises the following steps: synchronizing motion, facial expression, speech, and digital human mouth shape of the digital human using a time axis and an animation curve according to the text content; according to the text content, the text content is synchronized with the actions of the digital person, and the following time axis synchronization formula is adopted:
t = sqrt(dist / g) + 0.5 * sin(2 * π * freq* t)
wherein t represents time, dist represents distance between the digital person and a target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration;
the following animation curve optimization formula is adopted:
x(t) = x0 + v0 * t + 0.5 * a *t^2
wherein x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents the time;
the video module is directly used for generating a video according to the information of the virtual scene, the digital human mouth synchronization, the facial expression and the action based on the text content, and adding a transition effect when the scene is switched and the lens is changed to generate a final video;
wherein, to ensure consistency of digital people on different media, the following cross-media synchronization optimization formula is used:
d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)
wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media.
6. The apparatus for generating video for an alphanumeric person as in claim 5, wherein the emotional tendency comprises: positive, negative, or neutral.
7. The apparatus for driving a digital person to generate video of claim 6, further comprising:
displaying a happy facial expression and action when the text content is frontal;
when the text content is negative, a sad or angry facial expression and action is displayed.
8. The apparatus for driving a digital person to generate video of claim 5, wherein the transitional effect comprises: fade in and fade out, rotation and translation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311322110.5A CN117058286B (en) | 2023-10-13 | 2023-10-13 | Method and device for generating video by using word driving digital person |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311322110.5A CN117058286B (en) | 2023-10-13 | 2023-10-13 | Method and device for generating video by using word driving digital person |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117058286A CN117058286A (en) | 2023-11-14 |
CN117058286B true CN117058286B (en) | 2024-01-23 |
Family
ID=88664907
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311322110.5A Active CN117058286B (en) | 2023-10-13 | 2023-10-13 | Method and device for generating video by using word driving digital person |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117058286B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117348736B (en) * | 2023-12-06 | 2024-03-19 | 彩讯科技股份有限公司 | Digital interaction method, system and medium based on artificial intelligence |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106463118A (en) * | 2016-07-07 | 2017-02-22 | 深圳狗尾草智能科技有限公司 | Method, system and robot for synchronizing speech and virtual movement |
CN113096252A (en) * | 2021-03-05 | 2021-07-09 | 华中师范大学 | Multi-movement mechanism fusion method in hybrid enhanced teaching scene |
CN114969282A (en) * | 2022-05-05 | 2022-08-30 | 迈吉客科技(北京)有限公司 | Intelligent interaction method based on rich media knowledge graph multi-modal emotion analysis model |
WO2022182064A1 (en) * | 2021-02-28 | 2022-09-01 | 조지수 | Conversation learning system using artificial intelligence avatar tutor, and method therefor |
CN115220682A (en) * | 2021-08-03 | 2022-10-21 | 达闼机器人股份有限公司 | Method and device for driving virtual portrait by audio and electronic equipment |
CN116016986A (en) * | 2023-01-09 | 2023-04-25 | 上海元梦智能科技有限公司 | Virtual person interactive video rendering method and device |
CN116311456A (en) * | 2023-03-23 | 2023-06-23 | 应急管理部大数据中心 | Personalized virtual human expression generating method based on multi-mode interaction information |
CN116863038A (en) * | 2023-07-07 | 2023-10-10 | 东博未来人工智能研究院(厦门)有限公司 | Method for generating digital human voice and facial animation by text |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110285727A1 (en) * | 2010-05-24 | 2011-11-24 | Microsoft Corporation | Animation transition engine |
US11860925B2 (en) * | 2020-04-17 | 2024-01-02 | Accenture Global Solutions Limited | Human centered computing based digital persona generation |
-
2023
- 2023-10-13 CN CN202311322110.5A patent/CN117058286B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106463118A (en) * | 2016-07-07 | 2017-02-22 | 深圳狗尾草智能科技有限公司 | Method, system and robot for synchronizing speech and virtual movement |
WO2022182064A1 (en) * | 2021-02-28 | 2022-09-01 | 조지수 | Conversation learning system using artificial intelligence avatar tutor, and method therefor |
CN113096252A (en) * | 2021-03-05 | 2021-07-09 | 华中师范大学 | Multi-movement mechanism fusion method in hybrid enhanced teaching scene |
CN115220682A (en) * | 2021-08-03 | 2022-10-21 | 达闼机器人股份有限公司 | Method and device for driving virtual portrait by audio and electronic equipment |
CN114969282A (en) * | 2022-05-05 | 2022-08-30 | 迈吉客科技(北京)有限公司 | Intelligent interaction method based on rich media knowledge graph multi-modal emotion analysis model |
CN116016986A (en) * | 2023-01-09 | 2023-04-25 | 上海元梦智能科技有限公司 | Virtual person interactive video rendering method and device |
CN116311456A (en) * | 2023-03-23 | 2023-06-23 | 应急管理部大数据中心 | Personalized virtual human expression generating method based on multi-mode interaction information |
CN116863038A (en) * | 2023-07-07 | 2023-10-10 | 东博未来人工智能研究院(厦门)有限公司 | Method for generating digital human voice and facial animation by text |
Also Published As
Publication number | Publication date |
---|---|
CN117058286A (en) | 2023-11-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110688911B (en) | Video processing method, device, system, terminal equipment and storage medium | |
US8224652B2 (en) | Speech and text driven HMM-based body animation synthesis | |
Kennaway et al. | Providing signed content on the Internet by synthesized animation | |
KR102035596B1 (en) | System and method for automatically generating virtual character's facial animation based on artificial intelligence | |
CN103218842B (en) | A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation | |
US20120130717A1 (en) | Real-time Animation for an Expressive Avatar | |
KR102116309B1 (en) | Synchronization animation output system of virtual characters and text | |
Cosatto et al. | Lifelike talking faces for interactive services | |
US20120276504A1 (en) | Talking Teacher Visualization for Language Learning | |
CN117058286B (en) | Method and device for generating video by using word driving digital person | |
Naert et al. | A survey on the animation of signing avatars: From sign representation to utterance synthesis | |
WO2022170848A1 (en) | Human-computer interaction method, apparatus and system, electronic device and computer medium | |
CN113538641A (en) | Animation generation method and device, storage medium and electronic equipment | |
KR101089184B1 (en) | Method and system for providing a speech and expression of emotion in 3D charactor | |
Wolfe et al. | Linguistics as structure in computer animation: Toward a more effective synthesis of brow motion in American Sign Language | |
Gibbon et al. | Audio-visual and multimodal speech-based systems | |
CN110148406A (en) | A kind of data processing method and device, a kind of device for data processing | |
Albrecht et al. | " May I talk to you?:-)"-facial animation from text | |
Kacorri | TR-2015001: A survey and critique of facial expression synthesis in sign language animation | |
Wolfe et al. | State of the art and future challenges of the portrayal of facial nonmanual signals by signing avatar | |
Gibet et al. | Signing avatars-multimodal challenges for text-to-sign generation | |
Wolfe et al. | Exploring localization for mouthings in sign language avatars | |
Kolivand et al. | Realistic lip syncing for virtual character using common viseme set | |
Weerathunga et al. | Lip synchronization modeling for sinhala speech | |
Ye et al. | CSLML: a markup language for expressive Chinese sign language synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |