CN117058286B - Method and device for generating video by using word driving digital person - Google Patents

Method and device for generating video by using word driving digital person Download PDF

Info

Publication number
CN117058286B
CN117058286B CN202311322110.5A CN202311322110A CN117058286B CN 117058286 B CN117058286 B CN 117058286B CN 202311322110 A CN202311322110 A CN 202311322110A CN 117058286 B CN117058286 B CN 117058286B
Authority
CN
China
Prior art keywords
text content
digital
person
generating
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311322110.5A
Other languages
Chinese (zh)
Other versions
CN117058286A (en
Inventor
胡兴凯
郑航
费元华
郭建君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Weiling Times Technology Co Ltd
Original Assignee
Beijing Weiling Times Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Weiling Times Technology Co Ltd filed Critical Beijing Weiling Times Technology Co Ltd
Priority to CN202311322110.5A priority Critical patent/CN117058286B/en
Publication of CN117058286A publication Critical patent/CN117058286A/en
Application granted granted Critical
Publication of CN117058286B publication Critical patent/CN117058286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application provides a method and a device for generating video by an alphanumeric person, comprising the following steps: washing and standardizing the input text content, and extracting key information and identifying entities from the text content; labeling emotion polarities so as to determine emotion tendencies; based on the key information and the entity, generating a virtual scene related to the text content, and selecting different background colors, illumination and environment settings according to emotion tendencies; creating a digital man anchor, and placing the generated digital man in a virtual scene; generating facial expressions and actions of the digital person according to the emotion tendencies; generating voice according to the text content; synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth; and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed, so as to generate a final video. The method and the device directly drive the digital person to generate the video based on the text content, and improve the video generation efficiency.

Description

Method and device for generating video by using word driving digital person
Technical Field
The present application relates to the field of video creation, and more particularly, to a method and apparatus for text-driven digital person generation of video.
Background
The automatic video generation can accelerate business processes such as news broadcasting, teaching materials and the like, improves the production efficiency, and enables content creation and transmission to be quicker. The method for generating the video by the digital person can meet the requirements of users for personalized and targeted content transmission in different fields, and provides more attractive and customized information presentation.
There is currently no method to directly drive digital people to generate video based on text content.
Disclosure of Invention
The present application is directed to overcoming the problems in the art and providing a method and apparatus for an alphanumeric person to generate video.
The application provides a method for generating video by an alphanumeric person, comprising the following steps:
cleaning and standardizing the input text content;
extracting key information in the text content;
identifying an entity in the text content;
marking emotion polarities of the text contents to determine emotion tendencies of the text contents;
generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;
creating a digital man anchor, and placing the generated digital man in the virtual scene;
generating facial expressions and actions of the digital person according to the emotion tendencies;
generating voice according to the text content;
synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth shape;
and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video.
Optionally, the synchronizing the motion, facial expression, voice, and digital mouth shape of the digital person includes:
based on the text content, the motion, facial expression, voice, and digital human mouth shape of the digital human are synchronized using a time axis and an animation curve.
Optionally, the emotional tendency includes: positive, negative, or neutral.
Optionally, the method further comprises:
displaying a happy facial expression and action when the text content is frontal;
when the text content is negative, a sad or angry facial expression and action is displayed.
Optionally, the transition effect includes: fade in and fade out, rotation and translation.
The application provides an apparatus for a text-driven digital person to generate video, comprising:
the preprocessing module is used for cleaning and standardizing the input text content;
the extraction module is used for extracting key information in the text content;
the identification module is used for identifying the entity in the text content;
the labeling module is used for labeling the emotion polarity of the text content so as to determine the emotion tendency of the text content;
the scene module is used for generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;
the creation module is used for creating a digital person anchor and placing the generated digital person in the virtual scene;
the expression module is used for generating facial expressions and actions of the digital person according to the emotion tendencies;
the voice module is used for generating voice according to the text content;
the synchronization module is used for synchronizing the actions, facial expressions, voices and digital mouths of the digital people;
and the video module is used for generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video.
Optionally, the synchronization module synchronizes the actions, facial expressions, voices, and digital mouth shapes of the digital person, including:
based on the text content, the motion, facial expression, voice, and digital human mouth shape of the digital human are synchronized using a time axis and an animation curve.
Optionally, the emotional tendency includes: positive, negative, or neutral.
Optionally, the method further comprises:
displaying a happy facial expression and action when the text content is frontal;
when the text content is negative, a sad or angry facial expression and action is displayed.
Optionally, the transition effect includes: fade in and fade out, rotation and translation.
The application has the advantages and beneficial effects that:
the application provides a method for generating video by an alphanumeric person, comprising the following steps: cleaning and standardizing the input text content; extracting key information in the text content; identifying an entity in the text content; marking emotion polarities of the text contents to determine emotion tendencies of the text contents; generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies; creating a digital man anchor, and placing the generated digital man in the virtual scene; generating facial expressions and actions of the digital person according to the emotion tendencies; generating voice according to the text content; synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth shape; and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video. The method and the device directly drive the digital person to generate the video based on the text content, and improve the video generation efficiency.
Drawings
Fig. 1 is a schematic diagram of an alphanumeric human generated video in the present application.
Fig. 2 is a schematic diagram in the present application.
Fig. 3 is a schematic diagram of an apparatus for driving digital persons to generate video in the present application.
Detailed Description
The present application is further described in conjunction with the drawings and detailed embodiments so that those skilled in the art may better understand the present application and practice it.
The following are examples of specific implementation provided for the purpose of illustrating the technical solutions to be protected in this application in detail, but this application may also be implemented in other ways than described herein, and one skilled in the art may implement this application by using different technical means under the guidance of the conception of this application, so this application is not limited by the following specific embodiments.
The application provides a method for generating video by an alphanumeric person, comprising the following steps: cleaning and standardizing the input text content; extracting key information in the text content; identifying an entity in the text content; marking emotion polarities of the text contents to determine emotion tendencies of the text contents; generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies; creating a digital man anchor, and placing the generated digital man in the virtual scene; generating facial expressions and actions of the digital person according to the emotion tendencies; generating voice according to the text content; synchronizing the digital person's movements, facial expressions, speech, and digital person's mouth shape; and generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed to generate a final video. The method and the device directly drive the digital person to generate the video based on the text content, and improve the video generation efficiency.
Fig. 1 is a schematic diagram of an alphanumeric human generated video in the present application.
Referring to fig. 1, the steps for generating video by using an alphanumeric person provided in the present application include:
s101 cleans and normalizes the inputted text content.
First, it involves processing the entered text content, primarily for purposes of washing and normalizing the text. Specifically, the method comprises the following steps:
punctuation marks, spaces, special characters, etc. in the text are removed to facilitate more accurate word segmentation and part-of-speech tagging.
The text is segmented into individual words or vocabularies to facilitate keyword extraction and part-of-speech tagging. Chinese word segmentation may use jieba, THULAC, etc. tools.
Words that frequently appear in the text but do not help in extracting key information, such as "yes", "in", etc., are removed.
Unified processing is performed on the text, such as unified case, digital format, and the like.
S102 extracts key information in the text content.
This step is mainly for extracting key information from the text. The process can be performed in the following manner:
keywords are extracted using the Bag of Words model (Bag of Words) or TF-IDF (terminal frequency-Inverse Document Frequency) method.
S103 identifies an entity in the text content.
This step is mainly for identifying the person name, place name, organization and other entities in the text. Named Entity Recognition (NER) techniques may be used, such as entity recognition using deep learning models such as CRF (conditional random field) models, BERT, etc.
S104, marking emotion polarities of the text contents to determine emotion tendencies of the text contents.
For part of speech tagging, tools such as THULAC may be used.
The purpose of emotion analysis is to determine the emotion tendencies of the text, which can be based on dictionary methods or using deep learning models (e.g., LSTM, BERT, etc.).
S105, generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies.
Based on the key information in the text and the entity recognition result, virtual scenes related to the text content are generated by using 3D modeling and rendering technology. Specifically, the method comprises the following steps:
and creating corresponding 3D models, such as buildings, figures, props and the like, according to the entity identification result.
And a material and illumination technology is used for adding a vivid visual effect to the model.
According to the emotion tendencies of the text, different background colors, illumination and environment settings are selected, and the sense of reality and substitution of the scene are enhanced.
S106 creates a digital person anchor, and places the generated digital person in the virtual scene.
A realistic 3D digital anchor may be automatically created from the text description using digital person generation techniques. Specifically, the method comprises the following steps:
from the text description, a corresponding 3D model and animation are created.
And setting proper facial expressions and limb actions for the digital person according to the emotion analysis result.
Voice content is set for a digital person using voice synthesis techniques. Speech synthesis may use Text-to-Speech (TTS) technology to convert Text to Speech.
S107, generating the facial expression and action of the digital person according to the emotion tendencies.
And automatically generating facial expressions and actions of the digital person according to emotion analysis of the text content and emotion characteristics of the digital person. Specifically, the method comprises the following steps:
and selecting corresponding facial expressions and limb actions according to the emotion tendencies of the text.
Different facial expressions and action templates can be preset for different types of emotions.
Further, the emotional tendency includes: positive, negative, or neutral. Displaying a happy facial expression and action when the text content is frontal; when the text content is negative, a sad or angry facial expression and action is displayed.
S108, generating voice according to the text content.
Based on text content, natural and smooth speech is generated using TTS technology, synchronized with the mouth shape of the digital person. Specifically, the method comprises the following steps:
a high quality Speech synthesis engine is selected, such as Google Text-to-Speech, etc.
S109 synchronizes the digital person 'S motion, facial expression, voice, and digital person' S mouth shape.
And placing the generated digital person in a virtual scene, and synchronizing actions, facial expressions, voices and text contents of the digital person according to the text contents. Specifically, the method comprises the following steps:
the model and animation of the digital person are synchronized with the model and animation in the scene.
The time axis and animation curve are used to synchronize the digital person's performance with the text content.
According to the text content and the mouth shape of the digital person, a proper voice rhythm and intonation are selected.
In achieving synchronization of digital persons with virtual scenes, the following mathematical formulas may be used to optimize:
the synchronization may also be optimized as shown in fig. 2.
Synchronizing the animation of the digital person with the text content using a timeline synchronization formula:
t = sqrt(dist / g) + 0.5 * sin(2* π * freq * t)
where t represents time, dist represents distance between the digital person and the target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration. The formula may synchronize the movement of the digital person with the time axis.
The actions of the digital person are more natural and lifelike, and an animation curve optimization formula is used:
x(t) = x0 + v0 * t + 0.5 * a *t^2
where x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents time. The formula can be used for representing the motion trail of the object under the action of external force, so as to generate a more real animation curve.
Ensuring consistency of digital people on different media, and using a cross-media synchronous optimization formula:
d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)
wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media. The formula can be used for calculating the distance between digital persons on different media so as to realize cross-media synchronization.
S110, generating a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the action, and adding a transition effect when the scene is switched and the lens is changed, so as to generate a final video.
Smooth transitional effects such as fade-in and fade-out, rotation, panning, etc. are added upon scene cuts and shot changes. Specifically, the method comprises the following steps:
suitable transitional effects are selected, such as rotation, translation, scaling, etc.
The creation and editing of transitional effects is performed using video editing software or a dedicated transitional effect tool.
Further, emotion in voice is analyzed through AI, and a proper virtual scene and transition effect are recommended according to emotion characteristics of the digital person. Specifically, the method comprises the following steps:
emotion analysis is performed on the speech using the emotion analysis model.
And selecting corresponding virtual scenes and transition effects according to emotion analysis results.
In addition, the recommendation algorithm can be continuously optimized according to the user feedback and the data, and the recommendation accuracy and the user satisfaction are improved.
Fig. 3 is a schematic diagram of an apparatus for driving digital persons to generate video in the present application.
Referring to fig. 3, the apparatus for generating video by using an alphanumeric person provided in the present application includes: preprocessing module 301, extraction module 302, recognition module 303, annotation module 304, scene module 305, creation module 306, presentation module 307, speech module 308, synchronization module 309, video module 310.
The preprocessing module 301 is used for cleaning and normalizing the input text content.
First, it involves processing the entered text content, primarily for purposes of washing and normalizing the text. Specifically, the method comprises the following steps:
punctuation marks, spaces, special characters, etc. in the text are removed to facilitate more accurate word segmentation and part-of-speech tagging.
The text is segmented into individual words or vocabularies to facilitate keyword extraction and part-of-speech tagging. Chinese word segmentation may use jieba, THULAC, etc. tools.
Words that frequently appear in the text but do not help in extracting key information, such as "yes", "in", etc., are removed.
Unified processing is performed on the text, such as unified case, digital format, and the like.
And the extracting module 302 is used for extracting key information in the text content.
This step is mainly for extracting key information from the text. The process can be performed in the following manner:
keywords are extracted using the Bag of Words model (Bag of Words) or TF-IDF (terminal frequency-Inverse Document Frequency) method.
And the identifying module 303 is used for identifying the entity in the text content.
This step is mainly for identifying the person name, place name, organization and other entities in the text. Named Entity Recognition (NER) techniques may be used, such as entity recognition using deep learning models such as CRF (conditional random field) models, BERT, etc.
And the labeling module 304 is configured to label the emotion polarity of the text content, so as to determine emotion tendencies of the text content.
For part of speech tagging, tools such as THULAC may be used.
The purpose of emotion analysis is to determine the emotion tendencies of the text, which can be based on dictionary methods or using deep learning models (e.g., LSTM, BERT, etc.).
A scene module 305, configured to generate a virtual scene related to the text content based on the key information and the entity in the text content, and select different background colors, illumination, and environment settings according to the emotion tendencies.
Based on the key information in the text and the entity recognition result, virtual scenes related to the text content are generated by using 3D modeling and rendering technology. Specifically, the method comprises the following steps:
and creating corresponding 3D models, such as buildings, figures, props and the like, according to the entity identification result.
And a material and illumination technology is used for adding a vivid visual effect to the model.
According to the emotion tendencies of the text, different background colors, illumination and environment settings are selected, and the sense of reality and substitution of the scene are enhanced.
A creation module 306, configured to create a digital person anchor, and place the generated digital person in the virtual scene.
A realistic 3D digital anchor may be automatically created from the text description using digital person generation techniques. Specifically, the method comprises the following steps:
from the text description, a corresponding 3D model and animation are created.
And setting proper facial expressions and limb actions for the digital person according to the emotion analysis result.
Voice content is set for a digital person using voice synthesis techniques. Speech synthesis may use Text-to-Speech (TTS) technology to convert Text to Speech.
And the expression module 307 is used for generating facial expressions and actions of the digital person according to the emotion tendencies.
And automatically generating facial expressions and actions of the digital person according to emotion analysis of the text content and emotion characteristics of the digital person. Specifically, the method comprises the following steps:
and selecting corresponding facial expressions and limb actions according to the emotion tendencies of the text.
Different facial expressions and action templates can be preset for different types of emotions.
Further, the emotional tendency includes: positive, negative, or neutral. Displaying a happy facial expression and action when the text content is frontal; when the text content is negative, a sad or angry facial expression and action is displayed.
A voice module 308, configured to generate voice according to the text content.
Based on text content, natural and smooth speech is generated using TTS technology, synchronized with the mouth shape of the digital person. Specifically, the method comprises the following steps:
a high quality Speech synthesis engine is selected, such as Google Text-to-Speech, etc.
A synchronization module 309 for synchronizing the digital person's actions, facial expressions, voices and digital person's mouth shapes.
And placing the generated digital person in a virtual scene, and synchronizing actions, facial expressions, voices and text contents of the digital person according to the text contents. Specifically, the method comprises the following steps:
the model and animation of the digital person are synchronized with the model and animation in the scene.
The time axis and animation curve are used to synchronize the digital person's performance with the text content.
According to the text content and the mouth shape of the digital person, a proper voice rhythm and intonation are selected.
In achieving synchronization of digital persons with virtual scenes, the following mathematical formulas may be used to optimize:
synchronizing the animation of the digital person with the text content using a timeline synchronization formula:
t = sqrt(dist / g) + 0.5 * sin(2* π * freq * t)
where t represents time, dist represents distance between the digital person and the target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration. The formula may synchronize the movement of the digital person with the time axis.
The actions of the digital person are more natural and lifelike, and an animation curve optimization formula is used:
x(t) = x0 + v0 * t + 0.5 * a *t^2
where x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents time. The formula can be used for representing the motion trail of the object under the action of external force, so as to generate a more real animation curve.
Ensuring consistency of digital people on different media, and using a cross-media synchronous optimization formula:
d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)
wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media. The formula can be used for calculating the distance between digital persons on different media so as to realize cross-media synchronization.
The video module 310 is configured to generate a video according to the information of the virtual scene, the digital person, the synchronization, the facial expression and the motion, and add a transition effect when the scene is switched and the lens is changed, so as to generate a final video.
Smooth transitional effects such as fade-in and fade-out, rotation, panning, etc. are added upon scene cuts and shot changes. Specifically, the method comprises the following steps:
suitable transitional effects are selected, such as rotation, translation, scaling, etc.
The creation and editing of transitional effects is performed using video editing software or a dedicated transitional effect tool.
Further, emotion in voice is analyzed through AI, and a proper virtual scene and transition effect are recommended according to emotion characteristics of the digital person. Specifically, the method comprises the following steps:
emotion analysis is performed on the speech using the emotion analysis model.
And selecting corresponding virtual scenes and transition effects according to emotion analysis results.
In addition, the recommendation algorithm can be continuously optimized according to the user feedback and the data, and the recommendation accuracy and the user satisfaction are improved.

Claims (8)

1. A method of an alphanumeric person generating video, comprising:
cleaning and standardizing the input text content;
extracting key information in the text content;
identifying an entity in the text content;
marking emotion polarities of the text contents to determine emotion tendencies of the text contents;
generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;
creating a digital man anchor, and placing the generated digital man in the virtual scene;
generating facial expressions and actions of the digital person according to the emotion tendencies;
generating voice according to the text content;
synchronizing the actions, facial expressions, voices and digital mouth shapes of the digital person according to the text content, and specifically comprising the following steps: synchronizing motion, facial expression, speech, and digital human mouth shape of the digital human using a time axis and an animation curve according to the text content; according to the text content, the text content is synchronized with the actions of the digital person, and the following time axis synchronization formula is adopted:
t = sqrt(dist / g) + 0.5 * sin(2 * π * freq* t)
wherein t represents time, dist represents distance between the digital person and a target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration;
the following animation curve optimization formula is adopted:
x(t) = x0 + v0 * t + 0.5 * a *t^2
wherein x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents the time;
generating a video directly according to the information of the virtual scene, the digital human mouth synchronization, the facial expression and the action based on the text content, and adding a transition effect when the scene is switched and the lens is changed to generate a final video;
wherein, to ensure consistency of digital people on different media, the following cross-media synchronization optimization formula is used:
d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)
wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media.
2. The method of claim 1, wherein the emotional tendency comprises: positive, negative, or neutral.
3. The method of claim 2, further comprising:
displaying a happy facial expression and action when the text content is frontal;
when the text content is negative, a sad or angry facial expression and action is displayed.
4. The method of claim 1, wherein the transitional effect comprises: fade in and fade out, rotation and translation.
5. An apparatus for generating video by an alphanumeric person, comprising:
the preprocessing module is used for cleaning and standardizing the input text content;
the extraction module is used for extracting key information in the text content;
the identification module is used for identifying the entity in the text content;
the labeling module is used for labeling the emotion polarity of the text content so as to determine the emotion tendency of the text content;
the scene module is used for generating a virtual scene related to the text content based on the key information and the entity in the text content, and selecting different background colors, illumination and environment settings according to the emotion tendencies;
the creation module is used for creating a digital person anchor and placing the generated digital person in the virtual scene;
the expression module is used for generating facial expressions and actions of the digital person according to the emotion tendencies;
the voice module is used for generating voice according to the text content;
the synchronization module is used for synchronizing the actions, the facial expressions, the voices and the digital mouths of the digital people according to the text content and specifically comprises the following steps: synchronizing motion, facial expression, speech, and digital human mouth shape of the digital human using a time axis and an animation curve according to the text content; according to the text content, the text content is synchronized with the actions of the digital person, and the following time axis synchronization formula is adopted:
t = sqrt(dist / g) + 0.5 * sin(2 * π * freq* t)
wherein t represents time, dist represents distance between the digital person and a target position, g represents gravitational acceleration, freq represents vibration frequency, and 0.5 sin (2 pi freq t) represents influence of vibration;
the following animation curve optimization formula is adopted:
x(t) = x0 + v0 * t + 0.5 * a *t^2
wherein x (t) represents the position of the object at time t, x0 represents the initial position, v0 represents the initial velocity, a represents the acceleration, and t represents the time;
the video module is directly used for generating a video according to the information of the virtual scene, the digital human mouth synchronization, the facial expression and the action based on the text content, and adding a transition effect when the scene is switched and the lens is changed to generate a final video;
wherein, to ensure consistency of digital people on different media, the following cross-media synchronization optimization formula is used:
d = sqrt((x2 - x1)^2 + (y2 -y1)^2 + (z2 - z1)^2)
wherein (x 1, y1, z 1) and (x 2, y2, z 2) represent the position coordinates of the digital person on the two media, respectively, and d represents the distance between the digital person on the two media.
6. The apparatus for generating video for an alphanumeric person as in claim 5, wherein the emotional tendency comprises: positive, negative, or neutral.
7. The apparatus for driving a digital person to generate video of claim 6, further comprising:
displaying a happy facial expression and action when the text content is frontal;
when the text content is negative, a sad or angry facial expression and action is displayed.
8. The apparatus for driving a digital person to generate video of claim 5, wherein the transitional effect comprises: fade in and fade out, rotation and translation.
CN202311322110.5A 2023-10-13 2023-10-13 Method and device for generating video by using word driving digital person Active CN117058286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311322110.5A CN117058286B (en) 2023-10-13 2023-10-13 Method and device for generating video by using word driving digital person

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311322110.5A CN117058286B (en) 2023-10-13 2023-10-13 Method and device for generating video by using word driving digital person

Publications (2)

Publication Number Publication Date
CN117058286A CN117058286A (en) 2023-11-14
CN117058286B true CN117058286B (en) 2024-01-23

Family

ID=88664907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311322110.5A Active CN117058286B (en) 2023-10-13 2023-10-13 Method and device for generating video by using word driving digital person

Country Status (1)

Country Link
CN (1) CN117058286B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117348736B (en) * 2023-12-06 2024-03-19 彩讯科技股份有限公司 Digital interaction method, system and medium based on artificial intelligence

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106463118A (en) * 2016-07-07 2017-02-22 深圳狗尾草智能科技有限公司 Method, system and robot for synchronizing speech and virtual movement
CN113096252A (en) * 2021-03-05 2021-07-09 华中师范大学 Multi-movement mechanism fusion method in hybrid enhanced teaching scene
CN114969282A (en) * 2022-05-05 2022-08-30 迈吉客科技(北京)有限公司 Intelligent interaction method based on rich media knowledge graph multi-modal emotion analysis model
WO2022182064A1 (en) * 2021-02-28 2022-09-01 조지수 Conversation learning system using artificial intelligence avatar tutor, and method therefor
CN115220682A (en) * 2021-08-03 2022-10-21 达闼机器人股份有限公司 Method and device for driving virtual portrait by audio and electronic equipment
CN116016986A (en) * 2023-01-09 2023-04-25 上海元梦智能科技有限公司 Virtual person interactive video rendering method and device
CN116311456A (en) * 2023-03-23 2023-06-23 应急管理部大数据中心 Personalized virtual human expression generating method based on multi-mode interaction information
CN116863038A (en) * 2023-07-07 2023-10-10 东博未来人工智能研究院(厦门)有限公司 Method for generating digital human voice and facial animation by text

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110285727A1 (en) * 2010-05-24 2011-11-24 Microsoft Corporation Animation transition engine
US11860925B2 (en) * 2020-04-17 2024-01-02 Accenture Global Solutions Limited Human centered computing based digital persona generation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106463118A (en) * 2016-07-07 2017-02-22 深圳狗尾草智能科技有限公司 Method, system and robot for synchronizing speech and virtual movement
WO2022182064A1 (en) * 2021-02-28 2022-09-01 조지수 Conversation learning system using artificial intelligence avatar tutor, and method therefor
CN113096252A (en) * 2021-03-05 2021-07-09 华中师范大学 Multi-movement mechanism fusion method in hybrid enhanced teaching scene
CN115220682A (en) * 2021-08-03 2022-10-21 达闼机器人股份有限公司 Method and device for driving virtual portrait by audio and electronic equipment
CN114969282A (en) * 2022-05-05 2022-08-30 迈吉客科技(北京)有限公司 Intelligent interaction method based on rich media knowledge graph multi-modal emotion analysis model
CN116016986A (en) * 2023-01-09 2023-04-25 上海元梦智能科技有限公司 Virtual person interactive video rendering method and device
CN116311456A (en) * 2023-03-23 2023-06-23 应急管理部大数据中心 Personalized virtual human expression generating method based on multi-mode interaction information
CN116863038A (en) * 2023-07-07 2023-10-10 东博未来人工智能研究院(厦门)有限公司 Method for generating digital human voice and facial animation by text

Also Published As

Publication number Publication date
CN117058286A (en) 2023-11-14

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
US8224652B2 (en) Speech and text driven HMM-based body animation synthesis
Kennaway et al. Providing signed content on the Internet by synthesized animation
KR102035596B1 (en) System and method for automatically generating virtual character's facial animation based on artificial intelligence
CN103218842B (en) A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation
US20120130717A1 (en) Real-time Animation for an Expressive Avatar
KR102116309B1 (en) Synchronization animation output system of virtual characters and text
Cosatto et al. Lifelike talking faces for interactive services
US20120276504A1 (en) Talking Teacher Visualization for Language Learning
CN117058286B (en) Method and device for generating video by using word driving digital person
Naert et al. A survey on the animation of signing avatars: From sign representation to utterance synthesis
WO2022170848A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN113538641A (en) Animation generation method and device, storage medium and electronic equipment
KR101089184B1 (en) Method and system for providing a speech and expression of emotion in 3D charactor
Wolfe et al. Linguistics as structure in computer animation: Toward a more effective synthesis of brow motion in American Sign Language
Gibbon et al. Audio-visual and multimodal speech-based systems
CN110148406A (en) A kind of data processing method and device, a kind of device for data processing
Albrecht et al. " May I talk to you?:-)"-facial animation from text
Kacorri TR-2015001: A survey and critique of facial expression synthesis in sign language animation
Wolfe et al. State of the art and future challenges of the portrayal of facial nonmanual signals by signing avatar
Gibet et al. Signing avatars-multimodal challenges for text-to-sign generation
Wolfe et al. Exploring localization for mouthings in sign language avatars
Kolivand et al. Realistic lip syncing for virtual character using common viseme set
Weerathunga et al. Lip synchronization modeling for sinhala speech
Ye et al. CSLML: a markup language for expressive Chinese sign language synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant