CN112992116A - Automatic generation method and system of video content - Google Patents

Automatic generation method and system of video content Download PDF

Info

Publication number
CN112992116A
CN112992116A CN202110202986.0A CN202110202986A CN112992116A CN 112992116 A CN112992116 A CN 112992116A CN 202110202986 A CN202110202986 A CN 202110202986A CN 112992116 A CN112992116 A CN 112992116A
Authority
CN
China
Prior art keywords
word
unit
text
story content
starting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110202986.0A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Shenzhi Technology Co ltd
Original Assignee
Beijing Zhongke Shenzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Shenzhi Technology Co ltd filed Critical Beijing Zhongke Shenzhi Technology Co ltd
Priority to CN202110202986.0A priority Critical patent/CN112992116A/en
Publication of CN112992116A publication Critical patent/CN112992116A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Abstract

The invention discloses a method and a system for automatically generating video content, wherein the method comprises the following steps: generating input data into story content; synthesizing the story content in a text form into reading-aloud audio with the sound characteristics of the designated role; and taking the reading audio as the input of a facial animation synthesis model, and driving the character facial animation by the reading audio to generate a facial animation video. In the process of generating the story content, only one initial word is taken as a prediction basis of a new word, so that the generation speed of the story content is greatly increased, and the speed of converting the subsequent story content text into audio and driving the appointed role animation by the audio is further ensured.

Description

Automatic generation method and system of video content
Technical Field
The invention relates to the technical field of video synthesis, in particular to a method and a system for automatically generating video content.
Background
When people watch a novel or children story, the scene content described by characters in the story can easily suggest some real scenes. For example, imagine that the speech sound of the witch has some "evil" and the public speaking sound is sweet and clear. The novel or story is presented to the user in text form without giving the user a feeling of being personally on the scene, and the vocal novel is produced. However, the audio novels also have their own limitations, and they can only be auditorily made to be personally on the scene, but lack the picture sense, and cannot intuitively present the scene content of the story to the user in an animation form. To solve this problem, a voice-driven character facial animation technique is therefore born.
However, the story content is written in advance, if the story content is seen in advance, the ending of the story is known in advance, and the story content does not have the prospect, people hope to be capable of customizing the story content, for example, a certain child wants to listen to a story about the ocean world, only two words of ocean are needed to be given, a story about the ocean world can be automatically formed, the story content is brand new, and the brand new story content is presented to a user in an audio-video form and has the prospect. However, the existing methods for automatically generating related story content are few, and the story content generation process algorithm in few methods for automatically generating story content is complex, the story content generation speed is very slow, and the real-time property of the story content generation cannot be ensured, so that the speed of converting the subsequent text into audio and converting the audio into video is influenced.
Disclosure of Invention
The present invention aims to provide a method and a system for automatically generating video content, so as to solve the above technical problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for automatically generating the video content comprises the following specific steps:
1) generating input data into story content;
2) synthesizing the story content in a text form into reading-aloud audio with the sound characteristics of the designated role;
3) the reading audio is used as the input of a facial animation synthesis model, and the reading audio drives the character facial animation and generates a facial animation video;
in step 1), the specific method for generating the input data into the story content comprises the following steps:
1.1) giving a start word;
1.2) converting the starting word into a word vector which can represent the starting word;
1.3) calculating the probability that each word in the vocabulary can be the next word of the starting word according to the word vector associated with the starting word;
1.4) selecting a word with the maximum probability value as a new word to be added to the tail of the initial word, and forming a new word sequence together with the initial word;
1.5) extracting the last word in the sequence of words, and taking the extracted word as the given starting word and repeating steps 1.2) -1.4), forming a plurality of sequences of words;
1.6) splicing the word sequences into the story content in a text form according to the word sequence forming time from early to late.
As a preferable scheme of the present invention, in step 2), the specific method for synthesizing the story content in text form into the reading audio with the sound characteristics of the designated character includes:
2.1) analyzing the text sentence structure of the story content to identify the text language and segmenting the input text into clauses;
2.2) carrying out text regularization processing on the segmented clauses;
2.3) converting the clause text after the regularization treatment into phonemes;
2.4) carrying out prosody prediction on the clauses;
2.5) comprehensively forming phonemes and prosody of the clauses into language information;
2.6) determining the pronunciation time length of each character in the clauses through a preset time length model;
2.7) converting the language information into the sound characteristics of the designated role through an acoustic model;
2.8) converting the sound characteristics into sound through a vocoder and outputting the sound.
The invention also provides a video content automatic generation system which can realize the video content automatic generation method, and the video content automatic generation system comprises:
the story content generating module is used for generating the input data into story content;
the audio synthesis module is connected with the story content generation module and is used for synthesizing the story content in the text form into reading audio with the sound characteristics of the appointed role;
and the facial animation synthesis module is connected with the audio synthesis module and used for taking the reading audio as the input of the facial animation synthesis model, driving the character facial animation by the reading audio and generating a facial animation video.
As a preferred scheme of the present invention, the story content generating module specifically includes:
a start word giving unit for giving a start word to the user;
the word conversion unit is connected with the starting word giving unit and is used for converting the starting word into a word vector which can represent the starting word;
a word prediction unit connected to the word conversion unit for calculating the probability that each word in the vocabulary can be used as the next word of the starting word according to the word vector associated with the starting word;
the word selecting unit is connected with the word predicting unit and used for automatically selecting the word with the maximum probability value from the probability calculation result as a new word which can be added to the tail part of the starting word;
the new word adding unit is connected with the word selecting unit and is used for adding the new word to the tail part of the starting word;
a word sequence forming unit which is connected with the initial word giving unit and the new word adding unit and is used for forming and storing the new word added to the tail part of the initial word and the initial word into a word sequence;
a starting word obtaining unit, connected to the word sequence forming unit and the word converting unit, for extracting a word with the last ordering from the formed word sequence as the given starting word;
and the story content generating unit is connected with the word sequence forming unit and is used for splicing and forming the word sequences into the story content in a text form from early to late according to the word sequence forming time.
As a preferred aspect of the present invention, the audio synthesis module specifically includes:
the sentence structure analysis unit is used for analyzing the text sentence structure of the story content to identify the text language and segmenting the input text into clauses;
the text regularization processing unit is used for carrying out text regularization processing on the segmented clauses;
the clause text conversion unit is connected with the text regularization processing unit and used for converting the clause text subjected to regularization processing into phonemes;
a prosody prediction unit for performing prosody prediction on the clauses;
a clause language information generating unit respectively connected to the clause text converting unit and the prosody predicting unit, for comprehensively forming phonemes and prosody of the clause into language information;
the pronunciation duration setting unit is used for determining the pronunciation duration of each character in the clauses through a preset duration model;
the language information conversion unit is connected with the clause language information generation unit and used for converting the language information into the sound characteristics of the designated role through an acoustic model and outputting the sound characteristics;
and the voice characteristic conversion unit is connected with the language information conversion unit and used for converting the voice characteristics into voice through a vocoder and outputting the voice.
According to the invention, the initial word is obtained, the next word of the initial word is predicted according to the word vector corresponding to the initial word, the predicted word is added to the tail part of the initial word to form a word sequence, and finally, a plurality of word sequences are spliced in sequence according to the forming time to form the complete story content.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a flowchart illustrating a method for automatically generating video content according to an embodiment of the present invention;
FIG. 2 is a diagram of method steps for generating input data as story content;
FIG. 3 is a diagram of method steps for synthesizing story content into speakable audio with the sound characteristics of a specified character;
fig. 4 is a schematic structural diagram of an automatic video content generation system according to an embodiment of the present invention;
fig. 5 is a schematic internal structure diagram of a story content generation module in the video content automatic generation system according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an internal structure of an audio synthesis module in the automatic video content generation system according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.
In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
An automatic video content generation method provided in an embodiment of the present invention, as shown in fig. 1, includes the following steps:
step 1) generating input data into story content;
step 2) synthesizing the story content in the text form into reading audio with the sound characteristics of the designated role;
and 3) taking the reading audio as the input of the facial animation synthesis model, driving the character facial animation by the reading audio and generating a facial animation video.
In step 1), as shown in fig. 2, the specific method for generating the input data into the story content includes the steps of:
step 1.1) giving an initial word;
step 1.2) converting the starting words into word vectors capable of representing the starting words;
step 1.3) calculating the probability that each word in the vocabulary can be used as the next word of the starting word according to the word vector associated with the starting word; since the specific process of calculating the probability that each word in the vocabulary can be the next word of the starting word based on the word vector in combination with the preset vocabulary is not within the scope of the claimed invention, the specific process of probability calculation is not set forth herein;
step 1.4) selecting a word with the maximum probability value as a new word to be added to the tail of the initial word, and forming a new word sequence together with the initial word;
step 1.5) extracting a word at the last sequencing in the word sequence, taking the extracted word as a given starting word, and repeating the steps 1.2) to 1.4) to form a plurality of word sequences;
and step 1.6) splicing the word sequences into text-form story content from early to late according to the formation time of the word sequences.
The above principle of generating input data as story content is briefly described as follows:
for example, given that the starting word is "sea", and the word "ocean" is the word with the highest probability value that can follow the tail of "sea" in the vocabulary through probability calculation, the word "ocean" is added to the tail of "sea", and forms a word sequence "ocean" with "sea". Then, extracting the "ocean" in the word sequence "ocean" as a new initial word, repeating the word prediction process, calculating to obtain the "world" as a new word with the highest probability value which can follow the "ocean", adding the "world" to the tail of the "ocean" to form a new word sequence "ocean", and after the story content prediction is completed, splicing the word sequences, for example, splicing the "ocean", "ocean" and "world" to form an "ocean world".
In step 2), as shown in fig. 3, the method for synthesizing the story content in the text form into the reading-aloud audio with the sound feature of the designated character specifically includes the following steps:
step 2.1) analyzing the sentence structure of the input text (story content) to identify the text language, and performing clause segmentation on the input text (dividing the story content into one sentence);
step 2.2) carrying out text regularization treatment on the segmented clauses; the purpose of regularization is to convert punctuation marks and numbers in a sentence into Chinese characters;
step 2.3) converting the clause text into phonemes; because polyphones exist in Chinese, how to read Chinese characters must be correctly decided through some auxiliary information and some algorithms, where the auxiliary information includes word segments and part of speech of each word, and the auxiliary information is generally called phonemes;
step 2.4) carrying out prosody prediction on the clauses; the prosody is the rhythm when reading a sentence, and the fact that the speech is hard and unnatural when the sentence is read without the prosody is required to predict the prosody of each clause;
step 2.5) the language information generating module comprehensively forms the phonemes and the prosody of the clauses into language information;
step 2.6) determining the pronunciation time length of each character in the clause through a preset time length model; when reading a sentence, the pronunciation time of each character is different according to different context, so that the pronunciation time of each character needs to be determined in order to ensure the naturalness of audio output;
step 2.7) converting the language information into the sound characteristics of the designated role through an acoustic model;
and 2.8) converting the voice characteristics into voice through a vocoder and outputting the voice.
The present invention also provides an automatic video content generation system, as shown in fig. 4, the system includes:
the story content generating module is used for generating the input data into story content;
the audio synthesis module is connected with the story content generation module and is used for synthesizing the story content in the text form into reading audio with the sound characteristics of the appointed role;
and the facial animation synthesis module is connected with the audio synthesis module and used for taking the reading audio as the input of the facial animation synthesis model, driving the character facial animation by the reading audio and generating a facial animation video.
Specifically, as shown in fig. 5, the story content generating module includes:
a start word giving unit for giving a start word to the user;
the word conversion unit is connected with the starting word giving unit and is used for converting the starting word into a word vector which can represent the starting word;
a word prediction unit connected to the word conversion unit for calculating the probability that each word in the vocabulary can be the next word of the starting word according to the word vector associated with the starting word;
the word selecting unit is connected with the word predicting unit and used for automatically selecting the word with the maximum probability value from the probability calculation result as a new word which can be added to the tail part of the initial word;
the new word adding unit is connected with the word selecting unit and is used for adding the new word to the tail part of the initial word;
a word sequence forming unit which connects the initial word giving unit and the new word adding unit and is used for forming and storing the new word added to the tail part of the initial word and the initial word into a word sequence;
a starting word obtaining unit connecting the word sequence forming unit and the word converting unit for extracting a word at the last of the sequence from the formed word sequence as a given starting word;
and the story content generating unit is connected with the word sequence forming unit and used for splicing and forming each word sequence into the story content in a text form from early to late according to the word sequence forming time.
Specifically, as shown in fig. 6, the audio synthesis module specifically includes:
the sentence structure analysis unit is used for analyzing the text sentence structure of the input story content to identify the text language and segmenting the clauses of the input text;
the text regularization processing unit is used for carrying out text regularization processing on the segmented clauses;
the clause text conversion unit is connected with the text regularization processing unit and used for converting the clause text subjected to regularization processing into phonemes;
a prosody prediction unit for performing prosody prediction on the clauses;
a clause language information generating unit respectively connected with the clause text converting unit and the prosody predicting unit and used for comprehensively forming phonemes and prosody of the clause into language information;
the pronunciation duration setting unit is used for determining the pronunciation duration of each character in the clauses through a preset duration model;
the language information conversion unit is connected with the clause language information generation unit and used for converting the language information into the sound characteristics of the designated role through the acoustic model and outputting the sound characteristics;
and the voice characteristic conversion unit is connected with the language information conversion unit and used for converting the voice characteristics into voice through the vocoder and outputting the voice.
It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims (5)

1. An automatic video content generation method is characterized by comprising the following specific steps:
1) generating input data into story content;
2) synthesizing the story content in a text form into reading-aloud audio with the sound characteristics of the designated role;
3) the reading audio is used as the input of a facial animation synthesis model, and the reading audio drives the character facial animation and generates a facial animation video;
in step 1), the specific method for generating the input data into the story content comprises the following steps:
1.1) giving a start word;
1.2) converting the starting word into a word vector which can represent the starting word;
1.3) calculating the probability that each word in the vocabulary can be the next word of the starting word according to the word vector associated with the starting word;
1.4) selecting a word with the maximum probability value as a new word to be added to the tail of the initial word, and forming a new word sequence together with the initial word;
1.5) extracting the last word in the sequence of words, and taking the extracted word as the given starting word and repeating steps 1.2) -1.4), forming a plurality of sequences of words;
1.6) splicing the word sequences into the story content in a text form according to the word sequence forming time from early to late.
2. The method for automatically generating video content according to claim 1, wherein in step 2), the specific method step of synthesizing the story content in text form into reading audio with the sound characteristics of the designated character comprises:
2.1) analyzing the text sentence structure of the story content to identify the text language and segmenting the input text into clauses;
2.2) carrying out text regularization processing on the segmented clauses;
2.3) converting the clause text after the regularization treatment into phonemes;
2.4) carrying out prosody prediction on the clauses;
2.5) comprehensively forming phonemes and prosody of the clauses into language information;
2.6) determining the pronunciation time length of each character in the clauses through a preset time length model;
2.7) converting the language information into the sound characteristics of the designated role through an acoustic model;
2.8) converting the sound characteristics into sound through a vocoder and outputting the sound.
3. An automatic video content generation system capable of implementing the automatic video content generation method according to any one of claims 1 to 2, the automatic video content generation system comprising:
the story content generating module is used for generating the input data into story content;
the audio synthesis module is connected with the story content generation module and is used for synthesizing the story content in the text form into reading audio with the sound characteristics of the appointed role;
and the facial animation synthesis module is connected with the audio synthesis module and used for taking the reading audio as the input of the facial animation synthesis model, driving the character facial animation by the reading audio and generating a facial animation video.
4. The system for automatically generating video content according to claim 3, wherein the story content generating module specifically comprises:
a start word giving unit for giving a start word to the user;
the word conversion unit is connected with the starting word giving unit and is used for converting the starting word into a word vector which can represent the starting word;
a word prediction unit connected to the word conversion unit for calculating the probability that each word in the vocabulary can be used as the next word of the starting word according to the word vector associated with the starting word;
the word selecting unit is connected with the word predicting unit and used for automatically selecting the word with the maximum probability value from the probability calculation result as a new word which can be added to the tail part of the starting word;
the new word adding unit is connected with the word selecting unit and is used for adding the new word to the tail part of the starting word;
a word sequence forming unit which is connected with the initial word giving unit and the new word adding unit and is used for forming and storing the new word added to the tail part of the initial word and the initial word into a word sequence;
a starting word obtaining unit, connected to the word sequence forming unit and the word converting unit, for extracting a word with the last ordering from the formed word sequence as the given starting word;
and the story content generating unit is connected with the word sequence forming unit and is used for splicing and forming the word sequences into the story content in a text form from early to late according to the word sequence forming time.
5. The system for automatically generating video content according to claim 3, wherein the audio synthesis module specifically comprises:
the sentence structure analysis unit is used for analyzing the text sentence structure of the story content to identify the text language and segmenting the input text into clauses;
the text regularization processing unit is used for carrying out text regularization processing on the segmented clauses;
the clause text conversion unit is connected with the text regularization processing unit and used for converting the clause text subjected to regularization processing into phonemes;
a prosody prediction unit for performing prosody prediction on the clauses;
a clause language information generating unit respectively connected to the clause text converting unit and the prosody predicting unit, for comprehensively forming phonemes and prosody of the clause into language information;
the pronunciation duration setting unit is used for determining the pronunciation duration of each character in the clauses through a preset duration model;
the language information conversion unit is connected with the clause language information generation unit and used for converting the language information into the sound characteristics of the designated role through an acoustic model and outputting the sound characteristics;
and the voice characteristic conversion unit is connected with the language information conversion unit and used for converting the voice characteristics into voice through a vocoder and outputting the voice.
CN202110202986.0A 2021-02-24 2021-02-24 Automatic generation method and system of video content Pending CN112992116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110202986.0A CN112992116A (en) 2021-02-24 2021-02-24 Automatic generation method and system of video content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110202986.0A CN112992116A (en) 2021-02-24 2021-02-24 Automatic generation method and system of video content

Publications (1)

Publication Number Publication Date
CN112992116A true CN112992116A (en) 2021-06-18

Family

ID=76349738

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110202986.0A Pending CN112992116A (en) 2021-02-24 2021-02-24 Automatic generation method and system of video content

Country Status (1)

Country Link
CN (1) CN112992116A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116484048A (en) * 2023-04-21 2023-07-25 深圳市吉屋网络技术有限公司 Video content automatic generation method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209803A (en) * 2019-06-18 2019-09-06 腾讯科技(深圳)有限公司 Story generation method, device, computer equipment and storage medium
CN110867177A (en) * 2018-08-16 2020-03-06 林其禹 Voice playing system with selectable timbre, playing method thereof and readable recording medium
US10586369B1 (en) * 2018-01-31 2020-03-10 Amazon Technologies, Inc. Using dialog and contextual data of a virtual reality environment to create metadata to drive avatar animation
CN110941960A (en) * 2019-11-12 2020-03-31 广州爱学信息科技有限公司 Keyword-based children picture story generation method, system and equipment
CN111402855A (en) * 2020-03-06 2020-07-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10586369B1 (en) * 2018-01-31 2020-03-10 Amazon Technologies, Inc. Using dialog and contextual data of a virtual reality environment to create metadata to drive avatar animation
CN110867177A (en) * 2018-08-16 2020-03-06 林其禹 Voice playing system with selectable timbre, playing method thereof and readable recording medium
CN110209803A (en) * 2019-06-18 2019-09-06 腾讯科技(深圳)有限公司 Story generation method, device, computer equipment and storage medium
CN110941960A (en) * 2019-11-12 2020-03-31 广州爱学信息科技有限公司 Keyword-based children picture story generation method, system and equipment
CN111402855A (en) * 2020-03-06 2020-07-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴宇晗: "基于深度学习的自适应游戏剧情生成系统研究", 《智能计算机与应用》 *
吴宇晗: "基于深度学习的自适应游戏剧情生成系统研究", 《智能计算机与应用》, vol. 9, no. 5, 30 September 2019 (2019-09-30), pages 87 - 94 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116484048A (en) * 2023-04-21 2023-07-25 深圳市吉屋网络技术有限公司 Video content automatic generation method and system

Similar Documents

Publication Publication Date Title
US11837216B2 (en) Speech recognition using unspoken text and speech synthesis
US11514888B2 (en) Two-level speech prosody transfer
CN108899009B (en) Chinese speech synthesis system based on phoneme
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
KR102116309B1 (en) Synchronization animation output system of virtual characters and text
CN115485766A (en) Speech synthesis prosody using BERT models
CN113538641A (en) Animation generation method and device, storage medium and electronic equipment
CN116863038A (en) Method for generating digital human voice and facial animation by text
CN112992116A (en) Automatic generation method and system of video content
US20230298564A1 (en) Speech synthesis method and apparatus, device, and storage medium
KR102319753B1 (en) Method and apparatus for producing video contents based on deep learning
US20230148275A1 (en) Speech synthesis device and speech synthesis method
CN114242032A (en) Speech synthesis method, apparatus, device, storage medium and program product
Verma et al. Animating expressive faces across languages
CN113628609A (en) Automatic audio content generation
CN111916054B (en) Lip-based voice generation method, device and system and storage medium
CN113257225B (en) Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
CN113763924B (en) Acoustic deep learning model training method, and voice generation method and device
CN116052640A (en) Speech synthesis method and device
CN114267330A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN115346512A (en) Multi-emotion voice synthesis method based on digital people
CN113870828A (en) Audio synthesis method and device, electronic equipment and readable storage medium
CN117935807A (en) Method, device, equipment and storage medium for driving mouth shape of digital person
CN117789757A (en) Character target expression animation generation method and system based on intelligent semantic analysis
WO2024069471A1 (en) Method and system for producing synthesized speech digital audio content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination