CN112992116A

CN112992116A - Automatic generation method and system of video content

Info

Publication number: CN112992116A
Application number: CN202110202986.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Shenzhi Technology Co ltd
Current assignee: Beijing Zhongke Shenzhi Technology Co ltd
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2021-06-18

Abstract

The invention discloses a method and a system for automatically generating video content, wherein the method comprises the following steps: generating input data into story content; synthesizing the story content in a text form into reading-aloud audio with the sound characteristics of the designated role; and taking the reading audio as the input of a facial animation synthesis model, and driving the character facial animation by the reading audio to generate a facial animation video. In the process of generating the story content, only one initial word is taken as a prediction basis of a new word, so that the generation speed of the story content is greatly increased, and the speed of converting the subsequent story content text into audio and driving the appointed role animation by the audio is further ensured.

Description

Automatic generation method and system of video content

Technical Field

The invention relates to the technical field of video synthesis, in particular to a method and a system for automatically generating video content.

Background

When people watch a novel or children story, the scene content described by characters in the story can easily suggest some real scenes. For example, imagine that the speech sound of the witch has some "evil" and the public speaking sound is sweet and clear. The novel or story is presented to the user in text form without giving the user a feeling of being personally on the scene, and the vocal novel is produced. However, the audio novels also have their own limitations, and they can only be auditorily made to be personally on the scene, but lack the picture sense, and cannot intuitively present the scene content of the story to the user in an animation form. To solve this problem, a voice-driven character facial animation technique is therefore born.

However, the story content is written in advance, if the story content is seen in advance, the ending of the story is known in advance, and the story content does not have the prospect, people hope to be capable of customizing the story content, for example, a certain child wants to listen to a story about the ocean world, only two words of ocean are needed to be given, a story about the ocean world can be automatically formed, the story content is brand new, and the brand new story content is presented to a user in an audio-video form and has the prospect. However, the existing methods for automatically generating related story content are few, and the story content generation process algorithm in few methods for automatically generating story content is complex, the story content generation speed is very slow, and the real-time property of the story content generation cannot be ensured, so that the speed of converting the subsequent text into audio and converting the audio into video is influenced.

Disclosure of Invention

The present invention aims to provide a method and a system for automatically generating video content, so as to solve the above technical problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for automatically generating the video content comprises the following specific steps:

1) generating input data into story content;

2) synthesizing the story content in a text form into reading-aloud audio with the sound characteristics of the designated role;

3) the reading audio is used as the input of a facial animation synthesis model, and the reading audio drives the character facial animation and generates a facial animation video;

in step 1), the specific method for generating the input data into the story content comprises the following steps:

1.1) giving a start word;

1.2) converting the starting word into a word vector which can represent the starting word;

1.3) calculating the probability that each word in the vocabulary can be the next word of the starting word according to the word vector associated with the starting word;

1.4) selecting a word with the maximum probability value as a new word to be added to the tail of the initial word, and forming a new word sequence together with the initial word;

1.5) extracting the last word in the sequence of words, and taking the extracted word as the given starting word and repeating steps 1.2) -1.4), forming a plurality of sequences of words;

1.6) splicing the word sequences into the story content in a text form according to the word sequence forming time from early to late.

As a preferable scheme of the present invention, in step 2), the specific method for synthesizing the story content in text form into the reading audio with the sound characteristics of the designated character includes:

2.1) analyzing the text sentence structure of the story content to identify the text language and segmenting the input text into clauses;

2.2) carrying out text regularization processing on the segmented clauses;

2.3) converting the clause text after the regularization treatment into phonemes;

2.4) carrying out prosody prediction on the clauses;

2.5) comprehensively forming phonemes and prosody of the clauses into language information;

2.6) determining the pronunciation time length of each character in the clauses through a preset time length model;

2.7) converting the language information into the sound characteristics of the designated role through an acoustic model;

2.8) converting the sound characteristics into sound through a vocoder and outputting the sound.

The invention also provides a video content automatic generation system which can realize the video content automatic generation method, and the video content automatic generation system comprises:

the story content generating module is used for generating the input data into story content;

the audio synthesis module is connected with the story content generation module and is used for synthesizing the story content in the text form into reading audio with the sound characteristics of the appointed role;

and the facial animation synthesis module is connected with the audio synthesis module and used for taking the reading audio as the input of the facial animation synthesis model, driving the character facial animation by the reading audio and generating a facial animation video.

As a preferred scheme of the present invention, the story content generating module specifically includes:

a start word giving unit for giving a start word to the user;

the word conversion unit is connected with the starting word giving unit and is used for converting the starting word into a word vector which can represent the starting word;

a word prediction unit connected to the word conversion unit for calculating the probability that each word in the vocabulary can be used as the next word of the starting word according to the word vector associated with the starting word;

the word selecting unit is connected with the word predicting unit and used for automatically selecting the word with the maximum probability value from the probability calculation result as a new word which can be added to the tail part of the starting word;

the new word adding unit is connected with the word selecting unit and is used for adding the new word to the tail part of the starting word;

a word sequence forming unit which is connected with the initial word giving unit and the new word adding unit and is used for forming and storing the new word added to the tail part of the initial word and the initial word into a word sequence;

a starting word obtaining unit, connected to the word sequence forming unit and the word converting unit, for extracting a word with the last ordering from the formed word sequence as the given starting word;

and the story content generating unit is connected with the word sequence forming unit and is used for splicing and forming the word sequences into the story content in a text form from early to late according to the word sequence forming time.

As a preferred aspect of the present invention, the audio synthesis module specifically includes:

the sentence structure analysis unit is used for analyzing the text sentence structure of the story content to identify the text language and segmenting the input text into clauses;

the text regularization processing unit is used for carrying out text regularization processing on the segmented clauses;

the clause text conversion unit is connected with the text regularization processing unit and used for converting the clause text subjected to regularization processing into phonemes;

a prosody prediction unit for performing prosody prediction on the clauses;

a clause language information generating unit respectively connected to the clause text converting unit and the prosody predicting unit, for comprehensively forming phonemes and prosody of the clause into language information;

the pronunciation duration setting unit is used for determining the pronunciation duration of each character in the clauses through a preset duration model;

the language information conversion unit is connected with the clause language information generation unit and used for converting the language information into the sound characteristics of the designated role through an acoustic model and outputting the sound characteristics;

and the voice characteristic conversion unit is connected with the language information conversion unit and used for converting the voice characteristics into voice through a vocoder and outputting the voice.

According to the invention, the initial word is obtained, the next word of the initial word is predicted according to the word vector corresponding to the initial word, the predicted word is added to the tail part of the initial word to form a word sequence, and finally, a plurality of word sequences are spliced in sequence according to the forming time to form the complete story content.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a flowchart illustrating a method for automatically generating video content according to an embodiment of the present invention;

FIG. 2 is a diagram of method steps for generating input data as story content;

FIG. 3 is a diagram of method steps for synthesizing story content into speakable audio with the sound characteristics of a specified character;

fig. 4 is a schematic structural diagram of an automatic video content generation system according to an embodiment of the present invention;

fig. 5 is a schematic internal structure diagram of a story content generation module in the video content automatic generation system according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an internal structure of an audio synthesis module in the automatic video content generation system according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

An automatic video content generation method provided in an embodiment of the present invention, as shown in fig. 1, includes the following steps:

step 1) generating input data into story content;

step 2) synthesizing the story content in the text form into reading audio with the sound characteristics of the designated role;

and 3) taking the reading audio as the input of the facial animation synthesis model, driving the character facial animation by the reading audio and generating a facial animation video.

In step 1), as shown in fig. 2, the specific method for generating the input data into the story content includes the steps of:

step 1.1) giving an initial word;

step 1.2) converting the starting words into word vectors capable of representing the starting words;

step 1.3) calculating the probability that each word in the vocabulary can be used as the next word of the starting word according to the word vector associated with the starting word; since the specific process of calculating the probability that each word in the vocabulary can be the next word of the starting word based on the word vector in combination with the preset vocabulary is not within the scope of the claimed invention, the specific process of probability calculation is not set forth herein;

step 1.4) selecting a word with the maximum probability value as a new word to be added to the tail of the initial word, and forming a new word sequence together with the initial word;

step 1.5) extracting a word at the last sequencing in the word sequence, taking the extracted word as a given starting word, and repeating the steps 1.2) to 1.4) to form a plurality of word sequences;

and step 1.6) splicing the word sequences into text-form story content from early to late according to the formation time of the word sequences.

The above principle of generating input data as story content is briefly described as follows:

for example, given that the starting word is "sea", and the word "ocean" is the word with the highest probability value that can follow the tail of "sea" in the vocabulary through probability calculation, the word "ocean" is added to the tail of "sea", and forms a word sequence "ocean" with "sea". Then, extracting the "ocean" in the word sequence "ocean" as a new initial word, repeating the word prediction process, calculating to obtain the "world" as a new word with the highest probability value which can follow the "ocean", adding the "world" to the tail of the "ocean" to form a new word sequence "ocean", and after the story content prediction is completed, splicing the word sequences, for example, splicing the "ocean", "ocean" and "world" to form an "ocean world".

In step 2), as shown in fig. 3, the method for synthesizing the story content in the text form into the reading-aloud audio with the sound feature of the designated character specifically includes the following steps:

step 2.1) analyzing the sentence structure of the input text (story content) to identify the text language, and performing clause segmentation on the input text (dividing the story content into one sentence);

step 2.2) carrying out text regularization treatment on the segmented clauses; the purpose of regularization is to convert punctuation marks and numbers in a sentence into Chinese characters;

step 2.3) converting the clause text into phonemes; because polyphones exist in Chinese, how to read Chinese characters must be correctly decided through some auxiliary information and some algorithms, where the auxiliary information includes word segments and part of speech of each word, and the auxiliary information is generally called phonemes;

step 2.4) carrying out prosody prediction on the clauses; the prosody is the rhythm when reading a sentence, and the fact that the speech is hard and unnatural when the sentence is read without the prosody is required to predict the prosody of each clause;

step 2.5) the language information generating module comprehensively forms the phonemes and the prosody of the clauses into language information;

step 2.6) determining the pronunciation time length of each character in the clause through a preset time length model; when reading a sentence, the pronunciation time of each character is different according to different context, so that the pronunciation time of each character needs to be determined in order to ensure the naturalness of audio output;

step 2.7) converting the language information into the sound characteristics of the designated role through an acoustic model;

and 2.8) converting the voice characteristics into voice through a vocoder and outputting the voice.

The present invention also provides an automatic video content generation system, as shown in fig. 4, the system includes:

Specifically, as shown in fig. 5, the story content generating module includes:

a start word giving unit for giving a start word to the user;

a word prediction unit connected to the word conversion unit for calculating the probability that each word in the vocabulary can be the next word of the starting word according to the word vector associated with the starting word;

the word selecting unit is connected with the word predicting unit and used for automatically selecting the word with the maximum probability value from the probability calculation result as a new word which can be added to the tail part of the initial word;

the new word adding unit is connected with the word selecting unit and is used for adding the new word to the tail part of the initial word;

a word sequence forming unit which connects the initial word giving unit and the new word adding unit and is used for forming and storing the new word added to the tail part of the initial word and the initial word into a word sequence;

a starting word obtaining unit connecting the word sequence forming unit and the word converting unit for extracting a word at the last of the sequence from the formed word sequence as a given starting word;

and the story content generating unit is connected with the word sequence forming unit and used for splicing and forming each word sequence into the story content in a text form from early to late according to the word sequence forming time.

Specifically, as shown in fig. 6, the audio synthesis module specifically includes:

the sentence structure analysis unit is used for analyzing the text sentence structure of the input story content to identify the text language and segmenting the clauses of the input text;

a prosody prediction unit for performing prosody prediction on the clauses;

a clause language information generating unit respectively connected with the clause text converting unit and the prosody predicting unit and used for comprehensively forming phonemes and prosody of the clause into language information;

the language information conversion unit is connected with the clause language information generation unit and used for converting the language information into the sound characteristics of the designated role through the acoustic model and outputting the sound characteristics;

and the voice characteristic conversion unit is connected with the language information conversion unit and used for converting the voice characteristics into voice through the vocoder and outputting the voice.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. An automatic video content generation method is characterized by comprising the following specific steps:

1) generating input data into story content;

1.1) giving a start word;

2. The method for automatically generating video content according to claim 1, wherein in step 2), the specific method step of synthesizing the story content in text form into reading audio with the sound characteristics of the designated character comprises:

2.2) carrying out text regularization processing on the segmented clauses;

2.4) carrying out prosody prediction on the clauses;

3. An automatic video content generation system capable of implementing the automatic video content generation method according to any one of claims 1 to 2, the automatic video content generation system comprising:

4. The system for automatically generating video content according to claim 3, wherein the story content generating module specifically comprises:

a start word giving unit for giving a start word to the user;

5. The system for automatically generating video content according to claim 3, wherein the audio synthesis module specifically comprises:

a prosody prediction unit for performing prosody prediction on the clauses;