CN113096633B

CN113096633B - Information film generation method and device

Info

Publication number: CN113096633B
Application number: CN201911320539.4A
Authority: CN
Inventors: 黄显诏; 丁羿慈; 陈誉云; 杨崇文
Original assignee: Aten International Co Ltd
Current assignee: Aten International Co Ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2024-02-13
Anticipated expiration: 2039-12-19
Also published as: CN113096633A

Abstract

The invention discloses a method and a device for generating an information film, which firstly acquire a text; generating a plurality of paging digests through a paging digest machine model according to the text; generating a plurality of pages according to the page summary; generating a plurality of text manuscripts through a text manuscript machine model according to the text; generating a plurality of side voices through a text-to-voice machine model according to the text manuscript; and synthesizing the paging and the bystander voice into an information film.

Description

Information film generation method and device

Technical Field

The present invention relates to information processing technology, and more particularly, to a method and apparatus for generating an information film.

Background

Due to the development of information technology, a plurality of video media workers are promoted to input video creation lines and lines. However, the video editing not only needs to process the image, but also needs to record and save, and if the video editing is not planned in advance and even needs to be repeatedly recorded and adjusted in cooperation with the film adjusting time axis, the creator needs to pay a time which is several times longer than the time of producing the film.

Disclosure of Invention

In view of the above, an embodiment of the present invention provides a method and apparatus for generating an information film.

In one embodiment, a method for generating a voice map includes: acquiring a text; generating a plurality of paging digests through a paging digest machine model according to the text; generating a plurality of pages according to the page summary; generating a plurality of text manuscripts through a text manuscript machine model according to the text; generating a plurality of side voices through a text-to-voice machine model according to the text manuscript; and synthesizing the paging and the bystander voice into an information film.

In one embodiment, the information film generating device includes a paging summary module, a paging generating module, a text draft generating module, a text-to-speech module, and a film synthesizing module. The page summary module carries a page summary machine model for generating a plurality of page summaries according to text. The page generation module generates a plurality of pages according to the page digest. The manuscript generating module carries a manuscript machine model to generate a plurality of manuscripts according to the text. The text-to-speech module carries a text-to-speech machine model for generating a plurality of side voices according to the text manuscript. The film synthesizing module synthesizes the paging and the bystander voice into an information film.

In summary, according to the embodiment of the present invention, an information film may be generated according to a text, where the information film has abstract text of important content in the text and matches with voice of related description content. In some embodiments, the bystander may sound according to the voice of the corresponding person according to the user's choice. In some embodiments, the corresponding abstract model may be selected as the paging abstract machine model according to the text type of the text, so as to generate the paging abstract, so that the obtained paging abstract is more accurate.

Drawings

Fig. 1 is a schematic hardware architecture of an information film generating apparatus according to an embodiment of the invention.

Fig. 2 is a schematic software architecture of an information film generating apparatus according to an embodiment of the invention.

Fig. 3 is a schematic diagram of a neural model architecture of an information film generating apparatus according to an embodiment of the invention.

Fig. 4 is a flowchart of an information film generating method according to an embodiment of the invention.

FIG. 5 is a block diagram of a page summary machine model according to an embodiment of the invention.

FIG. 6 is a diagram illustrating a page summary according to an embodiment of the invention.

FIG. 7 is a schematic diagram of a text-to-speech machine model according to an embodiment of the invention.

FIG. 8 is a schematic diagram illustrating a text encoder according to an embodiment of the present invention

Fig. 9 is a schematic diagram of an audio encoder according to an embodiment of the invention.

Fig. 10 is a schematic diagram of an architecture of an audio decoder according to an embodiment of the invention.

Fig. 11 is a schematic diagram of a neural model architecture of an information film generating apparatus according to another embodiment of the invention.

FIG. 12 is a schematic diagram of a text classification machine model according to an embodiment of the invention.

Wherein, the reference numerals:

information film generating apparatus 100

Processing device 120

Processor 121

Central processing unit 1213

Neural network processor 1215

Memory 122

Volatile memory 1224

Nonvolatile memory 1226

Non-transitory computer readable recording medium 123

Peripheral interface 124

Bus 125

Corpus 200

Paging summary module 210

Paging generation module 220

Manuscript generation module 230

Text-to-speech module 240

Film synthesizing module 250

Text 300

Paging summary 310

Paging 320

Title 321

Summary item 322

Sub summary item 323

Manuscript 330

Side speech 340

Information film 350

Text field 360

Text type 370

Paging digest machine model 410

Abstract model 411

Attention mechanism 412

Decoder 413

Pre-training model 414

Text classification machine model 420

Embedded layer 421

Convolutional layer 422

Pooling layer 423

Full link layer 424

Loss layer 425

Text draft machine model 430

Text-to-speech machine model 440

Encoder 441

Attention mechanism 442

Audio decoder 443

First causal convolution layer 4431

Highway convolutional layer 4432

Second causal convolution layer 4433

Logic function layer 4434

Rear network 444

Vocoder 445

Text encoder 446

Character embedding layer 4461

Non-causal convolutional layer 4462

Highway convolutional layer 4463

Audio encoder 447

Causal convolution layer 4471

Highway convolutional layer 4472

Sentence vector 510

Wen field vector 520

Steps S401 to S406

Detailed Description

Referring to fig. 1, a hardware architecture of an information film generating apparatus 100 according to an embodiment of the invention is shown. The information film generating apparatus 100 is one or more computer systems (herein, a processing apparatus 120 is taken as an example) with computing capability, such as a personal computer, a notebook computer, a smart phone, a tablet computer, a server cluster, etc. The information film generation apparatus 100 can generate an information film by itself based on the text.

The hardware of the processing device 120 of the information film generating device 100 has a processor 121, a memory 122, a non-transitory computer readable recording medium 123, a peripheral interface 124, and a bus 125 through which the above elements communicate with each other. Bus 125 includes, but is not limited to, a combination of one or more of a system bus, a memory bus, a peripheral bus, and the like. The processor 121 includes, but is not limited to, a Central Processing Unit (CPU) 1213 and a neural Network Processor (NPU) 1215. Memory 122 includes, but is not limited to, volatile memory 1224 (e.g., random Access Memory (RAM)) and non-volatile memory 1226 (e.g., read Only Memory (ROM)). The non-transitory computer readable recording medium 123 may be, for example, a hard disk, a solid state disk, or the like, for storing a computer program product (hereinafter referred to as "software") including a plurality of instructions, which when executed by the processor 121 of the computer system, cause the computer system to perform the information film generation method. The peripheral interface 124 is used to connect input/output devices such as keyboards, microphones, speakers, displays, network cards, etc.

In some embodiments, processing device 120 includes more than two computer systems, such as: a personal computer and a server. The server performs an information movie production process. The personal computer transmits the text to the server via the network and receives the information film returned by the server via the network.

Reference is made to fig. 2, 3 and 4 in combination. Fig. 2 is a schematic software architecture of an information film generating apparatus 100 according to an embodiment of the invention. Fig. 3 is a schematic diagram of a neural model architecture of an information film generating apparatus 100 according to an embodiment of the invention. Fig. 4 is a flowchart of an information film generating method according to an embodiment of the invention. The software of the information film generating apparatus 100 includes: the page summary module 210, the page generation module 220, the manuscript generation module 230, the text-to-speech module 240, and the movie synthesis module 250.

First, in step S401, a text 300 is obtained. The text 300 may be in the form of a text file, a file (e.g., a file supporting a word processor format, an e-book format file), a web page, etc., obtained via a network or other communication means (e.g., by reading an external storage medium via the peripheral interface 124).

Next, in step S402, the page summary module 210 loads the page summary machine model 410 to generate a plurality of page summaries 310 according to the text 300. Each page summary 310 may provide the description page (e.g., a page in a presentation) that subsequently generates a page 320 in step S403. The page summary module 210 divides the text 300 into a plurality of text fields 360 (as shown in fig. 5), and the page summary machine model 410 selectively extracts the page summary 310 according to the respective text fields 360. That is, each page summary 310 is extracted from a different text segment 360, and one or more text segments 360 may not extract the page summary 310.

Referring to FIG. 5, a diagram of a page summary machine model 410 is shown, according to one embodiment of the invention. The paging summary machine model 410 includes an encoder 441, an Attention mechanism (Attention) 412, a decoder 413, and a pre-training model 414. Here, the pre-training model 414 is used to understand the semantic or semantic relevance between words or sentences in the text. The pre-training model 414 may be, for example, an ELMo model, a GPT2 model, a BERT model, or the like. The ELMo model is a combination of multiple layers of representations of a bi-directional language model (biLM) that can integrate the outputs of the multiple layers of bi-directional language models into one vector (weight). Thus, the ELMo model can reference the contextual relevance of words in sentences and calculate the maximum likelihood for word prediction. The GPT2 model adopts a decoder architecture similar to that in a transducer model proposed by Google corporation, and after a large number of texts are learned, a great number of parameters can be used, and after the article front (such as a sentence) is input into the GPT2 model, the next text can be predicted. The BERT model utilizes an encoder architecture in a transducer model like that proposed by Google corporation. The training of the BERT model mainly consists of two parts: firstly, randomly covering a part of words in a sentence, and then simultaneously predicting the covered words by using the contextual information, so that the meaning of the words can be better understood according to the whole text; and secondly, predicting the next sentence, giving two sentences, and judging whether the second sentence is connected after the first sentence of the original article or not so as to understand the relevance between the two sentences.

Therefore, the encoder 441 is subjected to the transfer learning (Transfer Learning) using the weight of the pre-training model 414. The text field 360 is input to the encoder 441, and the sentence vector 510 is formed after the encoding by the encoder 441, where the sentence vector may represent the semantic or semantic association between the sentence and other sentences (i.e., the closer the vector is, the higher the semantic or semantic association between the sentences is). Sentence vector 510 forms a word-segment vector 520 through attention mechanism 412, which may represent the semantic or semantic relevance between the word segment and other word segments. The word segment vector 520 is then converted to the page digest 310 by the decoder 413.

Referring back to fig. 2, 3 and 4. After step S402, the page generation module 220 generates a plurality of pages 320 according to the page digest 310 (step S403). Specifically, the page generation module 220 generates a page frame and lists the contents of the page summary 310 on the page frame to form a page 320. Referring to FIG. 6, a diagram of a page summary 310 is shown, according to one embodiment of the present invention. Taking the input text 300 as an academic paper related to the learning neural network of the machine, the pages 320 shown in fig. 6 are presented with contents including a title 321, a summary item 322, a sub-summary item 323, etc., but the embodiment of the invention is not limited thereto. Such content is extracted from text 300. For example, the text 300 may contain, for example, a title, an author, article content, etc., which may in turn be distinguished as paragraphs of abstract, introduction, etc. The page summary module 210 further divides the content of one paragraph into more text fields 360 according to the content of the different paragraphs. The page summary module 210 extracts a page summary 310 including the contents of the title 321, summary item 322, sub-summary item 323, etc., as described above, for each of the cut text fields 360. In some embodiments, the contents of the page summary 310 are not limited to the title 321, summary item 322, and sub-summary item 323, that is, may include more contents, or may include only a part of the contents (such as only the title 321 and summary item 322), or may include only a part of the title 321, summary item 322, sub-summary item 323, and other contents.

In some embodiments, pages 320 may be presented in a static manner; in other embodiments, the pagination 320 may be presented in a dynamic manner (e.g., content including other animated content, the pagination summary 310, is presented in a dynamic special effect manner).

In some embodiments, the text segment 360 is segmented according to the symbols recited in the text 300. The symbols may be, for example, line feed symbols, punctuation marks, and the like.

In some embodiments, the text segment 360 is segmented according to the text header recited in the text 300. The text title may be, for example, a text name, a subtitle (e.g., chapter, section, etc.), etc.

After step S401, step S404 is further performed to generate a plurality of manuscripts 330 by the manuscript machine model 430 according to the text 300. The manuscript 330 corresponds to the contents of the above-mentioned page summary 310, and provides more detailed explanation. Taking the example of fig. 6, the script will include an introduction to content such as title 321, summary item 322, sub-summary item 323, and the like. For example, for the first point in the sub summary item 323, explanation may be provided of what is called Mel-spectrum (Mel-spectrum), the principle of converting a character sequence into Mel spectrum, and the like.

Next, step S405 is proceeded to, according to the text document 330 generated in step S404, to generate a plurality of white-by-white voices 340 through the text-to-speech machine model 440. In other words, the script 330 is converted into a speech signal that pronounces the content of the script 330 as if the person were pronouncing using the script-to-speech machine model 440.

There is no necessary execution sequence between the steps S402 and S403 and the steps S404 and S405. Fig. 4 shows two threads, but is not limited to being performed at the same time. For example, steps S402 and S403 may be completed first, and steps S404 and S405 may be completed later; otherwise, steps S404 and S405 may be completed first, and then steps S402 and S403 may be completed.

In step S406, the movie synthesizing module 250 synthesizes the page 320 and the bye voice 340 into an information movie 350. Specifically, the information film 350 has a plurality of frames (frames), and the film composition module 250 sequentially arranges the pages 320 in the frames. Each grid has a fixed display time (e.g., 1/24 second). The number of frames corresponding to each page 320 is not necessarily the same, i.e., the time interval in which each page 320 is displayed in the information film 350 is not necessarily the same, as determined by the length of the corresponding bypass speech 340. The film synthesis module 250 adds the voice-over sounds 340 to the information film 350 such that each voice-over sound 340 corresponds to a time interval of each page 320.

In some embodiments, the page summary 310 and the script 330 are generated based on each of the text segments 360. That is, the page summary module 210 provides the text file generation module 230 with the cut text fields 360. The same text field 360 will generate the corresponding paged digest 310 (and corresponding paged 320) and the script 330 (and corresponding bystander 340). The information film 350 is formed in such a manner that the pages 320 and the bye voices 340 respectively corresponding to the same text field 360 are aligned with each other in time.

The aforementioned page summary machine model 410 is pre-trained and can use a large amount of text 300 as training data. Important content in such training data is noted manually. So that the page summary machine model 410 can converge to the corresponding weight parameters and store them in a weight database (not shown). Accordingly, when the paging digest machine model 410 is used later, the paging digest module 210 may call the corresponding weight parameters stored in the weight database for the paging digest machine model 410.

Similarly, the script machine model 430 is also pre-trained and may use a large number of texts 300 as training data. Important content in the training data can be marked manually or by a learned electronic device. Accordingly, the page summary machine model 410 is enabled to converge to the corresponding weight parameters and store the weight parameters in a weight database (not shown). Accordingly, when the document machine model 430 is used later, the document generation module 230 may call the corresponding weight parameters stored in the weight database for the document machine model 430.

The information film generating apparatus 100 further stores a corpus 200 for training the text-to-speech machine model 440. Corpus 200 is a corpus used to provide a person or persons, and refers to speech data, i.e., a file of speech uttered by the person. For example, a user may record his own voice into a corpus using a microphone. In some embodiments, corpus 200 also stores text corresponding to the content of each of the corpora. During training, a plurality of corpora and corresponding words belonging to a person are input into the word-to-speech machine model 440 to obtain model weights corresponding to the person. In the same way, model weights corresponding to different persons can be formed separately. Such model weights will be stored in a weight database for recall by text-to-speech module 240. Here, the text-to-speech machine model 440 is a sequence-to-sequence (Sequence to Sequence) model. In some embodiments, the corpus to be input may be preprocessed, such as filtering, volume adjustment, time-domain-to-frequency-domain conversion, dynamic compression, denoising, conforming audio formats, and the like.

In some embodiments, the information film generating apparatus 100 may provide the user with a choice of which person's voice to generate the bystander 340 (e.g., provide a person menu). According to the user's options, the text-to-speech module 240 may select the corresponding model weights from the weight database according to the selected person. Subsequently, the text-to-speech module 240 applies the selected model weights to the text-to-speech machine model 440 to obtain the bystander speech 340 having the voice of the selected person from the output of the text-to-speech machine model 440.

Referring to fig. 7, a schematic diagram of a text-to-speech machine model 440 according to an embodiment of the invention is shown. The text-to-speech machine model 440 includes an encoder 441, an Attention mechanism (Attention) 442, an audio decoder 443, a post network (PostNet) 444, and a Vocoder (Vocoder) 445.

The encoder 441 includes a text encoder (TextEncoder) 446 and an audio encoder (AudioEncoder) 447. Referring to fig. 8 and 9, fig. 8 is a schematic diagram of a text encoder 446 according to an embodiment of the invention, and fig. 9 is a schematic diagram of an audio encoder 447 according to an embodiment of the invention. In one embodiment, the text encoder 446 includes a character embedding (Character Embedding) layer 4461, a Non-causal convolution (Non-causal Convolution) layer 4462, and four highway convolution (Highway Convolution) layers 4463. In one embodiment, audio encoder 447 includes three causal convolutional (Causal Convolution) layers 4471 and four highway convolutional layers 4472. However, the text encoder 446 and the audio encoder 447 of the present embodiment are not limited to the above-described embodiment.

Referring to fig. 10, a schematic diagram of a decoder 443 (or audio decoder) according to an embodiment of the present invention is shown. In one embodiment, the decoder 443 includes a first causal convolutional layer 4431, four highway convolutional layers 4432, two second causal convolutional layers 4433, and a logical Sigmoid layer 4434. The decoder 443 according to the present embodiment is not limited to the above-described configuration.

In one embodiment, the attention mechanism 442 maps a lookup (query) to the correct input given a lookup and a key-value table, and the output is in the form of a weighted sum, where the weights are determined by the lookup, key, and value. Referring to equation 1, the output of the text encoder 446 is a key value. Wherein L is the input text, K is the key, and V is the value. Referring to equation 2, the output of the audio encoder 447 is find (Q). Wherein M is _1:F,1:T Mel-down frequency of input training corpus audio is two-dimensional information of f×t. F is the number of mel-filter banks and T is the number of audio time frames (frames). The matching degree of the characters and the voices is Q, K ^T The normalized SoftMax function is followed by Attention weight (Attention) for/-d, as shown in equation 3. Wherein d is dimension, K ^T The transfer matrix is K, and A is the attention weight value. The value and attention weight inner product (as shown in equation 4) is input to the audio decoder 443 to obtain the speech feature vector as shown in equation 5. Wherein Y is _1:F,2:T+1 For speech feature vectors, F is the number of Mel filter banks, T is the number of audio time frames, and R' is the output of the attention mechanism 442.

(K, V) = TextEncoder (L) (1)

Q＝AudioEncoder(M _1:F,1:T ) (2)

A＝SoftMax(QK ^T V/v d) (3)

R=v×a (formula 4)

Y _1:F,2:T+1 =audiodec (R') (formula 5)

The attention mechanism 442 is not limited to the above embodiment, but in another embodiment, the attention mechanism 442 gives a lookup (query) and a key-value (key-value) table, maps the lookup to the correct input, and outputs the result in the form of weighted summation, and the weights are determined by the lookup, the key and the value. Referring to equation 6, the output of the text encoder 446 is a plurality of key values. Wherein L is the input text, K= [ K ] ₁ ,...,K _n ]For n bonds, v= [ V ₁ ,...,V _n ]For the corresponding n values. Referring to fig. 7, the output of the audio encoder 447 is n lookups (q= [ Q ₁ ,...,Q _n ]). Wherein M is _1:F,1:T Mel-down frequency of input training corpus audio is two-dimensional information of f×t. F is the number of mel-filter banks and T is the number of audio time frames (frames). For the i-th group of key values and search pairing, the matching degree of characters and voices is Q _i K _i ^T V/d. The Attention weight (Attention) of the i-th group is obtained after the SoftMax function normalization process, as shown in equation 8. Wherein d is dimension, K _i ^T For K _i Transfer matrix of A) _i Is the i-th group attention weight value. The values of each group are added (Concate) after being multiplied by the attention weight value (as shown in equation 9), and input to the audio decoder 443 to obtain the speech feature vector (as shown in equation 10). Wherein Y is _1:F,2:T+1 For speech feature vectors, F is the number of Mel filter banks, T is the number of audio time frames (frames), and R is the output of the attention mechanism 442.

(K, V) = TextEncoder (L) (6)

Wherein K and V are each n bonds and values, and the number of n may be 10 or 20, but not limited thereto.

Q＝AudioEncoder(M _1:F,1:T ) (7)

Wherein Q is n searches, and the number of n may be 10 or 20, but not limited thereto.

A _i ＝SoftMax(Q _i K _i ^T V/v d) (8)

Wherein A is _i Calculated for the ith key of the n keys using equation 6 and the ith lookup of the n lookups of equation 7. A is that _i N in total as many as K, V, Q.

R＝Concatenate(V _i *A _i ) (9)

Wherein A is _i For n A in formula 8 _i Ith, V of (a) _i Is the i-th of the n values in equation 6. A of each pair of _i V (V) _i And adding (connecting) after matrix multiplication to obtain the final R.

Y _1:F,2:T+1 =audiodec (R) (10)

The post-network 440 optimizes the speech feature vector, in other words, the post-network 440 optimizes the speech feature vector output by the decoder 443, so as to reduce noise and pop noise of the output audio and improve the quality of the output audio.

Vocoder 445 converts the speech feature vectors into speech output. The vocoder 445 may be implemented using open source software "World" or "bar," but the embodiments of the present invention are not limited thereto.

In some embodiments, the text may be pre-processed prior to input into the text-to-speech machine model 440, for example: for the conversion of Chinese characters into coded character strings corresponding to phonetic symbols, word segmentation processing (such as through jieba software or CKIP Chinese word segmentation system in Chinese institute) is performed on a segment of characters, and for multi-tone characters, correct tones can be found out through a table look-up mode or adjusted according to a three-tone variation rule.

Referring to fig. 11, a schematic diagram of a neural model architecture of an information film generating apparatus 100 according to another embodiment of the present invention is shown. The difference from the embodiment shown in fig. 3 described above is that the information movie production apparatus 100 includes a plurality of digest models 411 and also includes a text classification machine model 420. The page summary module 210 loads the text classification machine model 420 to identify the type of text 300. That is, text 300 is classified by text classification machine model 420 into one of a plurality of text types 370. The text type 370 may be, for example, a literature class, an engineering class, a story class, etc., but embodiments of the present invention are not limited thereto. Since the text 300 of different text types 370 typically has different typesetting and article frameworks, the features of such typesetting or article frameworks can be utilized to distinguish between the different text types 370.

Referring to FIG. 12, a schematic diagram of a text classification machine model 420 according to an embodiment of the invention is shown. Text classification machine model 420 includes an embedding layer 421, a convolution layer 422, a pooling layer 423, a full connection layer 424, and a loss layer 425. The embedding layer 421 receives the text 300 as input to represent the text 300 as a vector. The convolution layer 422 convolves the vector to extract features. The pooling layer 423 is used for performing dimension reduction calculation on the extracted features to reduce the number of feature parameters. The text 300 is classified according to the input characteristics through the full connection layer 424, and finally enters the loss layer 425 to calculate the probability of classifying into each text type 370, so as to determine the text type 370 corresponding to the text 300. Here, the lossy layer 425 may be, for example, a softmax function.

Here, the text classification machine model 420 is also trained. Here, a large number of texts 300 respectively belonging to different text types 370 are used as training data. Such training data annotates the corresponding text type 370 via manual classification. So that the text classification machine model 420 can converge to the corresponding weight parameters and be stored in a weight database (not shown). Accordingly, when the text classification machine model 420 is used later, the paging summary module 210 may call the corresponding weight parameters stored in the weight database for the text classification machine model 420.

After identifying the text type 370 of the text 300, one of the plurality of summary models 411 corresponding to the classified text type 370 may be selected as the aforementioned paginated summary machine model 410. The page summary generated for text 300 based on the selected page summary machine model 410 may then be described as follows. Since the type of the text 300 is recognized in advance, the important contents of the text 300 can be extracted more accurately to generate the contents such as the title 321, the abstract item 322, the sub abstract item 323, and the like as described above.

Each abstract model 411 is trained through the text 300 of a certain text type 370, and obtains the weight parameters corresponding to different text types 370 respectively, so as to store in the weight database.

In summary, according to the embodiment of the present invention, the information film 350 may be generated according to the text 300, where the information film 350 has abstract text of important content in the text 300 and cooperates with voice of related description content. In some embodiments, the voice-over 340 may sound according to the voice of the corresponding person according to the user's selection. In some embodiments, the corresponding summary model 411 may be selected as the page summary machine model 410 in response to the text type 370 of the text 300 to generate the page summary 310, so that the obtained page summary 310 is more accurate.

Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, but rather is capable of modification and variation without departing from the spirit and scope of the present invention.

Claims

1. An information film generation method, characterized by comprising:

acquiring a text;

generating a plurality of paging digests through a paging digest machine model according to the text;

generating a plurality of pages according to the page summaries;

generating a plurality of text manuscripts through a text manuscript machine model according to the text;

generating a plurality of side voices through a text-to-voice machine model according to the text manuscripts; and

And synthesizing the paging and the side voice into an information film.

2. The information film production method as claimed in claim 1, wherein said step of producing the plurality of page summaries by the page summary machine model based on the text comprises: the text is classified by a text classification machine model into one of a plurality of text types.

3. The method of claim 2, wherein the page summary machine model belongs to one of a plurality of summary models, and the step of generating the plurality of page summaries by the page summary machine model according to the text further comprises:

selecting one of the abstract models as the paging abstract machine model according to the classified text type; and

and generating the paging digests for the text according to the selected paging digest machine model.

4. An information movie production method as defined in claim 3, further comprising:

and training the abstract models according to training data respectively belonging to the text types.

5. An information movie production method as defined in claim 1, further comprising:

training the text-to-speech machine model according to a plurality of corpora of a plurality of persons to obtain a plurality of model weights respectively corresponding to the persons;

selecting one of the people;

selecting the corresponding model weight according to the selected person; and

The selected model weights are applied to the text-to-speech machine model to obtain the side voices with the selected voice of the person.

6. The method of claim 1, further comprising, prior to the step of generating the page summaries and the text files: segmenting the text into a plurality of text fields;

the paging abstracts and the text manuscripts are respectively generated according to the text segments, and the information film is formed according to a mode that the paging and the bystander corresponding to the same text segment are aligned with each other in time.

7. The method of claim 6, wherein the splitting the text into the text fields is based on a symbol described in the text.

8. The method of claim 6, wherein the splitting the text into the text fields is based on a text title of the text.

9. An information film generation apparatus, comprising:

the paging summary module is loaded with a paging summary machine model to generate a plurality of paging summaries according to the text;

the paging generation module generates a plurality of paging according to the paging abstracts;

a manuscript generating module carrying a manuscript machine model for generating a plurality of manuscripts according to the text;

the text-to-speech module is loaded with a text-to-speech machine model to generate a plurality of side voices according to the text manuscripts; and

And the film synthesizing module synthesizes the paging and the side voice into an information film.

10. The information movie production apparatus of claim 9 wherein the pagination summary module further carries a text classification machine model to classify the text into one of a plurality of text types.

11. The apparatus of claim 10, wherein the digest paging machine model belongs to one of a plurality of digest models, wherein the digest paging module selects one of the digest models as the digest paging machine model according to the classified text type to generate the digest paging machine model for the text according to the digest paging machine model.

12. An information film production apparatus according to claim 11, wherein the digest models are trained on training data belonging to the text types, respectively.

13. The information film generating apparatus of claim 9, wherein the text-to-speech machine model is trained by a plurality of corpora of a plurality of persons to obtain a plurality of model weights corresponding to the persons, respectively, the information film generating apparatus further comprising a selection module for selecting one of the persons, the text-to-speech module selecting the corresponding model weight according to the selected person, and applying the selected model weight to the text-to-speech machine model to obtain the plurality of side voices having the sound of the selected person.

14. The apparatus of claim 9, wherein the page summary module divides the text into a plurality of text fields, the page summaries and the text scripts are generated according to the text fields, respectively, and the information film is formed according to the pages and the bystander voices corresponding to the same text field in a time alignment manner.

15. The information film generating apparatus of claim 14, wherein the splitting the text into the text fields is based on a symbol described in the text.

16. An information film production apparatus according to claim 15, wherein said segmenting the text into the text fields is based on a text title segmentation described in the text.