CN116129868A

CN116129868A - Method and system for generating structured photo

Info

Publication number: CN116129868A
Application number: CN202211710214.9A
Authority: CN
Inventors: 韩太军; 吴杨; 马宇峰; 徐斌; 顾炎; 刘东晓; 杨佳乐; 张松坡; 崔瑞博; 陈炜于
Original assignee: Shanghai Yuewen Information Technology Co Ltd
Current assignee: Shanghai Yuewen Information Technology Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-16

Abstract

The invention provides a method for generating a structured picture, which comprises the steps of inputting a text segment of a novel, and carrying out paragraph structured analysis on the text segment to generate the structured picture; and the paragraph structuralization analysis is to convert the text fragments into text dialogue scripts with multiple roles, multiple emotions and multiple scenes by carrying out role dialogue recognition, text emotion recognition and special effect scene mining on the input text fragments. According to the invention, full-automatic processing of the creation of the novel text picture is realized from character mining, tone matching, chapter structuring to special effect scene recognition through natural language processing technology, and compared with the current manual picture creation process, the full-automatic processing is more efficient and the time efficiency is improved by 30 times.

Description

Method and system for generating structured photo

Technical Field

The invention belongs to the technical field of speech synthesis, and relates to a method and a system for generating a structured picture.

Background

The market for audio books increases 35++ year by year in the last 3 years, still in the growth phase; the on-line reading platforms are mainly laid on the audio book racetracks, and users' habits are attracted and cultivated through better quality and more various audio contents, so that the market is expanded. Currently, the market competition of listening books is vigorous, and various listening book software flowers are put together. First, the listening and book system shares knowledge by means of the penetrating power of sound, so that the user fully utilizes the fragmentation time. Secondly, the listening system converts boring characters into vivid and lively sound for reading by means of the intonation of the pause, so that a user listens like listening to music. Although the technology of the current book listening system is mature, how to immerse the user in the plot in the novel makes the user feel more picture and substitution, and becomes a difficult problem for improving the use experience of the user.

There is no similar technology or implementation in the industry. The conventional method relies on the emotion, action and the like of the person in the manual friction-reducing novel to carry out real person dubbing in combination with different scenes. For example: in the evening of a cold wind and a whistle, she alone is crying in the room. The dubbing here is that under the background sound of the cold wind whistle, the girl sounds with crying.

For example, in the existing speech synthesis field, the steps of the speech synthesis method in chinese patent No. CN109523986a are as follows:

1. acquiring text information and determining characters in the text information and text contents of each character;

2. performing character recognition on the text content of each character, and determining character attribute information of each character;

3. acquiring the pronouncing persons corresponding to each role one by one according to the role attribute information of each role, and then manually confirming;

4. and generating multi-role synthesized voice according to the text information and the speaker corresponding to the role of the text information.

The main problems of the existing method are as follows:

(1) Screening timbre: after the characters are excavated, proper tone colors are manually selected from the candidate tone colors, and the period is long;

(2) Text preprocessing: the text is not pre-processed, and correction is carried out on some wrongly written characters and the like;

(3) The audio synthesis is single: text synthesis lacks emotion, special effects, scenes and other characteristics;

(4) Longer production period and high cost: because of the need of manual dubbing, the period is longer, the produced dubbing amount is less and the cost is high;

(5) Complex and inefficient: because the action, emotion of the person and the dubbing of the current scene are combined, firstly, the description about the current scene in the novel and the vocabulary about the action or emotion of the person are identified manually, and secondly, background sounds conforming to the description of the scene are added in a targeted manner. When a person produces a certain action or emotion fluctuation, the human reading also changes pertinently. Thus, the whole process becomes extremely complex.

Disclosure of Invention

In order to solve the defects existing in the prior art, the invention aims to provide a method for generating a structured script, which is used for converting a text fragment into a text dialogue script with multiple roles, multiple emotions and multiple scenes by carrying out role dialogue recognition, text emotion recognition and special effect scene mining on the input text fragment so as to generate the structured script. Wherein, through role dialogue dependency relationship and reference resolving text processing, the chapter is modified into dialogue; constructing a text emotion classification model by using Bert to realize text fine granularity emotion classification and endowing text with emotion attributes; extracting text positions with special effects in the novel segments through keywords and scene semantics, and finally creating a structured picture, comprising the following steps:

Step A, performing chapter disassembly on a formatted text which is obtained in advance and contains characters and scenario texts, and modifying the formatted text into a dialogue;

b, performing dialogue role recognition, text emotion recognition and special effect scene recognition on the text content which is obtained in the step A and is modified into dialogue;

and C, generating a final structured picture.

The reference resolution text processing in the invention is to split a chapter text into a plurality of sections of sentences through semantic rules based on a novel chapter dimension, input dialogue sentences and corresponding context information into a semantic-based role dialogue relation matching model, match correct dialogue main bodies for each dialogue sentence, and identify a plurality of names (such as human pronouns, character foreign numbers and the like) of a character role by the model, so that the names can be resolved into unique character names finally, and the complexity of character tone matching is simplified.

The character dialogue relation matching model based on the semantics in the invention refers to: a multiple choice model based on the bert implementation inputs the text of the current sentence, the context segment in which the sentence is located, and the list of roles that may occur in the context. And splicing roles in the list with the context fragments respectively, and inputting a piece of input data formed by a plurality of spliced data into the bert model. Finally, classifying tasks are carried out on the spliced segments to achieve the purpose of matching role dialogue dependency relationship.

The text emotion classification model is a text emotion classification model based on bert, dialogue center sentences and context information are input on the basis of splitting a novel chapter text, classification tasks of various emotion types are carried out, the current text emotion recognition model mostly adopts a model for judging the positive emotion score and the negative emotion score, and in order to achieve the purpose of representing various emotions, 7 emotion types of calm, gas generation, happiness, injury, surprise, fear, description and the like are adopted in the invention to identify emotion types of a human dialogue fragment in novel, so that diversified emotion colors can be added in audio.

In the step A of the invention, the chapter disassembly refers to disassembling the formatted text into a color dialogue sentence and a narrative sentence according to a dialogue disassembly strategy;

the character dialogue sentences refer to sentences of inter-character dialogue in the text, and the rest of the text is expressed as a narrative sentence;

the dialogue splitting strategy refers to a strategy for splitting the text according to the regular matching mode of the writing habit of the author and the text.

In the step B, the dialogue role identification means that dialogue texts in chapter bodies are mapped into corresponding roles through a dialogue role identification module; the model structure in the dialogue role recognition module adopts a language model structure to construct a context-based role decision structure.

The language model structure of the invention refers to a deep learning language model constructed based on bert, inputs dialogue text and identifies dialogue characters; the role decision structure is a role decision structure which is formed by combining a disassembly strategy and a language model, carrying out structural disassembly on a text, identifying dialogue role relations and finally converting the text into a structural picture.

After a role dialogue recognition module model framework is built, inputting a sample X= { X1, X2, & gt, wherein Xn }, xn represents nth data, xn= (Cn, qn, [ Choice1, choice2, & gt, choice m and label), m represents m candidate role names in total, qn represents dialogue text sentences, called a central sentence, cn is a context segment of a dynamic range before and after the central sentence, the length of the context segment is changed based on the recognizable character length L of the model, and the dynamic window range is given, so that the character length of the Cn segment is len (Cn) < L, and the window is reduced if the character is overlong; [ Choice1, choice2,.. Choicem ] is a candidate for a role that appears in the segment, and label is a real role sequence number that corresponds to the actual sample;

the training module in the model is constructed, the training module is based on a language model, the language model is a coding part of the whole model and is marked as LM, and the model structure of the training module is as follows:

M_role＝softmax(concat(Class(LM([Cn，Qn，Choice1])，...，Class(LM([Cn，Qn，Choicem])))，

Wherein m represents m options in total, wherein Class is to output scores corresponding to the current combined text through LM, concat is to splice scores corresponding to a plurality of candidate answers, and softmax obtains m categories to be output; fitting the output result with label;

repeating the content in the training module, storing an optimal model when the accuracy rate of the model after training reaches 90% of the measurement index, and predicting the chapter text;

the input text is denoted as h= { H1, H2, &..once., hn },

where hn= (Cn, qn, [ Choice1, choice2, choice major), the output result r=choice [ max_index (m_role (Batch (H))) ], wherein Max_index is the sequence number of the character with the largest probability in the candidate characters corresponding to the central sentence; finally, the item with the highest model score in the candidate list [ Choice1, choice2, &..choice ] is output as an output result.

In the step B, the text emotion recognition refers to recognizing text emotion of a dialogue through an emotion algorithm model, and the model adopts a text classification structure based on a language model.

The text classification structure of the invention classifies conversations into a plurality of emotion types based on a deep semantic text classification model realized by bert.

After a text emotion recognition module is constructed, a sample D= { D1, D2, & gt..the Dn }, dn represents nth data, dn= (Cn, label), cn is text with emotion and comprises a text sentence, hereinafter called a central sentence, and context fragments of dynamic range before and after the central sentence, and label is an emotion label; expressed by serial numbers in emotion classification sets, emotion classification set e= [ E1, E2, … …, em ];

Constructing a training module for identifying text emotion; the language model is taken as an encoder part of the whole model and is marked as LM; the overall model is as follows:

M_emotion＝softmax(LM(C1，C2，....Cn))

obtaining class score output after softmax function output, and fitting the class score output with label;

repeating the model training steps, and when the model training accuracy reaches 95%, storing an optimal model to predict emotion of a dialogue sentence in a text; input text c= { C1, C2, &..once., cn }, output result is:

R＝Choice[Max_index(M_emotion(Batch(C)))]

the max_index is the character sequence number with the highest probability in the emotion classification set corresponding to the central sentence, and finally, one item with the highest model score in the candidate list [ E1, E2, & Em ] of the emotion classification set is output as an output result.

In step B of the present invention, the specific scene recognition refers to extracting the position of occurrence of specific sound in a sentence through semantic understanding of text, and adding corresponding specific audio (for example, wind sound, footstep sound, etc.) in the sound effect synthesis step. The method is mainly realized based on a bert sequence labeling model, a training sample adopts a text sentence, special effect sounds and appearance positions thereof as input values, the output form of the trained model is that the special effect sounds are in the initial positions of the text and corresponding special effect sound types, the special effect sound position mark of the text layer is realized, and then special effect sound and audio can be added in the audio fusion process.

According to the special effect scene recognition method, the special effect scene mining module is used for realizing scene word positioning in the text by utilizing the text word granularity classification model, so that a scene which needs to be added with a special effect in the text is obtained; the model adopts a text classification structure based on a language model. In the special effect scene recognition process, a special effect sound recognition module model framework is constructed: input samples d= { D1, D2, &.. Dn denotes the nth data, dn= (Cn (start, end, label)), cn represents a text containing the effect sound, start and end respectively represent the initial position and the end position of the effect sound; label is a special effect sound type label; expressed by the serial numbers in the special effect sound class library, the special effect sound class library is expressed as Te= [ Te1, te2, … …, tem ];

inputting training data into the special effect sound model, executing model training, and storing an optimal model when 90% of the measurement indexes are reached:

M_special＝NER(LM(Cn))

inputting an original text t= { T1, T2, & gt..once., tn }, extracting an output result (start, end, specific_index) =choice [ (m_specific (Batch (T)))) ]; the Choice represents that the score output by the model is subjected to threshold screening, and the special_index in the output is the special effect sound sequence number.

Based on the above method for generating the structured picture, the invention also provides a system for generating the structured picture, which comprises a text chapter disassembly module, a dialogue role recognition module, a text emotion recognition module, a special effect scene recognition module and a structured picture output module.

Based on the structured photo, the invention also provides an application of the method for generating the structured photo in a voice synthesis method, including a voice synthesis method and a synthesis system. The TTS engine type required by the novel system can be called according to the novel plot in the net text, and the current scene is fused with the emotion automation of the person to generate the dubbing corresponding to the plot. The invention automatically carries out text preprocessing on the novel by utilizing a text algorithm deposited in the field of the internet, automatically generates a multi-role, multi-emotion and multi-scene structured picture, and combines the structured picture (manuscript picture) with a TTS voice technology to produce high-quality commercial sounding works.

In the invention, the text algorithm of the sedimentation in the field of the internet text comprises NLP models such as the mining of the character names of the internet text, the generation of text structured drawings, the emotion recognition of dialogue text, the recognition of specific effect sound scenes and the like.

The network character name mining model is used for extracting characteristics of character names, character sexes, character ages and the like which appear in the novel text, and providing character information support for the text structuring flow;

the text structured script generation model is used for decomposing a text into a plurality of sentences, extracting the dependency relationship between roles and sentences and providing a text structure for audio synthesis;

The dialogue text emotion recognition model is used for carrying out emotion classification and special effect sound appearance scene position extraction on the text through the language model, and finally improving audio hearing experience.

The TTS Speech technology in the present invention refers To Speech synthesis (Text To Speech) technology, i.e., converting Text appearing in a computer into natural and smooth Speech output.

In order to intelligently produce high-quality commercial voiced works through natural language processing capability and audio synthesis capability, the invention provides a voice synthesis method, which comprises the following steps:

firstly, preparing an audio material library, classifying the audio in the audio material library into a plurality of types according to types, and labeling the audio;

step two, training the labels of the audio materials obtained in the step one based on the Bert depth model to obtain word vectors corresponding to the audio materials; specifically, the labels of the audio materials are subjected to a Bert depth model to obtain text semantic vectors (namely word vectors), and the text semantic vectors are matched through vector similarity calculation in subsequent use.

Step three, inputting a text segment of the novel, and carrying out paragraph structural analysis on the text segment to generate a structural picture;

Step four, according to semantic approximation, obtaining audio candidates with highest similarity between different types of words in the structured script and an audio material library;

and step five, calling an applicable TTS engine, fusing the candidate audios in the matched audio material library, and outputting the voices according to the required output structure information.

In the first step, the types of the audio include: scene audio and emotion audio corresponding to different characters and sexes;

and/or the number of the groups of groups,

the audio material library is derived from historical accumulated audio and performs audio denoising operation uniformly;

and/or the number of the groups of groups,

the tagging of the audio means: and marking various scenes represented in the audio with corresponding scene tags, and marking emotions corresponding to different characters and sexes represented in the audio with gender and emotion tags.

In the second step, the generation of the word vector corresponding to the audio material specifically includes the following steps:

step 1, training a Bert depth semantic model containing corresponding semantic information in each layer; the model comprises a plurality of network layers;

and 2, inputting an audio material text label, and extracting a Bert output layer word vector.

The length of the outputted word vectors is 784.

The text content preprocessing is further included before the third step, and the method comprises the following steps:

Step 31, inputting original text content;

step 32, checking the original text content by using an error correction model;

step 33, analyzing the diagonal color in the text content after the correction;

step 34, generating formatted text containing characters and episodes for use with the structured script.

In step 31, the original text content includes novels, literature works, and historic biographies.

In step 32, the proofreading includes text correction, semantic replacement, content dialogue symbol regularization and non-text filtering; the text error correction is to carry out preliminary recall of the text misplaced words through an open source error correction pycorrect scheme, and then to filter a high-frequency misplaced word set obtained through precipitation statistics based on hundreds of thousands of magnitude historical network texts, so that the recall misplaced words are common misplaced words in the network texts.

The error correction model merges an open source error correction scheme and an error correction set precipitated in the field of the internet.

In the invention, error-prone words are greatly reduced through text error correction, and the speech synthesis accuracy is improved. The accuracy of the error-prone word scene is improved to 90%. The sign regular and non-text filtering mainly filters non-main content existing in novels, and regular precipitation is carried out through non-main text existing in massive texts.

In particular, the method comprises the steps of,

the text correction refers to correcting wrongly written words in the text; the semantic replacement refers to, for example: "a small spot in between" is moving towards this side "wherein" in between "will be corrected to" see only "; the content dialogue symbol regularization or non-text filtering is to delete text-independent literal symbols that appear in text, for example: … … ps, ming Ri Gao, will filter the text and promote the listener experience after speech synthesis.

In step 33, when analyzing the text character, character mining, character alias alignment, character relation attribute establishment, character attribute identification and character tone matching are required; the character mining is to perform end-to-end model learning by constructing a named entity recognition task and fusing a data sample enhancement strategy so as to achieve the purpose of extracting the novel character name; the character alias alignment refers to associating other names of the character appearance with the main name of the character, for example: role name Gu Nuoer, the small name of the nobody appears in the text, and the alignment of the role names can correlate the two role names; the character relation attribute establishment is to construct a depth semantic character relation model by constructing a character relation and text combination sample, divide paragraphs of novels and further judge the relation between characters by using a character relation model (a language model based on bert); character attribute identification means that the attributes such as age, sex and the like of characters in articles are identified by constructing a language model; character tone matching refers to matching the mined character attribute with a tone library, and selecting tone information suitable for the character. The medium tone color library is formed by manual collection and comprises the tone color of the character, the speed tone of the character and the like.

In the role mining part, the end-to-end model mainly refers to NER tasks realized by a sequence labeling model based on bert, the model composition mainly comprises a plurality of layers of transformation+, a novel chapter text is input to the model, and information such as role names, role sexes and the like appearing in text fragments is extracted.

In the character and alias alignment part, character and alias alignment mainly refers to a classification model based on bert, and two pieces of character information are input to judge whether the two pieces of character information belong to aliases or not.

In the character relation attribute establishing part, character relation attribute establishing mainly refers to a classification model based on bert, a novel text containing character information is input, and character classification of the character is output.

The character attribute identification is to input a character dialogue text and character name information based on a Bert's prompt model, extract character attribute information, for example, identify the attribute such as age, sex and the like of the character in the text.

The model constructed in the invention mainly adopts NER (named entity recognition) tasks based on sequence labeling, inputs the text of a novel section, extracts information such as character names, character sexes and the like appearing in text fragments, and extracts information such as the appearance frequency of novel characters, the section fragments and the like appearing in the characters through a character screening strategy, thereby providing sufficient screening conditions for character tone recognition.

The role screening strategy is to match the role names in the novel text by adopting a regular matching mode for the role information mined by the model, count the occurrence frequency and the occurrence chapters of the roles, and filter the role names with low occurrence frequency or irregular naming.

The data sample enhancement refers to role name exchange; in the training samples in the named entity recognition task, if the input text paragraphs contain a plurality of character names (which can be used as recognition labels), character name exchange can be carried out on the labels of the training data, so that a single training sample can be expanded into a plurality of samples, and the data volume of the training samples is greatly improved.

The role mining method used in the invention can realize 97% of role extraction accuracy in the field of internet texts.

For example: input the novel from Davone more, extract the corresponding names Xu Qian, wei Yuan, li Miaozhen, etc.

In the fourth step of the invention, semantic approximate matching is carried out on the character, emotion and scene in the structured picture generated in the third step and the word vector mapped by the audio material label in the audio material library, so as to obtain candidate audio materials corresponding to each part in the structured picture, and the most suitable audio materials are manually optimized and selected.

In the fourth step of the invention, the tone matching of the structured picture comprises the following steps:

step 41, special effect scene text, character features and novel label features in the structured picture book; wherein the novel tag features are derived from classification attributes for novel in a self-built novel tag library; respectively carrying out special effect audio matching, character tone library matching and background sound matching on the special effect scene text, character features and novel label features to obtain special effect sound, character sound and background sound;

step 42, synthesizing the tone of the diagonal color after the matching is completed;

and 43, carrying out audio formatting on the special effect sound, the synthesized character sound and the background sound, and outputting the formatted audio.

The special effect audio frequency refers to a special effect audio frequency library which is provided with more than 30 classes and more than 500 special effect audio frequency libraries, wherein the special effect audio frequency library is a self-built library, materials come from the Internet, text positions with special effects in text fragments are extracted through keywords and scene semantics, and accurate alignment is performed;

the character tone library matching refers to extracting character characteristics through syntactic dependency analysis and user interaction comment mining, and distributing the most suitable tone in the tone library to the characters through deep semantic similarity matching; the character feature extraction information comes from a character mining part in the text content preprocessing process, and characters such as gender, age and the like of characters in the article are extracted through NER tasks. After completing the construction of the character relation, extracting the character attribute, and assigning tone colors to each character as follows: for example: xu Qian (young mental) is matched automatically with morning wind (young vitality) semantics.

The background sound matching means that the background music library is matched through book classification information in the book library, the audio background atmosphere is improved, and the substitution sense is enhanced.

In step 42 of the present invention, the character tone synthesis refers to inputting the character-related text and character tone information through the text-to-speech interface, and outputting the audio of the text.

In step 43 of the present invention, the audio format is unified, which means that unified frequency processing is adopted to unify tone, special effect tone and background tone of the character in terms of hearing.

The syntactic dependency analysis refers to that a sentence is disassembled into a dependency syntax tree, dependency relations among words in the sentence are described, or syntactic collocation relations among words in the sentence are pointed out, and the collocation relations are associated with semantics.

The tone color library for matching the colors is rich in content, can adapt to character figures with different categories and characters, and can be used for expanding more tone colors.

In the tone matching synthesis process, the text with the structured content is combined into the voice synthesis markup language ssml label, and the TTS engine is used for voice synthesis, so that high-quality audio with multiple tones and multiple emotions is realized.

In the fifth step of the invention, a Chinese TTS engine and/or an English TTS engine matched with the text fragment content are called, under the background sound corresponding to the text scene, the corresponding audio occurs along with the emotion change of the character and/or certain action is generated, and the voice output is carried out according to the required output voice speed.

The audio fusion method comprises the following steps:

step 51, carrying out audio multi-track fusion on the obtained formatted audio including the special effect sound, the role sound and the background sound;

step 52, unifying audio modes of the fused multi-track audio;

and step 53, outputting the audio of the unified system to obtain the complete audio of the audio reading material.

After the audio fusion, the file is stored in the cloud for downstream distribution and delivery. The audio multi-track fusion refers to that audio tracks of a plurality of audios are synthesized into new audio containing all audio tracks, so that special effect sound and background sound are fused into character sound to form complete chapter audio output. The audio system unification means that synthesized audio is set to be in a unified audio format, so that the audio frequency and the tone quality are kept uniform.

The above method mainly comprises two parts of self-animation book generation and AI voice synthesis, and the two parts can be replaced by the following two modes:

1) Artificial picture introduction+ai speech synthesis: importing the artificial picture book, automatically aligning the artificial picture book and the automatic picture book format, and accessing AI voice synthesis to realize automatic synthesis of the artificial phone book;

2) Self-animation book generation+real person and AI mixed recording: the novel system automatically creates the novel script, synthesizes multiple timbres and multiple emotions for each sentence, replaces part of the AI recorded voice with the real person recorded voice, and realizes the mixed recording of the real person and the AI.

The key technical points of the invention are mainly as follows:

1. net text NLP processing technology. Based on the exploration of text technology precipitation and leading edge new technology in the field of internet text, character map mining, chapter structuring, text emotion classification, scene recognition and semantic matching.

Ai timbre library. Aiming at the high-quality tone color adaptation of various material novel roles in the field of the internet, the high-quality multi-tone color AI essence multicasting is realized.

The invention also provides application of the text content preprocessing method in natural language text processing.

The invention also provides a text content preprocessing system for realizing the text content preprocessing method, wherein the system comprises a role information mining module, a role relation construction module, a role tone matching module and a role information storage module; the character information mining module is used for mining character attributes and generating attribute information including character names, character sexes and ages; the role relation construction module is used for aligning the role aliases and constructing role relation attributes; the character tone matching module is used for matching the corresponding tone of the character according to the character attribute and comprises information of the tone of the character tone; the character information storage module is used for storing the generated character information into a corresponding database table to construct a character information inquiry system for the use of the follow-up structured drawing.

The invention also provides application of the method for generating the structured script in text structuring.

The invention also provides a structured picture generation system for realizing the structured picture generation method, which comprises a text chapter disassembly module, a dialogue role recognition module, a text emotion recognition module, a special effect scene recognition module and a structured picture output module.

The invention also provides application of the tone matching method in tone matching of texts.

The invention also provides a tone matching system for realizing the tone matching method, and the system comprises a role tone matching module, a special effect tone matching module, a scene tone matching module and an audio format unified processing module.

The invention also provides application of the audio fusion method in audio fusion of the audio book.

The invention also provides a tone matching system for realizing the tone matching method, and the system comprises an audio volume normalization module and an audio fusion module.

The beneficial effects of the invention include:

(1) High output efficiency

According to the invention, full-automatic processing of the creation of the novel text script is realized through natural language processing technology from character mining, tone matching, chapter structuring to special effect scene recognition. Compared with the current process of manually creating the picture, the picture is more efficient and is 30 times more efficient in time.

(2) High sound quality and richness

The invention maintains a tone color library suitable for multiple classes and can cover the tone color characteristics of most of the novel roles. And with the popularity of tone color migration technology, the platform tone color library is continually expanding new tone colors. Compared with manual dubbing, firstly, the invention can cover more kinds of tone colors, secondly, the AI tone colors can be combined into a plurality of books, the time for combining a single book is reduced by 10 times (300 days- >30 days) compared with manual work, and if the same tone color is used in a plurality of books, the time for manual work is limited by a sound engineer to take longer.

In the era of competitive market, various types of book listening software are all in full play, and how to immerse users in the plots in novels, so that the users have more picture sense and substitution sense, and have huge commercial value. The invention is used as a novel technology, can fuse the specific scenes in the TTS engine and the internet text or the gender and emotion of the characters, can transmit more infectious information to the user, and can promote the commercial value of related products.

Drawings

FIG. 1 is a flow chart of a prior art speech synthesis method.

FIG. 2 is a flow chart of a speech synthesis method of the present invention.

Fig. 3 is a schematic diagram of a character map according to the present invention.

Fig. 4 is a schematic view of the tone color of the scene of the present invention.

Fig. 5 is a block diagram of the process of the invention during its operation.

Fig. 6 is a diagram of steps performed in the content preprocessing module of the present invention.

FIG. 7 is a diagram of steps performed in the structured paint module of the present invention.

Fig. 8 is a diagram of steps performed in the timbre matching module of the present invention.

Fig. 9 is a diagram of steps performed in the audio fusion module of the present invention.

Fig. 10 is an exemplary diagram of an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following specific examples and drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.

The invention realizes a voice synthesis method. The method can call the required TTS engine type according to the novel plot in the net text, and fuse the current scene with the emotion automation of the character to generate the dubbing corresponding to the plot. The method mainly comprises five parts:

1. preparing a material library, wherein the types of audio in the material library are divided into two types:

(1) scene-like audio, such as cold wind whistle audio, water flow sound, etc., this type of audio serves as background sound.

(2) Emotional audio corresponding to the gender of different people, such as crying for girls, haha laughing for men and the like.

The above audio types are marked.

2. Training of material labels. Training by using the Bert depth model, and obtaining word vectors corresponding to the labels of scene types and emotion materials.

3. The segments of the novel text are input, structured by using an analytic technique, and the following types of structured contents are processed independently, specifically as follows:

(1) Scene words, for example: cold wind whistle, high lunar black wind and purulent water

(2) Gender words, for example: she, he, sister, mother, mister

(3) The emotion of a person describes a vocabulary of the class, for example: crying and laughing.

4. And obtaining the audio with highest similarity in the material library by using semantic similarity for the deconstructed structured script. Candidates are given based on semantic similarity between the words of the scene class and the audio in the material library, and the most suitable audio of the part is given based on manual preference.

5. And calling a required TTS engine, and outputting voice according to a required output voice speed when the emotion of the character changes or a certain action is generated under the background voice corresponding to the scene. The TTS engines are mainly Chinese TTS engines and English TTS engines.

More specifically, the structured script refers to converting the body of the novel chapter through the system into a text dialogue script with multiple roles, multiple emotions and multiple scenes. The function mainly comprises a role dialogue recognition module, a text emotion recognition module and a special effect scene mining module, and an example of the structured script is shown in fig. 10.

The role dialog recognition module refers to mapping dialog text in chapter text to roles in corresponding novels, e.g., sentence ids 2, 4, 10 shown in fig. 10 are recognized as last years, xu Qian, xu Xinnian as dialog issuers. The structure of the model can adopt a language model structure to construct a context-based role decision structure.

The text emotion recognition module recognizes emotion of a dialogue text through a self-research emotion algorithm model, and currently supports emotion such as happiness, heart injury, calm, angry, surprise, fear, narration and the like. The model structure can also adopt a text classification structure based on a language model.

The special effect scene mining module is used for realizing scene word positioning in a text through a text word granularity classification model, such as 'high-speed departure'. "output [2, walk sound ], currently supporting nearly thousands of scene sounds. The model structure can also adopt a text classification structure based on a language model.

The deconstructing is performed as follows:

step 1: constructing a character dialogue recognition module model architecture, inputting a sample, wherein x= { X1, X2, &..: xu Qian, since last year, label is the real role number corresponding to the real sample. Take the example in fig. 10:

qn is "slightly equal. ";

cn is the logical reasoning ability of Xu Qian, which is a dust-proof riding all the time in the previous generation, and is a pain in the same grade. Before n is changed, the new year is that the brothers are not put aside, and the brothers are put aside again, or perhaps never. N he has answered the last request of the brothers, the low channel: "slightly wait for a moment. "\n steps away. The n footstep sounds disappear in the corridor, xu Qian sits against the fence, and the heart 24248 is complex. ";

choice1, choice2, & gt, choice m [ Xu Qian, last year ]

label is 1.

Step 2: and constructing a training module based on the language model. The language model is denoted as LM as the encoder portion of the overall model. The overall model is as follows

M_role＝softmax(concat(Class(LM([Cn，Qn，Choice1])，...，Class(LM([Cn，Qn，Choicem])))

m represents m options, wherein Class is the score corresponding to the current combined text output through LM, concat is the stitching of scores corresponding to a plurality of candidate answers, and softmax obtains m category outputs. Fitting it to label.

Step 3: and (2) repeating the step (2), and when the measurement index is reached (the accuracy is 90%), storing an optimal model, and predicting the chapter text. The input text h= { H1, H2..the term, hn }, a

Wherein hn= (Cn, qn, [ Choice1, choice2,..the Choice M ]), the output result r=choice [ max_index (m_role (Batch (H))) ], wherein Max_index is the sequence number of the character with the largest probability in the candidate characters corresponding to the central sentence.

The language model is used as an encoder, and the understanding difficulty of the model on the text semantics is greatly reduced based on the corpus sample learning of the net text. For example:

after the scar-tired Xu Pingzhi goes to the hospital, xu Qian runs over to make Li Ru: "xx", model results: xu Qian

The scar-tired Xu Pingzhi runs on Li Ru: "xx". Model results: xu Pingzhi

A complex scene in which a plurality of roles interact appears in the scene, and the model can accurately identify a conversation sender.

Step 4: constructing a text emotion recognition module, inputting a sample d= { D1, D2, & gt, dn representing nth data, dn= (Cn, label), cn being text with emotion, including a body sentence, hereinafter referred to as a central sentence, and a dynamic range context fragment around the central sentence, label being emotion tags represented by sequence numbers in emotion classification sets, emotion classification sets e= [ E1, E2, & gt, em ], for example: [ narrative, qi-generating, happy.

Taking the example of figure 10 as an example,

cn is: i have to solve the case @ Xu Qian sinking channels: n 'w I want to know that the case passes and the death is also dead, otherwise I don't feel happy and N directly say that the case is broken, and the new year probably feels that his brain bag is watt, so Xu Qian changes a statement

Label of 2

Step 5: and constructing a training module for text emotion recognition. The language model is denoted as LM as the encoder portion of the overall model. The overall model is as follows \

M_emotion＝softmax(LM(C1，C2，....Cn))

Obtaining class score output after softmax, and fitting the class score output with label

Step 6: and 5, repeating the step, when the model training accuracy reaches 95%, storing the optimal model, and predicting the emotion of the dialogue sentence in the text. Input text c= { C1, C2, &..once., cn };

outputting a result:

R＝Choice[Max_index(M_emotion(Batch(C)))]

wherein, max_index is the character sequence number with the maximum probability in the emotion classification set corresponding to the central sentence.

Step 7: building a special effect sound identification module model architecture: input samples d= { D1, D2, &.. Dn denotes the nth data, dn= (Cn (start, end, label)), cn represents a text containing the effect sound, start and end respectively represent the initial position and the end position of the effect sound; label is a special effect sound type label. Expressed by a sequence number in the special effect sound class library, the special effect sound class library is expressed as e= [ E1, E2,.,. Em ] for example: the sound of the step and the sound of the door opening are the same, the data are detailed as follows:

Cn=' the female shouts and shouts are shouted at the ears, she Chen open eyes and cold sweat on the back. '

(start, end, label) = (0, 6, '94'), where label=94 represents the serial number of crying to the special effects library

The training data are input into the special effect sound model, the model training is executed, and when the measurement index is reached (the accuracy rate is 90%), the optimal model is saved.

M_special＝NER(LM(Cn))

Step 8: and (3) special effect sound identification: inputting an original text t= { T1, T2, &..once., tn }, extracting an output result (start, end, specific_index) =choice [ (m_specific (Batch (T))))) ] wherein Choice represents threshold screening of a score output by the model, and specific_index is a special effect sound number in the output.

Step 9: tone color matching synthesis: after the structured script is generated, the article is disassembled into a text structure with sentence granularity, and tone matching is carried out on the structured script, wherein the tone matching comprises tone matching of roles, tone matching of special effect sounds and matching of audio background sounds. After tone matching is completed, audio synthesis is carried out on the text with sentence granularity, and multi-section audio is generated.

In character tone matching, the characteristics of the produced characters such as gender, age, character and the like are mined based on the characters. And inputting the character information into a character tone database for matching, wherein the character tone database is a database of tone corresponding to different character characteristics formed by manually labeling and collecting data.

The special effect sound matching is that after the text passes through a special effect sound recognition model, the initial position of the special effect sound is extracted from the text, and after the text audio is generated, the fusion time position of the special effect audio is calculated according to the following formula and the audio is combined, and in the same way, the background music can be fused.

special_local＝(seconds-symbol_num*0.7)*(start-symbol_length)/(words_length-symbol_num)+symbol_length_before_special*0.7+duration_all

The special_local represents the time position of inserting the special effect sound into the audio, the seconds represents the total audio duration of the current sentence, symbol_num represents the number of symbols in the current sentence, start represents the starting position of the special effect sound in the text, word_length represents the total number of symbols in the sentence, symbol_length_before_special represents the number of symbols before the special effect position, and duration_all represents the number of current paragraphs.

Step 10: audio fusion: and merging the audio with the multi-section sentence granularity into multi-track audio with the section granularity, and unifying audio systems to form complete section audio. After the synthesis is finished, the audio can be uploaded to a storage cloud and distributed and put downstream.

The protection of the present invention is not limited to the above embodiments. Variations and advantages that would occur to one skilled in the art are included within the invention without departing from the spirit and scope of the inventive concept, and the scope of the invention is defined by the appended claims.

Claims

1. A method for generating a structured picture is characterized in that a text segment of a novel is input, paragraph structuring analysis is carried out on the text segment, and the structured picture is generated; the paragraph structuralization analysis is that character dialogue recognition, text emotion recognition and special effect scene mining are carried out on an input text segment, and the text segment is converted into a text dialogue script with multiple characters, multiple emotions and multiple scenes; wherein, the liquid crystal display device comprises a liquid crystal display device,

text processing is resolved through role dialogue dependency and references, and text fragments are modified into dialogue; constructing a text emotion classification model by using Bert to realize text fine granularity emotion classification and endowing text with emotion attributes; extracting text positions with special effects in the text fragments through keywords and scene semantics, and finally creating a structured picture; the method specifically comprises the following steps:

and C, generating a final structured picture.

2. The method for generating a structured book according to claim 1, wherein in the step a, the chapter resolution refers to resolution of a text formatted according to a dialogue resolution policy into a colored dialogue sentence and a narrative sentence;

3. The method for generating a structured book according to claim 1, wherein in the step B, the dialog character recognition means mapping the dialog text in the chapter body to the corresponding character through a dialog character recognition module; the model structure in the dialogue role recognition module adopts a language model structure to construct a context-based role decision structure.

4. The method for generating a structured codebook according to claim 3, wherein the language model structure refers to a deep learning language model constructed based on bert, and a dialogue text is input to identify a dialogue character; the role decision structure is a role decision structure which combines a disassembly strategy and a language model, carries out structural disassembly on the text, identifies the dialogue role relation and finally converts the text into a structural picture;

And/or the number of the groups of groups,

after the role dialogue recognition module model architecture is built, inputting samples X= { X1, X2, … …, xn }, xn represents nth data, xn= (Cn, qn, choice1, choice2, …, choice m, label), m represents m candidate role names in total, qn represents dialogue text sentences, called central sentences, cn is a context segment of dynamic range before and after the central sentences, the length of the context segment is changed based on the recognizable character length L of the model, and the range of the dynamic window is given, so that the character length len (Cn) < L of the Cn segment is reduced if the character is too long; [ Choice1, choice2, …, choice m ] is a role candidate appearing in the segment, and label is a real role sequence number corresponding to the actual sample;

M_role＝softmax(concat(Class(LM([Cn,Qn,Choice1]),…,Class(LM([Cn,Qn,Choicem])))，

The input text is denoted as h= { H1, H2, … …, hn },

wherein hn= (Cn, qn, choice1, choice2, …, choice ]), the output result r=choice [ max_index (m_roll (H)) ] ], wherein max_index is the most probable character number in the candidate characters corresponding to the center sentence; finally, the item with the highest model score in the candidate list [ Choice1, choice2, …, choice ] is output as an output result.

5. The method for generating a structured book according to claim 1, wherein in the step B, the text emotion recognition means that text emotion of a dialogue is recognized through an emotion algorithm model, and the model adopts a text classification structure based on a language model.

6. The method for generating a structured codebook according to claim 5 wherein said text classification structure classifies a dialog into multiple emotion types based on a bert-implemented deep semantic text classification model;

and/or the number of the groups of groups,

after a text emotion recognition module is constructed, a sample D= { D1, D2, … …, dn }, dn represents nth data, dn= (Cn, label), cn is text with emotion, a text sentence comprises a text sentence, hereinafter referred to as a center sentence, and context fragments of dynamic range before and after the center sentence, and label is an emotion label; expressed by serial numbers in emotion classification sets, emotion classification set e= [ E1, E2, … …, em ];

M_emotion＝softmax(LM(C1,C2,....Cn))，

repeating the model training steps, and when the model training accuracy reaches 95%, storing an optimal model to predict emotion of a dialogue sentence in a text; input text c= { C1, C2, … …, cn }, output result is:

R＝Choice[Max_index(M_emotion(Batch(C)))]；

7. The method for generating the structured script according to claim 1, wherein in the step B, the special effect scene recognition means that scene word positioning in a text is realized by using a text word granularity classification model through a special effect scene mining module, and a scene in which a scene special effect needs to be added later in the text is obtained; the model adopts a text classification structure based on a language model.

8. The method for generating a structured photo according to claim 7, wherein in the special effect scene recognition process, a special effect sound recognition module model architecture is constructed: input samples d= { D1, D2, … …, dn }, dn represents the nth piece of data, dn= (Cn (start, end, label)), cn represents a text containing a special effect sound, start, end respectively represent an initial position where the special effect sound appears, and an end position; label is a special effect sound type label; expressed by the serial numbers in the special effect sound class library, the special effect sound class library is expressed as Te= [ Te1, te2, … …, tem ];

M_special＝NER(LM(Cn))，

inputting an original text t= { T1, T2, … …, tn }, extracting an output result (start, end, specific_index) =choice [ (m_specific (Batch (T)))) ]; the Choice represents that the score output by the model is subjected to threshold screening, and the special_index in the output is the special effect sound sequence number.

9. Use of a method of generating a structured script as claimed in any of claims 1 to 8 in a speech synthesis method.

10. A structured picture generation system, characterized in that the system adopts the generation of the structured picture according to any one of claims 1-8, and the system comprises a text chapter disassembly module, a dialogue role recognition module, a text emotion recognition module, a special effect scene recognition module and a structured picture output module.