CN108091321B

CN108091321B - Speech synthesis method

Info

Publication number: CN108091321B
Application number: CN201711080122.6A
Authority: CN
Inventors: 孟猛
Original assignee: Yutou Technology Hangzhou Co Ltd
Current assignee: Yutou Technology Hangzhou Co Ltd
Priority date: 2017-11-06
Filing date: 2017-11-06
Publication date: 2021-07-16
Anticipated expiration: 2037-11-06
Also published as: CN108091321A

Abstract

The invention discloses a voice synthesis method, belonging to the technical field of voice processing; in the above method, presetting a plurality of human characters and a preset synthesizer parameter set further includes: obtaining a sentence text; analyzing each quote part and the speaking role corresponding to each quote part from the sentence text; globally regulating the speaking role according to the sentence text, matching the speaking role with a preset character role, and respectively determining the character role corresponding to the speaking role and a synthesizer parameter set according to a matching result; and performing voice synthesis on the corresponding reference part according to the synthesizer parameter set of each speaking role, thereby forming and outputting synthesized voice corresponding to the sentence text. The beneficial effects of the above technical scheme are: the characters of different characters are distinguished and the characters are reflected to the synthetic voice, so that the recognition degree of each character is improved, the synthetic voice is closer to the effect of describing the text by people, and the user experience is improved.

Description

Speech synthesis method

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice synthesis method.

Background

With the continuous development of speech technology, more and more software applications begin to cover the content of speech recognition and processing, for example, a certain software application recognizes the text input by the user, and synthesizes and outputs the corresponding speech according to the recognition result.

Generally, in real life, especially in languages describing story types, the same speaker often distinguishes different characters and scenes by changing tone of voice, for example, a mom tells a story about a wolf and a lamb to a child, a relatively deep and dull voice is used for explanation when telling the wolf, and a relatively lovely sharp voice is used for interpretation when telling the lamb. For another example, in some commentary, the shaping of different characters may be done with different sound sizes, and the dialog between different characters can be easily distinguished without the need for a bystander.

However, in the conventional speech software application, the speech synthesized and output according to the large text sentence is generally played uniformly by using a smoother tone, which gives the user experience that the speech is played in a tone completely without mood fluctuation when listening to a machine, and the above playing manner easily confuses the speaking roles set in different characters in the text, and the user needs to listen to the synthesized speech carefully and distinguish the different speaking roles from the played content, so that the output synthesized speech is completely different from the effect of describing the text sentence by people in real life, and the user experience is reduced.

Disclosure of Invention

According to the above problems in the prior art, a technical solution of a speech synthesis method is provided, which aims to distinguish characters of different characters and react to synthesized speech, and improve the recognition degree of each character in the synthesized speech, so that the synthesized speech is closer to the effect of describing texts by people, thereby improving user experience.

The technical scheme specifically comprises the following steps:

a speech synthesis method is characterized in that a plurality of types of characters are preset, and synthesizer parameter sets are preset respectively for each type of characters, and the method further comprises the following steps:

step S1, obtaining a sentence text to be synthesized;

step S2, analyzing each quoted part from the sentence text and obtaining the speaking role corresponding to each quoted part;

step S3, globally regulating speaking roles according to the sentence text to identify multiple identical speaking roles in the sentence text, matching the multiple identical speaking roles with preset character roles, and respectively determining character roles corresponding to the speaking roles and a synthesizer parameter set according to matching results;

step S4, according to the synthesizer parameter set of each speaking role, the corresponding quote part is synthesized by voice, so as to form and output the synthesized voice corresponding to the sentence text;

the reference part is a statement between two quotation marks;

the sentence text globally normalizes the speaking roles so as to unify the corresponding character roles and the synthesizer parameter sets thereof;

and matching the speaking role with a preset character role to find a class of character roles corresponding to the speaking role and a synthesizer parameter set thereof.

Preferably, the speech synthesis method is characterized in that step S2 specifically includes:

step S21, decomposing the sentence text into a plurality of independent sentences;

step S22, the quote part and the speaking role corresponding to the quote part of each sentence are analyzed from each sentence.

Preferably, in step S22, the speech synthesis method is characterized in that, according to the constraints of punctuation marks, a text analysis means is used to analyze and obtain a quote part from each sentence, and a corresponding speaking role is obtained according to the analysis of the quote part.

Preferably, the speech synthesis method is characterized in that the synthesizer parameter set comprises a plurality of synthesizer parameters;

the synthesizer parameters include a formant parameter, and/or a fundamental frequency fluctuation ratio parameter, and/or a speech rate parameter.

Preferably, the speech synthesis method is characterized in that the preset characters comprise a voice-over character for representing voice-over;

in step S3, matching the part of the sentence text excluding the reference part and the speaking character with the bystander character;

in step S4, speech synthesis is performed on the sentence text excluding the reference part and the speaking character by using the synthesizer parameter set corresponding to the bystander character.

Preferably, the speech synthesis method is characterized in that each preset character role comprises a plurality of sub-characters;

in step S3, for a speaking character, a sub-character is selected from the corresponding characters according to the matching result and determined as the character corresponding to the speaking character.

Preferably, in the speech synthesis method, in step S3, the matching result is output for the user to view after the corresponding character is matched for each speaking character, and the process goes to step S4 after the user views and confirms the matching result.

Preferably, the speech synthesis method is characterized in that a role label is set in advance for each type of character;

in step S3, the output matching result is a character text formed by adding a corresponding character label to each position of the speaking character in the sentence text.

The beneficial effects of the above technical scheme are: the method for synthesizing the speech can distinguish characters of different characters and reflect the characters into the synthesized speech, improves the identification degree of each character in the synthesized speech, enables the synthesized speech to be closer to the effect of describing the text by people, and improves user experience.

Drawings

FIG. 1 is a schematic flow chart of a speech synthesis method according to a preferred embodiment of the present invention;

FIG. 2 is a flow chart illustrating the finding of a reference part and a corresponding speaking role in the preferred embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

In light of the above problems in the prior art, a speech synthesis method is provided, which can distinguish the speaking roles with different characters and different settings in the text while synthesizing the speech according to the text, so that the output synthesized speech is more suitable for the description language of people.

In the speech synthesis method, a plurality of types of characters are preset, and a synthesizer parameter set is preset for each type of characters. The steps as described in fig. 1 are then performed:

step S1, obtaining a sentence text to be synthesized;

step S3, globally regulating the speaking role according to the sentence text, matching the speaking role with the preset character role, and respectively determining the character role corresponding to the speaking role and the synthesizer parameter set according to the matching result;

and step S4, performing voice synthesis on the corresponding quoted part according to the synthesizer parameter set of each speaking role, thereby forming and outputting synthesized voice corresponding to the sentence text.

Specifically, in this embodiment, before the speech synthesis method is executed, a plurality of types of characters are preset, and a synthesizer parameter set is preset for each type of characters. Specifically, the preset multiple types of characters may be characters conforming to some basic characters frequently involved in human speech in reality, such as men and women, or more detailed points may include men, women, the elderly, and children. Different synthesizer parameter sets are set for different categories of personas, respectively. Each synthesizer parameter set comprises a plurality of synthesizer parameters, and the specific voice, intonation and even speed of speech of the corresponding character can be simulated and formed by putting a certain synthesizer parameter set into a voice synthesis engine, so that the speaking effect of the corresponding character is realized in the synthesized voice.

In this embodiment, a standard sentence text to be synthesized usually includes large sections of sentences, and these sentences can be roughly classified into several categories:

1) a statement between two quotations marks may indicate a word or phrase having a special meaning or a word or phrase saying a certain talking role. Whether a word or phrase is intended to have a special meaning or a specific paragraph can be distinguished by the length of the word or word included in the reference section. Such statements are hereinafter referred to as "reference parts".

2) Words preceding a reference portion used to represent a segment of speech are typically used to represent the speaking role to which the subsequent reference portion corresponds. Such words are hereinafter referred to as "speaking roles

3) All statements except the above-mentioned reference part and the speaking role are usually expressed by some descriptive contents, such as description of the scene where the conversation occurs, description of the speaking role, and the like. Such statements are hereinafter referred to as "non-referenced parts". Also, the cited parts in the above-mentioned class 1) for expressing words or phrases having special meanings are also classified into non-cited parts.

In this embodiment, the sentence text to be recognized is first obtained, and the obtaining manner may be directly input by the user through the input device, or the text on the network may be captured by the crawler engine, or the corresponding text may be downloaded through the specified network address of the user, which is not described herein again.

In this embodiment, after the sentence text to be recognized is obtained, the sentence text needs to be divided first to decompose the whole text into a plurality of independent sentences, which is convenient for subsequent analysis and processing. The processing of the clause may be performed by a processor.

In this embodiment, after sentence segmentation, the whole text forms a plurality of independent sentences, and then the sentences are analyzed to obtain the reference parts in the sentences and obtain the speaking roles corresponding to each reference part. The analysis process after the sentence division can be executed by the processor.

In this embodiment, since one speaking role may appear in the sentence text for many times, after all speaking roles in the whole text of the sentence text are identified, the whole text of the speaking role is regulated, so as to arrange a plurality of identified identical speaking roles for uniform processing. For example, if multiple "woman says" appear in a sentence text to be recognized, the multiple "woman says" will be processed uniformly after final regularization, that is, the multiple "woman says" will correspond to a type of personas and their synthesizer parameter sets uniformly.

In this embodiment, after the full text is normalized, the preset personas are respectively matched for each speaking persona, so as to find a kind of personas corresponding to the speaking persona and the synthesizer parameter set thereof. And then, aiming at each speaking role, putting the corresponding synthesizer parameter set into a speech synthesis engine to carry out speech synthesis processing on the reference part behind the speaking role, thereby forming and outputting the synthesized speech aiming at the whole sentence text. The above-described processing may be performed by a speech synthesis engine or speech synthesizer in the processor.

In a preferred embodiment of the present invention, as shown in fig. 2, the step S2 specifically includes:

In this embodiment, the sentence text is first decomposed into a plurality of independent sentences according to punctuation marks in the sentence text. Specifically, the sentence text may be decomposed into a plurality of sentences according to punctuation marks such as periods, commas, exclamation marks, question marks, and semicolons, and the contents between two quotations are divided into the same sentence for the sake of completeness of the quotation part.

Subsequently, in the above step S22, the reference part is analyzed from each sentence by the text analysis means according to the constraint of punctuation, and the corresponding speaking character is analyzed according to the reference part.

Specifically, the manner of analyzing the reference part and the corresponding speaking role can be realized by adopting text analysis means such as syntactic analysis and part-of-speech tagging. Specifically, the method comprises the following steps:

for the reference part, the judgment may be made according to one or several of the following judgment rules:

1) a statement between two adjacent quotation marks;

2) the length of the sentence between the adjacent quotation marks is more than a preset value, for example, more than 4 words, or the sentence between the adjacent quotation marks is a certain symbol, for example, a period, a question mark, an exclamation mark or other punctuation marks;

3) commas or colons exist before quotations corresponding to sentences between adjacent quotation marks.

For the speaking role, after the reference part is confirmed, the sentence or word corresponding to the reference part can be used as the speaking role corresponding to the reference part. For example,

according to the organization of the sentences, if a colon exists before the quoted part, the words before the colon are used as the speaking roles of the quoted part, for example: a, saying: "XXXX. ", then A is taken as" XXXX. "the talking role of this reference part.

Or a comma is connected after the quotation part, a word is connected after the comma, and a sentence pattern of the quotation part is connected after the word, so that the word in the middle of the two quotation parts is used as the speaking role of the two quotation parts, for example: "XXXX. ", A says" XXX ". ", then A is taken as" XXXX. "and" XXX. "the talking roles of these two referenced parts.

Or a comma is connected after the quote part, a word is connected after the comma, and a period is connected after the word, then the word is used as the speaking role of the precedent quote part, for example: "XXXX. ", A says so. Then a serves as the previous reference portion "XXXX. "the speaking role.

Moreover, the speaking character can also locate its position in the sentence text by "speaking", or "saying," or other similar verbs used to represent speaking.

The analysis method for analyzing the reference part and the corresponding speaking role can also be realized by adopting other means of syntactic analysis and part-of-speech tagging, which are only some typical analysis methods and do not limit the protection scope of the analysis method.

In a preferred embodiment of the present invention, the synthesizer parameter set includes a plurality of synthesizer parameters;

Specifically, in this embodiment, each synthesizer parameter mentioned above may affect a certain aspect of the human voice. For example:

the fundamental frequency parameter is determined by the conditions of the length, width, tightness and the like of the vocal cords of the speaker, and the longer, thicker and looser the vocal cords are, the fewer the times of vocal cord vibration are, and the lower the sound is; the shorter, thinner, and stronger the vocal cords are, the more the vocal cords vibrate, the sharper the sound is made. Therefore, the fundamental frequency parameters of female voice and child voice are higher, and the fundamental frequency parameters of male voice are lower.

The fundamental frequency fluctuation ratio parameter can be used for adjusting the change amplitude of the fundamental frequency, the larger the change amplitude of the fundamental frequency is, the more obvious the feeling of suppressing the yangtong frustration is, and the smaller the change amplitude of the fundamental frequency is, the emitted sound is similar to the robot voice without tone change. The fundamental frequency fluctuation ratio parameter can thus be used to adjust and simulate the mood fluctuations of the utterance.

The tone color is related to the fundamental frequency parameters, and also has close relation with the size, shape and applicable parts of the resonance cavity, after the sound is emitted, the sound is resonated through each sounding resonance cavity (pharyngeal cavity, oral cavity, nasal cavity and the like), wherein the resonance cavities such as the thoracic cavity, the nasal cavity and the like belong to non-adjustable fixed resonance cavities, and the resonance cavities such as the laryngeal cavity, the pharyngeal cavity, the oral cavity and the like belong to adjustable variable resonance cavities. The parameter of the resonance peak can be used as a main characteristic parameter of the resonance cavity change.

In this embodiment, different timbres can be obtained by adjusting a series of parameters, such as the fundamental frequency parameter, the fundamental frequency fluctuation ratio parameter, the formant parameter, and the speech rate parameter. And combining the synthesizer parameter sets in advance and adjusting in advance to obtain preset timbres respectively aiming at each type of personas, wherein the preset timbres correspond to the synthesizer parameter sets.

In a preferred embodiment of the present invention, the preset characters include a voice character for representing voice;

in the step S3, the part of the sentence text excluding the reference part and the speaking character is matched with the bystander character;

in step S4, the speech synthesis is performed on the sentence text excluding the reference part and the speaking character by using the synthesizer parameter set corresponding to the bystander character.

Specifically, in this embodiment, a bystander role is set in the preset personas, and a synthesizer parameter set corresponding to the bystander role is set, the synthesizer parameters in the synthesizer parameter set may all be default parameters, and the final integrated tone may be a sound without mood fluctuation similar to machine sound production, or a sound with a tone different from that of other personas, for example, a sound produced by sampling the tone of a newscaster or a famous host.

In this embodiment, for the non-cited part, the synthesizer parameter set corresponding to the voice-over character is uniformly adopted to perform the speech synthesis process, and finally a sound that can be defined as "voice-over" is formed to distinguish the sound corresponding to the speaking character.

In a preferred embodiment of the present invention, each preset character comprises a plurality of sub-characters;

then, in step S3, for a speaking character, a sub-character is selected from the corresponding group of characters according to the matching result and determined as the character corresponding to the speaking character.

Specifically, in this embodiment, each character role includes a plurality of sub-characters, for example, three sub-characters, namely, "man 1", "man 2", and "man 3", are included in the character roles of "man", and the synthesizer parameter sets corresponding to each sub-character are different, that is, the sound color presented by each sub-character is completely different, for example, the sub-character of "man 1" presents the sound of a young male, the sub-character of "man 2" presents the sound of a more mature and steady adult male, and the sub-character of "man 3" presents the sound of a slightly old middle-aged male. If a speaking role is matched with the human roles such as 'man', one of the sub-roles can be selected as the human role corresponding to the speaking role, and the voice synthesis is performed on the reference part corresponding to the speaking role according to the corresponding synthesizer parameter set.

Further, in a preferred embodiment of the present invention, the manner of selecting the sub-role may include:

1) and randomly selecting, namely randomly selecting one of the sub-roles after matching with one type of character roles and determining the character role corresponding to the speaking role.

2) Two sub-roles with larger tone difference are allocated to two adjacent speaking roles which belong to the same character role. For example, two adjacent speaking characters belonging to the character characters of 'man', are assigned to the sub-character of the former speaking character 'man 1', and are assigned to the sub-character of the latter speaking character 'man 2', so that the user can distinguish the two speaking characters.

3) If the words representing the speaking character have special orientation, for example, the words representing the speaking character are 'young men', the child character of 'man 1' is directly assigned to the speaking character. That is, a plurality of words with special directions are preset for each sub-role, and if the word representing the speaking role is matched with the preset word under the sub-role, the sub-role is directly assigned to the speaking role.

In a preferred embodiment of the present invention, a default set of personas can be set for assigning the indistinguishable categories of speaking personas. Such as a speaking character represented directly by a person's name. Of course, multiple sub-personas are also included within the class of default personas. If a sentence text to be identified has a plurality of speaking roles which can not be classified respectively, different sub-roles are respectively allocated to prevent confusion among different speaking roles. Meanwhile, the synthesizer parameter set corresponding to the default persona needs to be different from the synthesizer parameter set corresponding to the bystander persona to avoid confusion.

In a preferred embodiment of the present invention, in the step S3, the matching result is output for the user to view after the corresponding human character is matched for each speaking character, and the process goes to the step S4 after the user views and confirms the matching result.

Specifically, in the present embodiment, in order to ensure that the output synthesized speech is compatible with the description language of people in real life, before the synthesized speech is finally output, the matching result of the human character needs to be output to the user for viewing so as to obtain the confirmation of the user.

Further, in a preferred embodiment of the present invention, a role label is set in advance for each type of character;

in step S3, the matching result is output as a character text formed by adding a corresponding character label to each speaking character position in the sentence text.

Specifically, when setting each type of character and even each child character, it is necessary to set corresponding character labels such as "man", "man 1", "man 2", and "man 3" described above, respectively, or give more visualized labels such as "man", "young man", "mature man", and "middle man", and the like. When the matching result is output, the role label corresponding to the assigned character role can be added to the corresponding position on the sentence text on the basis of the original sentence text, so that a role text is formed and output to the user for viewing. And outputting the final synthesized voice after the user confirms the character text. The user can also modify the assigned character role by himself so as to modify and output a new synthesized voice, that is, the user can perform manual intervention on the synthesized voice so as to achieve a better output effect.

In summary, the technical solution of the present invention provides a speech synthesis method, which can distinguish each quote part and its speaking role in a sentence text, and allocate synthesizer parameter sets for representing different timbres to distinguish the speech difference between the bystander part and the role part, and the speech difference between the role and the role, so that the output synthesized speech is closer to the speaking habit of people describing the language in real life, and the synthesis effect is significantly improved compared with the traditional speech synthesis.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A speech synthesis method is characterized in that a plurality of types of personas are preset and synthesizer parameter sets are preset for each type of persona respectively, and the method further comprises the following steps:

step S1, the obtaining unit obtains the sentence text to be synthesized;

step S2, analyzing each quote part from the sentence text, and the speaking role corresponding to each quote part;

step S3, globally regulating the speaking role according to the sentence text to identify a plurality of identical speaking roles in the sentence text, matching the plurality of identical speaking roles with the preset character role, and respectively determining the character role corresponding to the speaking role and the synthesizer parameter set according to the matching result;

step S4, according to the synthesizer parameter set of each speaking role, carrying out voice synthesis on the corresponding quoted part, thereby forming and outputting the synthesized voice corresponding to the sentence text;

the reference part is a statement between two quotation marks;

the sentence text globally normalizes the speaking roles so as to uniformly correspond to one type of the character roles and a synthesizer parameter set thereof;

and matching the speaking role with the preset character role to find the character role of the same type corresponding to the speaking role and a synthesizer parameter set thereof.

2. The speech synthesis method according to claim 1, wherein the step S2 specifically includes:

step S22, analyzing the reference part and the speaking role corresponding to each reference part from each sentence respectively.

3. The speech synthesis method according to claim 2, wherein in step S22, the quote part is analyzed from each sentence by using a text analysis means according to the constraint of punctuation, and the corresponding speaking role is analyzed according to the quote part.

4. The speech synthesis method of claim 1, wherein the synthesizer parameter set includes a plurality of synthesizer parameters;

the synthesizer parameters comprise a formant parameter, and/or a fundamental frequency fluctuation ratio parameter, and/or a speech speed parameter.

5. The speech synthesis method of claim 1, wherein the predetermined plurality of human characters includes a voice character for representing voice;

in step S3, matching the part of the sentence text excluding the reference part and the speaking character with the voice character;

in step S4, the synthesizer parameter set corresponding to the voice-over character is used to perform speech synthesis on the sentence text excluding the reference part and the part of the speaking character.

6. The speech synthesis method of claim 1, wherein each preset type of the human character comprises a plurality of sub-characters;

in step S3, for one of the speaking characters, one of the child characters is selected from the corresponding class of characters according to the matching result, and the child character is determined as the character corresponding to the speaking character.

7. The speech synthesis method according to claim 1, wherein in step S3, the matching result is output for a user to view after the corresponding character role is matched for each of the speaking roles, and the process goes to step S4 after the user views and confirms the matching result.

8. The speech synthesis method according to claim 7, wherein a character tag is set in advance for each type of the human character;

in step S3, the output matching result is a character text formed by adding the corresponding character label to the position of each speaking character in the sentence text.