CN114333763A

CN114333763A - Stress-based voice synthesis method and related device

Info

Publication number: CN114333763A
Application number: CN202210255579.0A
Authority: CN
Inventors: 余勇; 钟少恒; 陈志刚; 王翊; 蔡勇超
Original assignee: Foshan Power Supply Bureau of Guangdong Power Grid Corp
Current assignee: Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-04-12

Abstract

The application discloses a method and a related device for synthesizing voice based on stress, wherein the method comprises the following steps: performing part-of-speech classification processing on a target sentence based on an artificial intelligence word segmentation technology to obtain a category part-of-speech set, wherein categories of the category part-of-speech set comprise verbs, nouns and modifiers; determining a sentence structure of a target sentence according to the category part-of-speech set, wherein the sentence structure comprises a major phrase and a minor phrase; performing importance sequencing of rereading on a preset stress vocabulary set to obtain a rereading sequence; and performing re-reading voice synthesis on the target sentence based on the preset re-reading probability, the re-reading sequence, the preset re-reading mode and the sentence structure to obtain target synthesized voice, wherein the re-reading mode comprises slow reading and pause reading. The application can solve the technical problem that the voice obtained by the existing voice synthesis technology is hard and vivid enough, so that the experience feeling of the audience is relatively poor.

Description

Stress-based voice synthesis method and related device

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to an accent-based speech synthesis method and related apparatus.

Background

Accents refer to words or phrases that play an important role in the expression in the text, with special emphasis placed on reading. The accent reading is highlighted by the emphasis of sound, which can add points to words that are vivid in color. Accenting is therefore particularly important during text-to-speech, or speech synthesis. In speech synthesis (TTS), the prior art merely performs simple text conversion output on speech, which is very hard and not vivid, and the experience of listeners is poor.

Disclosure of Invention

The application provides a voice synthesis method based on accent and a related device, which are used for solving the technical problem that the experience of audiences is poor due to the fact that voices obtained by the existing voice synthesis technology are hard and not vivid enough.

In view of the above, a first aspect of the present application provides an accent-based speech synthesis method, including:

carrying out part-of-speech classification processing on a target sentence based on an artificial intelligence word segmentation technology to obtain a category part-of-speech set, wherein categories of the category part-of-speech set comprise verbs, nouns and modifiers;

determining a sentence structure of the target sentence according to the category part-of-speech set, wherein the sentence structure comprises a major phrase and a minor phrase;

performing importance sequencing of rereading on a preset stress vocabulary set to obtain a rereading sequence;

and performing re-reading voice synthesis on the target sentence based on a preset re-reading probability, the re-reading sequence, a preset re-reading mode and the sentence structure to obtain target synthesized voice, wherein the re-reading mode comprises slow reading and pause reading.

Optionally, the word classification processing is performed on the target sentence based on the artificial intelligence word segmentation technology to obtain a category word set, where the categories of the category word set include verbs, nouns, and modifiers, and the method includes:

performing word segmentation processing on a target sentence based on an artificial intelligence word segmentation technology to obtain a word segmentation set;

and performing part-of-speech matching on the words in the word segmentation set according to a preset part-of-speech library to obtain a category part-of-speech set.

Optionally, the importance ranking of rereading the preset accent vocabulary set to obtain a rereading sequence includes:

configuring accent vocabularies according to a Chinese grammar rule to obtain a preset accent vocabulary collection, wherein the accent vocabularies comprise predicates, objects, determinants and subjects;

and performing importance sequencing of rereading on the accented vocabularies in the preset accented vocabulary set to obtain a rereading sequence, wherein the rereading sequence is that the priority of the predicate is higher than that of the object, the priority of the object is higher than that of the fixed language, and the priority of the fixed language is equal to that of the shape.

Optionally, the re-reading speech synthesis is performed on the target sentence based on a preset re-reading probability, the re-reading sequence, a preset re-reading mode and the sentence structure to obtain a target synthesized speech, where the re-reading mode includes slow reading and pause reading, and includes:

repeating and speech synthesizing the main and subordinate phrases in the target sentence based on the main and subordinate repeating gradient probability, the repeating sequence and a preset repeating mode;

re-reading voice synthesis is carried out on the guest phrases in the target sentence based on the guest re-reading gradient probability, the re-reading sequence and the preset re-reading mode, and target synthesized voice is obtained;

the preset re-reading probability comprises a main re-reading gradient probability and a guest re-reading gradient probability, and the re-reading mode comprises slow reading and pause reading.

A second aspect of the present application provides an accent-based speech synthesis apparatus, comprising:

the part-of-speech classification module is used for carrying out part-of-speech classification processing on the target sentence based on an artificial intelligence word segmentation technology to obtain a category part-of-speech set, wherein the category of the category part-of-speech set comprises verbs, nouns and modifiers;

the structure analysis module is used for determining a sentence structure of the target sentence according to the category part of speech set, wherein the sentence structure comprises a major phrase and a minor phrase;

the rereading sequencing module is used for sequencing the importance of rereading the preset stress vocabulary set to obtain a rereading sequence;

and the rereading synthesis module is used for performing rereading speech synthesis on the target sentence based on a preset rereading probability, the rereading sequence, a preset rereading mode and the sentence structure to obtain target synthesized speech, wherein the rereading mode comprises slow reading and pause reading.

Optionally, the part-of-speech classification module is specifically configured to:

Optionally, the rereading sorting module is specifically configured to:

Optionally, the reread synthesis module is specifically configured to:

A third aspect of the present application provides an accent-based speech synthesis apparatus, the apparatus comprising a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the accent based speech synthesis method of the first aspect according to instructions in the program code.

A fourth aspect of the present application provides a computer-readable storage medium for storing program code for performing the accent-based speech synthesis method of the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the application provides a method for synthesizing voice based on accents, which comprises the following steps: performing part-of-speech classification processing on a target sentence based on an artificial intelligence word segmentation technology to obtain a category part-of-speech set, wherein categories of the category part-of-speech set comprise verbs, nouns and modifiers; determining a sentence structure of a target sentence according to the category part-of-speech set, wherein the sentence structure comprises a major phrase and a minor phrase; performing importance sequencing of rereading on a preset stress vocabulary set to obtain a rereading sequence; and performing re-reading voice synthesis on the target sentence based on the preset re-reading probability, the re-reading sequence, the preset re-reading mode and the sentence structure to obtain target synthesized voice, wherein the re-reading mode comprises slow reading and pause reading.

According to the accent-based speech synthesis method, a target sentence is divided into different phrase structures in a part of speech analysis mode; and then, target sentences are subjected to different degrees of rereading processing according to different parts of speech, the difference of importance of rereading and the sentence structure, so that the target synthesized voice can better meet the conventional requirements of audiences, can transfer emotion and emotion of different contexts, and is more vivid, and the experience of the audiences is improved. Therefore, the technical problem that the experience of audiences is poor due to the fact that voices obtained by the existing voice synthesis technology are hard and not vivid enough can be solved.

Drawings

Fig. 1 is a schematic flowchart of an accent-based speech synthesis method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an accent-based speech synthesis apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For easy understanding, please refer to fig. 1, an embodiment of an accent-based speech synthesis method provided in the present application includes:

step 101, performing part-of-speech classification processing on a target sentence based on an artificial intelligence word segmentation technology to obtain a category part-of-speech set, wherein categories of the category part-of-speech set comprise verbs, nouns and modifiers.

Further, step 101 includes:

Regarding the Chinese word segmentation processing technology, the existing Chinese word segmentation algorithms have five major categories: dictionary-based methods, statistical-based methods, rule-based methods, word-labeling-based methods, artificial intelligence technology-based (also known as understanding-based) methods. And the three levels of text information processing are: lexical analysis, syntactic analysis and semantic analysis; the Chinese word segmentation is the first step of lexical analysis and is very important. Chinese segmentation is the basis for most downstream applications, as small as POS part-of-speech tagging, NER named entity recognition, as large as automatic classification, automatic summarization, automatic collation, language modeling, machine translation, search engines, speech synthesis, and so forth.

The word segmentation principle based on the artificial intelligence word segmentation technology selected in the embodiment is to perform syntax and semantic analysis while segmenting words, and to process an ambiguous phenomenon by using syntax information and semantic information. The method related to the artificial intelligence word segmentation technology mainly comprises a neural network word segmentation method, an expert system word segmentation method and the like, and the selection can be specifically carried out according to the actual text situation, and is not limited herein.

In order to facilitate planning and management of the divided words, the words are classified according to the text grammar in the embodiment, the main categories include verbs, nouns, modifiers and the like, and the words can be further classified into more detailed categories, for example, the modifiers include adjectives, degree adverbs and the like, and the specific number of the classified categories is not limited. The words may be sorted into a set of category parts-of-speech according to the divided categories for subsequent speech synthesis.

And 102, determining a sentence structure of the target sentence according to the category part of speech set, wherein the sentence structure comprises a main phrase and a subordinate phrase and a guest phrase.

Each word in the category part-of-speech set has a specific sentence to which the word belongs and a definite position, a main and subordinate phrase component of the sentence is determined according to the sequence or position of the word in the sentence, and then components such as a definite form complement are determined, so that the overall structure of the sentence, namely the sentence structure comprising the main and subordinate phrases and the subordinate phrase components, is obtained; the subject-predicate phrase comprises components such as a subject, a predicate and an object, and the guest phrase comprises components such as an object, a predicate and a subject. The method for determining the sentence structure in the embodiment is to train a specific model to complete sentence analysis through machine learning, and can also be realized through other methods, and the method is not limited specifically, and can be used for completing sentence structure analysis.

And 103, sequencing the importance of the rereading of the preset stress vocabulary set to obtain a rereading sequence.

Further, step 103 includes:

and (3) carrying out importance sequencing of re-reading on the stress vocabularies in the preset stress vocabulary set to obtain a re-reading sequence, wherein the re-reading sequence is that the predicate priority is higher than the object, the object priority is higher than the fixed-phrase, and the fixed-phrase priority is equal to the state.

The preset stress vocabulary collection is obtained by pre-configuration according to a Chinese grammar rule, and can be specifically described as follows: it is specified that a predicate is overread in a predicate phrase, an object is overread in an object phrase, and modifiers such as a predicate and a subject need to be overread. Furthermore, in the case of words that are specified to be reread, the reread importance, i.e., the priority, or the gradient of the degree of stress, needs to be set; for example, in a subject-predicate phrase, predicate > object > predicate; in the phrase of the guest, the object is' shape-fixed; by ranking the importance according to such a rule, the re-read words can be arranged into a sequence to obtain a re-read sequence. After the sentence structures are arranged in all texts, the rereading can be carried out according to the rereading sequence.

And 104, performing re-reading voice synthesis on the target sentence based on the preset re-reading probability, the re-reading sequence, the preset re-reading mode and the sentence structure to obtain target synthesized voice, wherein the re-reading mode comprises slow reading and pause reading.

Further, step 104 includes:

repeating the reading speech synthesis of the main and auxiliary phrases in the target sentence based on the main and auxiliary repeating gradient probability, the repeating sequence and the preset repeating mode;

The preset re-reading probability means that the re-reading degree of each re-read word in each sentence is different, the probability of the highest re-read word is the largest, and then is sequentially reduced, for example, the preset re-reading probability of the highest re-read word is set to be 75%, the probabilities of other re-read words in the sentence can be sequentially reduced, and the reduction step size is 10%. The rereading probability can be set with different probability distributions according to different sentence structure components, that is, the master-predicate rereading gradient probability and the guest rereading gradient probability, and other probability distributions can be set specifically according to needs, and the principle is the same, and the details are not described herein.

The rereading sequence may find out the words that each sentence needs to be rereaded, and the rereading importance of those words.

The preset rereading mode comprises slow reading and pause reading, namely accent slow reading and accent pause reading; the accent slow reading is to prolong the duration of the sound, so that the words containing special emotions have a better expression effect, and the accent pause reading is to pause slightly before or after the emphasized words, so that the emotions are more fully expressed. The word is read repeatedly, namely the pronunciation strength is increased, the volume is enhanced, and the high and violent emotions can be expressed.

The obtained target synthesized voice can definitely reflect the rereading of the key words, the vividness and audibility of the voice synthesis can be improved, and the problem of the turning from characters to voice is effectively solved.

According to the accent-based speech synthesis method provided by the embodiment of the application, a part-of-speech analysis mode is adopted to divide a target sentence into different phrase structures; and then, target sentences are subjected to different degrees of rereading processing according to different parts of speech, the difference of importance of rereading and the sentence structure, so that the target synthesized voice can better meet the conventional requirements of audiences, can transfer emotion and emotion of different contexts, and is more vivid, and the experience of the audiences is improved. Therefore, the technical problem that the experience of audiences is poor due to the fact that voices obtained by the existing voice synthesis technology are hard and not vivid enough can be solved.

For ease of understanding, referring to fig. 2, the present application provides an embodiment of an accent-based speech synthesis apparatus, comprising:

the part-of-speech classification module 201 is configured to perform part-of-speech classification processing on a target sentence based on an artificial intelligence word segmentation technology to obtain a category part-of-speech set, where the category of the category part-of-speech set includes verbs, nouns, and modifiers;

the structure analysis module 202 is configured to determine a sentence structure of the target sentence according to the category part-of-speech set, where the sentence structure includes a major phrase and a minor phrase;

the rereading sorting module 203 is used for sorting the importance of rereading the preset stress vocabulary set to obtain a rereading sequence;

and the rereading synthesis module 204 is configured to perform rereading speech synthesis on the target sentence based on the preset rereading probability, the rereading sequence, the preset rereading mode and the sentence structure to obtain a target synthesized speech, where the rereading mode includes slow reading and pause reading.

Further, the part-of-speech classification module 201 is specifically configured to:

Further, the rereading sorting module 203 is specifically configured to:

Further, the reread synthesis module 204 is specifically configured to:

The application also provides accent-based speech synthesis equipment, which comprises a processor and a memory;

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is configured to execute the accent based speech synthesis method of the above method embodiments according to instructions in the program code.

The present application also provides a computer-readable storage medium for storing program code for performing the stress-based speech synthesis method in the above-described method embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An accent-based speech synthesis method, comprising:

2. The accent-based speech synthesis method of claim 1, wherein the artificial intelligence based word segmentation technique performs part-of-speech classification on the target sentence to obtain a category part-of-speech set, and categories of the category part-of-speech set include verbs, nouns and modifiers, and the method comprises:

3. The accent-based speech synthesis method of claim 1, wherein said ranking of importance of accent to pre-set accent vocabulary, resulting in an accent sequence, comprises:

4. The accent-based speech synthesis method according to claim 1, wherein the re-reading speech synthesis of the target sentence based on the preset re-reading probability, the re-reading sequence, the preset re-reading manner and the sentence structure, to obtain the target synthesized speech, the re-reading manner includes slow reading and pause reading, and includes:

5. An accent-based speech synthesis apparatus, comprising:

6. The accent-based speech synthesis apparatus of claim 5, wherein the part-of-speech classification module is specifically configured to:

7. The accent-based speech synthesis apparatus of claim 5, wherein the rereading ordering module is specifically configured to:

8. The accent-based speech synthesis apparatus of claim 5, wherein the rereading synthesis module is specifically configured to:

9. An accent-based speech synthesis apparatus, the apparatus comprising a processor and a memory;

the processor is configured to perform the accent based speech synthesis method of any one of claims 1-4 according to instructions in the program code.

10. A computer-readable storage medium for storing program code for performing the stress-based speech synthesis method of any one of claims 1-4.