CN115249472A - Voice synthesis method and device for realizing stress overall planning by combining context - Google Patents
Voice synthesis method and device for realizing stress overall planning by combining context Download PDFInfo
- Publication number
- CN115249472A CN115249472A CN202110455076.3A CN202110455076A CN115249472A CN 115249472 A CN115249472 A CN 115249472A CN 202110455076 A CN202110455076 A CN 202110455076A CN 115249472 A CN115249472 A CN 115249472A
- Authority
- CN
- China
- Prior art keywords
- stress
- information
- sentence
- target
- multidimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 claims abstract description 65
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 43
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 43
- 238000012549 training Methods 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 18
- 238000002372 labelling Methods 0.000 claims description 10
- 238000007726 management method Methods 0.000 claims 1
- 230000035882 stress Effects 0.000 description 303
- 230000008569 process Effects 0.000 description 33
- 230000005540 biological transmission Effects 0.000 description 14
- 238000004458 analytical method Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 7
- 238000012706 support-vector machine Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000004927 fusion Effects 0.000 description 5
- 241001672694 Citrus reticulata Species 0.000 description 4
- 230000033764 rhythmic process Effects 0.000 description 4
- 244000144730 Amygdalus persica Species 0.000 description 3
- 235000006040 Prunus persica var persica Nutrition 0.000 description 3
- 239000000470 constituent Substances 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 240000006927 Foeniculum vulgare Species 0.000 description 2
- 235000004204 Foeniculum vulgare Nutrition 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 241000282376 Panthera tigris Species 0.000 description 1
- 241000219094 Vitaceae Species 0.000 description 1
- 230000032683 aging Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 235000021021 grapes Nutrition 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a voice synthesis method and a voice synthesis device for realizing stress overall planning by combining the context. Wherein, the method comprises the following steps: acquiring a target sentence of voice to be synthesized and an upper language context sentence of the target sentence; determining the multidimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multidimensional characteristics comprise semantic characteristics, syntactic characteristics and vocabulary characteristics; determining the information characteristics of the target sentence according to the text information of the language situation sentence, wherein the information characteristics are used for indicating an information focus of accents to be distributed in the target sentence; inputting the multidimensional characteristics and the information characteristics into a preset stress determination model, and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
Description
Technical Field
The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus for realizing overall emphasis by combining the context.
Background
TTS is a process of converting languages from a character carrier to a sound carrier by a machine, and is a key module in systems of man-machine conversation, intelligent broadcasting and the like. With the aging of the related technologies, the naturalness of the synthesized speech becomes more and more a key point for various manufacturers. Whether the rhythm boundary can be correctly predicted, whether the speech flow sound change can be mastered, whether the arrangement of the accent is proper or not and the fluency are four key factors influencing the naturalness of the speech. Compared with accented languages (such as english), accent in chinese is a phonetic feature without explicit identification. Therefore, how to utilize the hidden speech feature seems to be a key technology for improving the naturalness of the speech.
Most of the current stress determination methods are to select a proper stress description system, create a hierarchically-labeled stress corpus, train and generate a prosodic word stress prediction model and a sentence stress prediction model, and finally generate a stress labeling result of each syllable comprehensively according to the prosodic word stress prediction model and the sentence stress prediction model. However, the above-described stress determination method has the following disadvantages: 1) Only dividing accents into two levels of prosodic word accents and sentence accents, without considering information accents, and using Focus of information as the information Focus (Focus of information) according to Focus Stress rules (Focus Stress Assignment) should be realized by accent means. In addition, the information stress is the most important stress in all stresses, and the interaction is the most essential function of the language; 2) Although the accent labeling result of each syllable is generated by integrating accents of prosodic words and accents of sentences, the variation of accents caused by contexts (the context of the accent sentences needing to be predicted is called as the context) is not considered, such as the increase and the loss of accents, accent transfer and the increase and the decrease of accent priority; 3) The situation that accents are limited in the fixed-length target text is not considered. At the syntactic level and the semantic level, the accents have the effect of emphasizing, and too many accents in the text cannot play the role of emphasizing, so the number of the accents in the text should be controlled within a certain range.
In addition, most of the current text-to-speech conversion systems include: 1) The text analysis module is used for analyzing special symbols and punctuations of text information in the mobile phone, dividing the text according to punctuations and paragraph marks in the text, and then segmenting words of sentences in the text, so that the text information to be converted is converted into corresponding context information; 2) The prosody generating module is used for converting the character codes in the context information obtained by the text analyzing module into corresponding phonetic codes so as to obtain prosody information which follows a phonetic prosody rule and has accurate pitch, duration, intensity, pause size between syllables and the like; 3) And the acoustic module is used for converting the prosody information obtained by the prosody generating module into voice format information and transmitting the voice format information to a user through the player. However, the above text-to-speech system has the following disadvantages: 1) Although the context information is mentioned, the essence of the context information is the speech information of the text to be synthesized (target text) itself, which is front-back information viewed from a smaller granularity (words/words), and is not the context information of the sentence level with the target text as a whole; 2) The influence of the text content of the context, namely the semantics of the context, on the prosody of the target sentence is ignored.
Aiming at the technical problems that information stress is not considered in the voice synthesis process and the condition that stress is limited in a target text with a fixed length is not considered in the prior art, so that the synthesized voice has poor naturalness, low stress accuracy and low information transmission accuracy, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the disclosure provides a voice synthesis method and a device for realizing accent overall planning by combining the above contexts, so as to solve at least the technical problems that in the prior art, in the voice synthesis process, information accent is not considered and the condition that accent is limited in a fixed-length target text is not considered, so that the synthesized voice has poor naturalness, low accuracy of accent and low accuracy of information transmission.
According to an aspect of an embodiment of the present disclosure, there is provided a speech synthesis method for realizing accent overall planning in combination with context, including: acquiring a target sentence of the voice to be synthesized and an upper language context sentence of the target sentence: determining multi-dimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multi-dimensional characteristics comprise semantic characteristics, syntactic characteristics and vocabulary characteristics; determining the information characteristics of the target sentence according to the text information of the language situation sentence, wherein the information characteristics are used for indicating an information focus of accents to be distributed in the target sentence; inputting the multidimensional characteristics and the information characteristics into a preset stress determination model, and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the method of any one of the above is performed by a processor when the program is executed.
According to another aspect of the embodiments of the present disclosure, there is also provided a speech synthesis apparatus for realizing accent orchestration in combination with the above context, including: the acquisition module is used for acquiring a target sentence of the voice to be synthesized and an upper language context sentence of the target sentence; the multi-dimensional characteristic determining module is used for determining the multi-dimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multi-dimensional characteristics comprise semantic characteristics, grammatical characteristics and vocabulary characteristics; the information characteristic determining module is used for determining the information characteristic of the target sentence according to the text information of the language context sentence, wherein the information characteristic is used for indicating an information focus of accent to be distributed in the target sentence; the stress determining module is used for inputting the multidimensional characteristics and the information characteristics into a preset stress determining model and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and the target voice determining module is used for determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
According to another aspect of the embodiments of the present disclosure, there is also provided a speech synthesis apparatus for realizing accent orchestration in combination with the above context, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a target sentence of a voice to be synthesized and an upper language situation sentence of the target sentence; determining multi-dimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multi-dimensional characteristics comprise semantic characteristics, syntactic characteristics and vocabulary characteristics; determining the information characteristics of the target sentence according to the text information of the language context sentence, wherein the information characteristics are used for indicating an information focus of accents to be distributed in the target sentence; inputting the multidimensional characteristics and the information characteristics into a preset stress determination model, and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
In the embodiment of the disclosure, when the accent analysis is performed on the target sentence, the information accent of the target sentence is assisted to be analyzed by combining the text information of the context sentence, so that the information accent for the conversation is brought into an accent system, and the information accent is combined with the hierarchical accents such as semantic accent, grammar accent and vocabulary accent to form a multilevel fusion consideration of the accent. Specifically, semantic features, grammatical features, lexical features and information features are first determined through comprehensive analysis of the target sentence and its context sentences above. And then, determining semantic stress, grammatical stress, vocabulary stress and information stress through a preset stress determination model, thereby determining the position of the multilevel stress. And finally, determining the target voice corresponding to the target sentence according to the semantic stress, the grammar stress, the vocabulary stress and the information stress. In the present application, when considering that the number of accents is limited in a target sentence having a fixed length, an accent totalization model is selected in which a sentence length (text length), semantic accents, grammatical accents, vocabulary accents, and information accents of the target sentence are input to a training machine, the accent totalization model analyzes the number of accents of the target sentence from the sentence length, and then, by comprehensively considering the accent priorities, an arrangement is made to totalize the semantic accents, the grammar accents, the vocabulary accents, and the information accents. Therefore, by the method, information stress is fully considered in the process of voice synthesis, and the naturalness, the accuracy of stress and the accuracy of transmitted information of the synthesized target voice are effectively improved. In addition, the condition that the accents in the target text with the fixed length are limited is fully considered in the voice synthesis process, the accent priority is comprehensively considered by using the pre-trained accent overall model, and the accent overall is carried out, so that the naturalness, the accuracy of the accent and the accuracy of the transmitted information of the target voice obtained by synthesis are effectively improved. And furthermore, the technical problems that in the prior art, information stress is not considered in the voice synthesis process and the limited stress condition in a fixed-length target text is not considered, so that the synthesized voice is poor in naturalness, low in stress accuracy and low in information transmission accuracy are solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:
fig. 1 is a hardware block diagram of a computing device for implementing the method according to embodiment 1 of the present disclosure;
fig. 2 is a schematic flow chart of a speech synthesis method for realizing accent orchestration in combination with the above context according to the first aspect of embodiment 1 of the present disclosure;
fig. 3 is an overall flowchart of a speech synthesis method for realizing accent orchestration in combination with the above context according to embodiment 1 of the present disclosure;
fig. 4a is a schematic flowchart of training an accent determination model according to embodiment 1 of the present disclosure;
fig. 4b is a schematic flow chart of training an emphasis orchestration model according to embodiment 1 of the present disclosure;
fig. 5 is a schematic diagram of a speech synthesis apparatus for implementing accent orchestration in combination with the above context according to embodiment 2 of the present disclosure; and
fig. 6 is a schematic diagram of a speech synthesis apparatus for implementing accent orchestration in conjunction with the above context according to embodiment 3 of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, some of the nouns or terms appearing in the description of the embodiments of the present disclosure are applicable to the following explanations:
the term 1: TTS technology (text to speech), text-to-speech technology, also called speech synthesis, text-to-speech technology.
The term 2: an SVM (support vector machines), a support vector machine, is a two-class model, and its basic model is a linear classifier with maximum interval defined on a feature space.
The term 3: the information accent principle is that each language has sentence accent and follows the same principle, namely the information accent principle, words with large information amount need to be read again, and words with small information amount do not need to be read again. This principle can explain a series of accents, including accent focus accent of Mandarin language sentence, comparison accent, relation between accent and grammar and relation between accent and word frequency. Accents can be divided into: word stress, phrase stress, sentence stress, the latter two can be called sentence stress, word stress is generally more stable, but sentence stress is very variable. In fact, any word can be rereaded to express different emphasis, and the same word is not frequently used by different people to generate the difference of whether to reread or not;
the term 4: morphemes, linguistic terms, refer to the smallest combinations of sound and meaning in a language;
the term 5: simple words, in Chinese, a word composed of a morpheme is called a simple word;
the term 6: and the prosodic boundary is a product of prosodic segmentation, namely a boundary generated between prosodic units when the speech blocks are divided and combined during speaking. Prosodic boundaries play an important role in speech interaction, syntactic disambiguation, and in improving the naturalness and intelligibility of speech synthesis.
Example 1
According to the present embodiment, there is provided an embodiment of a speech synthesis method for implementing accent orchestration in conjunction with the above context, it being noted that the steps illustrated in the flow charts of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flow charts, in some cases the steps illustrated or described may be performed in an order different than here.
The method embodiment provided by the present embodiment may be executed in a server or similar computing device. FIG. 1 illustrates a block diagram of a hardware architecture of a computing device for implementing a speech synthesis method for accent orchestration in conjunction with the above context. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. In addition, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of variable resistance termination paths connected to the interface).
The memory may be configured to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the voice synthesis method for implementing accent coordination in combination with the context in the embodiment of the present disclosure, and the processor may execute various functional applications and data processing by running the software programs and modules stored in the memory, that is, implement the voice synthesis method for implementing accent coordination in combination with the context in the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.
It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.
In the operating environment described above, according to a first aspect of the present embodiment, there is provided a speech synthesis method for realizing accent orchestration in combination with the above context. Fig. 2 shows a flow diagram of the method, and referring to fig. 2, the method comprises:
s201: acquiring a target sentence of a voice to be synthesized and an upper language situation sentence of the target sentence;
s202: determining the multidimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multidimensional characteristics comprise semantic characteristics, syntactic characteristics and vocabulary characteristics;
s203: determining the information characteristics of the target sentence according to the text information of the language context sentence, wherein the information characteristics are used for indicating an information focus of accents to be distributed in the target sentence;
s204: inputting the multidimensional characteristics and the information characteristics into a preset stress determination model, and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and
s205: and determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
As described in the background art, the conventional accent determination method simply divides accents into two levels, namely, prosodic word accents and sentence accents, and does not consider information accents, and the Focus (Focus of information) should be realized by accent means according to a Focus Stress Assignment rule (Focus Stress Assignment). Although the accent labeling result of each syllable is generated by integrating the accent of the prosodic words and the accent of the sentence, the variation of the accent caused by the context (the context of the accent sentence to be predicted is called as the context) is not considered, such as the increase and the loss of the accent, the accent transfer and the increase and the decrease of the accent priority. In addition, although some speech synthesis systems refer to the context information, the essence of the context information is the speech information of the text to be synthesized (target text) itself, which is context information viewed from a smaller granularity (words/words), which is not the sentence level context information with the target text as a whole, and the influence of the content of the context text, that is, the semantics of the context, on the prosody of the target sentence is ignored.
In view of this, referring to fig. 3, in the speech synthesis process of the target sentence, the speech synthesis method proposed in this embodiment first obtains the target sentence of the speech to be synthesized and the context sentence of the target sentence. For example, in an intelligent customer service system, a target sentence in a dialog system and a previous sentence (corresponding to a preceding language situation sentence) of the target sentence are obtained. And then, determining the semantic features, the syntactic features and the vocabulary features of the target sentence according to a preset prediction algorithm. The semantic features mainly highlight the semantic relation of the target sentence, and the semantic features comprise the following steps: number of arguments in a sentence, whether the argument is a job, a victim, a tool, a mode, a destination, etc. For example, in the case where the target sentence is "i eat an apple", the number of arguments within the sentence is two, including argument "i me" and argument "apple". The grammatical features mainly highlight sentence patterns (such as cardinal and non-cardinal sentences) and relations (cardinal, parallel, modified, complementary, etc.) between grammatical components (words and phrases) in the sentences. The lexical features mainly highlight the lexical structure, including the following sub-features: the number of words, the number of constituent morphemes (monolingual constituent words/bilingual constituent words), whether they are simple words ("tigers", "grapes"), whether they contain affixes ("old" or "lean"), whether they are compound words, whether the morphemes in words are side-by-side relationships ("hands and feet"), whether they are cardinal relationships ("earthquakes"), whether they are modification relationships ("snow" or "quick run"), whether they are complement relationships ("write-over"), etc.
Further, according to the text information of the above language context sentence, determining the information characteristics of the target sentence, wherein the information characteristics are used for indicating the information focus of the accent to be distributed in the target sentence. Specifically, the information focus of the target sentence is determined according to the text information of the above language situation sentence and by combining the semantics of the target sentence. Such as: in a dialog system, the above-text phrase is "who tells you not to have money? "then it can be inferred that the intended target sentence's information focus is the implementer of the information" tell you not to pay more "in conjunction with the target sentence: "telling me that this information is your leader", it can be concluded at this time that "your leader" in the target sentence is the information focus and needs to be assigned a focus.
Therefore, the multidimensional characteristics and the information characteristics are input into a preset stress determination model, and multidimensional stress and information stress are output, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress. The preset stress determination model is composed of four SVM models, stress of a semantic level, a grammar level, a vocabulary level and an information level is predicted respectively, and the semantic stress, the grammar stress, the vocabulary stress and the information stress are output. And finally, determining the target voice corresponding to the target sentence according to the semantic stress, the grammar stress, the vocabulary stress and the information stress.
Therefore, in this way, when performing stress analysis on the target sentence, the embodiment combines the text information of the context sentence to assist in analyzing the stress of the target sentence, so as to bring the stress of the information for the interactive dialog into a stress system, and combines the stress of the information with the stress of the semantic stress, the grammatical stress, the vocabulary stress and other levels to form a multi-level fusion consideration of the stress. Specifically, semantic features, grammatical features, lexical features and information features are first determined through comprehensive analysis of the target sentence and its context sentences above. And then, determining semantic stress, grammatical stress, vocabulary stress and information stress through a preset stress determination model, thereby determining the position of the multilevel stress. And finally, determining the target voice corresponding to the target sentence according to the semantic stress, the grammar stress, the vocabulary stress and the information stress. Therefore, information stress is fully considered in the process of voice synthesis, and the naturalness, the accuracy of stress and the accuracy of transmitted information of the target voice obtained by synthesis are effectively improved. And then the technical problems that the information stress is not considered in the voice synthesis process in the prior art, so that the synthesized voice has poor naturalness, low stress accuracy and low information transmission accuracy are solved.
Optionally, the operations of inputting the multidimensional feature and the information feature into the preset accent determination model and outputting the multidimensional accent and the information accent include: inputting the semantic features into a semantic stress determination model and outputting semantic stress; inputting the grammar features into a grammar stress determination model, and outputting grammar stress; inputting the vocabulary characteristics into a semantic stress determination model, and outputting vocabulary stress; and inputting the information characteristics into the semantic stress determination model and outputting the information stress.
Specifically, the preset accent determination model may be designed as a multi-layer independent accent determination model, including a semantic accent determination model, a grammar accent determination model, a vocabulary accent determination model, and an information accent determination model. The semantic accent determination model, the grammar accent determination model, the vocabulary accent determination model and the information accent determination model may be a two-classification model (for example, SVM, support vector machine) for predicting accents of features of each dimension. Specifically, after vectorizing each dimension feature, normalization processing is performed, the semantic feature, the grammatical feature, the lexical feature and the information feature after normalization processing are respectively input into a semantic stress determination model, a grammar stress determination model, a lexical stress determination model and an information stress determination model, and independent prediction is performed on a semantic level, a grammar level, a lexical level and an information level sequentially through each corresponding stress determination model, so that semantic stress, grammatical stress, lexical stress and information stress of a target sentence are efficiently and accurately output.
Optionally, the operation of determining the target speech corresponding to the target sentence according to the multidimensional emphasis and the information emphasis includes: inputting the sentence length, the multidimensional stress and the information stress of the target sentence into a preset stress overall model, and determining a stress point planning result corresponding to the target sentence, wherein the preset stress overall model is obtained based on the sentence length of the sample sentence, the multidimensional stress of the sample and the stress of the sample information; determining the acoustic characteristics of the target sentence according to the accent point planning result; and determining a target voice corresponding to the target sentence according to the acoustic characteristics.
As described in the above background art, the accents are used for emphasis at a syntactic level and a semantic level, and too many accents in the text cannot be used for emphasis, so the number of accents in the text should be controlled within a certain range. However, the existing speech synthesis method does not consider the situation that the accents in the target text with a fixed length are limited, so that the synthesized speech has poor naturalness, low accuracy of the accents and low accuracy of information transmission.
In view of this, as shown in fig. 3, the present embodiment further provides an emphasis overall model dedicated to performing emphasis overall. In a preferred embodiment, since the sentence length affects the number of accents finally determined for the text to be subjected to speech synthesis (for example, the longer the sentence is, the more the number of accents of the sentence is), in the process of training the accent ensemble model, as shown in fig. 4b, the sentence length (directly calculating the text length), the information accent, the semantic accent, the grammar accent, and the vocabulary accent of the sample sentence may be used as input x, and the final accent point planning result of manual annotation (which needs to be completed when labeling for the first time, and the manual accent annotation includes not only independent semantic accent, grammar accent, vocabulary accent, and information accent labeling, but also the final text accent labeling result) may be used as output y to train the accent ensemble model under the accent condition.
Therefore, in the present embodiment, in consideration of the limited number of accents in a target sentence having a fixed length, an accent totaling model is selected in which a sentence length (text length), semantic accents, grammatical accents, vocabulary accents, and information accents of the target sentence are input to a training, the number of accents in the target sentence is analyzed by the accent totaling model based on the sentence length, and then, the accent priority is comprehensively considered, and the accent totaling arrangement is performed for the semantic accents, the grammar accents, the vocabulary accents, and the information accents. In particular, when multiple levels of stress transmission conflict, information stress should be prioritized over other stresses. Therefore, in this way, under the condition that the number of accents (i.e., accent positions) of the target sentence is limited, the accent priority can be comprehensively considered through the pre-trained accent overall model, and the accent overall model is performed, so that a reasonable and accurate accent point planning result is output, then the acoustic characteristics of the target sentence are determined according to the accent point planning result, and finally the target voice corresponding to the target sentence is determined according to the acoustic characteristics. Therefore, the condition that the accents in the target text with the fixed length are limited is fully considered in the voice synthesis process, the accent priority is comprehensively considered by using the pre-trained accent overall planning model, and the accent overall planning is carried out, so that the naturalness, the accuracy of the accent and the accuracy of the transmitted information of the target voice obtained by synthesis are effectively improved. Further, the technical problems that in the prior art, the situation that accents in a target text with a fixed length are limited is not considered in the voice synthesis process, so that the synthesized voice is poor in naturalness, low in accuracy of accents and low in accuracy of transmitted information are solved.
Optionally, the operation of determining the multidimensional characteristic of the target sentence according to a preset prediction algorithm includes: performing prosodic boundary prediction on the target sentence, and determining each level of prosodic units of the target sentence, wherein each level of prosodic units comprise a voice rhythm word, a prosodic phrase and a intonation phrase; and determining the multidimensional characteristics of the target sentence according to each level of prosodic units.
Specifically, referring to fig. 3, in the operation process of determining the multidimensional characteristic of the target sentence according to the preset prediction algorithm, firstly, prosodic boundary (prosodic boundary) prediction is performed on the target sentence, so as to determine each level of prosodic units of the target sentence. Wherein, each level of rhythm unit mainly comprises: prosodic words, prosodic phrases, and intonation phrases. The prosodic unit is a language segment separated by prosodic boundaries, for example, if the target sentence is "he bought one jin of peaches", then the prosodic boundaries are respectively located after "he", and after "jin", the sentences separated by the prosodic boundaries are in the following form: he | buy | jin | peach, so the prosodic units are in turn: he, buy, jin and peach. Prosodic units can be divided into three levels from small to large: prosodic words, prosodic phrases, and intonation phrases. Because semantic features, grammatical features and vocabulary features are required to be used when accents of prosodic units of three levels are predicted, multidimensional features of a target sentence are analyzed according to the prosodic units of all levels, and thus the multidimensional features of the target sentence are determined. The multi-dimensional features comprise semantic features, syntactic features and lexical features. In this way, the semantic features, the grammatical features and the lexical features of the target sentence can be determined quickly and accurately.
Optionally, the operation of determining the target speech corresponding to the target sentence according to the multidimensional emphasis and the information emphasis further includes: judging and accepting or rejecting multidimensional stress and information stress by using a preset stress priority judgment rule and combining with the sentence length of the target sentence; determining the acoustic characteristics of the target sentence according to the judgment and the accepting and rejecting results; and determining a target voice corresponding to the target sentence according to the acoustic characteristics.
Specifically, in the present embodiment, in the operation process of determining the target speech corresponding to the target sentence according to the multidimensional emphasis and the information emphasis, the multidimensional emphasis and the information emphasis may be determined and discarded by using a preset emphasis priority determination rule and combining with the sentence length of the target sentence. That is, when the accents of each layer are arranged, the accent overall model may be replaced by the priority determination and selection/deletion rules by giving the accent priority determination rules in advance, so as to realize the overall arrangement of the multiple accents. Then, the acoustic feature of the target sentence is determined based on the result of the determination and the cut-off (i.e., the accent ensemble result). And finally, determining the target voice corresponding to the target sentence according to the acoustic characteristics. Therefore, the condition that the accents in the fixed-length target text are limited is fully considered in the voice synthesis process, the accent priority is comprehensively considered by using the preset accent priority judgment rule, the accent is overall planned, and the naturalness, the accuracy of the accent and the accuracy of the transmitted information of the synthesized target voice are effectively improved. And furthermore, the technical problems that in the prior art, information stress is not considered in the voice synthesis process and the limited stress condition in a fixed-length target text is not considered, so that the synthesized voice is poor in naturalness, low in stress accuracy and low in information transmission accuracy are solved.
Optionally, the method further comprises: and taking the dimensional characteristics of the sample sentence as input x, taking the artificial marked stress result of the sample sentence as output y, and respectively training a semantic stress determination model, a grammar stress determination model, a vocabulary stress determination model and an information stress determination model, wherein the dimensional characteristics of the sample sentence comprise a sample semantic characteristic, a sample grammar characteristic, a sample vocabulary characteristic and a sample information characteristic.
Specifically, as shown in fig. 4a, in the process of training the accent determination model, the feature of each dimension (semantic feature, grammatical feature, lexical feature, and information feature, where the information feature is used to display the non-accent feature and the accent feature) of the sample sentence may be used as an input x, the artificially labeled accent result (including the information accent, the semantic accent, the grammatical accent, and the lexical accent) of the sample sentence is used as an output y, and the accents at each stage are independently predicted, so that the independent training of the semantic accent determination model, the grammatical accent determination model, the lexical accent determination model, and the information accent determination model is realized.
In addition, the embodiment can also be combined with the voice synthesis process of the stress overall planning of the context sentence as a basis, visualize the stress overall planning result and draw a language stream stress transfer graph.
It should be noted that, although the present application is described based on the stress prediction of mandarin chinese, the process of performing stress planning according to the context is not limited to the language. As for some specific languages, it will be important for later research that dimensions are required in addition to the stress dimensions mentioned in the present application. For example: the least accent difference pair will be more severe for beijing mandarin than mandarin, as follows: in the word "fennel", the last syllable "fragrant" is semisoft with a duration shorter than one syllable, and "fennel" is relatively semihard with a duration longer than one syllable, but in the word "home" the front and back are repeated with a duration equivalent to one syllable.
The method and the device are mainly used for a dialog system and can also be used for voice broadcast of a single role. The context in the two scenarios differs in whether it is the same role or a different role, but both are contextual, which is crucial for the determination of information stress.
Therefore, when the stress analysis is carried out on the target sentence, the stress analysis method combines the text information of the context sentence to assist in analyzing the information stress of the target sentence, so that the information stress for the communication conversation is brought into a stress system, and the information stress is combined with the stress of the levels such as semantic stress, grammar stress and vocabulary stress to form multi-level fusion consideration of the stress. Specifically, semantic features, grammatical features, lexical features and information features are first determined through comprehensive analysis of the target sentence and its contextual sentences above. And then, determining semantic stress, grammatical stress, vocabulary stress and information stress through a preset stress determination model, thereby determining the position of the multilevel stress. And finally, determining the target voice corresponding to the target sentence according to the semantic stress, the grammar stress, the vocabulary stress and the information stress. In the present application, when considering that the number of accents is limited in a target sentence having a fixed length, an accent totalization model is selected in which a sentence length (text length), semantic accents, grammatical accents, vocabulary accents, and information accents of the target sentence are input to a training machine, the accent totalization model analyzes the number of accents of the target sentence from the sentence length, and then, by comprehensively considering the accent priorities, an arrangement is made to totalize the semantic accents, the grammar accents, the vocabulary accents, and the information accents. Therefore, by the method, information stress is fully considered in the process of voice synthesis, and the naturalness, the accuracy of stress and the accuracy of transmitted information of the synthesized target voice are effectively improved. In addition, the condition that the accents in the target text with the fixed length are limited is fully considered in the voice synthesis process, the accent priority is comprehensively considered by using the pre-trained accent overall model, and the accent overall is carried out, so that the naturalness, the accuracy of the accent and the accuracy of the transmitted information of the target voice obtained by synthesis are effectively improved. Furthermore, the technical problems that in the prior art, information stress is not considered in the voice synthesis process and the condition that stress is limited in a target text with a fixed length is not considered, so that the synthesized voice has poor naturalness, low stress accuracy and low information transmission accuracy are solved.
Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
Fig. 5 shows a speech synthesis apparatus 500 for implementing stress planning in conjunction with the above context according to the present embodiment, which apparatus 500 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 5, the apparatus 500 includes: an obtaining module 510, configured to obtain a target sentence of a speech to be synthesized and an upper language context sentence of the target sentence; a multidimensional feature determining module 520, configured to determine multidimensional features of the target sentence according to a preset prediction algorithm, where the multidimensional features include semantic features, syntactic features, and lexical features; an information characteristic determining module 530, configured to determine an information characteristic of the target sentence according to text information of the context sentence, where the information characteristic is used to indicate an information focus to be assigned with accents in the target sentence; an accent determination module 540, configured to input the multidimensional features and the information features into a preset accent determination model, and output multidimensional accents and information accents, where the multidimensional accents include semantic accents, grammar accents, and vocabulary accents; and a target speech determination module 550, configured to determine a target speech corresponding to the target sentence according to the multidimensional accent and the information accent.
Optionally, the preset accent determination models include a semantic accent determination model, a grammar accent determination model, a vocabulary accent determination model, and an information accent determination model, and the accent determination module 540 includes: the semantic accent determining submodule is used for inputting semantic features into the semantic accent determining model and outputting semantic accents; the grammar stress determining submodule is used for inputting the grammar characteristics into the grammar stress determining model and outputting grammar stress; the vocabulary stress determining submodule is used for inputting the vocabulary characteristics into the semantic stress determining model and outputting vocabulary stress; and the information stress determining submodule is used for inputting the information characteristics into the semantic stress determining model and outputting the information stress.
Optionally, the target speech determination module 550 includes: the stress overall planning submodule is used for inputting the sentence length, the multidimensional stress and the information stress of the target sentence into a preset stress overall planning model and determining a stress point planning result corresponding to the target sentence, wherein the preset stress overall planning model is obtained on the basis of the sentence length of the sample sentence, the multidimensional stress of the sample and the stress training of the sample information; the first acoustic characteristic determining submodule is used for determining the acoustic characteristics of the target sentence according to the accent point planning result; and the first target voice determining submodule is used for determining the target voice corresponding to the target sentence according to the acoustic characteristics.
Optionally, the multi-dimensional feature determination module 520 includes: the prosodic boundary prediction submodule is used for performing prosodic boundary prediction on the target sentence and determining each level of prosodic units of the target sentence, wherein each level of prosodic units comprise a prosodic word, a prosodic phrase and a intonation phrase; and the multi-dimensional characteristic determining submodule is used for determining the multi-dimensional characteristics of the target sentence according to each level of prosodic units.
Optionally, the target speech determining module 550 further includes: the stress priority judging submodule is used for judging and rejecting multidimensional stress and information stress by utilizing a preset stress priority judging rule and combining the sentence length of the target sentence; the second acoustic feature determining submodule is used for determining the acoustic features of the target sentence according to the judgment and the rejection results; and the second target voice determining submodule is used for determining the target voice corresponding to the target sentence according to the acoustic characteristics.
Optionally, the apparatus 500 further comprises: and the stress determination model training module is used for respectively training a semantic stress determination model, a grammar stress determination model, a vocabulary stress determination model and an information stress determination model by taking the dimensional characteristics of the sample sentence as input x and the artificial marked stress result of the sample sentence as output y, wherein the dimensional characteristics of the sample sentence comprise the sample semantic characteristics, the sample grammar characteristics, the sample vocabulary characteristics and the sample information characteristics.
Optionally, the apparatus 500 further comprises: and the stress overall model training module is used for training a stress overall model by taking the sentence length of the sample sentence and the information stress, the semantic stress, the grammar stress and the vocabulary stress of the sample sentence as input x and taking the artificial labeling stress planning result of the sample sentence as output y, wherein the artificial labeling stress planning result comprises the artificially labeled semantic stress, the grammar stress, the vocabulary stress and the final stress marking result of the sample sentence.
Therefore, according to the embodiment, when the accent analysis is performed on the target sentence, the information accent of the target sentence is assisted to be analyzed by combining the text information of the context sentence, so that the information accent for the conversation is brought into an accent system, and the information accent is combined with the hierarchical accents such as semantic accent, grammar accent and vocabulary accent to form a multi-level fusion consideration of the accent. Specifically, semantic features, grammatical features, lexical features and information features are first determined through comprehensive analysis of the target sentence and its contextual sentences above. And then, determining semantic stress, grammatical stress, vocabulary stress and information stress through a preset stress determination model, thereby determining the position of the multilevel stress. And finally, determining the target voice corresponding to the target sentence according to the semantic stress, the grammar stress, the vocabulary stress and the information stress. In the present application, when considering that the number of accents is limited in a target sentence having a fixed length, an accent totalization model is selected in which a sentence length (text length), semantic accents, grammatical accents, vocabulary accents, and information accents of the target sentence are input to a training machine, the accent totalization model analyzes the number of accents of the target sentence from the sentence length, and then, by comprehensively considering the accent priorities, an arrangement is made to totalize the semantic accents, the grammar accents, the vocabulary accents, and the information accents. Therefore, by the method, information stress is fully considered in the process of voice synthesis, and the naturalness, the accuracy of stress and the accuracy of transmitted information of the synthesized target voice are effectively improved. In addition, the condition that the accents in the target text with the fixed length are limited is fully considered in the voice synthesis process, the accent priority is comprehensively considered by using the pre-trained accent overall model, and the accent overall is carried out, so that the naturalness, the accuracy of the accent and the accuracy of the transmitted information of the target voice obtained by synthesis are effectively improved. And furthermore, the technical problems that in the prior art, information stress is not considered in the voice synthesis process and the limited stress condition in a fixed-length target text is not considered, so that the synthesized voice is poor in naturalness, low in stress accuracy and low in information transmission accuracy are solved.
Example 3
Fig. 6 shows a speech synthesis apparatus 600 according to the present embodiment for implementing accent orchestration in combination with the above context, the apparatus 600 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 6, the apparatus 600 includes: a processor 610; and a memory 620 coupled to the processor 610 for providing instructions to the processor 610 to process the following processing steps: acquiring a target sentence of voice to be synthesized and an upper language context sentence of the target sentence; determining the multidimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multidimensional characteristics comprise semantic characteristics, syntactic characteristics and vocabulary characteristics; determining the information characteristics of the target sentence according to the text information of the language situation sentence, wherein the information characteristics are used for indicating an information focus of accents to be distributed in the target sentence; inputting the multidimensional characteristics and the information characteristics into a preset stress determination model, and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
Optionally, the operations of the preset accent determination model including a semantic accent determination model, a grammar accent determination model, a vocabulary accent determination model, and an information accent determination model, inputting the multidimensional characteristic and the information characteristic into the preset accent determination model, and outputting the multidimensional accent and the information accent include: inputting the semantic features into a semantic stress determination model and outputting semantic stresses; inputting the grammar features into a grammar stress determination model, and outputting grammar stresses; inputting the vocabulary characteristics into a semantic stress determination model, and outputting vocabulary stress; and inputting the information characteristics into the semantic stress determination model and outputting the information stress.
Optionally, the operation of determining the target speech corresponding to the target sentence according to the multidimensional emphasis and the information emphasis includes: inputting the sentence length, the multidimensional stress and the information stress of the target sentence into a preset stress overall model, and determining a stress point planning result corresponding to the target sentence, wherein the preset stress overall model is obtained based on the sentence length of the sample sentence, the multidimensional stress of the sample and the stress of the sample information; determining the acoustic characteristics of the target sentence according to the accent point planning result; and determining a target voice corresponding to the target sentence according to the acoustic characteristics.
Optionally, the operation of determining the multidimensional characteristic of the target sentence according to a preset prediction algorithm includes: performing prosodic boundary prediction on the target sentence, and determining each level of prosodic units of the target sentence, wherein each level of prosodic units comprise a voice rhythm word, a prosodic phrase and a intonation phrase; and determining the multidimensional characteristics of the target sentence according to each level of prosodic units.
Optionally, the operation of determining the target speech corresponding to the target sentence according to the multidimensional emphasis and the information emphasis further includes: judging and accepting or rejecting multidimensional stress and information stress by using a preset stress priority judgment rule and combining with the sentence length of the target sentence; determining the acoustic characteristics of the target sentence according to the judgment and the accepting and rejecting results; and determining a target voice corresponding to the target sentence according to the acoustic characteristics.
Optionally, the memory 620 is further configured to provide the processor 610 with instructions to process the following processing steps: and respectively training a semantic stress determination model, a grammar stress determination model, a vocabulary stress determination model and an information stress determination model by taking all dimensional characteristics of the sample sentence as input x and taking an artificial marking stress result of the sample sentence as output y, wherein all dimensional characteristics of the sample sentence comprise a sample semantic characteristic, a sample grammatical characteristic, a sample vocabulary characteristic and a sample information characteristic.
Optionally, the memory 620 is further configured to provide the processor 610 with instructions to process the following processing steps:
taking the sentence length of the sample sentence, the information stress, the semantic stress, the grammar stress and the vocabulary stress of the sample sentence as input x, taking the manually marked stress planning result of the sample sentence as output y, and training a stress planning model, wherein the manually marked stress planning result comprises the manually marked semantic stress, the grammar stress, the vocabulary stress and the final stress marking result of the sample sentence.
Therefore, according to the embodiment, when the accent analysis is performed on the target sentence, the information accent of the target sentence is assisted to be analyzed by combining the text information of the context sentence, so that the information accent for the conversation is brought into an accent system, and the information accent is combined with the hierarchical accents such as semantic accent, grammar accent and vocabulary accent to form a multi-level fusion consideration of the accent. Specifically, semantic features, grammatical features, lexical features and information features are first determined through comprehensive analysis of the target sentence and its context sentences above. Then, semantic stress, grammar stress, vocabulary stress and information stress are determined through a preset stress determination model, and accordingly, the position of the multilevel stress is determined. And finally, determining the target voice corresponding to the target sentence according to the semantic stress, the grammar stress, the vocabulary stress and the information stress. In the present application, when considering that the number of accents is limited in a target sentence having a fixed length, an accent totalization model is selected in which a sentence length (text length), semantic accents, grammatical accents, vocabulary accents, and information accents of the target sentence are input to a training machine, the accent totalization model analyzes the number of accents of the target sentence from the sentence length, and then, by comprehensively considering the accent priorities, an arrangement is made to totalize the semantic accents, the grammar accents, the vocabulary accents, and the information accents. Therefore, by the method, the information stress is fully considered in the voice synthesis process, and the naturalness, the accuracy of the stress and the accuracy of the transmitted information of the synthesized target voice are effectively improved. In addition, the condition that the accents in the target text with the fixed length are limited is fully considered in the voice synthesis process, the accent priority is comprehensively considered by using a pre-trained accent overall planning model, and the accent overall planning is carried out, so that the naturalness, the accuracy of the accent and the accuracy of the transmitted information of the target voice obtained by synthesis are effectively improved. Furthermore, the technical problems that in the prior art, information stress is not considered in the voice synthesis process and the condition that stress is limited in a target text with a fixed length is not considered, so that the synthesized voice has poor naturalness, low stress accuracy and low information transmission accuracy are solved.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.
Claims (10)
1. A speech synthesis method for realizing stress overall planning by combining the context, which is characterized by comprising the following steps:
acquiring a target sentence of voice to be synthesized and an upper language context sentence of the target sentence;
determining the multidimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multidimensional characteristics comprise semantic characteristics, syntactic characteristics and vocabulary characteristics;
determining the information characteristics of the target sentence according to the text information of the above language situation sentence, wherein the information characteristics are used for indicating an information focus of accents to be distributed in the target sentence;
inputting the multidimensional characteristics and the information characteristics into a preset stress determination model, and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and
and determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
2. The method of claim 1, wherein the predetermined accent determination models include a semantic accent determination model, a syntactic accent determination model, a lexical accent determination model, and an informational accent determination model, and the operations of inputting the multidimensional features and the informational features into the predetermined accent determination model, and outputting multidimensional accents and informational accents, comprise:
inputting the semantic features into the semantic accent determination model and outputting the semantic accents;
inputting the grammar features into the grammar stress determination model and outputting the grammar stress;
inputting the vocabulary characteristics into the semantic stress determination model, and outputting the vocabulary stress; and
and inputting the information characteristics into the semantic stress determination model, and outputting the information stress.
3. The method of claim 1, wherein the act of determining a target speech corresponding to the target sentence based on the multidimensional emphasis and the information emphasis comprises:
inputting the sentence length of the target sentence, the multidimensional stress and the information stress into a preset stress overall model, and determining a stress point planning result corresponding to the target sentence, wherein the preset stress overall model is obtained based on the sentence length of the sample sentence, the multidimensional stress of the sample and the stress training of the sample information;
determining the acoustic characteristics of the target sentence according to the accent point planning result; and
and determining the target voice corresponding to the target sentence according to the acoustic features.
4. The method of claim 1, wherein the operation of determining multidimensional characteristics of the target sentence according to a predetermined predictive algorithm comprises:
performing prosodic boundary prediction on the target sentence, and determining each level of prosodic units of the target sentence, wherein each level of prosodic units comprises a prosodic word, a prosodic phrase and a intonation phrase; and
and determining the multidimensional characteristics of the target sentence according to the prosodic units at each level.
5. The method of claim 1, wherein the act of determining a target speech corresponding to the target sentence based on the multidimensional emphasis and the information emphasis further comprises:
judging and accepting or rejecting the multidimensional stress and the information stress by using a preset stress priority judgment rule and combining with the sentence length of the target sentence;
determining the acoustic characteristics of the target sentence according to the judgment and the accepting and rejecting results; and
and determining the target voice corresponding to the target sentence according to the acoustic features.
6. The method of claim 2, further comprising:
and taking all dimensional features of a sample sentence as input x, taking an artificial marked stress result of the sample sentence as output y, and respectively training the semantic stress determination model, the grammar stress determination model, the vocabulary stress determination model and the information stress determination model, wherein all dimensional features of the sample sentence comprise a sample semantic feature, a sample grammatical feature, a sample vocabulary feature and a sample information feature.
7. The method of claim 3, further comprising:
taking the sentence length of the sample sentence and the information stress, the semantic stress, the grammar stress and the vocabulary stress of the sample sentence as input x, taking the artificial labeling stress planning result of the sample sentence as output y, training the stress overall planning model, wherein the artificial labeling stress planning result comprises the artificially labeled semantic stress, the grammar stress, the vocabulary stress and the final stress marking result of the sample sentence.
8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.
9. A speech synthesis apparatus for achieving stress management in conjunction with a context of a user, comprising:
the system comprises an acquisition module, a voice synthesis module and a voice recognition module, wherein the acquisition module is used for acquiring a target sentence of a voice to be synthesized and an upper language situation sentence of the target sentence;
the multi-dimensional characteristic determining module is used for determining the multi-dimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multi-dimensional characteristics comprise semantic characteristics, syntactic characteristics and vocabulary characteristics;
the information characteristic determining module is used for determining the information characteristic of the target sentence according to the text information of the language context sentence, wherein the information characteristic is used for indicating an information focus of accent to be distributed in the target sentence;
the stress determining module is used for inputting the multidimensional characteristics and the information characteristics into a preset stress determining model and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and
and the target voice determining module is used for determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
10. A speech synthesis apparatus for implementing accent orchestration in conjunction with context above, comprising:
a processor; and
a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:
acquiring a target sentence of voice to be synthesized and an upper language context sentence of the target sentence;
determining the multidimensional characteristics of the target sentence according to a preset prediction algorithm, wherein the multidimensional characteristics comprise semantic characteristics, syntactic characteristics and vocabulary characteristics;
determining the information characteristics of the target sentence according to the text information of the above language situation sentence, wherein the information characteristics are used for indicating an information focus of accents to be distributed in the target sentence;
inputting the multidimensional characteristics and the information characteristics into a preset stress determination model, and outputting multidimensional stress and information stress, wherein the multidimensional stress comprises semantic stress, grammar stress and vocabulary stress; and
and determining the target voice corresponding to the target sentence according to the multidimensional stress and the information stress.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110455076.3A CN115249472B (en) | 2021-04-26 | 2021-04-26 | Speech synthesis method and device for realizing accent overall planning by combining with above context |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110455076.3A CN115249472B (en) | 2021-04-26 | 2021-04-26 | Speech synthesis method and device for realizing accent overall planning by combining with above context |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115249472A true CN115249472A (en) | 2022-10-28 |
CN115249472B CN115249472B (en) | 2024-09-27 |
Family
ID=83696474
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110455076.3A Active CN115249472B (en) | 2021-04-26 | 2021-04-26 | Speech synthesis method and device for realizing accent overall planning by combining with above context |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115249472B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12073822B2 (en) | 2021-12-23 | 2024-08-27 | Beijing Baidu Netcom Science Technology Co., Ltd. | Voice generating method and apparatus, electronic device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000764A (en) * | 2006-12-18 | 2007-07-18 | 黑龙江大学 | Speech synthetic text processing method based on rhythm structure |
WO2009021183A1 (en) * | 2007-08-08 | 2009-02-12 | Lessac Technologies, Inc. | System-effected text annotation for expressive prosody in speech synthesis and recognition |
KR20100085433A (en) * | 2009-01-20 | 2010-07-29 | 주식회사 보이스웨어 | High quality voice synthesizing method using multiple target prosody |
CN102254554A (en) * | 2011-07-18 | 2011-11-23 | 中国科学院自动化研究所 | Method for carrying out hierarchical modeling and predicating on mandarin accent |
CN112002302A (en) * | 2020-07-27 | 2020-11-27 | 北京捷通华声科技股份有限公司 | Speech synthesis method and device |
CN112331176A (en) * | 2020-11-03 | 2021-02-05 | 北京有竹居网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
-
2021
- 2021-04-26 CN CN202110455076.3A patent/CN115249472B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000764A (en) * | 2006-12-18 | 2007-07-18 | 黑龙江大学 | Speech synthetic text processing method based on rhythm structure |
WO2009021183A1 (en) * | 2007-08-08 | 2009-02-12 | Lessac Technologies, Inc. | System-effected text annotation for expressive prosody in speech synthesis and recognition |
KR20100085433A (en) * | 2009-01-20 | 2010-07-29 | 주식회사 보이스웨어 | High quality voice synthesizing method using multiple target prosody |
CN102254554A (en) * | 2011-07-18 | 2011-11-23 | 中国科学院自动化研究所 | Method for carrying out hierarchical modeling and predicating on mandarin accent |
CN112002302A (en) * | 2020-07-27 | 2020-11-27 | 北京捷通华声科技股份有限公司 | Speech synthesis method and device |
CN112331176A (en) * | 2020-11-03 | 2021-02-05 | 北京有竹居网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
Non-Patent Citations (1)
Title |
---|
朱维彬;: "支持重音合成的汉语语音合成系统", 中文信息学报, no. 03, 15 May 2007 (2007-05-15) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12073822B2 (en) | 2021-12-23 | 2024-08-27 | Beijing Baidu Netcom Science Technology Co., Ltd. | Voice generating method and apparatus, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115249472B (en) | 2024-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107924394B (en) | Natural language processor for providing natural language signals in natural language output | |
CN112309366B (en) | Speech synthesis method, speech synthesis device, storage medium and electronic equipment | |
KR20210070891A (en) | Method and apparatus for evaluating translation quality | |
CN103714048B (en) | Method and system for correcting text | |
CN106710592A (en) | Speech recognition error correction method and speech recognition error correction device used for intelligent hardware equipment | |
CN114580382A (en) | Text error correction method and device | |
CN112331176B (en) | Speech synthesis method, speech synthesis device, storage medium and electronic equipment | |
CN110010136B (en) | Training and text analysis method, device, medium and equipment for prosody prediction model | |
CN110852075B (en) | Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium | |
CN1495641B (en) | Method and device for converting speech character into text character | |
CN112466279B (en) | Automatic correction method and device for spoken English pronunciation | |
CN112309367B (en) | Speech synthesis method, speech synthesis device, storage medium and electronic equipment | |
CN111916062B (en) | Voice recognition method, device and system | |
Álvarez et al. | Towards customized automatic segmentation of subtitles | |
CN112463942A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
CN110503956A (en) | Audio recognition method, device, medium and electronic equipment | |
WO2023045186A1 (en) | Intention recognition method and apparatus, and electronic device and storage medium | |
CN114254649A (en) | Language model training method and device, storage medium and equipment | |
CN105895076B (en) | A kind of phoneme synthesizing method and system | |
CN115249472B (en) | Speech synthesis method and device for realizing accent overall planning by combining with above context | |
Coto‐Solano | Computational sociophonetics using automatic speech recognition | |
CN109872718A (en) | The answer acquisition methods and device of voice data, storage medium, computer equipment | |
KR20120045906A (en) | Apparatus and method for correcting error of corpus | |
Zine et al. | Towards a high-quality lemma-based text to speech system for the Arabic language | |
CN114373445B (en) | Voice generation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |