CN110444191B

CN110444191B - Rhythm level labeling method, model training method and device

Info

Publication number: CN110444191B
Application number: CN201910751371.6A
Authority: CN
Inventors: 吴志勇; 杜耀; 康世胤; 苏丹; 俞栋
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Tsinghua University
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Tsinghua University
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2021-11-26
Anticipated expiration: 2039-01-22
Also published as: CN110444191A; CN109697973A

Abstract

The application discloses a rhythm level labeling method, which is applied to the field of artificial intelligence, in particular to the field of voice synthesis, and comprises the following steps: acquiring text data to be marked and audio data, wherein the text data to be marked and the audio data have a corresponding relation; extracting a text feature set to be labeled of each word according to the text data to be labeled; extracting an acoustic feature set of each word according to the audio data; and acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word. The application also discloses a model training method, a rhythm level marking device and a model training device. The prosodic hierarchy marking model is established by combining the text features and the acoustic features, richer features can be provided for prosodic hierarchy marking, the prosodic hierarchy marking accuracy is improved, and the voice synthesis effect is improved.

Description

Rhythm level labeling method, model training method and device

The application is a divisional application of Chinese patent application with the name of 'a rhythm level labeling method, a model training method and a device', which is filed by the Chinese patent office on 22.1.2019 and 22.1. 201910060152.3.

Technical Field

The present application relates to the field of intelligent speech synthesis, and in particular, to a prosodic hierarchy labeling method, a model training method, and a related apparatus.

Background

In order to realize a high-quality voice synthesis system, mass data with a prosody hierarchical structure is important to be accurately marked, the prosody hierarchical structure is used for modeling the rhythm and pause of voice, and the method capable of accurately and automatically marking the prosody hierarchical structure has important significance for quickly constructing a voice synthesis corpus and improving the naturalness of voice synthesis.

At present, an automatic labeling model needs to be trained by a machine learning method for automatic labeling of a prosodic hierarchy structure, two types are mainly used for feature selection, one type is to use text features, firstly divide words, then extract the text features of the words, judge the prosodic hierarchy structure type of the words by the machine learning method, and the other type is to use acoustic features, mainly depending on the detection of the pause position of audio, distinguish different prosodic hierarchy structure types according to the pause duration.

However, in practical situations, the labeling task only uses text data, and does not consider the phenomena that the duration of a syllable before a prosodic hierarchy structure boundary is prolonged and a intonation phrase boundary is often accompanied by short-time pause, but only uses acoustic features, so that it is difficult to accurately label three-layer prosodic hierarchies simultaneously, and the internal relation between the text features and the acoustic features is ignored, so that the labeling effect of the prosodic hierarchy structure is reduced, and the quality of a corpus relied on by speech synthesis is affected.

Disclosure of Invention

The embodiment of the application provides a prosody hierarchy labeling method, a model training method and a prosody hierarchy labeling device.

In view of the above, a first aspect of the present application provides a method for prosody hierarchy annotation, including:

acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;

extracting a text feature set to be annotated of each word according to the text data to be annotated, wherein the text feature set to be annotated comprises part of speech, word length and post-word punctuation types;

extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;

and acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word.

A second aspect of the present application provides a method of model training, comprising:

acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;

extracting a text feature set to be trained of each word according to the text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;

extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;

and training the word identification corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy structure.

A third aspect of the present application provides a prosodic hierarchy labeling apparatus, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring text data to be labeled and audio data, the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;

the extraction module is used for extracting a text feature set to be labeled of each word according to the text data to be labeled obtained by the obtaining module, wherein the text feature set to be labeled comprises part of speech, word length and word post punctuation types;

the extraction module is further configured to extract an acoustic feature set of each word according to the audio data acquired by the acquisition module, where the acoustic feature set includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;

and the prediction module is used for acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word extracted by the extraction module and the acoustic feature set of each word.

In one possible design, in a first implementation of the third aspect of an embodiment of the present application,

the prediction module is specifically configured to determine at least one of a prosodic word, a prosodic phrase, and a intonation phrase through the prosodic hierarchy labeling model;

or the like, or, alternatively,

and determining prosodic words and/or prosodic phrases through the prosodic hierarchy annotation model.

The present application in a fourth aspect provides a model training apparatus comprising:

the training device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring text data to be trained and audio data to be trained, the text data to be trained and the audio data to be trained have a corresponding relation, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;

the extraction module is used for extracting a text feature set to be trained of each word according to the text data to be trained acquired by the acquisition module, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation types;

the extraction module is further configured to extract an acoustic feature set to be trained of each word according to the audio data to be trained acquired by the acquisition module, where the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;

and the training module is used for training the word identifier corresponding to each word, the text feature set to be trained of each word extracted by the extraction module and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy.

In one possible design, in a first implementation manner of the fourth aspect of the embodiment of the present application, the model training apparatus further includes a processing module and a generating module;

the processing module is used for performing word segmentation processing on the text data to be trained after the acquisition module acquires the text data to be trained and the audio data to be trained to obtain at least one word;

the obtaining module is further configured to obtain a target word identifier corresponding to a target word according to a preset word identifier relationship, where the preset word identifier relationship is used to indicate a relationship between each preset word and a word identifier, and the target word belongs to any one of the at least one word processed by the processing module;

the generating module is used for generating a target word vector corresponding to the target word in the text data to be trained;

the training module is specifically configured to train the target word identifier obtained by the obtaining module and the target word vector generated by the generating module to obtain a first model parameter, where the first model parameter is used to generate a word embedding layer in the prosody level labeling model.

In one possible design, in a second implementation of the fourth aspect of the embodiments of the present application,

the extraction module is specifically configured to acquire a part of speech, a word length, and a post-word punctuation type of a target word in the text data to be trained, where the part of speech indicates a result of grammar classification of the word, the word length indicates a word number of the word, and the post-word punctuation type is used to indicate a punctuation type corresponding to the post-word;

acquiring the part of speech, word length and post-word punctuation types of associated words in the text data to be trained, wherein the associated words are words having an association relation with the target words;

the training module is specifically configured to train the part of speech, the word length, and the post-word punctuation type of the target word and the part of speech, the word length, and the post-word punctuation type of the associated word to obtain a second model parameter, where the second model parameter is used to generate a text neural network in the prosody level labeling model.

In one possible design, in a third implementation manner of the fourth aspect of the embodiment of the present application, the model training apparatus further includes an alignment module;

the alignment module is used for forcibly aligning the text data to be trained and the audio data to be trained after the acquisition module acquires the text data to be trained and the audio data to be trained to obtain a time-aligned text;

the extraction module is specifically configured to determine the word end syllable duration of the target word according to the time alignment text.

In one possible design, in a fourth implementation of the fourth aspect of the embodiment of the present application,

the extraction module is specifically configured to determine post-word pause duration of the target word according to the time alignment text.

In one possible design, in a fifth implementation form of the fourth aspect of the embodiments of the present application,

the extraction module is specifically configured to calculate a frame number of a final syllable voiced initial frame and a frame number of a voiced end frame of the target word according to the time alignment text and fundamental frequency information extracted from the audio data to be trained;

extracting a logarithmic fundamental frequency curve and a logarithmic energy curve of the audio data to be trained;

and calculating the acoustic statistical characteristics of the end-of-speech syllables of the target word according to the frame number of the end-of-speech syllable voiced initial frame of the target word, the frame number of the voiced end frame, the logarithmic fundamental frequency curve and the logarithmic energy curve, wherein the acoustic statistical characteristics of the end-of-speech syllables comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic fundamental frequency curve, and the acoustic statistical characteristics of the end-of-speech syllables further comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic energy curve.

In one possible design, in a sixth implementation form of the fourth aspect of the embodiment of the present application,

the extraction module is specifically configured to calculate, according to the time-aligned text and fundamental frequency information extracted from the audio data to be trained, a frame number of a last voiced frame of the target word and a frame number of a voiced frame of a next adjacent word prefix of the target word;

determining a fundamental frequency value and an energy value between the end voiced frame of the target word and the next adjacent word beginning voiced frame according to the frame number of the last voiced frame of the target word, the frame number of the voiced frame of the next adjacent word beginning of the target word, and fundamental frequency information and energy information which are extracted from the audio data to be trained in a framing manner;

and calculating to obtain a logarithmic difference value of the fundamental frequency value according to the fundamental frequency value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, and calculating to obtain a logarithmic difference value of the energy value according to the energy value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, wherein the logarithmic difference value of the fundamental frequency value and the logarithmic difference value of the energy value belong to the acoustic feature change value between words.

In one possible design, in a seventh implementation form of the fourth aspect of the embodiment of the present application,

the training module is specifically configured to obtain a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, where the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter;

acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter;

training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model;

and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.

A fifth aspect of the present application provides a prosodic hierarchy labeling apparatus, including: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is used for executing the program in the memory and comprises the following steps:

acquiring a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

A sixth aspect of the present application provides a model training apparatus, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy structure;

A seventh aspect of the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-described aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, a prosodic hierarchy labeling method is provided, which includes the steps of firstly, obtaining text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled includes at least one word, each word corresponds to one word identifier, then extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled includes a part of speech, a word length and a post-word punctuation type, and then extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set includes a word end syllable time length, a post-word pause time length, a word end acoustic statistical feature and an inter-word acoustic feature variation value, and finally, according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, and acquiring a prosody hierarchy through a prosody hierarchy labeling model. Through the mode, the prosodic hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for prosodic hierarchy structure labeling, the accuracy of prosodic hierarchy labeling can be improved by adopting the more accurate prosodic hierarchy labeling model, and the naturalness of voice synthesis tone quality is favorably improved.

Drawings

FIG. 1 is a block diagram of a speech synthesis system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a prosody hierarchy in an embodiment of the present application;

FIG. 3 is a diagram of an embodiment of a method for prosody hierarchy labeling in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating an application of the prosodic hierarchy labeling system of the embodiment of the present application;

FIG. 5 is a flow chart illustrating prosodic hierarchy labeling in an embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of a method for model training in an embodiment of the present application;

FIG. 7 is a schematic flow chart of extracting an acoustic feature set according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an embodiment of a fundamental frequency curve in an embodiment of the present application;

FIG. 9 is a schematic diagram of an embodiment of an energy curve in an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a prosody hierarchy labeling model according to an embodiment of the present application;

FIG. 11 is a schematic diagram of an embodiment of a prosodic hierarchy labeling apparatus according to the present application;

FIG. 12 is a schematic diagram of an embodiment of a model training apparatus according to an embodiment of the present application;

FIG. 13 is a schematic diagram of another embodiment of a model training apparatus according to an embodiment of the present application;

FIG. 14 is a schematic diagram of another embodiment of a model training apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a terminal device in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a server in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be appreciated that the present application is primarily applicable to automatic prosodic hierarchy labeling of text data during data preparation for building a speech synthesis corpus. The voice synthesis is a task of converting a text into voice, massive data are required to be prepared for constructing a high-quality voice synthesis system, the data with prosody hierarchical structure labels have important influence on the naturalness of the voice synthesis, the traditional labeling mode is usually manual labeling, time and labor are wasted for labeling massive data, different labels can have inconsistency on the labeling of words, and the system with the automatic prosody hierarchical structure labeling has important significance for rapidly constructing the massive prosody hierarchical data labeling task of the voice synthesis system and solving the inconsistency of different labels.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

For convenience of understanding, the present application provides a prosody hierarchy labeling method and a model training method, where the method is applied to a speech synthesis system shown in fig. 1, please refer to fig. 1, fig. 1 is an architecture diagram of the speech synthesis system in an embodiment of the present application, as shown in the figure, a terminal device or a server first obtains text data and audio data, where the text data and the audio data are corresponding, for example, the text data is "today is a good day", and the audio data is an audio of "today is a good day", and a forced alignment tool is used to align the text data and the audio data. And then, extracting a text feature set corresponding to each word in the text data, wherein the text feature set of each word comprises part of speech, word length and word post punctuation type. Meanwhile, feature extraction is also needed to be performed on the audio data, so that an acoustic feature set of each word is obtained, wherein the acoustic feature set of each word comprises a word ending syllable time length, a word post pause time length, word ending syllable acoustic statistical features and an inter-word acoustic feature change value, and the inter-word acoustic feature change value is represented as a logarithmic difference value of a final voiced sound frame of the current word and a voiced sound frame of a next word beginning on a fundamental frequency and a logarithmic difference value on energy. In addition, word Identification (ID) of each word can be extracted according to the text data, the word identification of each word in the whole sentence, the text feature set of each word and the acoustic feature set of each word are all input into the trained prosody level labeling model, and a prosody layer labeling result is output by the model. If the prosody hierarchy marking model is deployed in the terminal device, the terminal device can directly play a corresponding sentence according to the prosody hierarchy after obtaining the prosody hierarchy through the prosody hierarchy marking model. If the prosody hierarchy annotation model is deployed in the server, the server needs to feed back the prosody hierarchy to the terminal device after obtaining the prosody hierarchy through the prosody hierarchy annotation model, and the terminal device plays a corresponding sentence according to the prosody hierarchy.

It should be noted that the terminal device includes, but is not limited to, a tablet computer, a notebook computer, a palm computer, a mobile phone, a voice interaction device, and a Personal Computer (PC), and is not limited herein. The voice interaction device includes, but is not limited to, an intelligent sound and an intelligent household appliance. The voice interaction device also has the following characteristics:

1. the network function, various voice interaction devices can be connected together through the local area network, can also be connected with a service site of a manufacturer through a home gateway interface, and can be finally connected with the Internet to realize the sharing of information.

2. Intellectualization, the voice interaction equipment can automatically respond according to different surrounding environments without human intervention.

3. Openness and compatibility, since the user's voice interaction device may come from different vendors, the voice interaction device needs to have the development and compatibility.

4. The energy-saving intelligent household appliance can automatically adjust the working time and the working state according to the surrounding environment, thereby realizing energy conservation.

5. Ease of use, the user only needs to know very simple operations, since the complex control operation flow has been solved by the controller embedded in the voice interaction device. The voice interaction equipment does not refer to a certain equipment, but is a technical system, the content of the voice interaction equipment is richer along with the continuous development of human application requirements and the intellectualization of the voice interaction equipment, the functions of different voice interaction equipment according to actual application environments are different, and the voice interaction equipment generally has an intelligent control technology.

It should be understood that the prosodic hierarchy output by the speech synthesis system may be specifically the prosodic hierarchy of the chinese language, which is a tonal language whose prosodic features are very complex. The prosodic hierarchy is a model of prosodic features such as pause and rhythm of voice and has important significance on naturalness of voice quality synthesized by a voice synthesis system. A typical prosodic hierarchy division is shown in fig. 2, please refer to fig. 2, and fig. 2 is a schematic structural diagram of a prosodic hierarchy in the embodiment of the present application, which is divided into Prosodic Words (PW), Prosodic Phrases (PPH), and Intonation Phrases (IPH) from bottom to top. For example, in the sentence "get sincere greeting and heartful blessing", PW is "get sincere", "greeting", "heartful" and "blessing". PPH is "get it," sincere greeting, "" and heartful blessing. IPH is "sincere greeting" and "heartful blessing".

With reference to fig. 3, a method for prosody hierarchy labeling in the present application will be described below, where an embodiment of the method for prosody hierarchy labeling in the present application includes:

101. acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;

in this embodiment, text data to be labeled and corresponding audio data are first obtained, where the text data to be labeled may be a sentence or a segment, and the language type includes, but is not limited to, chinese, japanese, english, or korean. The audio data may specifically be an audio file. At least one word is included in the text data to be annotated, so that word segmentation is possible, for example, "get an honest greeting and a keen blessing", which can be divided into five words, "get an honest greeting", "and a keen blessing", respectively, and different words correspond to different word identifiers.

102. Extracting a text feature set to be annotated of each word according to text data to be annotated, wherein the text feature set to be annotated comprises part of speech, word length and post-word punctuation types;

in this embodiment, feature extraction is then performed on each word, and the feature extraction includes two aspects, the first is extraction of text features, and the second is extraction of acoustic features. In the process of extracting the text features, text features of each word in the text data to be annotated need to be extracted, and taking the text data to be annotated as "sincere greeting and heartful blessing" as an example, a text feature set to be annotated corresponding to each word can be extracted, where the text feature set to be annotated includes, but is not limited to, part of speech, word length, and post-word punctuation type.

The part of speech is usually divided into real words and imaginary words, the real words are one of Chinese word classes, the words contain words with actual meanings, and the real words can be independently used as sentence components, namely, the words with lexical meanings and grammatical meanings. The grammar function is taken as a main basis, and the grammar function can be regarded as a syntactic component independently, and the real words with lexical meaning and grammar meaning are real words. The real words include nouns, verbs, adjectives, numerators, quantifiers, and pronouns. A particle is a term that is not fully meaningful, but rather grammatical in meaning or function. The method has the characteristics that the method must be attached to real words or sentences to express grammatical significance, cannot form sentences independently, cannot form grammatical components independently, and cannot be overlapped. The fictional words include adverbs, prepositions, conjunctions, adjectives, sighs, and pseudonyms.

The word length indicates the length of the word, such as "greeting" having a word length of 2, "and" word length of 1.

The word punctuation type indicates whether the punctuation mark is followed immediately after the word, and if the punctuation mark is followed immediately, the type of the punctuation mark needs to be confirmed. Punctuation marks represent pause times in spoken language and can help people to express thought emotion and understand written language exactly.

103. Extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word ending syllable duration, word post pause duration, acoustic statistical features of word ending syllables and acoustic feature variation values among words;

in this embodiment, in the process of extracting the acoustic features, it is necessary to extract the acoustic features of each word in the audio data, and taking the text data to be labeled "enthusiastic greeting and enthusiasm" as an example, five groups of acoustic feature sets may be extracted, where the acoustic feature set includes, but is not limited to, a word ending syllable duration, a word post-pause duration, a word ending syllable acoustic statistical feature, and an inter-word acoustic feature variation value.

The term "end syllable duration" refers to the time length of the last syllable of a word, the syllable refers to a voiced syllable, such as a "waiting" word of "greeting", the pronunciation is "hou", the unvoiced sound is "h", the voiced sound is "ou", the term "end syllable duration refers to the time length of the syllable of" ou ", the duration is detected by a special tool, and the discussion is omitted here.

The post pause duration refers to the length of time after the word is spoken until the next word begins to be spoken, such as the length of time between "hello" and ".

The acoustic statistical features of the last syllable typically include ten parameters, five of which are parameters related to the log-fundamental curve of the last syllable, that is, a maximum value based on the log-fundamental curve, a minimum value based on the log-fundamental curve, a range based on the log-fundamental curve, a mean value based on the log-fundamental curve, and a variance based on the log-fundamental curve. The other five parameters are parameters related to the log energy curve of the last syllable, namely, the parameters comprise a minimum value based on the log energy curve, a maximum value based on the log energy curve, a range based on the log energy curve, a mean value based on the log energy curve and a variance based on the log energy curve.

The inter-word acoustic feature variation value represents the log fundamental frequency difference and log energy difference between the voiced sound at the tail of a word and the voiced sound at the head of the next word.

104. And acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word.

In this embodiment, text data to be labeled and audio data are input to a prosody hierarchy labeling model, and the prosody hierarchy labeling model outputs a corresponding prosody hierarchy structure according to a word identifier of each word, a text feature set to be labeled of each word, and an acoustic feature set of each word.

For convenience of introduction, referring to fig. 4, fig. 4 is a schematic illustration of an application of the prosody hierarchy annotation system in the embodiment of the present application, as shown in the figure, a user provides text data and audio data that need to be annotated with a prosody hierarchy, for example, when the user inputs text data to be annotated as "sincere greeting and keen blessing", the text data to be annotated and the corresponding audio data are provided to a prosody hierarchy annotation model. And extracting features through a prosodic hierarchy labeling model. The method comprises the steps of extracting a text feature set to be labeled of each word and an acoustic feature set of each word respectively, then obtaining a prosody hierarchy structure by utilizing a deep neural network forward calculation, and providing a text labeled with the prosody hierarchy structure for a user by a prosody hierarchy labeling model.

Referring to fig. 5, fig. 5 is a schematic flow chart of prosody level labeling in the embodiment of the present application, and as shown in the drawing, specifically, in step S1, text data and audio number information of a sentence to be labeled are first obtained. In step S2, the text data is subjected to word segmentation processing, and the text data and the audio data are forcibly aligned. In step S3, after the text data and the audio data are aligned forcibly, corresponding text features and acoustic features may be extracted. In step S4, the extracted text features and acoustic features are input into a prosody hierarchy labeling model, which includes a feedforward neural network and a bidirectional long-and-short-term neural network. In step S5, the prosody hierarchy of the sentence is output by the prosody hierarchy labeling model.

In the embodiment of the application, a prosodic hierarchy labeling method is provided, which includes the steps of firstly, obtaining text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled includes at least one word, each word corresponds to one word identifier, then extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled includes a part of speech, a word length and a post-word punctuation type, and then extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set includes a word end syllable time length, a post-word pause time length, a word end acoustic statistical feature and an inter-word acoustic feature variation value, and finally, according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, and acquiring a prosody hierarchy through a prosody hierarchy labeling model. Through the mode, the prosodic hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for labeling of prosodic hierarchies, the more accurate prosodic hierarchy labeling model can be adopted to improve the accuracy of prosodic hierarchy labeling, and the voice synthesis effect is improved.

Optionally, on the basis of the embodiment corresponding to fig. 3, in a first optional embodiment of the method for providing prosody hierarchy annotation in the embodiment of the present application, obtaining a prosody hierarchy structure through a prosody hierarchy annotation model may include:

determining at least one of prosodic words, prosodic phrases and intonation phrases through a prosodic hierarchy annotation model;

or, determining prosodic words and/or prosodic phrases through a prosodic hierarchy labeling model.

In this embodiment, two common prosodic hierarchies will be introduced. In the first case, at least one of prosodic words, prosodic phrases, and intonation phrases is determined by a prosodic hierarchy labeling model, that is, the prosodic hierarchy labeling model trains four cases, namely, a non-prosodic hierarchy boundary, a prosodic word boundary, a prosodic phrase boundary, and an intonation phrase boundary. In the first case, prosodic words and/or prosodic phrases are determined by a prosodic hierarchy model, i.e., the prosodic hierarchy model trains three cases, namely, a non-prosodic hierarchy boundary, a prosodic word boundary, and a prosodic phrase boundary.

When the prosody hierarchy is labeled, prosody hierarchy labeling is carried out on input text data after text processing by adopting a prosody hierarchy labeling model generated in a training stage, so that a text with a prosody hierarchy structure is obtained and used for quickly constructing a corpus required by a voice synthesis system.

Secondly, in the embodiment of the present application, two common prosody hierarchy labeling methods are introduced, one is to determine prosodic words, prosodic phrases and intonation phrases through a prosody hierarchy labeling model, and the other is to determine prosodic words and prosodic phrases through a prosody hierarchy labeling model. Through the mode, the user can select a more detailed labeling scheme with three-layer prosody hierarchical structures of prosodic words, prosodic phrases and intonation phrases, and can also select a labeling scheme with two-layer prosody hierarchical structures of the prosodic words and the prosodic phrases. Therefore, the prosody hierarchy output can be selected according to the requirement, and the flexibility of the scheme is improved.

With reference to fig. 6, an embodiment of the method for training a model in this application includes:

201. acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;

in this embodiment, first, to-be-trained text data and corresponding to-be-trained audio data are obtained, where the to-be-trained text data may specifically be a sentence or a paragraph, and the language type includes, but is not limited to, chinese, japanese, english, or korean. The audio data to be trained may specifically be an audio file. At least one word is included in the text data to be trained, so that word segmentation is possible, for example, "get an honest greeting and a keen blessing", which can be divided into five words, "get an honest greeting", "and a keen blessing", respectively, and different words correspond to different word identifiers.

It can be understood that a large number of samples are often required during training, where the text data to be trained and the audio data to be trained are samples, and for convenience of description, the text data to be trained and the audio data to be trained are described as one sample, which should not be construed as a limitation to the present application.

202. Extracting a text feature set to be trained of each word according to text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;

in this embodiment, feature extraction is then performed on each word, and the feature extraction includes two aspects, the first is extraction of text features, and the second is extraction of acoustic features. In the process of extracting the text features, text features need to be extracted for each word in the text data to be trained, and taking the text data to be trained, "sincere greeting and keen blessing" as an example, a text feature set to be trained corresponding to each word can be extracted, where the text feature set to be trained includes, but is not limited to, part of speech, word length, and post-word punctuation type.

It should be noted that the parts of speech, the word length, and the word punctuation type have been introduced in the above embodiments, and therefore, the description thereof is omitted here.

203. Extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;

in this embodiment, in the process of extracting acoustic features, it is necessary to extract acoustic features of each word in audio data to be trained, and taking "an honest greeting and a happy blessing" as an example of text data to be trained, an acoustic feature set of the word to be trained may be extracted, where the acoustic feature set to be trained includes, but is not limited to, a word ending syllable duration, a word post-pause duration, a word ending syllable acoustic statistical feature, and an inter-word acoustic feature variation value.

It should be noted that the word end syllable duration, the post-word pause duration, the word end syllable acoustic statistical characteristics, and the inter-word acoustic characteristic variation values have been described in the above embodiments, and therefore are not described herein again.

204. And training the word identification corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy.

In this embodiment, in a training process of introducing a prosodic hierarchy labeling model, training data are text data labeled with a prosodic hierarchy structure and corresponding audio data, a deep neural network is used to model a sequence, each sentence has a plurality of words, and a sentence is a word sequence, features and labels of each word are used as input and output of the deep neural network at a time step, each word has a corresponding label Y, so that the label of a sentence can be represented as a vector Y, word identifiers, text features and acoustic features of each word in the sentence can be extracted from the text data and the corresponding audio data, thereby forming a feature X of the word, a plurality of words in a sentence can be represented as an input vector X, a loss function is represented as L (Y, f (X)), a loss function is reduced as much as possible by training through a large number of samples, and obtaining training parameters of the neural network so as to obtain a rhythm hierarchy structure automatic labeling model, namely a rhythm hierarchy labeling model.

In the embodiment of the application, a method for model training is provided, which includes firstly, obtaining text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, each word corresponds to a word identifier, then extracting a feature set of the text to be trained of each word according to the text data to be trained, wherein the feature set of the text to be trained includes a part of speech, a word length and a post-word punctuation type, and extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end acoustic statistical feature and an inter-word acoustic feature variation value, and finally training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word, and obtaining a rhythm level labeling model. Through the mode, the prosodic hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for labeling of prosodic hierarchies, the more accurate prosodic hierarchy labeling model can be adopted to improve the accuracy of prosodic hierarchy labeling, and the voice synthesis effect is improved.

Optionally, on the basis of the embodiment corresponding to fig. 6, in a first optional embodiment of the method for model training provided in the embodiment of the present application, after obtaining text data to be trained and audio data to be trained, the method may further include:

performing word segmentation processing on text data to be trained to obtain at least one word;

acquiring a target word identifier corresponding to a target word according to a preset word identifier relationship, wherein the preset word identifier relationship is used for representing a preset relationship between each word and the word identifier, and the target word belongs to any one of at least one word;

generating a target word vector corresponding to a target word in text data to be trained;

training the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, which may include:

and training the target word identifier and the target word vector to obtain a first model parameter, wherein the first model parameter is used for generating a word embedding layer in a prosody hierarchy labeling model.

In the embodiment, a method for training a word embedding layer in a prosodic hierarchy labeling model is provided. Firstly, text data to be trained needs to be acquired, and then word segmentation processing is performed on the text data to be trained, for example, the text data to be trained is 'establishment cooperative society, new patterns are formed by means of an e-commerce platform', and 'establishment', 'cooperation society', 'assistance', 'e-commerce', 'platform', 'formation', 'new patterns' are obtained after word segmentation. This fact requires determining the word identifier corresponding to each word according to the preset word identifier relationship. For ease of understanding, please refer to table 1, where table 1 is an illustration of a predetermined word identification relationship.

TABLE 1

Word mark	Word and phrase
		0	Established
1	Cooperative society
		2	By means of
3	Electronic commerce
		4	Half platform
5	Composition of
		6	New mode

As can be seen from table 1, the preset word identifier relationship is used to indicate the relationship between a word and a word identifier, the same word corresponds to the same word identifier, and if the target word is "true", the word identifier of the word is "0", and at this time, "0" is used as the input of the word embedding layer.

According to the method for generating the target word identifications and the target word vectors, other word identifications and word vectors are generated, the word identifications and the word vectors are trained according to the mapping relation between the word identifications and the word vectors, the first model parameters can be obtained by utilizing the minimum value of the loss function, and the first model parameters are used for generating a word embedding layer in the rhythm level labeling model. In practical application, the word embedding layer can be updated regularly, so that the accuracy of the word embedding layer is improved.

Secondly, in the embodiment of the application, a method for embedding training words into a layer is introduced, namely, word segmentation processing is performed on text data to be trained, then target word identifiers corresponding to target words are obtained according to a preset word identifier relation, target word vectors corresponding to the target words in the text data to be trained are generated, then the target word identifiers and the target word vectors are trained, and first model parameters are obtained, wherein the first model parameters are used for generating a word embedding layer in a rhythm level labeling model. By the mode, the word embedding layer in the rhythm level labeling model can be obtained through direct training, and other neural networks in the rhythm level labeling model can be trained simultaneously when the word embedding layer is trained, so that the process of additionally training a word vector model by using an independent neural network is saved, and the training efficiency is improved.

Optionally, on the basis of the embodiment corresponding to fig. 6, in a second optional embodiment of the method for model training provided in the embodiment of the present application, extracting a feature set of a text to be trained for each word according to text data to be trained may include:

acquiring the part of speech, the word length and the post-word punctuation type of a target word in text data to be trained, wherein the part of speech represents a grammar classification result of the word, the word length represents the word number of the word, and the post-word punctuation type is used for representing the punctuation type corresponding to the post-word;

acquiring the part of speech, word length and post-word punctuation types of associated words in text data to be trained, wherein the associated words are words having an association relation with target words;

and training the part of speech, the word length and the post-word punctuation type of the target word and the part of speech, the word length and the post-word punctuation type of the associated word to obtain a second model parameter, wherein the second model parameter is used for generating a text neural network in a rhythm level labeling model.

In the embodiment, a method for training a text neural network in a prosodic hierarchy labeling model is provided. For convenience of understanding, the following description will continue to use the target word in the text data to be trained as an example, and it can be understood that the processing manner of other words in the text data to be trained is similar to that of the target word, and is not described herein again.

Specifically, word segmentation is performed on the text data to be trained, for example, if the text data to be trained is "a standing cooperative society," a new pattern is composed with the help of an e-commerce platform, "and after word segmentation," a standing "," a cooperative society, "a" help, "" e-commerce, "" platform, "" composition, "and a new pattern" are obtained. Assuming that the target word is "cooperative", the part of speech of the target word is noun, the word length is 3, and the post-word punctuation type is comma. For ease of understanding, the relationship between parts of speech and tokens, and the relationship between word suffix type and tokens will be described below in conjunction with tables 2 and 3. In practical applications, text features are often represented by numbers, and therefore, it is necessary to convert a word concept into a number concept.

TABLE 2

Part of speech identification	Part of speech	Examples of such applications are
			0	Noun (name)	Shanghai, cucumber, cabbage, Shanghai, tractor, quality, pinde
1	Verb and its usage	"come, walk, run, attach, learn, take off, ok, know
			2	Adjectives	"Duan, thin, tall, ugly, snow white, beautiful, red"
3	Adverb	"very, rather, extreme, very, just, all, immediately, once"
			4	Pronouns	"I, you, he, she, it, us, and you"
5	Preposition word	"handle, slave, direction, order, go, then, ratio, quilt, in"
			6	Volume word	"mu, zhan, xie, zhi, ben, che, ke, zhan, tou and jiao"
7	Conjunction word	"then, so, and, or"
			8	Word aid	'di, shi, mo, bar and di'
9	Digit word	"one, two, three, seven, ten, hundred, thousand, ten thousand, hundred million"
			10	Exclamation word	"feed, car, hi, hum, chess, etc."
11	Pseudonyms	Wu, Wang, hong Long, Gele, Shasha and Hula

TABLE 3

Post-word punctuation type identification	Post word punctuation type	Examples of such applications are
			0	Sentence number	。
1	Question mark	？
			2	Exclamation mark	！
3	Tun number	、
			4	Comma (comma)	，
5	Branch number	；
			6	Colon	：
7	Non-punctuation mark

As can be seen from tables 2 and 3, when the target word is "cooperative", the corresponding feature is "noun 3 comma", which can be expressed as "034". In order to enrich text features, terms around the target term need to be considered, that is, associated terms are obtained, and the associated terms may be a previous term, a next term, or two previous terms of the target term, and the like, which is not limited herein. Assuming that the associated words are the preceding word and the succeeding word of the target word, and the target word is "cooperative", the associated words are "true" and "by". As is clear from the contents of table 2 and table 3, the feature corresponding to "established" is "verb 2 has no punctuation". And counting the part-of-speech category number, the maximum word length and the punctuation category number in the corpus, wherein the part-of-speech characteristics, the word length characteristics and the post-word punctuation characteristics can all be represented by one-hot vectors, the three one-hot vectors are spliced to obtain the text characteristics of the current target word, and the text characteristics of the target word are spliced with the text characteristics of the associated words to obtain the text characteristic vector of the target word, namely the text characteristic set to be trained.

According to the method for extracting the feature set of the text to be trained, the feature set of the text to be trained of each word is extracted, the feature set of the text to be trained of the words is trained, and a second model parameter can be obtained by using the minimum value of the loss function, wherein the second model parameter is used for generating a text neural network in a prosody level labeling model. In practical application, the text neural network can be updated regularly, so that the accuracy of the text neural network is improved.

It is understood that the text neural network may be a feedforward neural network or a convolutional neural network, and may be another type of neural network, and the bidirectional long-term memory network may be replaced by a variation thereof, such as a recurrent neural network with a gated recurrent unit, which is only an illustration here and should not be construed as a limitation to the present application. And the number of layers and the number of neurons of the text neural network are not limited by the application.

Secondly, in the embodiment of the application, a method for training a text neural network is introduced, namely, the part of speech, the word length and the post-word punctuation type of a target word in text data to be trained are firstly obtained, the part of speech, the word length and the post-word punctuation type of a related word in the text data to be trained are also obtained, then the part of speech, the word length and the post-word punctuation type of the target word and the part of speech, the word length and the post-word punctuation type of the related word are trained to obtain second model parameters, and the second model parameters are used for generating the text neural network in a rhythm level labeling model. By the mode, the system can automatically learn the high-level feature expression which is favorable for prosodic hierarchy structure labeling through the neural network, and automatically learn the high-level feature which is favorable for labeling from the original input text feature set, so that the automatic labeling performance of the prosodic hierarchy structure is improved.

Optionally, on the basis of the embodiment corresponding to fig. 6, in a third optional embodiment of the method for model training provided in the embodiment of the present application, after obtaining text data to be trained and audio data to be trained, the method may further include:

forcibly aligning the text data to be trained and the audio data to be trained to obtain a time-aligned text;

extracting the set of acoustic features to be trained of each word according to the audio data to be trained may include:

and determining the syllable duration of the word end of the target word according to the time alignment text.

In this embodiment, how to extract the acoustic feature set of the word is introduced, that is, how to perform forced alignment on the text data to be trained and the audio data to be trained to obtain a time-aligned text, specifically, a frame boundary at a phoneme level can be obtained, so that a frame boundary of the end syllable can also be obtained, and the end syllable duration of the target word is calculated according to the start frame number and the end frame number of the end syllable.

For convenience of introduction, please refer to fig. 7, where fig. 7 is a schematic flowchart of a process of extracting an acoustic feature set according to an embodiment of the present application, and as shown in the drawing, in step a1, text data and audio data are first obtained, and specifically, the text data to be trained and the audio data to be trained may be obtained. In step a2, the text data to be trained is word-segmented, and the text data and the audio data are aligned by using a forced alignment tool to obtain a time-aligned text, i.e. frame boundary information at a phoneme level. In step a4, the start-stop frame number corresponding to the boundary of the end syllable is determined, and similarly, the frame numbers of the last voiced frame at the end of the word and the first voiced frame at the beginning of the next word can also be determined. In step a3, a logarithmic fundamental frequency curve and a logarithmic energy curve are extracted from the audio data by frame. In step a5, the time-aligned text is combined to obtain the log-fundamental frequency curve and log-energy curve of the end-of-word syllable, and the log-fundamental frequency value and log-energy value of the end-of-word voiced frame sequence number and the voiced frame of the next beginning-of-word. In step a6, the log fundamental frequency statistical characteristic, log energy statistical characteristic, log fundamental frequency difference and log energy difference of the end of word voiced sound frame and the next beginning of word voiced sound frame of the word are calculated. In step a7, these acoustic features are spliced to form a set of acoustic features for words of the prosodic hierarchy for automatic labeling tasks.

Specifically, after the text data to be trained and the audio data to be trained are aligned forcibly, frame boundary information at a phoneme level is obtained, and assuming that the text data to be trained is "sincere greeting and heartful blessing", the target word is "greeting", and the word end syllable duration refers to the time length of the last syllable of the word, the frame boundary of the word end syllable can be calculated through the information of the alignment forcibly, for example, the "waiting" of the target word "greeting" is pronounced as "hou", the word end syllable is "ou", the initial frame number of "ou" on the audio is 101 th frame, the end frame number is 120 frames, the "ou" pronounces for 20 frames, 5 milliseconds per frame, the pronunciation duration of "ou" is 100 milliseconds, that is, the word end syllable duration of "greeting" is 100 milliseconds.

Secondly, in the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are forcibly aligned to obtain a time-aligned text, and the word end syllable duration of the target word is determined according to the time-aligned text. By the method, the time-aligned text can be obtained, the word end syllable duration is extracted, the word end syllable duration is used as one item in the acoustic feature set, and high-level features beneficial to labeling are automatically learned from the originally input acoustic feature set, so that the accuracy of the rhythm level labeling model is improved.

Optionally, on the basis of the third embodiment corresponding to fig. 6, in a fourth optional embodiment of the method for model training provided in the embodiment of the present application, extracting an acoustic feature set to be trained of each word according to audio data to be trained may include:

and determining post-word pause duration of the target word according to the time alignment text.

In this embodiment, how to obtain the post-word pause duration of a word will be described. Specifically, after the text data to be trained and the audio data to be trained are aligned forcibly, M speech frames are obtained, assuming that the text data to be trained is "sincere greeting and heartful blessing", and the target word is "greeting", the next adjacent word of the target word is "sum", and the "waiting" word of the target word "greeting" is calculated according to the time-aligned text data, so that the short pause duration between "waiting" and "sum" can be obtained, the short pause is 20 frames, each frame is 5 milliseconds, and the post-word pause duration of the target word is 100 milliseconds.

In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are aligned forcibly to obtain a time aligned text, and then the post-word pause duration can be determined according to the time aligned text. By the method, the post-word pause duration of each word can be determined after the text data and the audio data are aligned forcibly, the post-word pause duration is used as one item in the acoustic feature set, and high-level features beneficial to labeling are automatically learned from the acoustic feature set input originally, so that the accuracy of the rhythm level labeling model is improved.

Optionally, on the basis of the third embodiment corresponding to fig. 6, in a fifth optional embodiment of the method for model training provided in the embodiment of the present application, extracting an acoustic feature set to be trained of each word according to audio data to be trained may include:

calculating the frame number of the final syllable voiced sound initial frame and the frame number of the voiced sound end frame of the target word according to the time alignment text and the fundamental frequency information extracted from the audio data to be trained;

extracting a logarithmic fundamental frequency curve and a logarithmic energy curve of audio data to be trained;

and calculating the acoustic statistical characteristics of the end-of-speech syllables of the target word according to the frame number of the end-of-speech syllable voiced initial frame of the target word, the frame number of the voiced end frame, a logarithmic fundamental frequency curve and a logarithmic energy curve, wherein the acoustic statistical characteristics of the end-of-speech syllables comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic fundamental frequency curve, and the acoustic statistical characteristics of the end-of-speech syllables further comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic energy curve.

In this embodiment, how to obtain the acoustic statistical characteristics of the end syllable of the word will be described. Specifically, after the text data to be trained and the audio data to be trained are aligned forcibly, a time alignment text is obtained, assuming that the text data to be trained is "enthusiastic greeting and heartthrob blessing", the fundamental frequency and energy of the corresponding audio are extracted in frames, so as to generate a fundamental frequency curve and an energy curve, for easy understanding, please refer to fig. 8 and 9, fig. 8 is an embodiment diagram of a fundamental frequency curve in the embodiment of the present application, fig. 9 is an embodiment diagram of an energy curve in the embodiment of the present application, for normalizing data, a logarithmic fundamental frequency curve and a logarithmic energy curve are obtained according to a logarithmic value obtained by taking a logarithm value of the two curves, and the fundamental frequency and the energy are weakened near a prosody hierarchy boundary. Assuming that the target word is a greeting, according to the frame number of a voiced start speech frame and the frame number of a voiced end speech frame of the greeting, a logarithmic fundamental frequency curve and a logarithmic energy curve corresponding to the end of the target word are intercepted from a logarithmic fundamental frequency curve and a logarithmic energy curve of the audio, and according to the logarithmic fundamental frequency curve and the logarithmic energy curve corresponding to the end of the target word, acoustic statistical characteristics of end syllables of ten dimensions are respectively calculated, namely the maximum value of the logarithmic fundamental frequency curve, the minimum value of the logarithmic fundamental frequency curve, the interval range of the logarithmic fundamental frequency curve, the average value of the logarithmic fundamental frequency curve, the variance of the logarithmic energy curve, the minimum value of the logarithmic energy curve, the interval range of the logarithmic energy curve, the average value of the logarithmic energy curve and the variance of the logarithmic energy curve.

In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are forcibly aligned to obtain a time aligned text, then the frame number of the final syllable voiced initial frame and the frame number of the voiced end frame of the target word are obtained through calculation according to the time aligned text and the fundamental frequency information extracted from the audio data to be trained, the logarithmic fundamental frequency curve and the logarithmic energy curve of the audio data to be trained are extracted, and finally the acoustic statistical characteristics of the final syllable of the target word are obtained through calculation according to the frame number of the final syllable voiced initial frame, the frame number of the voiced end frame, the logarithmic fundamental frequency curve and the logarithmic energy curve of the target word. By the method, the time-aligned text data is obtained, the frame numbers of the initial frame and the ending frame of the voiced speech segment at the end of the word can be obtained according to the fundamental frequency information extracted from the audio, and the high-level features beneficial to labeling are automatically learned from the acoustic feature set input originally, so that the accuracy of the rhythm level labeling model is improved.

Optionally, on the basis of the third embodiment corresponding to fig. 6, in a seventh optional embodiment of the method for model training provided in the embodiment of the present application, extracting an acoustic feature set to be trained of each word according to audio data to be trained may include:

calculating the frame number of the last voiced sound frame of the target word and the frame number of the voiced sound frame of the next adjacent word prefix of the target word according to the time alignment text and fundamental frequency information extracted from the audio data to be trained;

determining a fundamental frequency value and an energy value between a voiced frame at the tail of the target word and a voiced frame at the head of the next adjacent word according to the frame number of the last voiced frame of the target word, the frame number of the voiced frame at the head of the next adjacent word of the target word and fundamental frequency information and energy information which are extracted from audio data to be trained in a framing manner;

calculating to obtain a logarithmic difference value of the fundamental frequency value according to the fundamental frequency value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, and calculating to obtain a logarithmic difference value of the energy value according to the energy value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, wherein the logarithmic difference value of the fundamental frequency value and the logarithmic difference value of the energy value belong to the acoustic characteristic change value between words.

In this embodiment, how to obtain the acoustic feature variation value between words will be described. Specifically, after the text data to be trained and the audio data to be trained are aligned forcibly, a time alignment text is obtained, and assuming that the text data to be trained is 'enthusiastic greeting and heartful blessing', fundamental frequency information and energy information corresponding to the audio data to be trained are extracted in frames, so that a fundamental frequency curve and an energy curve are generated, and in order to standardize data, a logarithmic fundamental frequency curve and a logarithmic energy curve are obtained according to logarithmic values of the two curves. Assuming that the target word is "greeting", the frame number of the last voiced frame of the "candidate" and the frame number of the first voiced frame of the next word "and" can be determined according to the time-aligned text and the extracted fundamental frequency information of the audio frame, then the fundamental frequency value and the energy value of the two frames can be obtained, and then the logarithmic fundamental frequency difference and the logarithmic energy difference of the two frames are calculated.

In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are aligned forcibly to obtain a time aligned text, then the fundamental frequency value and the energy value of the voiced frame at the end of the target word and the fundamental frequency value and the energy value of the voiced frame at the head of the next word are determined according to the frame number of the voiced frame of the last voiced frame of the target word and the voiced frame of the next adjacent word of the target word and the fundamental frequency and energy data extracted from the audio frame, and then the logarithmic difference value of the fundamental frequency values and the logarithmic difference value of the energy values of the two are obtained by calculation, so that the logarithmic difference value is used as the change value of the acoustic features between words. By the method, high-level features beneficial to labeling can be automatically learned from the acoustic feature set of the original input, so that the accuracy of the prosody hierarchy labeling model is improved.

Optionally, on the basis of any one of the first to seventh embodiments corresponding to fig. 6 and fig. 6, in an eighth optional embodiment of the method for model training provided in the embodiment of the present application, training a word identifier corresponding to each word, a text feature set to be trained of each word, and an acoustic feature set to be trained of each word to obtain a prosody level labeling model, may include:

acquiring a first output result of a target word identifier through a word embedding layer in a rhythm level labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of at least one word, and the word embedding layer is obtained through training according to a first model parameter;

acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target words, and the text neural network is obtained through training according to second model parameters;

training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to a target word, and the third model parameter is used for generating an acoustic neural network in a prosody hierarchy labeling model;

and generating a rhythm level labeling model according to the first model parameter, the second model parameter and the third model parameter.

In this embodiment, a method for obtaining a prosody hierarchy labeling model through training will be described, and for convenience of understanding, please refer to fig. 10, where fig. 10 is a schematic structural diagram of the prosody hierarchy labeling model in the embodiment of the present application, as shown in the figure, a target word is taken as an example, that is, a word identifier is a target word identifier, a text feature set is a target to-be-trained text feature set corresponding to the target word, and an acoustic feature set is a target to-be-trained acoustic feature set corresponding to the target word. And taking the target word identifier as an input of the word embedding layer, thereby outputting a first output result, wherein the first output result is a word vector mapped by the target word identifier, and the word vector can be 200-dimensional. And taking the target text feature set to be trained (part of speech, word length and word post-punctuation type) as the input of a text neural network (such as a feed-forward neural network), and outputting a second output result. The target acoustic feature set to be trained, the first output result and the second output result are used as input of an acoustic neural network (such as a bidirectional long-and-short-term memory network) together, posterior probabilities of all prosodic hierarchy structure types of the target word are output through a softmax layer, for example, the probability of a non-prosodic hierarchy boundary is 0.1, the probability of a prosodic word is 0.1, the probability of a prosodic phrase is 0.2, the probability of a intonation phrase is 0.6, a prosodic hierarchy structure corresponding to the maximum posterior probability is taken as a labeling result, and then the labeling result of the target word is a intonation phrase. The labeled result is a predicted result obtained by training and needs to be compared with a real result, namely, a loss function is adopted, and the minimum value of the two is taken to determine a third model parameter of the acoustic neural network. And training to obtain a rhythm level labeling model by combining the first model parameter, the second model parameter and the third model parameter. The prosodic hierarchy labeling model adopts a stacking structure of a feedforward neural network and a bidirectional long-time and short-time memory network, and can label three prosodic hierarchies of prosodic words, prosodic phrases and intonation phrases at the same time.

The loss function is used for measuring the degree of inconsistency between the predicted value and the real value of the model, and is a non-negative real value function. The loss function adopted by the application can adopt cross entropy and can also use cross entropy with weight.

It is understood that the word embedding layer, the feed-forward neural network, and the bi-directional long-and-short mnemonic network are trained together. The word embedding layer is used for training word vectors, and the feedforward neural network is used for automatically extracting high-level feature representation which is more beneficial to a labeling task from original input features (part of speech, word length and word post-punctuation types). These features are spliced together at the bi-directional long and short term memory network input, thereby jointly utilizing textual features and acoustic features.

The bidirectional long-time and short-time memory network can learn the dependency relationship between contexts, and the labeling task also needs context information, for example, if the previous word is a intonation phrase boundary, the current word is unlikely to be the intonation phrase boundary, so that the stack structure of the feedforward neural network and the bidirectional long-time and short-time memory network is jointly utilized, the trainable word embedding layer is adopted, not only can the text and acoustic feature information be utilized, but also high-level features can be automatically extracted from the text features, and the context features are utilized, so that the bidirectional long-time and short-time memory network is suitable for the prosody hierarchy structure labeling task.

Further, in the embodiment of the present application, a method for obtaining a prosody level labeling model through training is introduced, that is, three types of model parameters are obtained through training, which are a first model parameter, a second model parameter, and a third model parameter, respectively, and the first model parameter, the second model parameter, and the third model parameter are taken as a whole, and a prosody level labeling model is generated through training at the same time. Through the mode, the three parts of neural networks are stacked to form a complete prosody level labeling model and are used as a whole for model training, the training content comprises training between word identifications and word vectors, training of word texts and word text characteristics and training of audio and acoustic characteristics, and therefore richer characteristics can be obtained, and sentence labeling accuracy is improved.

Referring to fig. 11, fig. 11 is a schematic diagram of an embodiment of a prosody hierarchy labeling apparatus 30 according to the present application, which includes:

an obtaining module 301, configured to obtain text data to be labeled and audio data, where the text data to be labeled and the audio data have a corresponding relationship, the text data to be labeled includes at least one word, and each word corresponds to a word identifier;

an extracting module 302, configured to extract a to-be-labeled text feature set of each word according to the to-be-labeled text data obtained by the obtaining module 301, where the to-be-labeled text feature set includes a part of speech, a word length, and a post-word punctuation type;

the extracting module 302 is further configured to extract an acoustic feature set of each word according to the audio data acquired by the acquiring module 301, where the acoustic feature set includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;

the prediction module 303 is configured to obtain a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word extracted by the extraction module 302, and the acoustic feature set of each word.

In this embodiment, an obtaining module 301 obtains text data to be tagged and audio data, where the text data to be tagged and the audio data have a corresponding relationship, the text data to be tagged includes at least one word, and each word corresponds to a word identifier, an extracting module 302 extracts a feature set of a text to be tagged of each word according to the text data to be tagged obtained by the obtaining module 301, where the feature set of the text to be tagged includes a part of speech, a word length, and a post-word punctuation type, the extracting module 302 extracts an acoustic feature set of each word according to the audio data obtained by the obtaining module 301, where the acoustic feature set includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value, and a predicting module 303 identifies a word according to the word identifier of each word, and a word of each word, The extraction module 302 extracts the feature set of the text to be labeled of each word and the acoustic feature set of each word, and obtains a prosody hierarchy structure through a prosody hierarchy labeling model.

In the embodiment of the application, a prosodic hierarchy labeling device is provided, which includes obtaining text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relationship, the text data to be labeled includes at least one word, each word corresponds to a word identifier, extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled includes a part of speech, a word length and a post-word punctuation type, extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set includes a word end syllable time length, a post-word time length, an end syllable acoustic statistical feature and an inter-word acoustic feature variation value, and finally obtaining the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, and acquiring a prosody hierarchy through a prosody hierarchy labeling model. Through the mode, the prosody hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for labeling prosody hierarchies, and the accuracy of prosody hierarchy labeling can be improved by adopting a more accurate prosody hierarchy labeling model.

Optionally, on the basis of the embodiment corresponding to fig. 11, in another embodiment of the prosody hierarchy labeling device 30 provided in the embodiment of the present application, the prosody hierarchy labeling device is further provided

The prediction module 303 is specifically configured to determine at least one of a prosodic word, a prosodic phrase, and a intonation phrase through the prosodic hierarchy labeling model;

or the like, or, alternatively,

Referring to fig. 12, fig. 12 is a schematic view of an embodiment of the model training apparatus according to the embodiment of the present application, and the model training apparatus 40 includes:

an obtaining module 401, configured to obtain text data to be trained and audio data to be trained, where the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained includes at least one word, and each word corresponds to a word identifier;

an extracting module 402, configured to extract a to-be-trained text feature set of each word according to the to-be-trained text data acquired by the acquiring module 401, where the to-be-trained text feature set includes a part of speech, a word length, and a word post-punctuation type;

the extracting module 402 is further configured to extract an acoustic feature set to be trained of each word according to the audio data to be trained acquired by the acquiring module 401, where the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;

a training module 403, configured to train the word identifier corresponding to each word, the text feature set to be trained of each word extracted by the extraction module 402, and the acoustic feature set to be trained of each word, so as to obtain a prosody hierarchy labeling model, where the prosody hierarchy labeling model is used to label a prosody hierarchy structure.

In this embodiment, an obtaining module 401 obtains text data to be trained and audio data to be trained, where the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained includes at least one word, and each word corresponds to a word identifier, an extracting module 402 extracts a text feature set to be trained of each word according to the text data to be trained obtained by the obtaining module 401, where the text feature set to be trained includes a part of speech, a word length, and a post-word punctuation type, the extracting module 402 extracts an acoustic feature set to be trained of each word according to the audio data to be trained obtained by the obtaining module 401, where the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value, the training module 403 trains the word identifier corresponding to each word, the text feature set to be trained of each word extracted by the extraction module 402, and the acoustic feature set to be trained of each word, to obtain a prosody hierarchy labeling model, where the prosody hierarchy labeling model is used to label a prosody hierarchy.

In the embodiment of the application, a method for model training is provided, which includes firstly, obtaining text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, each word corresponds to a word identifier, then extracting a feature set of the text to be trained of each word according to the text data to be trained, wherein the feature set of the text to be trained includes a part of speech, a word length and a post-word punctuation type, and extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end acoustic statistical feature and an inter-word acoustic feature variation value, and finally training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word, and obtaining a rhythm level labeling model. Through the mode, the prosody hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for prosody hierarchy labeling tasks, more accurate prosody hierarchy labeling models can be adopted to improve the accuracy of prosody hierarchy labeling, and the effect of voice synthesis is improved.

Optionally, on the basis of the embodiment corresponding to fig. 12, please refer to fig. 13, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application, the model training apparatus 40 further includes a processing module 404 and a generating module 405;

the processing module 404 is configured to perform word segmentation processing on the text data to be trained after the obtaining module 401 obtains the text data to be trained and the audio data to be trained, so as to obtain at least one word;

the obtaining module 401 is further configured to obtain a target word identifier corresponding to a target word according to a preset word identifier relationship, where the preset word identifier relationship is used to indicate a relationship between each preset word and a word identifier, and the target word belongs to any one of the at least one word processed by the processing module;

the generating module 405 is configured to generate a target word vector corresponding to the target word in the text data to be trained;

the training module 403 is specifically configured to train the target word identifier obtained by the obtaining module 401 and the target word vector generated by the generating module 405 to obtain a first model parameter, where the first model parameter is used to generate a word embedding layer in the prosody level labeling model.

Alternatively, on the basis of the embodiment corresponding to fig. 12, in another embodiment of the model training device 40 provided in the embodiment of the present application,

the extraction module 402 is specifically configured to obtain a part of speech, a word length, and a post-word punctuation type of a target word in the text data to be trained, where the part of speech represents a result of a grammar classification of the word, the word length represents a word number of the word, and the post-word punctuation type is used to represent a punctuation type corresponding to the post-word;

the training module 403 is specifically configured to train the part of speech, the word length, and the post-word punctuation type of the target word and the part of speech, the word length, and the post-word punctuation type of the associated word to obtain a second model parameter, where the second model parameter is used to generate a text neural network in the prosody level labeling model.

Secondly, in the embodiment of the application, a method for training a text neural network is introduced, namely, the part of speech, the word length and the post-word punctuation type of a target word in text data to be trained are firstly obtained, the part of speech, the word length and the post-word punctuation type of a related word in the text data to be trained are also obtained, then the part of speech, the word length and the post-word punctuation type of the target word and the part of speech, the word length and the post-word punctuation type of the related word are trained to obtain second model parameters, and the second model parameters are used for generating the text neural network in a rhythm level labeling model. By the mode, the system can automatically learn the high-level feature expression which is favorable for prosody hierarchy structure labeling through the neural network, and automatically learn the high-level feature which is favorable for labeling from the originally input text feature set, so that the accuracy of the prosody hierarchy labeling model is improved.

Optionally, on the basis of the embodiment corresponding to fig. 12, please refer to fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application, the model training apparatus 40 further includes an alignment module 406;

the alignment module 406 is configured to, after the obtaining module 401 obtains text data to be trained and audio data to be trained, perform forced alignment on the text data to be trained and the audio data to be trained to obtain a time-aligned text;

the extracting module 402 is specifically configured to determine the word end syllable duration of the target word according to the time alignment text.

Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application,

the extracting module 402 is specifically configured to determine a post-word pause duration of the target word according to the time-aligned text.

the extracting module 402 is specifically configured to calculate, according to the time-aligned text and the fundamental frequency information extracted from the audio data to be trained, a frame number of a voiced beginning frame and a frame number of a voiced ending frame of the end-of-word syllable of the target word;

the extracting module 402 is specifically configured to calculate, according to the time-aligned text and fundamental frequency information extracted from the audio data to be trained, a frame number of a last voiced frame of the target word and a frame number of a voiced frame of a next adjacent word prefix of the target word;

Optionally, on the basis of the embodiments corresponding to fig. 12, fig. 13 or fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application,

the training module 403 is specifically configured to obtain a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, where the target word identifier corresponds to a target word, the target word belongs to any word in the at least one word, and the word embedding layer is obtained through training according to a first model parameter;

As shown in fig. 15, for convenience of description, only the relevant parts of the embodiments of the present application are shown, and details of the specific technology are not disclosed, please refer to the method part of the embodiments of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal device as the mobile phone as an example:

fig. 15 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 15, the cellular phone includes: radio Frequency (RF) circuitry 510, memory 520, input unit 530, display unit 540, sensor 550, audio circuitry 560, wireless fidelity (WiFi) module 570, processor 580, and power supply 590. Those skilled in the art will appreciate that the handset configuration shown in fig. 15 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 15:

RF circuit 510 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to processor 580; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 520 may be used to store software programs and modules, and the processor 580 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 520. The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 530 may include a touch panel 531 and other input devices 532. The touch panel 531, also called a touch screen, can collect touch operations of a user on or near the touch panel 531 (for example, operations of the user on or near the touch panel 531 by using any suitable object or accessory such as a finger or a stylus pen), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 580, and can receive and execute commands sent by the processor 580. In addition, the touch panel 531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 530 may include other input devices 532 in addition to the touch panel 531. In particular, other input devices 532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 540 may include a display panel 541, and optionally, the display panel 541 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 531 may cover the display panel 541, and when the touch panel 531 detects a touch operation on or near the touch panel 531, the touch panel is transmitted to the processor 580 to determine the type of the touch event, and then the processor 580 provides a corresponding visual output on the display panel 541 according to the type of the touch event. Although the touch panel 531 and the display panel 541 are shown as two separate components in fig. 15 to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 531 and the display panel 541 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 550, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 541 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 541 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 560, speaker 561, and microphone 562 may provide an audio interface between a user and a cell phone. The audio circuit 560 may transmit the electrical signal converted from the received audio data to the speaker 561, and convert the electrical signal into a sound signal by the speaker 561 for output; on the other hand, the microphone 562 converts the collected sound signals into electrical signals, which are received by the audio circuit 560 and converted into audio data, which are then processed by the audio data output processor 580, and then passed through the RF circuit 510 to be sent to, for example, another cellular phone, or output to the memory 520 for further processing.

WiFi belongs to short distance wireless transmission technology, and the mobile phone can help the user to send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 570, which provides wireless broadband internet access for the user. Although fig. 15 shows the WiFi module 570, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 580 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 520 and calling data stored in the memory 520, thereby performing overall monitoring of the mobile phone. Alternatively, processor 580 may include one or more processing units; optionally, processor 580 may integrate an application processor, which handles primarily the operating system, user interface, applications, etc., and a modem processor, which handles primarily the wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 580.

The handset also includes a power supply 590 (e.g., a battery) for powering the various components, which may optionally be logically connected to the processor 580 via a power management system, such that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In the embodiment of the present application, the processor 580 included in the terminal device further has the following functions:

Fig. 16 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 600 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and a memory 632, and one or more storage media 630 (e.g., one or more mass storage devices) for storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 600.

The server 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, and/or one or more operating systems 641, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 16.

In the embodiment of the present application, the CPU 622 included in the server also has the following functions:

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for prosodic hierarchy annotation, comprising:

extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled comprises part of speech, word length and post-word punctuation types, and the audio data is voice data;

obtaining a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, wherein the prosodic hierarchy structure comprises at least one of a prosodic word, a prosodic phrase and a intonation phrase, or the prosodic hierarchy structure comprises at least one of a prosodic word and a prosodic phrase, the prosodic hierarchy labeling model is obtained by training according to the word identifier corresponding to the word, the text feature set to be trained of the word and the acoustic feature set to be trained of the word, and the training process of the prosodic hierarchy labeling model is as follows: obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter; acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter; training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model; and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.

2. A method of model training, comprising:

acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, each word corresponds to a word identifier, and the audio data to be trained is voice data;

training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, wherein the prosodic hierarchy labeling model is used for labeling a prosodic hierarchy structure, and the prosodic hierarchy structure comprises at least one of prosodic words, prosodic phrases and intonation phrases, or the prosodic hierarchy structure comprises at least one of prosodic words and prosodic phrases;

training the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, including:

obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter;

3. The method of claim 2, wherein after obtaining the text data to be trained and the audio data to be trained, the method further comprises:

performing word segmentation processing on the text data to be trained to obtain at least one word;

acquiring a target word identifier corresponding to a target word according to a preset word identifier relationship, wherein the preset word identifier relationship is used for representing a preset relationship between each word and the word identifier, and the target word belongs to any one of the at least one word;

generating a target word vector corresponding to the target word in the text data to be trained;

the training of the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model includes:

training the target word identifier and the target word vector to obtain a first model parameter, wherein the first model parameter is used for generating a word embedding layer in the prosodic hierarchy labeling model, and the word embedding layer is updated in a target time.

4. The method according to claim 2, wherein the extracting a feature set of the text to be trained for each word according to the text data to be trained comprises:

acquiring part of speech, word length and post-word punctuation types of target words in the text data to be trained, wherein the part of speech represents a grammar classification result of the words, the word length represents word number of the words, and the post-word punctuation types are used for representing punctuation types corresponding to the post-word;

training the part of speech, word length and post-word punctuation types of the target words and the part of speech, word length and post-word punctuation types of the associated words by adopting a loss function;

and when the loss function reaches the minimum value, obtaining a second model parameter, wherein the second model parameter is used for generating a text neural network in the prosody hierarchy labeling model.

5. The method of claim 2, wherein after obtaining the text data to be trained and the audio data to be trained, the method further comprises:

the extracting the acoustic feature set to be trained of each word according to the audio data to be trained includes:

and determining the word end syllable duration of the target word according to the time alignment text.

6. The method of claim 5, wherein the extracting the set of acoustic features to be trained for each word from the audio data to be trained comprises:

7. The method of claim 5, wherein the extracting the set of acoustic features to be trained for each word from the audio data to be trained comprises:

calculating the frame number of the final syllable voiced initial frame and the frame number of the voiced end frame of the target word according to the time alignment text and the fundamental frequency information extracted from the audio data to be trained;

8. The method of claim 5, wherein the extracting the set of acoustic features to be trained for each word from the audio data to be trained comprises:

calculating the frame number of the last voiced sound frame of the target word and the frame number of the voiced sound frame of the next adjacent word prefix of the target word according to the time alignment text and the fundamental frequency information extracted from the audio data to be trained;

9. A prosodic hierarchy labeling apparatus, comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring text data to be labeled and audio data, the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, each word corresponds to a word identifier, and the audio data is voice data;

the prediction module is configured to obtain a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word extracted by the extraction module, and the acoustic feature set of each word, where the prosodic hierarchy structure includes at least one of a prosodic word, a prosodic phrase, and a intonation phrase, or the prosodic hierarchy structure includes at least one of a prosodic word and a prosodic phrase, the prosodic hierarchy labeling model is obtained by training according to the word identifier corresponding to the word, the text feature set to be trained of the word, and the acoustic feature set to be trained of the word, and the training process of the prosodic hierarchy labeling model is as follows: obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter; acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter; training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model; and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.

10. A model training apparatus, comprising:

the training device comprises an acquisition module, a comparison module and a comparison module, wherein the acquisition module is used for acquiring text data to be trained and audio data to be trained, the text data to be trained and the audio data to be trained have a corresponding relation, the text data to be trained comprises at least one word, each word corresponds to a word identifier, and the audio data to be trained is voice data;

a training module, configured to train a word identifier corresponding to each word, the text feature set to be trained of each word extracted by the extraction module, and the acoustic feature set to be trained of each word, so as to obtain a prosodic hierarchy labeling model, where the prosodic hierarchy labeling model is used to label a prosodic hierarchy structure, and the prosodic hierarchy structure includes at least one of prosodic words, prosodic phrases, and intonation phrases, or the prosodic hierarchy structure includes at least one of prosodic words and prosodic phrases;

wherein the training module is specifically configured to:

11. A terminal device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

obtaining a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, wherein the prosodic hierarchy structure comprises at least one of a prosodic word, a prosodic phrase and a intonation phrase, or the prosodic hierarchy structure comprises at least one of a prosodic word and a prosodic phrase, the prosodic hierarchy labeling model is obtained by training according to the word identifier corresponding to the word, the text feature set to be trained of the word and the acoustic feature set to be trained of the word, and the training process of the prosodic hierarchy labeling model is as follows: obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter; acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter; training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model; generating the prosody hierarchy labeling model according to the first model parameter, the second model parameter and the third model parameter;

12. A server, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate;

13. An intelligent voice interaction system is characterized by comprising a voice acquisition module, a voice processing and analyzing module and a storage module;

the voice acquisition module is used for acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;

the voice processing and analyzing module is used for extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled comprises part of speech, word length and post-word punctuation types, and the audio data is voice data;

the storage module is used for storing the prosody hierarchy.

14. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of claim 1, or perform the method of any of claims 2 to 8.