CN110444191B - Rhythm level labeling method, model training method and device - Google Patents

Rhythm level labeling method, model training method and device Download PDF

Info

Publication number
CN110444191B
CN110444191B CN201910751371.6A CN201910751371A CN110444191B CN 110444191 B CN110444191 B CN 110444191B CN 201910751371 A CN201910751371 A CN 201910751371A CN 110444191 B CN110444191 B CN 110444191B
Authority
CN
China
Prior art keywords
word
trained
feature set
text
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910751371.6A
Other languages
Chinese (zh)
Other versions
CN110444191A (en
Inventor
吴志勇
杜耀
康世胤
苏丹
俞栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Shenzhen Graduate School Tsinghua University
Original Assignee
Tencent Technology Shenzhen Co Ltd
Shenzhen Graduate School Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd, Shenzhen Graduate School Tsinghua University filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910751371.6A priority Critical patent/CN110444191B/en
Publication of CN110444191A publication Critical patent/CN110444191A/en
Application granted granted Critical
Publication of CN110444191B publication Critical patent/CN110444191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Abstract

The application discloses a rhythm level labeling method, which is applied to the field of artificial intelligence, in particular to the field of voice synthesis, and comprises the following steps: acquiring text data to be marked and audio data, wherein the text data to be marked and the audio data have a corresponding relation; extracting a text feature set to be labeled of each word according to the text data to be labeled; extracting an acoustic feature set of each word according to the audio data; and acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word. The application also discloses a model training method, a rhythm level marking device and a model training device. The prosodic hierarchy marking model is established by combining the text features and the acoustic features, richer features can be provided for prosodic hierarchy marking, the prosodic hierarchy marking accuracy is improved, and the voice synthesis effect is improved.

Description

Rhythm level labeling method, model training method and device
The application is a divisional application of Chinese patent application with the name of 'a rhythm level labeling method, a model training method and a device', which is filed by the Chinese patent office on 22.1.2019 and 22.1. 201910060152.3.
Technical Field
The present application relates to the field of intelligent speech synthesis, and in particular, to a prosodic hierarchy labeling method, a model training method, and a related apparatus.
Background
In order to realize a high-quality voice synthesis system, mass data with a prosody hierarchical structure is important to be accurately marked, the prosody hierarchical structure is used for modeling the rhythm and pause of voice, and the method capable of accurately and automatically marking the prosody hierarchical structure has important significance for quickly constructing a voice synthesis corpus and improving the naturalness of voice synthesis.
At present, an automatic labeling model needs to be trained by a machine learning method for automatic labeling of a prosodic hierarchy structure, two types are mainly used for feature selection, one type is to use text features, firstly divide words, then extract the text features of the words, judge the prosodic hierarchy structure type of the words by the machine learning method, and the other type is to use acoustic features, mainly depending on the detection of the pause position of audio, distinguish different prosodic hierarchy structure types according to the pause duration.
However, in practical situations, the labeling task only uses text data, and does not consider the phenomena that the duration of a syllable before a prosodic hierarchy structure boundary is prolonged and a intonation phrase boundary is often accompanied by short-time pause, but only uses acoustic features, so that it is difficult to accurately label three-layer prosodic hierarchies simultaneously, and the internal relation between the text features and the acoustic features is ignored, so that the labeling effect of the prosodic hierarchy structure is reduced, and the quality of a corpus relied on by speech synthesis is affected.
Disclosure of Invention
The embodiment of the application provides a prosody hierarchy labeling method, a model training method and a prosody hierarchy labeling device.
In view of the above, a first aspect of the present application provides a method for prosody hierarchy annotation, including:
acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be annotated of each word according to the text data to be annotated, wherein the text feature set to be annotated comprises part of speech, word length and post-word punctuation types;
extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
and acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word.
A second aspect of the present application provides a method of model training, comprising:
acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be trained of each word according to the text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;
extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
and training the word identification corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy structure.
A third aspect of the present application provides a prosodic hierarchy labeling apparatus, including:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring text data to be labeled and audio data, the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
the extraction module is used for extracting a text feature set to be labeled of each word according to the text data to be labeled obtained by the obtaining module, wherein the text feature set to be labeled comprises part of speech, word length and word post punctuation types;
the extraction module is further configured to extract an acoustic feature set of each word according to the audio data acquired by the acquisition module, where the acoustic feature set includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;
and the prediction module is used for acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word extracted by the extraction module and the acoustic feature set of each word.
In one possible design, in a first implementation of the third aspect of an embodiment of the present application,
the prediction module is specifically configured to determine at least one of a prosodic word, a prosodic phrase, and a intonation phrase through the prosodic hierarchy labeling model;
or the like, or, alternatively,
and determining prosodic words and/or prosodic phrases through the prosodic hierarchy annotation model.
The present application in a fourth aspect provides a model training apparatus comprising:
the training device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring text data to be trained and audio data to be trained, the text data to be trained and the audio data to be trained have a corresponding relation, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;
the extraction module is used for extracting a text feature set to be trained of each word according to the text data to be trained acquired by the acquisition module, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation types;
the extraction module is further configured to extract an acoustic feature set to be trained of each word according to the audio data to be trained acquired by the acquisition module, where the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;
and the training module is used for training the word identifier corresponding to each word, the text feature set to be trained of each word extracted by the extraction module and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy.
In one possible design, in a first implementation manner of the fourth aspect of the embodiment of the present application, the model training apparatus further includes a processing module and a generating module;
the processing module is used for performing word segmentation processing on the text data to be trained after the acquisition module acquires the text data to be trained and the audio data to be trained to obtain at least one word;
the obtaining module is further configured to obtain a target word identifier corresponding to a target word according to a preset word identifier relationship, where the preset word identifier relationship is used to indicate a relationship between each preset word and a word identifier, and the target word belongs to any one of the at least one word processed by the processing module;
the generating module is used for generating a target word vector corresponding to the target word in the text data to be trained;
the training module is specifically configured to train the target word identifier obtained by the obtaining module and the target word vector generated by the generating module to obtain a first model parameter, where the first model parameter is used to generate a word embedding layer in the prosody level labeling model.
In one possible design, in a second implementation of the fourth aspect of the embodiments of the present application,
the extraction module is specifically configured to acquire a part of speech, a word length, and a post-word punctuation type of a target word in the text data to be trained, where the part of speech indicates a result of grammar classification of the word, the word length indicates a word number of the word, and the post-word punctuation type is used to indicate a punctuation type corresponding to the post-word;
acquiring the part of speech, word length and post-word punctuation types of associated words in the text data to be trained, wherein the associated words are words having an association relation with the target words;
the training module is specifically configured to train the part of speech, the word length, and the post-word punctuation type of the target word and the part of speech, the word length, and the post-word punctuation type of the associated word to obtain a second model parameter, where the second model parameter is used to generate a text neural network in the prosody level labeling model.
In one possible design, in a third implementation manner of the fourth aspect of the embodiment of the present application, the model training apparatus further includes an alignment module;
the alignment module is used for forcibly aligning the text data to be trained and the audio data to be trained after the acquisition module acquires the text data to be trained and the audio data to be trained to obtain a time-aligned text;
the extraction module is specifically configured to determine the word end syllable duration of the target word according to the time alignment text.
In one possible design, in a fourth implementation of the fourth aspect of the embodiment of the present application,
the extraction module is specifically configured to determine post-word pause duration of the target word according to the time alignment text.
In one possible design, in a fifth implementation form of the fourth aspect of the embodiments of the present application,
the extraction module is specifically configured to calculate a frame number of a final syllable voiced initial frame and a frame number of a voiced end frame of the target word according to the time alignment text and fundamental frequency information extracted from the audio data to be trained;
extracting a logarithmic fundamental frequency curve and a logarithmic energy curve of the audio data to be trained;
and calculating the acoustic statistical characteristics of the end-of-speech syllables of the target word according to the frame number of the end-of-speech syllable voiced initial frame of the target word, the frame number of the voiced end frame, the logarithmic fundamental frequency curve and the logarithmic energy curve, wherein the acoustic statistical characteristics of the end-of-speech syllables comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic fundamental frequency curve, and the acoustic statistical characteristics of the end-of-speech syllables further comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic energy curve.
In one possible design, in a sixth implementation form of the fourth aspect of the embodiment of the present application,
the extraction module is specifically configured to calculate, according to the time-aligned text and fundamental frequency information extracted from the audio data to be trained, a frame number of a last voiced frame of the target word and a frame number of a voiced frame of a next adjacent word prefix of the target word;
determining a fundamental frequency value and an energy value between the end voiced frame of the target word and the next adjacent word beginning voiced frame according to the frame number of the last voiced frame of the target word, the frame number of the voiced frame of the next adjacent word beginning of the target word, and fundamental frequency information and energy information which are extracted from the audio data to be trained in a framing manner;
and calculating to obtain a logarithmic difference value of the fundamental frequency value according to the fundamental frequency value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, and calculating to obtain a logarithmic difference value of the energy value according to the energy value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, wherein the logarithmic difference value of the fundamental frequency value and the logarithmic difference value of the energy value belong to the acoustic feature change value between words.
In one possible design, in a seventh implementation form of the fourth aspect of the embodiment of the present application,
the training module is specifically configured to obtain a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, where the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter;
acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter;
training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model;
and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.
A fifth aspect of the present application provides a prosodic hierarchy labeling apparatus, including: a memory, a transceiver, a processor, and a bus system;
wherein the memory is used for storing programs;
the processor is used for executing the program in the memory and comprises the following steps:
acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be annotated of each word according to the text data to be annotated, wherein the text feature set to be annotated comprises part of speech, word length and post-word punctuation types;
extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
acquiring a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
A sixth aspect of the present application provides a model training apparatus, comprising: a memory, a transceiver, a processor, and a bus system;
wherein the memory is used for storing programs;
the processor is used for executing the program in the memory and comprises the following steps:
acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be trained of each word according to the text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;
extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy structure;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
A seventh aspect of the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-described aspects.
According to the technical scheme, the embodiment of the application has the following advantages:
in the embodiment of the application, a prosodic hierarchy labeling method is provided, which includes the steps of firstly, obtaining text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled includes at least one word, each word corresponds to one word identifier, then extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled includes a part of speech, a word length and a post-word punctuation type, and then extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set includes a word end syllable time length, a post-word pause time length, a word end acoustic statistical feature and an inter-word acoustic feature variation value, and finally, according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, and acquiring a prosody hierarchy through a prosody hierarchy labeling model. Through the mode, the prosodic hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for prosodic hierarchy structure labeling, the accuracy of prosodic hierarchy labeling can be improved by adopting the more accurate prosodic hierarchy labeling model, and the naturalness of voice synthesis tone quality is favorably improved.
Drawings
FIG. 1 is a block diagram of a speech synthesis system according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a prosody hierarchy in an embodiment of the present application;
FIG. 3 is a diagram of an embodiment of a method for prosody hierarchy labeling in an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating an application of the prosodic hierarchy labeling system of the embodiment of the present application;
FIG. 5 is a flow chart illustrating prosodic hierarchy labeling in an embodiment of the present application;
FIG. 6 is a schematic diagram of an embodiment of a method for model training in an embodiment of the present application;
FIG. 7 is a schematic flow chart of extracting an acoustic feature set according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an embodiment of a fundamental frequency curve in an embodiment of the present application;
FIG. 9 is a schematic diagram of an embodiment of an energy curve in an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a prosody hierarchy labeling model according to an embodiment of the present application;
FIG. 11 is a schematic diagram of an embodiment of a prosodic hierarchy labeling apparatus according to the present application;
FIG. 12 is a schematic diagram of an embodiment of a model training apparatus according to an embodiment of the present application;
FIG. 13 is a schematic diagram of another embodiment of a model training apparatus according to an embodiment of the present application;
FIG. 14 is a schematic diagram of another embodiment of a model training apparatus according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of a terminal device in an embodiment of the present application;
fig. 16 is a schematic structural diagram of a server in the embodiment of the present application.
Detailed Description
The embodiment of the application provides a prosody hierarchy labeling method, a model training method and a prosody hierarchy labeling device.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be appreciated that the present application is primarily applicable to automatic prosodic hierarchy labeling of text data during data preparation for building a speech synthesis corpus. The voice synthesis is a task of converting a text into voice, massive data are required to be prepared for constructing a high-quality voice synthesis system, the data with prosody hierarchical structure labels have important influence on the naturalness of the voice synthesis, the traditional labeling mode is usually manual labeling, time and labor are wasted for labeling massive data, different labels can have inconsistency on the labeling of words, and the system with the automatic prosody hierarchical structure labeling has important significance for rapidly constructing the massive prosody hierarchical data labeling task of the voice synthesis system and solving the inconsistency of different labels.
Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
For convenience of understanding, the present application provides a prosody hierarchy labeling method and a model training method, where the method is applied to a speech synthesis system shown in fig. 1, please refer to fig. 1, fig. 1 is an architecture diagram of the speech synthesis system in an embodiment of the present application, as shown in the figure, a terminal device or a server first obtains text data and audio data, where the text data and the audio data are corresponding, for example, the text data is "today is a good day", and the audio data is an audio of "today is a good day", and a forced alignment tool is used to align the text data and the audio data. And then, extracting a text feature set corresponding to each word in the text data, wherein the text feature set of each word comprises part of speech, word length and word post punctuation type. Meanwhile, feature extraction is also needed to be performed on the audio data, so that an acoustic feature set of each word is obtained, wherein the acoustic feature set of each word comprises a word ending syllable time length, a word post pause time length, word ending syllable acoustic statistical features and an inter-word acoustic feature change value, and the inter-word acoustic feature change value is represented as a logarithmic difference value of a final voiced sound frame of the current word and a voiced sound frame of a next word beginning on a fundamental frequency and a logarithmic difference value on energy. In addition, word Identification (ID) of each word can be extracted according to the text data, the word identification of each word in the whole sentence, the text feature set of each word and the acoustic feature set of each word are all input into the trained prosody level labeling model, and a prosody layer labeling result is output by the model. If the prosody hierarchy marking model is deployed in the terminal device, the terminal device can directly play a corresponding sentence according to the prosody hierarchy after obtaining the prosody hierarchy through the prosody hierarchy marking model. If the prosody hierarchy annotation model is deployed in the server, the server needs to feed back the prosody hierarchy to the terminal device after obtaining the prosody hierarchy through the prosody hierarchy annotation model, and the terminal device plays a corresponding sentence according to the prosody hierarchy.
It should be noted that the terminal device includes, but is not limited to, a tablet computer, a notebook computer, a palm computer, a mobile phone, a voice interaction device, and a Personal Computer (PC), and is not limited herein. The voice interaction device includes, but is not limited to, an intelligent sound and an intelligent household appliance. The voice interaction device also has the following characteristics:
1. the network function, various voice interaction devices can be connected together through the local area network, can also be connected with a service site of a manufacturer through a home gateway interface, and can be finally connected with the Internet to realize the sharing of information.
2. Intellectualization, the voice interaction equipment can automatically respond according to different surrounding environments without human intervention.
3. Openness and compatibility, since the user's voice interaction device may come from different vendors, the voice interaction device needs to have the development and compatibility.
4. The energy-saving intelligent household appliance can automatically adjust the working time and the working state according to the surrounding environment, thereby realizing energy conservation.
5. Ease of use, the user only needs to know very simple operations, since the complex control operation flow has been solved by the controller embedded in the voice interaction device. The voice interaction equipment does not refer to a certain equipment, but is a technical system, the content of the voice interaction equipment is richer along with the continuous development of human application requirements and the intellectualization of the voice interaction equipment, the functions of different voice interaction equipment according to actual application environments are different, and the voice interaction equipment generally has an intelligent control technology.
It should be understood that the prosodic hierarchy output by the speech synthesis system may be specifically the prosodic hierarchy of the chinese language, which is a tonal language whose prosodic features are very complex. The prosodic hierarchy is a model of prosodic features such as pause and rhythm of voice and has important significance on naturalness of voice quality synthesized by a voice synthesis system. A typical prosodic hierarchy division is shown in fig. 2, please refer to fig. 2, and fig. 2 is a schematic structural diagram of a prosodic hierarchy in the embodiment of the present application, which is divided into Prosodic Words (PW), Prosodic Phrases (PPH), and Intonation Phrases (IPH) from bottom to top. For example, in the sentence "get sincere greeting and heartful blessing", PW is "get sincere", "greeting", "heartful" and "blessing". PPH is "get it," sincere greeting, "" and heartful blessing. IPH is "sincere greeting" and "heartful blessing".
With reference to fig. 3, a method for prosody hierarchy labeling in the present application will be described below, where an embodiment of the method for prosody hierarchy labeling in the present application includes:
101. acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
in this embodiment, text data to be labeled and corresponding audio data are first obtained, where the text data to be labeled may be a sentence or a segment, and the language type includes, but is not limited to, chinese, japanese, english, or korean. The audio data may specifically be an audio file. At least one word is included in the text data to be annotated, so that word segmentation is possible, for example, "get an honest greeting and a keen blessing", which can be divided into five words, "get an honest greeting", "and a keen blessing", respectively, and different words correspond to different word identifiers.
102. Extracting a text feature set to be annotated of each word according to text data to be annotated, wherein the text feature set to be annotated comprises part of speech, word length and post-word punctuation types;
in this embodiment, feature extraction is then performed on each word, and the feature extraction includes two aspects, the first is extraction of text features, and the second is extraction of acoustic features. In the process of extracting the text features, text features of each word in the text data to be annotated need to be extracted, and taking the text data to be annotated as "sincere greeting and heartful blessing" as an example, a text feature set to be annotated corresponding to each word can be extracted, where the text feature set to be annotated includes, but is not limited to, part of speech, word length, and post-word punctuation type.
The part of speech is usually divided into real words and imaginary words, the real words are one of Chinese word classes, the words contain words with actual meanings, and the real words can be independently used as sentence components, namely, the words with lexical meanings and grammatical meanings. The grammar function is taken as a main basis, and the grammar function can be regarded as a syntactic component independently, and the real words with lexical meaning and grammar meaning are real words. The real words include nouns, verbs, adjectives, numerators, quantifiers, and pronouns. A particle is a term that is not fully meaningful, but rather grammatical in meaning or function. The method has the characteristics that the method must be attached to real words or sentences to express grammatical significance, cannot form sentences independently, cannot form grammatical components independently, and cannot be overlapped. The fictional words include adverbs, prepositions, conjunctions, adjectives, sighs, and pseudonyms.
The word length indicates the length of the word, such as "greeting" having a word length of 2, "and" word length of 1.
The word punctuation type indicates whether the punctuation mark is followed immediately after the word, and if the punctuation mark is followed immediately, the type of the punctuation mark needs to be confirmed. Punctuation marks represent pause times in spoken language and can help people to express thought emotion and understand written language exactly.
103. Extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word ending syllable duration, word post pause duration, acoustic statistical features of word ending syllables and acoustic feature variation values among words;
in this embodiment, in the process of extracting the acoustic features, it is necessary to extract the acoustic features of each word in the audio data, and taking the text data to be labeled "enthusiastic greeting and enthusiasm" as an example, five groups of acoustic feature sets may be extracted, where the acoustic feature set includes, but is not limited to, a word ending syllable duration, a word post-pause duration, a word ending syllable acoustic statistical feature, and an inter-word acoustic feature variation value.
The term "end syllable duration" refers to the time length of the last syllable of a word, the syllable refers to a voiced syllable, such as a "waiting" word of "greeting", the pronunciation is "hou", the unvoiced sound is "h", the voiced sound is "ou", the term "end syllable duration refers to the time length of the syllable of" ou ", the duration is detected by a special tool, and the discussion is omitted here.
The post pause duration refers to the length of time after the word is spoken until the next word begins to be spoken, such as the length of time between "hello" and ".
The acoustic statistical features of the last syllable typically include ten parameters, five of which are parameters related to the log-fundamental curve of the last syllable, that is, a maximum value based on the log-fundamental curve, a minimum value based on the log-fundamental curve, a range based on the log-fundamental curve, a mean value based on the log-fundamental curve, and a variance based on the log-fundamental curve. The other five parameters are parameters related to the log energy curve of the last syllable, namely, the parameters comprise a minimum value based on the log energy curve, a maximum value based on the log energy curve, a range based on the log energy curve, a mean value based on the log energy curve and a variance based on the log energy curve.
The inter-word acoustic feature variation value represents the log fundamental frequency difference and log energy difference between the voiced sound at the tail of a word and the voiced sound at the head of the next word.
104. And acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word.
In this embodiment, text data to be labeled and audio data are input to a prosody hierarchy labeling model, and the prosody hierarchy labeling model outputs a corresponding prosody hierarchy structure according to a word identifier of each word, a text feature set to be labeled of each word, and an acoustic feature set of each word.
For convenience of introduction, referring to fig. 4, fig. 4 is a schematic illustration of an application of the prosody hierarchy annotation system in the embodiment of the present application, as shown in the figure, a user provides text data and audio data that need to be annotated with a prosody hierarchy, for example, when the user inputs text data to be annotated as "sincere greeting and keen blessing", the text data to be annotated and the corresponding audio data are provided to a prosody hierarchy annotation model. And extracting features through a prosodic hierarchy labeling model. The method comprises the steps of extracting a text feature set to be labeled of each word and an acoustic feature set of each word respectively, then obtaining a prosody hierarchy structure by utilizing a deep neural network forward calculation, and providing a text labeled with the prosody hierarchy structure for a user by a prosody hierarchy labeling model.
Referring to fig. 5, fig. 5 is a schematic flow chart of prosody level labeling in the embodiment of the present application, and as shown in the drawing, specifically, in step S1, text data and audio number information of a sentence to be labeled are first obtained. In step S2, the text data is subjected to word segmentation processing, and the text data and the audio data are forcibly aligned. In step S3, after the text data and the audio data are aligned forcibly, corresponding text features and acoustic features may be extracted. In step S4, the extracted text features and acoustic features are input into a prosody hierarchy labeling model, which includes a feedforward neural network and a bidirectional long-and-short-term neural network. In step S5, the prosody hierarchy of the sentence is output by the prosody hierarchy labeling model.
In the embodiment of the application, a prosodic hierarchy labeling method is provided, which includes the steps of firstly, obtaining text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled includes at least one word, each word corresponds to one word identifier, then extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled includes a part of speech, a word length and a post-word punctuation type, and then extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set includes a word end syllable time length, a post-word pause time length, a word end acoustic statistical feature and an inter-word acoustic feature variation value, and finally, according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, and acquiring a prosody hierarchy through a prosody hierarchy labeling model. Through the mode, the prosodic hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for labeling of prosodic hierarchies, the more accurate prosodic hierarchy labeling model can be adopted to improve the accuracy of prosodic hierarchy labeling, and the voice synthesis effect is improved.
Optionally, on the basis of the embodiment corresponding to fig. 3, in a first optional embodiment of the method for providing prosody hierarchy annotation in the embodiment of the present application, obtaining a prosody hierarchy structure through a prosody hierarchy annotation model may include:
determining at least one of prosodic words, prosodic phrases and intonation phrases through a prosodic hierarchy annotation model;
or, determining prosodic words and/or prosodic phrases through a prosodic hierarchy labeling model.
In this embodiment, two common prosodic hierarchies will be introduced. In the first case, at least one of prosodic words, prosodic phrases, and intonation phrases is determined by a prosodic hierarchy labeling model, that is, the prosodic hierarchy labeling model trains four cases, namely, a non-prosodic hierarchy boundary, a prosodic word boundary, a prosodic phrase boundary, and an intonation phrase boundary. In the first case, prosodic words and/or prosodic phrases are determined by a prosodic hierarchy model, i.e., the prosodic hierarchy model trains three cases, namely, a non-prosodic hierarchy boundary, a prosodic word boundary, and a prosodic phrase boundary.
When the prosody hierarchy is labeled, prosody hierarchy labeling is carried out on input text data after text processing by adopting a prosody hierarchy labeling model generated in a training stage, so that a text with a prosody hierarchy structure is obtained and used for quickly constructing a corpus required by a voice synthesis system.
Secondly, in the embodiment of the present application, two common prosody hierarchy labeling methods are introduced, one is to determine prosodic words, prosodic phrases and intonation phrases through a prosody hierarchy labeling model, and the other is to determine prosodic words and prosodic phrases through a prosody hierarchy labeling model. Through the mode, the user can select a more detailed labeling scheme with three-layer prosody hierarchical structures of prosodic words, prosodic phrases and intonation phrases, and can also select a labeling scheme with two-layer prosody hierarchical structures of the prosodic words and the prosodic phrases. Therefore, the prosody hierarchy output can be selected according to the requirement, and the flexibility of the scheme is improved.
With reference to fig. 6, an embodiment of the method for training a model in this application includes:
201. acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;
in this embodiment, first, to-be-trained text data and corresponding to-be-trained audio data are obtained, where the to-be-trained text data may specifically be a sentence or a paragraph, and the language type includes, but is not limited to, chinese, japanese, english, or korean. The audio data to be trained may specifically be an audio file. At least one word is included in the text data to be trained, so that word segmentation is possible, for example, "get an honest greeting and a keen blessing", which can be divided into five words, "get an honest greeting", "and a keen blessing", respectively, and different words correspond to different word identifiers.
It can be understood that a large number of samples are often required during training, where the text data to be trained and the audio data to be trained are samples, and for convenience of description, the text data to be trained and the audio data to be trained are described as one sample, which should not be construed as a limitation to the present application.
202. Extracting a text feature set to be trained of each word according to text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;
in this embodiment, feature extraction is then performed on each word, and the feature extraction includes two aspects, the first is extraction of text features, and the second is extraction of acoustic features. In the process of extracting the text features, text features need to be extracted for each word in the text data to be trained, and taking the text data to be trained, "sincere greeting and keen blessing" as an example, a text feature set to be trained corresponding to each word can be extracted, where the text feature set to be trained includes, but is not limited to, part of speech, word length, and post-word punctuation type.
It should be noted that the parts of speech, the word length, and the word punctuation type have been introduced in the above embodiments, and therefore, the description thereof is omitted here.
203. Extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
in this embodiment, in the process of extracting acoustic features, it is necessary to extract acoustic features of each word in audio data to be trained, and taking "an honest greeting and a happy blessing" as an example of text data to be trained, an acoustic feature set of the word to be trained may be extracted, where the acoustic feature set to be trained includes, but is not limited to, a word ending syllable duration, a word post-pause duration, a word ending syllable acoustic statistical feature, and an inter-word acoustic feature variation value.
It should be noted that the word end syllable duration, the post-word pause duration, the word end syllable acoustic statistical characteristics, and the inter-word acoustic characteristic variation values have been described in the above embodiments, and therefore are not described herein again.
204. And training the word identification corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy.
In this embodiment, in a training process of introducing a prosodic hierarchy labeling model, training data are text data labeled with a prosodic hierarchy structure and corresponding audio data, a deep neural network is used to model a sequence, each sentence has a plurality of words, and a sentence is a word sequence, features and labels of each word are used as input and output of the deep neural network at a time step, each word has a corresponding label Y, so that the label of a sentence can be represented as a vector Y, word identifiers, text features and acoustic features of each word in the sentence can be extracted from the text data and the corresponding audio data, thereby forming a feature X of the word, a plurality of words in a sentence can be represented as an input vector X, a loss function is represented as L (Y, f (X)), a loss function is reduced as much as possible by training through a large number of samples, and obtaining training parameters of the neural network so as to obtain a rhythm hierarchy structure automatic labeling model, namely a rhythm hierarchy labeling model.
In the embodiment of the application, a method for model training is provided, which includes firstly, obtaining text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, each word corresponds to a word identifier, then extracting a feature set of the text to be trained of each word according to the text data to be trained, wherein the feature set of the text to be trained includes a part of speech, a word length and a post-word punctuation type, and extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end acoustic statistical feature and an inter-word acoustic feature variation value, and finally training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word, and obtaining a rhythm level labeling model. Through the mode, the prosodic hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for labeling of prosodic hierarchies, the more accurate prosodic hierarchy labeling model can be adopted to improve the accuracy of prosodic hierarchy labeling, and the voice synthesis effect is improved.
Optionally, on the basis of the embodiment corresponding to fig. 6, in a first optional embodiment of the method for model training provided in the embodiment of the present application, after obtaining text data to be trained and audio data to be trained, the method may further include:
performing word segmentation processing on text data to be trained to obtain at least one word;
acquiring a target word identifier corresponding to a target word according to a preset word identifier relationship, wherein the preset word identifier relationship is used for representing a preset relationship between each word and the word identifier, and the target word belongs to any one of at least one word;
generating a target word vector corresponding to a target word in text data to be trained;
training the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, which may include:
and training the target word identifier and the target word vector to obtain a first model parameter, wherein the first model parameter is used for generating a word embedding layer in a prosody hierarchy labeling model.
In the embodiment, a method for training a word embedding layer in a prosodic hierarchy labeling model is provided. Firstly, text data to be trained needs to be acquired, and then word segmentation processing is performed on the text data to be trained, for example, the text data to be trained is 'establishment cooperative society, new patterns are formed by means of an e-commerce platform', and 'establishment', 'cooperation society', 'assistance', 'e-commerce', 'platform', 'formation', 'new patterns' are obtained after word segmentation. This fact requires determining the word identifier corresponding to each word according to the preset word identifier relationship. For ease of understanding, please refer to table 1, where table 1 is an illustration of a predetermined word identification relationship.
TABLE 1
Word mark Word and phrase
0 Established
1 Cooperative society
2 By means of
3 Electronic commerce
4 Half platform
5 Composition of
6 New mode
As can be seen from table 1, the preset word identifier relationship is used to indicate the relationship between a word and a word identifier, the same word corresponds to the same word identifier, and if the target word is "true", the word identifier of the word is "0", and at this time, "0" is used as the input of the word embedding layer.
According to the method for generating the target word identifications and the target word vectors, other word identifications and word vectors are generated, the word identifications and the word vectors are trained according to the mapping relation between the word identifications and the word vectors, the first model parameters can be obtained by utilizing the minimum value of the loss function, and the first model parameters are used for generating a word embedding layer in the rhythm level labeling model. In practical application, the word embedding layer can be updated regularly, so that the accuracy of the word embedding layer is improved.
Secondly, in the embodiment of the application, a method for embedding training words into a layer is introduced, namely, word segmentation processing is performed on text data to be trained, then target word identifiers corresponding to target words are obtained according to a preset word identifier relation, target word vectors corresponding to the target words in the text data to be trained are generated, then the target word identifiers and the target word vectors are trained, and first model parameters are obtained, wherein the first model parameters are used for generating a word embedding layer in a rhythm level labeling model. By the mode, the word embedding layer in the rhythm level labeling model can be obtained through direct training, and other neural networks in the rhythm level labeling model can be trained simultaneously when the word embedding layer is trained, so that the process of additionally training a word vector model by using an independent neural network is saved, and the training efficiency is improved.
Optionally, on the basis of the embodiment corresponding to fig. 6, in a second optional embodiment of the method for model training provided in the embodiment of the present application, extracting a feature set of a text to be trained for each word according to text data to be trained may include:
acquiring the part of speech, the word length and the post-word punctuation type of a target word in text data to be trained, wherein the part of speech represents a grammar classification result of the word, the word length represents the word number of the word, and the post-word punctuation type is used for representing the punctuation type corresponding to the post-word;
acquiring the part of speech, word length and post-word punctuation types of associated words in text data to be trained, wherein the associated words are words having an association relation with target words;
training the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, which may include:
and training the part of speech, the word length and the post-word punctuation type of the target word and the part of speech, the word length and the post-word punctuation type of the associated word to obtain a second model parameter, wherein the second model parameter is used for generating a text neural network in a rhythm level labeling model.
In the embodiment, a method for training a text neural network in a prosodic hierarchy labeling model is provided. For convenience of understanding, the following description will continue to use the target word in the text data to be trained as an example, and it can be understood that the processing manner of other words in the text data to be trained is similar to that of the target word, and is not described herein again.
Specifically, word segmentation is performed on the text data to be trained, for example, if the text data to be trained is "a standing cooperative society," a new pattern is composed with the help of an e-commerce platform, "and after word segmentation," a standing "," a cooperative society, "a" help, "" e-commerce, "" platform, "" composition, "and a new pattern" are obtained. Assuming that the target word is "cooperative", the part of speech of the target word is noun, the word length is 3, and the post-word punctuation type is comma. For ease of understanding, the relationship between parts of speech and tokens, and the relationship between word suffix type and tokens will be described below in conjunction with tables 2 and 3. In practical applications, text features are often represented by numbers, and therefore, it is necessary to convert a word concept into a number concept.
TABLE 2
Part of speech identification Part of speech Examples of such applications are
0 Noun (name) Shanghai, cucumber, cabbage, Shanghai, tractor, quality, pinde
1 Verb and its usage "come, walk, run, attach, learn, take off, ok, know
2 Adjectives "Duan, thin, tall, ugly, snow white, beautiful, red"
3 Adverb "very, rather, extreme, very, just, all, immediately, once"
4 Pronouns "I, you, he, she, it, us, and you"
5 Preposition word "handle, slave, direction, order, go, then, ratio, quilt, in"
6 Volume word "mu, zhan, xie, zhi, ben, che, ke, zhan, tou and jiao"
7 Conjunction word "then, so, and, or"
8 Word aid 'di, shi, mo, bar and di'
9 Digit word "one, two, three, seven, ten, hundred, thousand, ten thousand, hundred million"
10 Exclamation word "feed, car, hi, hum, chess, etc."
11 Pseudonyms Wu, Wang, hong Long, Gele, Shasha and Hula
TABLE 3
Post-word punctuation type identification Post word punctuation type Examples of such applications are
0 Sentence number
1 Question mark
2 Exclamation mark
3 Tun number
4 Comma (comma)
5 Branch number
6 Colon
7 Non-punctuation mark
As can be seen from tables 2 and 3, when the target word is "cooperative", the corresponding feature is "noun 3 comma", which can be expressed as "034". In order to enrich text features, terms around the target term need to be considered, that is, associated terms are obtained, and the associated terms may be a previous term, a next term, or two previous terms of the target term, and the like, which is not limited herein. Assuming that the associated words are the preceding word and the succeeding word of the target word, and the target word is "cooperative", the associated words are "true" and "by". As is clear from the contents of table 2 and table 3, the feature corresponding to "established" is "verb 2 has no punctuation". And counting the part-of-speech category number, the maximum word length and the punctuation category number in the corpus, wherein the part-of-speech characteristics, the word length characteristics and the post-word punctuation characteristics can all be represented by one-hot vectors, the three one-hot vectors are spliced to obtain the text characteristics of the current target word, and the text characteristics of the target word are spliced with the text characteristics of the associated words to obtain the text characteristic vector of the target word, namely the text characteristic set to be trained.
According to the method for extracting the feature set of the text to be trained, the feature set of the text to be trained of each word is extracted, the feature set of the text to be trained of the words is trained, and a second model parameter can be obtained by using the minimum value of the loss function, wherein the second model parameter is used for generating a text neural network in a prosody level labeling model. In practical application, the text neural network can be updated regularly, so that the accuracy of the text neural network is improved.
It is understood that the text neural network may be a feedforward neural network or a convolutional neural network, and may be another type of neural network, and the bidirectional long-term memory network may be replaced by a variation thereof, such as a recurrent neural network with a gated recurrent unit, which is only an illustration here and should not be construed as a limitation to the present application. And the number of layers and the number of neurons of the text neural network are not limited by the application.
Secondly, in the embodiment of the application, a method for training a text neural network is introduced, namely, the part of speech, the word length and the post-word punctuation type of a target word in text data to be trained are firstly obtained, the part of speech, the word length and the post-word punctuation type of a related word in the text data to be trained are also obtained, then the part of speech, the word length and the post-word punctuation type of the target word and the part of speech, the word length and the post-word punctuation type of the related word are trained to obtain second model parameters, and the second model parameters are used for generating the text neural network in a rhythm level labeling model. By the mode, the system can automatically learn the high-level feature expression which is favorable for prosodic hierarchy structure labeling through the neural network, and automatically learn the high-level feature which is favorable for labeling from the original input text feature set, so that the automatic labeling performance of the prosodic hierarchy structure is improved.
Optionally, on the basis of the embodiment corresponding to fig. 6, in a third optional embodiment of the method for model training provided in the embodiment of the present application, after obtaining text data to be trained and audio data to be trained, the method may further include:
forcibly aligning the text data to be trained and the audio data to be trained to obtain a time-aligned text;
extracting the set of acoustic features to be trained of each word according to the audio data to be trained may include:
and determining the syllable duration of the word end of the target word according to the time alignment text.
In this embodiment, how to extract the acoustic feature set of the word is introduced, that is, how to perform forced alignment on the text data to be trained and the audio data to be trained to obtain a time-aligned text, specifically, a frame boundary at a phoneme level can be obtained, so that a frame boundary of the end syllable can also be obtained, and the end syllable duration of the target word is calculated according to the start frame number and the end frame number of the end syllable.
For convenience of introduction, please refer to fig. 7, where fig. 7 is a schematic flowchart of a process of extracting an acoustic feature set according to an embodiment of the present application, and as shown in the drawing, in step a1, text data and audio data are first obtained, and specifically, the text data to be trained and the audio data to be trained may be obtained. In step a2, the text data to be trained is word-segmented, and the text data and the audio data are aligned by using a forced alignment tool to obtain a time-aligned text, i.e. frame boundary information at a phoneme level. In step a4, the start-stop frame number corresponding to the boundary of the end syllable is determined, and similarly, the frame numbers of the last voiced frame at the end of the word and the first voiced frame at the beginning of the next word can also be determined. In step a3, a logarithmic fundamental frequency curve and a logarithmic energy curve are extracted from the audio data by frame. In step a5, the time-aligned text is combined to obtain the log-fundamental frequency curve and log-energy curve of the end-of-word syllable, and the log-fundamental frequency value and log-energy value of the end-of-word voiced frame sequence number and the voiced frame of the next beginning-of-word. In step a6, the log fundamental frequency statistical characteristic, log energy statistical characteristic, log fundamental frequency difference and log energy difference of the end of word voiced sound frame and the next beginning of word voiced sound frame of the word are calculated. In step a7, these acoustic features are spliced to form a set of acoustic features for words of the prosodic hierarchy for automatic labeling tasks.
Specifically, after the text data to be trained and the audio data to be trained are aligned forcibly, frame boundary information at a phoneme level is obtained, and assuming that the text data to be trained is "sincere greeting and heartful blessing", the target word is "greeting", and the word end syllable duration refers to the time length of the last syllable of the word, the frame boundary of the word end syllable can be calculated through the information of the alignment forcibly, for example, the "waiting" of the target word "greeting" is pronounced as "hou", the word end syllable is "ou", the initial frame number of "ou" on the audio is 101 th frame, the end frame number is 120 frames, the "ou" pronounces for 20 frames, 5 milliseconds per frame, the pronunciation duration of "ou" is 100 milliseconds, that is, the word end syllable duration of "greeting" is 100 milliseconds.
Secondly, in the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are forcibly aligned to obtain a time-aligned text, and the word end syllable duration of the target word is determined according to the time-aligned text. By the method, the time-aligned text can be obtained, the word end syllable duration is extracted, the word end syllable duration is used as one item in the acoustic feature set, and high-level features beneficial to labeling are automatically learned from the originally input acoustic feature set, so that the accuracy of the rhythm level labeling model is improved.
Optionally, on the basis of the third embodiment corresponding to fig. 6, in a fourth optional embodiment of the method for model training provided in the embodiment of the present application, extracting an acoustic feature set to be trained of each word according to audio data to be trained may include:
and determining post-word pause duration of the target word according to the time alignment text.
In this embodiment, how to obtain the post-word pause duration of a word will be described. Specifically, after the text data to be trained and the audio data to be trained are aligned forcibly, M speech frames are obtained, assuming that the text data to be trained is "sincere greeting and heartful blessing", and the target word is "greeting", the next adjacent word of the target word is "sum", and the "waiting" word of the target word "greeting" is calculated according to the time-aligned text data, so that the short pause duration between "waiting" and "sum" can be obtained, the short pause is 20 frames, each frame is 5 milliseconds, and the post-word pause duration of the target word is 100 milliseconds.
In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are aligned forcibly to obtain a time aligned text, and then the post-word pause duration can be determined according to the time aligned text. By the method, the post-word pause duration of each word can be determined after the text data and the audio data are aligned forcibly, the post-word pause duration is used as one item in the acoustic feature set, and high-level features beneficial to labeling are automatically learned from the acoustic feature set input originally, so that the accuracy of the rhythm level labeling model is improved.
Optionally, on the basis of the third embodiment corresponding to fig. 6, in a fifth optional embodiment of the method for model training provided in the embodiment of the present application, extracting an acoustic feature set to be trained of each word according to audio data to be trained may include:
calculating the frame number of the final syllable voiced sound initial frame and the frame number of the voiced sound end frame of the target word according to the time alignment text and the fundamental frequency information extracted from the audio data to be trained;
extracting a logarithmic fundamental frequency curve and a logarithmic energy curve of audio data to be trained;
and calculating the acoustic statistical characteristics of the end-of-speech syllables of the target word according to the frame number of the end-of-speech syllable voiced initial frame of the target word, the frame number of the voiced end frame, a logarithmic fundamental frequency curve and a logarithmic energy curve, wherein the acoustic statistical characteristics of the end-of-speech syllables comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic fundamental frequency curve, and the acoustic statistical characteristics of the end-of-speech syllables further comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic energy curve.
In this embodiment, how to obtain the acoustic statistical characteristics of the end syllable of the word will be described. Specifically, after the text data to be trained and the audio data to be trained are aligned forcibly, a time alignment text is obtained, assuming that the text data to be trained is "enthusiastic greeting and heartthrob blessing", the fundamental frequency and energy of the corresponding audio are extracted in frames, so as to generate a fundamental frequency curve and an energy curve, for easy understanding, please refer to fig. 8 and 9, fig. 8 is an embodiment diagram of a fundamental frequency curve in the embodiment of the present application, fig. 9 is an embodiment diagram of an energy curve in the embodiment of the present application, for normalizing data, a logarithmic fundamental frequency curve and a logarithmic energy curve are obtained according to a logarithmic value obtained by taking a logarithm value of the two curves, and the fundamental frequency and the energy are weakened near a prosody hierarchy boundary. Assuming that the target word is a greeting, according to the frame number of a voiced start speech frame and the frame number of a voiced end speech frame of the greeting, a logarithmic fundamental frequency curve and a logarithmic energy curve corresponding to the end of the target word are intercepted from a logarithmic fundamental frequency curve and a logarithmic energy curve of the audio, and according to the logarithmic fundamental frequency curve and the logarithmic energy curve corresponding to the end of the target word, acoustic statistical characteristics of end syllables of ten dimensions are respectively calculated, namely the maximum value of the logarithmic fundamental frequency curve, the minimum value of the logarithmic fundamental frequency curve, the interval range of the logarithmic fundamental frequency curve, the average value of the logarithmic fundamental frequency curve, the variance of the logarithmic energy curve, the minimum value of the logarithmic energy curve, the interval range of the logarithmic energy curve, the average value of the logarithmic energy curve and the variance of the logarithmic energy curve.
In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are forcibly aligned to obtain a time aligned text, then the frame number of the final syllable voiced initial frame and the frame number of the voiced end frame of the target word are obtained through calculation according to the time aligned text and the fundamental frequency information extracted from the audio data to be trained, the logarithmic fundamental frequency curve and the logarithmic energy curve of the audio data to be trained are extracted, and finally the acoustic statistical characteristics of the final syllable of the target word are obtained through calculation according to the frame number of the final syllable voiced initial frame, the frame number of the voiced end frame, the logarithmic fundamental frequency curve and the logarithmic energy curve of the target word. By the method, the time-aligned text data is obtained, the frame numbers of the initial frame and the ending frame of the voiced speech segment at the end of the word can be obtained according to the fundamental frequency information extracted from the audio, and the high-level features beneficial to labeling are automatically learned from the acoustic feature set input originally, so that the accuracy of the rhythm level labeling model is improved.
Optionally, on the basis of the third embodiment corresponding to fig. 6, in a seventh optional embodiment of the method for model training provided in the embodiment of the present application, extracting an acoustic feature set to be trained of each word according to audio data to be trained may include:
calculating the frame number of the last voiced sound frame of the target word and the frame number of the voiced sound frame of the next adjacent word prefix of the target word according to the time alignment text and fundamental frequency information extracted from the audio data to be trained;
determining a fundamental frequency value and an energy value between a voiced frame at the tail of the target word and a voiced frame at the head of the next adjacent word according to the frame number of the last voiced frame of the target word, the frame number of the voiced frame at the head of the next adjacent word of the target word and fundamental frequency information and energy information which are extracted from audio data to be trained in a framing manner;
calculating to obtain a logarithmic difference value of the fundamental frequency value according to the fundamental frequency value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, and calculating to obtain a logarithmic difference value of the energy value according to the energy value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, wherein the logarithmic difference value of the fundamental frequency value and the logarithmic difference value of the energy value belong to the acoustic characteristic change value between words.
In this embodiment, how to obtain the acoustic feature variation value between words will be described. Specifically, after the text data to be trained and the audio data to be trained are aligned forcibly, a time alignment text is obtained, and assuming that the text data to be trained is 'enthusiastic greeting and heartful blessing', fundamental frequency information and energy information corresponding to the audio data to be trained are extracted in frames, so that a fundamental frequency curve and an energy curve are generated, and in order to standardize data, a logarithmic fundamental frequency curve and a logarithmic energy curve are obtained according to logarithmic values of the two curves. Assuming that the target word is "greeting", the frame number of the last voiced frame of the "candidate" and the frame number of the first voiced frame of the next word "and" can be determined according to the time-aligned text and the extracted fundamental frequency information of the audio frame, then the fundamental frequency value and the energy value of the two frames can be obtained, and then the logarithmic fundamental frequency difference and the logarithmic energy difference of the two frames are calculated.
In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are aligned forcibly to obtain a time aligned text, then the fundamental frequency value and the energy value of the voiced frame at the end of the target word and the fundamental frequency value and the energy value of the voiced frame at the head of the next word are determined according to the frame number of the voiced frame of the last voiced frame of the target word and the voiced frame of the next adjacent word of the target word and the fundamental frequency and energy data extracted from the audio frame, and then the logarithmic difference value of the fundamental frequency values and the logarithmic difference value of the energy values of the two are obtained by calculation, so that the logarithmic difference value is used as the change value of the acoustic features between words. By the method, high-level features beneficial to labeling can be automatically learned from the acoustic feature set of the original input, so that the accuracy of the prosody hierarchy labeling model is improved.
Optionally, on the basis of any one of the first to seventh embodiments corresponding to fig. 6 and fig. 6, in an eighth optional embodiment of the method for model training provided in the embodiment of the present application, training a word identifier corresponding to each word, a text feature set to be trained of each word, and an acoustic feature set to be trained of each word to obtain a prosody level labeling model, may include:
acquiring a first output result of a target word identifier through a word embedding layer in a rhythm level labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of at least one word, and the word embedding layer is obtained through training according to a first model parameter;
acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target words, and the text neural network is obtained through training according to second model parameters;
training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to a target word, and the third model parameter is used for generating an acoustic neural network in a prosody hierarchy labeling model;
and generating a rhythm level labeling model according to the first model parameter, the second model parameter and the third model parameter.
In this embodiment, a method for obtaining a prosody hierarchy labeling model through training will be described, and for convenience of understanding, please refer to fig. 10, where fig. 10 is a schematic structural diagram of the prosody hierarchy labeling model in the embodiment of the present application, as shown in the figure, a target word is taken as an example, that is, a word identifier is a target word identifier, a text feature set is a target to-be-trained text feature set corresponding to the target word, and an acoustic feature set is a target to-be-trained acoustic feature set corresponding to the target word. And taking the target word identifier as an input of the word embedding layer, thereby outputting a first output result, wherein the first output result is a word vector mapped by the target word identifier, and the word vector can be 200-dimensional. And taking the target text feature set to be trained (part of speech, word length and word post-punctuation type) as the input of a text neural network (such as a feed-forward neural network), and outputting a second output result. The target acoustic feature set to be trained, the first output result and the second output result are used as input of an acoustic neural network (such as a bidirectional long-and-short-term memory network) together, posterior probabilities of all prosodic hierarchy structure types of the target word are output through a softmax layer, for example, the probability of a non-prosodic hierarchy boundary is 0.1, the probability of a prosodic word is 0.1, the probability of a prosodic phrase is 0.2, the probability of a intonation phrase is 0.6, a prosodic hierarchy structure corresponding to the maximum posterior probability is taken as a labeling result, and then the labeling result of the target word is a intonation phrase. The labeled result is a predicted result obtained by training and needs to be compared with a real result, namely, a loss function is adopted, and the minimum value of the two is taken to determine a third model parameter of the acoustic neural network. And training to obtain a rhythm level labeling model by combining the first model parameter, the second model parameter and the third model parameter. The prosodic hierarchy labeling model adopts a stacking structure of a feedforward neural network and a bidirectional long-time and short-time memory network, and can label three prosodic hierarchies of prosodic words, prosodic phrases and intonation phrases at the same time.
The loss function is used for measuring the degree of inconsistency between the predicted value and the real value of the model, and is a non-negative real value function. The loss function adopted by the application can adopt cross entropy and can also use cross entropy with weight.
It is understood that the word embedding layer, the feed-forward neural network, and the bi-directional long-and-short mnemonic network are trained together. The word embedding layer is used for training word vectors, and the feedforward neural network is used for automatically extracting high-level feature representation which is more beneficial to a labeling task from original input features (part of speech, word length and word post-punctuation types). These features are spliced together at the bi-directional long and short term memory network input, thereby jointly utilizing textual features and acoustic features.
The bidirectional long-time and short-time memory network can learn the dependency relationship between contexts, and the labeling task also needs context information, for example, if the previous word is a intonation phrase boundary, the current word is unlikely to be the intonation phrase boundary, so that the stack structure of the feedforward neural network and the bidirectional long-time and short-time memory network is jointly utilized, the trainable word embedding layer is adopted, not only can the text and acoustic feature information be utilized, but also high-level features can be automatically extracted from the text features, and the context features are utilized, so that the bidirectional long-time and short-time memory network is suitable for the prosody hierarchy structure labeling task.
Further, in the embodiment of the present application, a method for obtaining a prosody level labeling model through training is introduced, that is, three types of model parameters are obtained through training, which are a first model parameter, a second model parameter, and a third model parameter, respectively, and the first model parameter, the second model parameter, and the third model parameter are taken as a whole, and a prosody level labeling model is generated through training at the same time. Through the mode, the three parts of neural networks are stacked to form a complete prosody level labeling model and are used as a whole for model training, the training content comprises training between word identifications and word vectors, training of word texts and word text characteristics and training of audio and acoustic characteristics, and therefore richer characteristics can be obtained, and sentence labeling accuracy is improved.
Referring to fig. 11, fig. 11 is a schematic diagram of an embodiment of a prosody hierarchy labeling apparatus 30 according to the present application, which includes:
an obtaining module 301, configured to obtain text data to be labeled and audio data, where the text data to be labeled and the audio data have a corresponding relationship, the text data to be labeled includes at least one word, and each word corresponds to a word identifier;
an extracting module 302, configured to extract a to-be-labeled text feature set of each word according to the to-be-labeled text data obtained by the obtaining module 301, where the to-be-labeled text feature set includes a part of speech, a word length, and a post-word punctuation type;
the extracting module 302 is further configured to extract an acoustic feature set of each word according to the audio data acquired by the acquiring module 301, where the acoustic feature set includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;
the prediction module 303 is configured to obtain a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word extracted by the extraction module 302, and the acoustic feature set of each word.
In this embodiment, an obtaining module 301 obtains text data to be tagged and audio data, where the text data to be tagged and the audio data have a corresponding relationship, the text data to be tagged includes at least one word, and each word corresponds to a word identifier, an extracting module 302 extracts a feature set of a text to be tagged of each word according to the text data to be tagged obtained by the obtaining module 301, where the feature set of the text to be tagged includes a part of speech, a word length, and a post-word punctuation type, the extracting module 302 extracts an acoustic feature set of each word according to the audio data obtained by the obtaining module 301, where the acoustic feature set includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value, and a predicting module 303 identifies a word according to the word identifier of each word, and a word of each word, The extraction module 302 extracts the feature set of the text to be labeled of each word and the acoustic feature set of each word, and obtains a prosody hierarchy structure through a prosody hierarchy labeling model.
In the embodiment of the application, a prosodic hierarchy labeling device is provided, which includes obtaining text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relationship, the text data to be labeled includes at least one word, each word corresponds to a word identifier, extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled includes a part of speech, a word length and a post-word punctuation type, extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set includes a word end syllable time length, a post-word time length, an end syllable acoustic statistical feature and an inter-word acoustic feature variation value, and finally obtaining the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, and acquiring a prosody hierarchy through a prosody hierarchy labeling model. Through the mode, the prosody hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for labeling prosody hierarchies, and the accuracy of prosody hierarchy labeling can be improved by adopting a more accurate prosody hierarchy labeling model.
Optionally, on the basis of the embodiment corresponding to fig. 11, in another embodiment of the prosody hierarchy labeling device 30 provided in the embodiment of the present application, the prosody hierarchy labeling device is further provided
The prediction module 303 is specifically configured to determine at least one of a prosodic word, a prosodic phrase, and a intonation phrase through the prosodic hierarchy labeling model;
or the like, or, alternatively,
and determining prosodic words and/or prosodic phrases through the prosodic hierarchy annotation model.
Secondly, in the embodiment of the present application, two common prosody hierarchy labeling methods are introduced, one is to determine prosodic words, prosodic phrases and intonation phrases through a prosody hierarchy labeling model, and the other is to determine prosodic words and prosodic phrases through a prosody hierarchy labeling model. Through the mode, the user can select a more detailed labeling scheme with three-layer prosody hierarchical structures of prosodic words, prosodic phrases and intonation phrases, and can also select a labeling scheme with two-layer prosody hierarchical structures of the prosodic words and the prosodic phrases. Therefore, the prosody hierarchy output can be selected according to the requirement, and the flexibility of the scheme is improved.
Referring to fig. 12, fig. 12 is a schematic view of an embodiment of the model training apparatus according to the embodiment of the present application, and the model training apparatus 40 includes:
an obtaining module 401, configured to obtain text data to be trained and audio data to be trained, where the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained includes at least one word, and each word corresponds to a word identifier;
an extracting module 402, configured to extract a to-be-trained text feature set of each word according to the to-be-trained text data acquired by the acquiring module 401, where the to-be-trained text feature set includes a part of speech, a word length, and a word post-punctuation type;
the extracting module 402 is further configured to extract an acoustic feature set to be trained of each word according to the audio data to be trained acquired by the acquiring module 401, where the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;
a training module 403, configured to train the word identifier corresponding to each word, the text feature set to be trained of each word extracted by the extraction module 402, and the acoustic feature set to be trained of each word, so as to obtain a prosody hierarchy labeling model, where the prosody hierarchy labeling model is used to label a prosody hierarchy structure.
In this embodiment, an obtaining module 401 obtains text data to be trained and audio data to be trained, where the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained includes at least one word, and each word corresponds to a word identifier, an extracting module 402 extracts a text feature set to be trained of each word according to the text data to be trained obtained by the obtaining module 401, where the text feature set to be trained includes a part of speech, a word length, and a post-word punctuation type, the extracting module 402 extracts an acoustic feature set to be trained of each word according to the audio data to be trained obtained by the obtaining module 401, where the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value, the training module 403 trains the word identifier corresponding to each word, the text feature set to be trained of each word extracted by the extraction module 402, and the acoustic feature set to be trained of each word, to obtain a prosody hierarchy labeling model, where the prosody hierarchy labeling model is used to label a prosody hierarchy.
In the embodiment of the application, a method for model training is provided, which includes firstly, obtaining text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, each word corresponds to a word identifier, then extracting a feature set of the text to be trained of each word according to the text data to be trained, wherein the feature set of the text to be trained includes a part of speech, a word length and a post-word punctuation type, and extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end acoustic statistical feature and an inter-word acoustic feature variation value, and finally training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word, and obtaining a rhythm level labeling model. Through the mode, the prosody hierarchy labeling model is established by combining the text features and the acoustic features, richer features can be provided for prosody hierarchy labeling tasks, more accurate prosody hierarchy labeling models can be adopted to improve the accuracy of prosody hierarchy labeling, and the effect of voice synthesis is improved.
Optionally, on the basis of the embodiment corresponding to fig. 12, please refer to fig. 13, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application, the model training apparatus 40 further includes a processing module 404 and a generating module 405;
the processing module 404 is configured to perform word segmentation processing on the text data to be trained after the obtaining module 401 obtains the text data to be trained and the audio data to be trained, so as to obtain at least one word;
the obtaining module 401 is further configured to obtain a target word identifier corresponding to a target word according to a preset word identifier relationship, where the preset word identifier relationship is used to indicate a relationship between each preset word and a word identifier, and the target word belongs to any one of the at least one word processed by the processing module;
the generating module 405 is configured to generate a target word vector corresponding to the target word in the text data to be trained;
the training module 403 is specifically configured to train the target word identifier obtained by the obtaining module 401 and the target word vector generated by the generating module 405 to obtain a first model parameter, where the first model parameter is used to generate a word embedding layer in the prosody level labeling model.
Secondly, in the embodiment of the application, a method for embedding training words into a layer is introduced, namely, word segmentation processing is performed on text data to be trained, then target word identifiers corresponding to target words are obtained according to a preset word identifier relation, target word vectors corresponding to the target words in the text data to be trained are generated, then the target word identifiers and the target word vectors are trained, and first model parameters are obtained, wherein the first model parameters are used for generating a word embedding layer in a rhythm level labeling model. By the mode, the word embedding layer in the rhythm level labeling model can be obtained through direct training, and other neural networks in the rhythm level labeling model can be trained simultaneously when the word embedding layer is trained, so that the process of additionally training a word vector model by using an independent neural network is saved, and the training efficiency is improved.
Alternatively, on the basis of the embodiment corresponding to fig. 12, in another embodiment of the model training device 40 provided in the embodiment of the present application,
the extraction module 402 is specifically configured to obtain a part of speech, a word length, and a post-word punctuation type of a target word in the text data to be trained, where the part of speech represents a result of a grammar classification of the word, the word length represents a word number of the word, and the post-word punctuation type is used to represent a punctuation type corresponding to the post-word;
acquiring the part of speech, word length and post-word punctuation types of associated words in the text data to be trained, wherein the associated words are words having an association relation with the target words;
the training module 403 is specifically configured to train the part of speech, the word length, and the post-word punctuation type of the target word and the part of speech, the word length, and the post-word punctuation type of the associated word to obtain a second model parameter, where the second model parameter is used to generate a text neural network in the prosody level labeling model.
Secondly, in the embodiment of the application, a method for training a text neural network is introduced, namely, the part of speech, the word length and the post-word punctuation type of a target word in text data to be trained are firstly obtained, the part of speech, the word length and the post-word punctuation type of a related word in the text data to be trained are also obtained, then the part of speech, the word length and the post-word punctuation type of the target word and the part of speech, the word length and the post-word punctuation type of the related word are trained to obtain second model parameters, and the second model parameters are used for generating the text neural network in a rhythm level labeling model. By the mode, the system can automatically learn the high-level feature expression which is favorable for prosody hierarchy structure labeling through the neural network, and automatically learn the high-level feature which is favorable for labeling from the originally input text feature set, so that the accuracy of the prosody hierarchy labeling model is improved.
Optionally, on the basis of the embodiment corresponding to fig. 12, please refer to fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application, the model training apparatus 40 further includes an alignment module 406;
the alignment module 406 is configured to, after the obtaining module 401 obtains text data to be trained and audio data to be trained, perform forced alignment on the text data to be trained and the audio data to be trained to obtain a time-aligned text;
the extracting module 402 is specifically configured to determine the word end syllable duration of the target word according to the time alignment text.
Secondly, in the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are forcibly aligned to obtain a time-aligned text, and the word end syllable duration of the target word is determined according to the time-aligned text. By the method, the time-aligned text can be obtained, the word end syllable duration is extracted, the word end syllable duration is used as one item in the acoustic feature set, and high-level features beneficial to labeling are automatically learned from the originally input acoustic feature set, so that the accuracy of the rhythm level labeling model is improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application,
the extracting module 402 is specifically configured to determine a post-word pause duration of the target word according to the time-aligned text.
In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are aligned forcibly to obtain a time aligned text, and then the post-word pause duration can be determined according to the time aligned text. By the method, the post-word pause duration of each word can be determined after the text data and the audio data are aligned forcibly, the post-word pause duration is used as one item in the acoustic feature set, and high-level features beneficial to labeling are automatically learned from the acoustic feature set input originally, so that the accuracy of the rhythm level labeling model is improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application,
the extracting module 402 is specifically configured to calculate, according to the time-aligned text and the fundamental frequency information extracted from the audio data to be trained, a frame number of a voiced beginning frame and a frame number of a voiced ending frame of the end-of-word syllable of the target word;
extracting a logarithmic fundamental frequency curve and a logarithmic energy curve of the audio data to be trained;
and calculating the acoustic statistical characteristics of the end-of-speech syllables of the target word according to the frame number of the end-of-speech syllable voiced initial frame of the target word, the frame number of the voiced end frame, the logarithmic fundamental frequency curve and the logarithmic energy curve, wherein the acoustic statistical characteristics of the end-of-speech syllables comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic fundamental frequency curve, and the acoustic statistical characteristics of the end-of-speech syllables further comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic energy curve.
In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are forcibly aligned to obtain a time aligned text, then the frame number of the final syllable voiced initial frame and the frame number of the voiced end frame of the target word are obtained through calculation according to the time aligned text and the fundamental frequency information extracted from the audio data to be trained, the logarithmic fundamental frequency curve and the logarithmic energy curve of the audio data to be trained are extracted, and finally the acoustic statistical characteristics of the final syllable of the target word are obtained through calculation according to the frame number of the final syllable voiced initial frame, the frame number of the voiced end frame, the logarithmic fundamental frequency curve and the logarithmic energy curve of the target word. By the method, the time-aligned text data is obtained, the frame numbers of the initial frame and the ending frame of the voiced speech segment at the end of the word can be obtained according to the fundamental frequency information extracted from the audio, and the high-level features beneficial to labeling are automatically learned from the acoustic feature set input originally, so that the accuracy of the rhythm level labeling model is improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application,
the extracting module 402 is specifically configured to calculate, according to the time-aligned text and fundamental frequency information extracted from the audio data to be trained, a frame number of a last voiced frame of the target word and a frame number of a voiced frame of a next adjacent word prefix of the target word;
determining a fundamental frequency value and an energy value between the end voiced frame of the target word and the next adjacent word beginning voiced frame according to the frame number of the last voiced frame of the target word, the frame number of the voiced frame of the next adjacent word beginning of the target word, and fundamental frequency information and energy information which are extracted from the audio data to be trained in a framing manner;
and calculating to obtain a logarithmic difference value of the fundamental frequency value according to the fundamental frequency value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, and calculating to obtain a logarithmic difference value of the energy value according to the energy value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, wherein the logarithmic difference value of the fundamental frequency value and the logarithmic difference value of the energy value belong to the acoustic feature change value between words.
In the embodiment of the application, after the text data to be trained and the audio data to be trained are obtained, the text data to be trained and the audio data to be trained are aligned forcibly to obtain a time aligned text, then the fundamental frequency value and the energy value of the voiced frame at the end of the target word and the fundamental frequency value and the energy value of the voiced frame at the head of the next word are determined according to the frame number of the voiced frame of the last voiced frame of the target word and the voiced frame of the next adjacent word of the target word and the fundamental frequency and energy data extracted from the audio frame, and then the logarithmic difference value of the fundamental frequency values and the logarithmic difference value of the energy values of the two are obtained by calculation, so that the logarithmic difference value is used as the change value of the acoustic features between words. By the method, high-level features beneficial to labeling can be automatically learned from the acoustic feature set of the original input, so that the accuracy of the prosody hierarchy labeling model is improved.
Optionally, on the basis of the embodiments corresponding to fig. 12, fig. 13 or fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application,
the training module 403 is specifically configured to obtain a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, where the target word identifier corresponds to a target word, the target word belongs to any word in the at least one word, and the word embedding layer is obtained through training according to a first model parameter;
acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter;
training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model;
and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.
Further, in the embodiment of the present application, a method for obtaining a prosody level labeling model through training is introduced, that is, three types of model parameters are obtained through training, which are a first model parameter, a second model parameter, and a third model parameter, respectively, and the first model parameter, the second model parameter, and the third model parameter are taken as a whole, and a prosody level labeling model is generated through training at the same time. Through the mode, the three parts of neural networks are stacked to form a complete prosody level labeling model and are used as a whole for model training, the training content comprises training between word identifications and word vectors, training of word texts and word text characteristics and training of audio and acoustic characteristics, and therefore richer characteristics can be obtained, and sentence labeling accuracy is improved.
As shown in fig. 15, for convenience of description, only the relevant parts of the embodiments of the present application are shown, and details of the specific technology are not disclosed, please refer to the method part of the embodiments of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal device as the mobile phone as an example:
fig. 15 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 15, the cellular phone includes: radio Frequency (RF) circuitry 510, memory 520, input unit 530, display unit 540, sensor 550, audio circuitry 560, wireless fidelity (WiFi) module 570, processor 580, and power supply 590. Those skilled in the art will appreciate that the handset configuration shown in fig. 15 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the mobile phone in detail with reference to fig. 15:
RF circuit 510 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to processor 580; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.
The memory 520 may be used to store software programs and modules, and the processor 580 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 520. The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 530 may include a touch panel 531 and other input devices 532. The touch panel 531, also called a touch screen, can collect touch operations of a user on or near the touch panel 531 (for example, operations of the user on or near the touch panel 531 by using any suitable object or accessory such as a finger or a stylus pen), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 580, and can receive and execute commands sent by the processor 580. In addition, the touch panel 531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 530 may include other input devices 532 in addition to the touch panel 531. In particular, other input devices 532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 540 may include a display panel 541, and optionally, the display panel 541 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 531 may cover the display panel 541, and when the touch panel 531 detects a touch operation on or near the touch panel 531, the touch panel is transmitted to the processor 580 to determine the type of the touch event, and then the processor 580 provides a corresponding visual output on the display panel 541 according to the type of the touch event. Although the touch panel 531 and the display panel 541 are shown as two separate components in fig. 15 to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 531 and the display panel 541 may be integrated to implement the input and output functions of the mobile phone.
The handset may also include at least one sensor 550, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 541 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 541 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.
Audio circuitry 560, speaker 561, and microphone 562 may provide an audio interface between a user and a cell phone. The audio circuit 560 may transmit the electrical signal converted from the received audio data to the speaker 561, and convert the electrical signal into a sound signal by the speaker 561 for output; on the other hand, the microphone 562 converts the collected sound signals into electrical signals, which are received by the audio circuit 560 and converted into audio data, which are then processed by the audio data output processor 580, and then passed through the RF circuit 510 to be sent to, for example, another cellular phone, or output to the memory 520 for further processing.
WiFi belongs to short distance wireless transmission technology, and the mobile phone can help the user to send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 570, which provides wireless broadband internet access for the user. Although fig. 15 shows the WiFi module 570, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 580 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 520 and calling data stored in the memory 520, thereby performing overall monitoring of the mobile phone. Alternatively, processor 580 may include one or more processing units; optionally, processor 580 may integrate an application processor, which handles primarily the operating system, user interface, applications, etc., and a modem processor, which handles primarily the wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 580.
The handset also includes a power supply 590 (e.g., a battery) for powering the various components, which may optionally be logically connected to the processor 580 via a power management system, such that the power management system may be used to manage charging, discharging, and power consumption.
Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.
In the embodiment of the present application, the processor 580 included in the terminal device further has the following functions:
acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be annotated of each word according to the text data to be annotated, wherein the text feature set to be annotated comprises part of speech, word length and post-word punctuation types;
extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
and acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word.
In the embodiment of the present application, the processor 580 included in the terminal device further has the following functions:
acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be trained of each word according to the text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;
extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
and training the word identification corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy structure.
Fig. 16 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 600 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and a memory 632, and one or more storage media 630 (e.g., one or more mass storage devices) for storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 600.
The server 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, and/or one or more operating systems 641, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 16.
In the embodiment of the present application, the CPU 622 included in the server also has the following functions:
acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be annotated of each word according to the text data to be annotated, wherein the text feature set to be annotated comprises part of speech, word length and post-word punctuation types;
extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
and acquiring a prosody hierarchy structure through a prosody hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word.
In the embodiment of the present application, the CPU 622 included in the server also has the following functions:
acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be trained of each word according to the text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;
extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
and training the word identification corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosody hierarchy labeling model, wherein the prosody hierarchy labeling model is used for labeling a prosody hierarchy structure.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (14)

1. A method for prosodic hierarchy annotation, comprising:
acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled comprises part of speech, word length and post-word punctuation types, and the audio data is voice data;
extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
obtaining a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, wherein the prosodic hierarchy structure comprises at least one of a prosodic word, a prosodic phrase and a intonation phrase, or the prosodic hierarchy structure comprises at least one of a prosodic word and a prosodic phrase, the prosodic hierarchy labeling model is obtained by training according to the word identifier corresponding to the word, the text feature set to be trained of the word and the acoustic feature set to be trained of the word, and the training process of the prosodic hierarchy labeling model is as follows: obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter; acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter; training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model; and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.
2. A method of model training, comprising:
acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, each word corresponds to a word identifier, and the audio data to be trained is voice data;
extracting a text feature set to be trained of each word according to the text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;
extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, wherein the prosodic hierarchy labeling model is used for labeling a prosodic hierarchy structure, and the prosodic hierarchy structure comprises at least one of prosodic words, prosodic phrases and intonation phrases, or the prosodic hierarchy structure comprises at least one of prosodic words and prosodic phrases;
training the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, including:
obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter;
acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter;
training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model;
and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.
3. The method of claim 2, wherein after obtaining the text data to be trained and the audio data to be trained, the method further comprises:
performing word segmentation processing on the text data to be trained to obtain at least one word;
acquiring a target word identifier corresponding to a target word according to a preset word identifier relationship, wherein the preset word identifier relationship is used for representing a preset relationship between each word and the word identifier, and the target word belongs to any one of the at least one word;
generating a target word vector corresponding to the target word in the text data to be trained;
the training of the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model includes:
training the target word identifier and the target word vector to obtain a first model parameter, wherein the first model parameter is used for generating a word embedding layer in the prosodic hierarchy labeling model, and the word embedding layer is updated in a target time.
4. The method according to claim 2, wherein the extracting a feature set of the text to be trained for each word according to the text data to be trained comprises:
acquiring part of speech, word length and post-word punctuation types of target words in the text data to be trained, wherein the part of speech represents a grammar classification result of the words, the word length represents word number of the words, and the post-word punctuation types are used for representing punctuation types corresponding to the post-word;
acquiring the part of speech, word length and post-word punctuation types of associated words in the text data to be trained, wherein the associated words are words having an association relation with the target words;
the training of the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model includes:
training the part of speech, word length and post-word punctuation types of the target words and the part of speech, word length and post-word punctuation types of the associated words by adopting a loss function;
and when the loss function reaches the minimum value, obtaining a second model parameter, wherein the second model parameter is used for generating a text neural network in the prosody hierarchy labeling model.
5. The method of claim 2, wherein after obtaining the text data to be trained and the audio data to be trained, the method further comprises:
forcibly aligning the text data to be trained and the audio data to be trained to obtain a time-aligned text;
the extracting the acoustic feature set to be trained of each word according to the audio data to be trained includes:
and determining the word end syllable duration of the target word according to the time alignment text.
6. The method of claim 5, wherein the extracting the set of acoustic features to be trained for each word from the audio data to be trained comprises:
and determining post-word pause duration of the target word according to the time alignment text.
7. The method of claim 5, wherein the extracting the set of acoustic features to be trained for each word from the audio data to be trained comprises:
calculating the frame number of the final syllable voiced initial frame and the frame number of the voiced end frame of the target word according to the time alignment text and the fundamental frequency information extracted from the audio data to be trained;
extracting a logarithmic fundamental frequency curve and a logarithmic energy curve of the audio data to be trained;
and calculating the acoustic statistical characteristics of the end-of-speech syllables of the target word according to the frame number of the end-of-speech syllable voiced initial frame of the target word, the frame number of the voiced end frame, the logarithmic fundamental frequency curve and the logarithmic energy curve, wherein the acoustic statistical characteristics of the end-of-speech syllables comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic fundamental frequency curve, and the acoustic statistical characteristics of the end-of-speech syllables further comprise at least one of the maximum value, the minimum value, the interval range, the average value and the variance of the logarithmic energy curve.
8. The method of claim 5, wherein the extracting the set of acoustic features to be trained for each word from the audio data to be trained comprises:
calculating the frame number of the last voiced sound frame of the target word and the frame number of the voiced sound frame of the next adjacent word prefix of the target word according to the time alignment text and the fundamental frequency information extracted from the audio data to be trained;
determining a fundamental frequency value and an energy value between the end voiced frame of the target word and the next adjacent word beginning voiced frame according to the frame number of the last voiced frame of the target word, the frame number of the voiced frame of the next adjacent word beginning of the target word, and fundamental frequency information and energy information which are extracted from the audio data to be trained in a framing manner;
and calculating to obtain a logarithmic difference value of the fundamental frequency value according to the fundamental frequency value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, and calculating to obtain a logarithmic difference value of the energy value according to the energy value between the suffix voiced frame of the target word and the next adjacent word prefix voiced frame, wherein the logarithmic difference value of the fundamental frequency value and the logarithmic difference value of the energy value belong to the acoustic feature change value between words.
9. A prosodic hierarchy labeling apparatus, comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring text data to be labeled and audio data, the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, each word corresponds to a word identifier, and the audio data is voice data;
the extraction module is used for extracting a text feature set to be labeled of each word according to the text data to be labeled obtained by the obtaining module, wherein the text feature set to be labeled comprises part of speech, word length and word post punctuation types;
the extraction module is further configured to extract an acoustic feature set of each word according to the audio data acquired by the acquisition module, where the acoustic feature set includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;
the prediction module is configured to obtain a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word extracted by the extraction module, and the acoustic feature set of each word, where the prosodic hierarchy structure includes at least one of a prosodic word, a prosodic phrase, and a intonation phrase, or the prosodic hierarchy structure includes at least one of a prosodic word and a prosodic phrase, the prosodic hierarchy labeling model is obtained by training according to the word identifier corresponding to the word, the text feature set to be trained of the word, and the acoustic feature set to be trained of the word, and the training process of the prosodic hierarchy labeling model is as follows: obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter; acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter; training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model; and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.
10. A model training apparatus, comprising:
the training device comprises an acquisition module, a comparison module and a comparison module, wherein the acquisition module is used for acquiring text data to be trained and audio data to be trained, the text data to be trained and the audio data to be trained have a corresponding relation, the text data to be trained comprises at least one word, each word corresponds to a word identifier, and the audio data to be trained is voice data;
the extraction module is used for extracting a text feature set to be trained of each word according to the text data to be trained acquired by the acquisition module, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation types;
the extraction module is further configured to extract an acoustic feature set to be trained of each word according to the audio data to be trained acquired by the acquisition module, where the acoustic feature set to be trained includes a word end syllable duration, a post-word pause duration, a word end syllable acoustic statistical feature, and an inter-word acoustic feature variation value;
a training module, configured to train a word identifier corresponding to each word, the text feature set to be trained of each word extracted by the extraction module, and the acoustic feature set to be trained of each word, so as to obtain a prosodic hierarchy labeling model, where the prosodic hierarchy labeling model is used to label a prosodic hierarchy structure, and the prosodic hierarchy structure includes at least one of prosodic words, prosodic phrases, and intonation phrases, or the prosodic hierarchy structure includes at least one of prosodic words and prosodic phrases;
wherein the training module is specifically configured to:
obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter;
acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter;
training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model;
and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.
11. A terminal device, comprising: a memory, a transceiver, a processor, and a bus system;
wherein the memory is used for storing programs;
the processor is used for executing the program in the memory and comprises the following steps:
acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled comprises part of speech, word length and post-word punctuation types, and the audio data is voice data;
extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
obtaining a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, wherein the prosodic hierarchy structure comprises at least one of a prosodic word, a prosodic phrase and a intonation phrase, or the prosodic hierarchy structure comprises at least one of a prosodic word and a prosodic phrase, the prosodic hierarchy labeling model is obtained by training according to the word identifier corresponding to the word, the text feature set to be trained of the word and the acoustic feature set to be trained of the word, and the training process of the prosodic hierarchy labeling model is as follows: obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter; acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter; training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model; generating the prosody hierarchy labeling model according to the first model parameter, the second model parameter and the third model parameter;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
12. A server, comprising: a memory, a transceiver, a processor, and a bus system;
wherein the memory is used for storing programs;
the processor is used for executing the program in the memory and comprises the following steps:
acquiring text data to be trained and audio data to be trained, wherein the text data to be trained and the audio data to be trained have a corresponding relationship, the text data to be trained comprises at least one word, each word corresponds to a word identifier, and the audio data to be trained is voice data;
extracting a text feature set to be trained of each word according to the text data to be trained, wherein the text feature set to be trained comprises part of speech, word length and word post punctuation type;
extracting an acoustic feature set to be trained of each word according to the audio data to be trained, wherein the acoustic feature set to be trained comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
training the word identifier corresponding to each word, the text feature set to be trained of each word and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, wherein the prosodic hierarchy labeling model is used for labeling a prosodic hierarchy structure, and the prosodic hierarchy structure comprises at least one of prosodic words, prosodic phrases and intonation phrases, or the prosodic hierarchy structure comprises at least one of prosodic words and prosodic phrases;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate;
training the word identifier corresponding to each word, the text feature set to be trained of each word, and the acoustic feature set to be trained of each word to obtain a prosodic hierarchy labeling model, including:
obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter;
acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter;
training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model;
and generating the prosody hierarchy annotation model according to the first model parameter, the second model parameter and the third model parameter.
13. An intelligent voice interaction system is characterized by comprising a voice acquisition module, a voice processing and analyzing module and a storage module;
the voice acquisition module is used for acquiring text data to be labeled and audio data, wherein the text data to be labeled and the audio data have a corresponding relation, the text data to be labeled comprises at least one word, and each word corresponds to a word identifier;
the voice processing and analyzing module is used for extracting a text feature set to be labeled of each word according to the text data to be labeled, wherein the text feature set to be labeled comprises part of speech, word length and post-word punctuation types, and the audio data is voice data;
extracting an acoustic feature set of each word according to the audio data, wherein the acoustic feature set comprises word end syllable duration, post-word pause duration, word end syllable acoustic statistical features and inter-word acoustic feature variation values;
obtaining a prosodic hierarchy structure through a prosodic hierarchy labeling model according to the word identifier of each word, the text feature set to be labeled of each word and the acoustic feature set of each word, wherein the prosodic hierarchy structure comprises at least one of a prosodic word, a prosodic phrase and a intonation phrase, or the prosodic hierarchy structure comprises at least one of a prosodic word and a prosodic phrase, the prosodic hierarchy labeling model is obtained by training according to the word identifier corresponding to the word, the text feature set to be trained of the word and the acoustic feature set to be trained of the word, and the training process of the prosodic hierarchy labeling model is as follows: obtaining a first output result of a target word identifier through a word embedding layer in the prosodic hierarchy labeling model, wherein the target word identifier corresponds to a target word, the target word belongs to any one of the at least one word, and the word embedding layer is obtained through training according to a first model parameter; acquiring a second output result of a target text feature set to be trained through a text neural network in the prosodic hierarchy labeling model, wherein the target text feature set to be trained corresponds to the target word, and the text neural network is obtained through training according to a second model parameter; training the first output result, the second output result and a target acoustic feature set to be trained to obtain a third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, and the third model parameter is used for generating an acoustic neural network in the prosody level labeling model; generating the prosody hierarchy labeling model according to the first model parameter, the second model parameter and the third model parameter;
the storage module is used for storing the prosody hierarchy.
14. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of claim 1, or perform the method of any of claims 2 to 8.
CN201910751371.6A 2019-01-22 2019-01-22 Rhythm level labeling method, model training method and device Active CN110444191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910751371.6A CN110444191B (en) 2019-01-22 2019-01-22 Rhythm level labeling method, model training method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910751371.6A CN110444191B (en) 2019-01-22 2019-01-22 Rhythm level labeling method, model training method and device
CN201910060152.3A CN109697973A (en) 2019-01-22 2019-01-22 A kind of method, the method and device of model training of prosody hierarchy mark

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910060152.3A Division CN109697973A (en) 2019-01-22 2019-01-22 A kind of method, the method and device of model training of prosody hierarchy mark

Publications (2)

Publication Number Publication Date
CN110444191A CN110444191A (en) 2019-11-12
CN110444191B true CN110444191B (en) 2021-11-26

Family

ID=66234262

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910751371.6A Active CN110444191B (en) 2019-01-22 2019-01-22 Rhythm level labeling method, model training method and device
CN201910060152.3A Pending CN109697973A (en) 2019-01-22 2019-01-22 A kind of method, the method and device of model training of prosody hierarchy mark

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910060152.3A Pending CN109697973A (en) 2019-01-22 2019-01-22 A kind of method, the method and device of model training of prosody hierarchy mark

Country Status (1)

Country Link
CN (2) CN110444191B (en)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020218635A1 (en) * 2019-04-23 2020-10-29 엘지전자 주식회사 Voice synthesis apparatus using artificial intelligence, method for operating voice synthesis apparatus, and computer-readable recording medium
CN110164413B (en) * 2019-05-13 2021-06-04 北京百度网讯科技有限公司 Speech synthesis method, apparatus, computer device and storage medium
CN110619035B (en) * 2019-08-01 2023-07-25 平安科技(深圳)有限公司 Method, device, equipment and storage medium for identifying keywords in interview video
CN112528014B (en) * 2019-08-30 2023-04-18 成都启英泰伦科技有限公司 Method and device for predicting word segmentation, part of speech and rhythm of language text
CN110556093B (en) * 2019-09-17 2021-12-10 浙江同花顺智富软件有限公司 Voice marking method and system
CN110459202B (en) * 2019-09-23 2022-03-15 浙江同花顺智能科技有限公司 Rhythm labeling method, device, equipment and medium
CN110675896B (en) * 2019-09-30 2021-10-22 北京字节跳动网络技术有限公司 Character time alignment method, device and medium for audio and electronic equipment
CN110797005B (en) * 2019-11-05 2022-06-10 百度在线网络技术(北京)有限公司 Prosody prediction method, apparatus, device, and medium
CN110767213A (en) * 2019-11-08 2020-02-07 四川长虹电器股份有限公司 Rhythm prediction method and device
CN112863476A (en) * 2019-11-27 2021-05-28 阿里巴巴集团控股有限公司 Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing
CN111164674A (en) * 2019-12-31 2020-05-15 深圳市优必选科技股份有限公司 Speech synthesis method, device, terminal and storage medium
CN111128120B (en) * 2019-12-31 2022-05-10 思必驰科技股份有限公司 Text-to-speech method and device
WO2021134581A1 (en) * 2019-12-31 2021-07-08 深圳市优必选科技股份有限公司 Prosodic feature prediction-based speech synthesis method, apparatus, terminal, and medium
CN113129863A (en) * 2019-12-31 2021-07-16 科大讯飞股份有限公司 Voice time length prediction method, device, equipment and readable storage medium
CN111261162B (en) * 2020-03-09 2023-04-18 北京达佳互联信息技术有限公司 Speech recognition method, speech recognition apparatus, and storage medium
CN111369971B (en) * 2020-03-11 2023-08-04 北京字节跳动网络技术有限公司 Speech synthesis method, device, storage medium and electronic equipment
CN111681641B (en) * 2020-05-26 2024-02-06 微软技术许可有限责任公司 Phrase-based end-to-end text-to-speech (TTS) synthesis
CN111710326B (en) * 2020-06-12 2024-01-23 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111754978B (en) * 2020-06-15 2023-04-18 北京百度网讯科技有限公司 Prosodic hierarchy labeling method, device, equipment and storage medium
CN111667816B (en) 2020-06-15 2024-01-23 北京百度网讯科技有限公司 Model training method, speech synthesis method, device, equipment and storage medium
CN111785247A (en) * 2020-07-13 2020-10-16 北京字节跳动网络技术有限公司 Voice generation method, device, equipment and computer readable medium
CN114064964A (en) * 2020-07-30 2022-02-18 华为技术有限公司 Text time labeling method and device, electronic equipment and readable storage medium
CN112102847B (en) * 2020-09-09 2022-08-09 四川大学 Audio and slide content alignment method
CN112216267A (en) * 2020-09-15 2021-01-12 北京捷通华声科技股份有限公司 Rhythm prediction method, device, equipment and storage medium
CN112466277B (en) * 2020-10-28 2023-10-20 北京百度网讯科技有限公司 Prosody model training method and device, electronic equipment and storage medium
CN112382270A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Speech synthesis method, apparatus, device and storage medium
CN112863484B (en) * 2021-01-25 2024-04-09 中国科学技术大学 Prosodic phrase boundary prediction model training method and prosodic phrase boundary prediction method
CN113178188A (en) * 2021-04-26 2021-07-27 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
CN113421550A (en) * 2021-06-25 2021-09-21 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment
CN113421543A (en) * 2021-06-30 2021-09-21 深圳追一科技有限公司 Data labeling method, device and equipment and readable storage medium
CN113327615B (en) * 2021-08-02 2021-11-16 北京世纪好未来教育科技有限公司 Voice evaluation method, device, equipment and storage medium
CN114420089B (en) * 2022-03-30 2022-06-21 北京世纪好未来教育科技有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN115116428B (en) * 2022-05-19 2024-03-15 腾讯科技(深圳)有限公司 Prosodic boundary labeling method, device, equipment, medium and program product
CN115116427B (en) * 2022-06-22 2023-11-14 马上消费金融股份有限公司 Labeling method, voice synthesis method, training method and training device
CN115188365B (en) * 2022-09-09 2022-12-27 中邮消费金融有限公司 Pause prediction method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8554566B2 (en) * 2008-08-12 2013-10-08 Morphism Llc Training and applying prosody models
TW201432668A (en) * 2013-02-05 2014-08-16 Univ Nat Chiao Tung Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech-synthesizing
CN105374350A (en) * 2015-09-29 2016-03-02 百度在线网络技术(北京)有限公司 Speech marking method and device
CN105551481A (en) * 2015-12-21 2016-05-04 百度在线网络技术(北京)有限公司 Rhythm marking method of voice data and apparatus thereof
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN106971709A (en) * 2017-04-19 2017-07-21 腾讯科技(上海)有限公司 Statistic parameter model method for building up and device, phoneme synthesizing method and device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7136816B1 (en) * 2002-04-05 2006-11-14 At&T Corp. System and method for predicting prosodic parameters
US20070055526A1 (en) * 2005-08-25 2007-03-08 International Business Machines Corporation Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis
JP4539537B2 (en) * 2005-11-17 2010-09-08 沖電気工業株式会社 Speech synthesis apparatus, speech synthesis method, and computer program
CN103035241A (en) * 2012-12-07 2013-04-10 中国科学院自动化研究所 Model complementary Chinese rhythm interruption recognition system and method
US20160365087A1 (en) * 2015-06-12 2016-12-15 Geulah Holdings Llc High end speech synthesis
CN105185373B (en) * 2015-08-06 2017-04-05 百度在线网络技术(北京)有限公司 The generation of prosody hierarchy forecast model and prosody hierarchy Forecasting Methodology and device
CN105244020B (en) * 2015-09-24 2017-03-22 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN105185372B (en) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN107039034B (en) * 2016-02-04 2020-05-01 科大讯飞股份有限公司 Rhythm prediction method and system
CN108305612B (en) * 2017-11-21 2020-07-31 腾讯科技(深圳)有限公司 Text processing method, text processing device, model training method, model training device, storage medium and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8554566B2 (en) * 2008-08-12 2013-10-08 Morphism Llc Training and applying prosody models
TW201432668A (en) * 2013-02-05 2014-08-16 Univ Nat Chiao Tung Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech-synthesizing
CN105374350A (en) * 2015-09-29 2016-03-02 百度在线网络技术(北京)有限公司 Speech marking method and device
CN105551481A (en) * 2015-12-21 2016-05-04 百度在线网络技术(北京)有限公司 Rhythm marking method of voice data and apparatus thereof
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN106971709A (en) * 2017-04-19 2017-07-21 腾讯科技(上海)有限公司 Statistic parameter model method for building up and device, phoneme synthesizing method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AUTOMATIC PROSODY BOUNDARY LABELING OF MANDARIN USING BOTH TEXT AND ACOUSTIC INFORMATION;Chongjia Ni et al;《2008 6th International Symposium on Chinese Spoken Language Processing》;20081219;全文 *
Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis;Mu Wang et al;《2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia)》;20181231;全文 *
基于半监督学习的汉语韵律短语预测研究;苏丹;《中国优秀硕士学位论文全文数据库 信息科技辑》;20121015(第10期);全文 *
基于文本和语音特征的汉语韵律短语边界预测;李枭;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180315(第03期);全文 *

Also Published As

Publication number Publication date
CN110444191A (en) 2019-11-12
CN109697973A (en) 2019-04-30

Similar Documents

Publication Publication Date Title
CN110444191B (en) Rhythm level labeling method, model training method and device
CN110288077B (en) Method and related device for synthesizing speaking expression based on artificial intelligence
CN110853618B (en) Language identification method, model training method, device and equipment
CN107481718B (en) Audio recognition method, device, storage medium and electronic equipment
WO2021036644A1 (en) Voice-driven animation method and apparatus based on artificial intelligence
CN110853617B (en) Model training method, language identification method, device and equipment
CN111261144B (en) Voice recognition method, device, terminal and storage medium
CN110838286A (en) Model training method, language identification method, device and equipment
CN110570840B (en) Intelligent device awakening method and device based on artificial intelligence
CN110890093A (en) Intelligent device awakening method and device based on artificial intelligence
CN110634474B (en) Speech recognition method and device based on artificial intelligence
CN112840396A (en) Electronic device for processing user words and control method thereof
WO2020098269A1 (en) Speech synthesis method and speech synthesis device
CN113393828A (en) Training method of voice synthesis model, and voice synthesis method and device
CN112735418B (en) Voice interaction processing method, device, terminal and storage medium
CN114360510A (en) Voice recognition method and related device
CN112562723B (en) Pronunciation accuracy determination method and device, storage medium and electronic equipment
CN102063282A (en) Chinese speech input system and method
CN111292727B (en) Voice recognition method and electronic equipment
CN112906369A (en) Lyric file generation method and device
CN112488157A (en) Dialog state tracking method and device, electronic equipment and storage medium
CN111522592A (en) Intelligent terminal awakening method and device based on artificial intelligence
CN111145734A (en) Voice recognition method and electronic equipment
CN114708849A (en) Voice processing method and device, computer equipment and computer readable storage medium
CN113948060A (en) Network training method, data processing method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant