CN110930975B

CN110930975B - Method and device for outputting information

Info

Publication number: CN110930975B
Application number: CN201811597465.4A
Authority: CN
Inventors: 周志平; 盖于涛; 陈昌滨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-08-31
Filing date: 2018-12-26
Publication date: 2023-08-04
Anticipated expiration: 2038-12-26
Also published as: CN110930975A

Abstract

The embodiment of the application discloses a method and a device for outputting information. One embodiment of the method comprises the following steps: acquiring a fundamental frequency curve corresponding to a syllable of a sample to be marked; extracting a fundamental frequency sequence from a fundamental frequency curve; converting the base frequency sequence into a sequence of sample values; clustering the sample value sequence and a reference sequence of known boundary tone types to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be marked, and outputting the boundary tone type of the sample syllable to be marked. The implementation mode realizes the automatic labeling of boundary call in the English voice synthesis system, thereby shortening the labeling time and saving the cost.

Description

Method and device for outputting information

Technical Field

The embodiment of the application relates to the technical field of voice synthesis, in particular to a method and a device for outputting information.

Background

The pronunciation of English has no tone, and emotion is expressed mainly through tone change, for example, the sentence end of a question sentence is generally expressed by raised tone. Therefore, in the English synthesis system, intonation information is added to better synthesize emotion-carrying sounds, and most of the existing English emotion synthesis systems are added with boundary intonation information to represent intonation change types.

The related English emotion synthesis system can well synthesize emotion-carrying voice, but the emotion type of training data needs to be manually marked, so that not only is the marked person required to have strong English expertise, but also great manpower and financial resources are required to be consumed.

Disclosure of Invention

The embodiment of the application provides a method and a device for outputting information.

In a first aspect, an embodiment of the present application provides a method for outputting information, including: acquiring a fundamental frequency curve corresponding to a syllable of a sample to be marked; extracting a fundamental frequency sequence from a fundamental frequency curve; converting the base frequency sequence into a sequence of sample values; and clustering the sample value sequence with a reference sequence of known boundary tone types to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be marked. And outputting the boundary tone type of the syllable of the sample to be marked.

In some embodiments, converting the base frequency sequence into a sequence of sample values comprises: the base frequency sequence is sampled and interpolated to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence.

In some embodiments, converting the base frequency sequence into a sequence of sample values comprises: the base frequency sequence is subjected to discrete cosine transform, and the discrete cosine transform coefficient is used as a sample value sequence.

In some embodiments, clustering the sequence of sample values with a reference sequence of known boundary tone types includes: clustering the sample value sequence and the first reference sequence through the pearson correlation coefficient, and classifying the sample value sequence into two classes according to the positive and negative of the correlation coefficient, wherein the class with positive slope is marked as a first class, and the class with negative slope is marked as a second class; clustering the sample value sequence and the second reference sequence through Euclidean distance, and clustering the first class into a third class and a fourth class and the second class into a fifth class and a sixth class according to the overall height of the fundamental frequency; and clustering the sample value sequence and a third reference sequence through Euclidean distance, and respectively classifying the third class, the fourth class, the fifth class and the sixth class into two classes according to the variation amplitude of the fundamental frequency.

In some embodiments, the method further comprises: obtaining an English text to be synthesized, wherein the English text comprises at least one word, and the word comprises at least one syllable; extracting the characteristics of the word from at least one word, inputting the characteristics of the word into a pre-trained front-end prediction model, and outputting the boundary tone type of the last syllable of the word; inputting the English text and the boundary adjustment type of the last syllable of each word in the English text into a pre-trained back-end acoustic model, and outputting acoustic parameters; and synthesizing English text into English voice based on the output acoustic parameters.

In some embodiments, the front-end predictive model is trained by: acquiring a first training sample set, wherein the first training sample comprises a sample word and a boundary tone type corresponding to the last sample syllable of the sample word; and taking a sample word of a first training sample in the first training sample set as input, taking a boundary tone type corresponding to the last sample syllable of the input sample word as output, and training to obtain a front-end prediction model.

In some embodiments, the back-end acoustic model is trained by: acquiring a second training sample set, wherein the second training sample comprises a boundary tone type of a sample syllable corresponding to a sample phoneme sequence and acoustic parameters corresponding to the sample phoneme sequence; and taking a sample phoneme sequence of a second training sample in the second training sample set and a boundary tone type of a sample syllable corresponding to the sample phoneme sequence as input, taking acoustic parameters corresponding to the input sample phoneme sequence as output, and training to obtain a back-end acoustic model.

In a second aspect, an embodiment of the present application provides an apparatus for outputting information, including: the acquisition unit is configured to acquire a fundamental frequency curve corresponding to a sample syllable to be marked; an extraction unit configured to extract a base frequency sequence from the base frequency curve; a conversion unit configured to convert the base frequency sequence into a sequence of sample values; and the clustering unit is configured to cluster the sample value sequence with a reference sequence of known boundary tone types to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be marked. And the output unit is configured to output the boundary tone type of the sample syllable to be marked.

In some embodiments, the conversion unit is further configured to: the base frequency sequence is sampled and interpolated to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence.

In some embodiments, the conversion unit is further configured to: the base frequency sequence is subjected to discrete cosine transform, and the discrete cosine transform coefficient is used as a sample value sequence.

In some embodiments, the clustering unit is further configured to: clustering the sample value sequence and the first reference sequence through the pearson correlation coefficient, and classifying the sample value sequence into two classes according to the positive and negative of the correlation coefficient, wherein the class with positive slope is marked as a first class, and the class with negative slope is marked as a second class; clustering the sample value sequence and the second reference sequence through Euclidean distance, and clustering the first class into a third class and a fourth class and the second class into a fifth class and a sixth class according to the overall height of the fundamental frequency; and clustering the sample value sequence and a third reference sequence through Euclidean distance, and respectively classifying the third class, the fourth class, the fifth class and the sixth class into two classes according to the variation amplitude of the fundamental frequency.

In some embodiments, the apparatus further comprises a synthesis unit configured to: obtaining an English text to be synthesized, wherein the English text comprises at least one word, and the word comprises at least one syllable; extracting the characteristics of the word from at least one word, inputting the characteristics of the word into a pre-trained front-end prediction model, and outputting the boundary tone type of the last syllable of the word; inputting the English text and the boundary adjustment type of the last syllable of each word in the English text into a pre-trained back-end acoustic model, and outputting acoustic parameters; and synthesizing English text into English voice based on the output acoustic parameters.

In some embodiments, the apparatus further comprises a first training unit configured to: acquiring a first training sample set, wherein the first training sample comprises a sample word and a boundary tone type corresponding to the last sample syllable of the sample word; and taking a sample word of a first training sample in the first training sample set as input, taking a boundary tone type corresponding to the last sample syllable of the input sample word as output, and training to obtain a front-end prediction model.

In some embodiments, the apparatus further comprises a second training unit configured to: acquiring a second training sample set, wherein the second training sample comprises a boundary tone type of a sample syllable corresponding to a sample phoneme sequence and acoustic parameters corresponding to the sample phoneme sequence; and taking a sample phoneme sequence of a second training sample in the second training sample set and a boundary tone type of a sample syllable corresponding to the sample phoneme sequence as input, taking acoustic parameters corresponding to the input sample phoneme sequence as output, and training to obtain a back-end acoustic model.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method as in any of the first aspects.

In a fourth aspect, embodiments of the present application provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as in any of the first aspects.

According to the method and the device for outputting information, the base frequency sequence is extracted from the base frequency curve of the syllable, then the base frequency sequence is converted into the sample value sequence and then clustering is carried out, and the boundary tone type of the syllable is obtained. The automatic labeling of boundary tones in an English speech synthesis system is realized, so that the labeling time is shortened and the cost is saved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method for outputting information according to the present application;

FIG. 3 is a schematic illustration of one application scenario of a method for outputting information according to the present application;

FIG. 4 is a flow chart of yet another embodiment of a method for outputting information according to the present application;

FIG. 5 is a schematic diagram of yet another application scenario of a method for outputting information according to the present application;

FIG. 6 is a schematic structural diagram of one embodiment of an apparatus for outputting information according to the present application;

fig. 7 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the methods for outputting information or the apparatus for outputting information of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a speech synthesis class application, a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting audio playback, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background speech synthesis server providing support for audio played on the terminal devices 101, 102, 103. The background speech synthesis server can analyze and other data of the received sample syllables and the like to obtain the boundary tone type of the syllables. A front-end prediction model for predicting the boundary tone type and a back-end acoustic model for generating acoustic parameters may then be trained based on the boundary tone type for a number of sample syllables. And then the server can synthesize the voice with intonation through the front-end prediction model and the back-end acoustic model when receiving the English text to be synthesized, and feed back the synthesized voice to the terminal equipment. The terminal device may also obtain a front-end predictive model and a back-end acoustic model from the server and then perform speech synthesis locally.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., a plurality of software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the method for outputting information provided in the embodiments of the present application is generally performed by the server 105, and accordingly, the device for outputting information is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for outputting information according to the present application is shown. The method for outputting information comprises the following steps:

step 201, obtaining a fundamental frequency curve corresponding to a syllable of a sample to be marked.

In this embodiment, the execution body (e.g., the server shown in fig. 1) of the method for outputting information may obtain the baseband curves corresponding to the syllables of the sample to be annotated from the sound library of the third party server through a wired connection manner or a wireless connection manner. The local server can also extract the base frequency curve from the sample syllables. Syllables are the basic unit of English pronunciation, and the pronunciation of any word is decomposed into syllable-dependent pronunciations. In english the vowels (aeiou u total of five) are particularly loud, one vowel phoneme can constitute a syllable, and a combination of one vowel phoneme and one or several consonant phonemes can also constitute a syllable. The vowel phone is a main body constituting syllables, and the consonant is a boundary of syllables. In sound, the fundamental frequency refers to the frequency of the fundamental tone in a complex tone. Among the several tones constituting one complex tone, the fundamental tone has the lowest frequency and the greatest intensity. The level of the fundamental frequency determines the level of a tone. The frequency of speech is usually the frequency of fundamental tone. Basic components for extracting fundamental frequencies are commonly found in vocoders and various language signal processing systems. The language fundamental frequency recognition is to extract the fundamental frequency of the voice signal and display the size and the changing shape in a color dynamic graph mode, namely, form a fundamental frequency curve. The present example may employ a baseband extraction algorithm based on STRAIGHT.

Step 202, extracting a base frequency sequence from the base frequency curve.

In this embodiment, the base frequency curve can be quantized in segments according to the boundary range of syllables. The baseband sequence may be formed by segmenting at fixed time intervals, for example, by taking a baseband value from the baseband curve every 5 milliseconds.

Step 203 converts the base frequency sequence into a sequence of sample values.

In this embodiment, since the value of the fundamental frequency is large, in order to facilitate the subsequent clustering process, the fundamental frequency sequence needs to be converted into a sample value sequence of a predetermined length. A sequence of sample values of uniform length can be obtained by means of sampling and interpolation. For example, a base frequency sequence of length 50 is down-sampled and converted to a sequence of sample values of length 30. The base frequency sequence of length 20 is interpolated and converted into a sequence of sample values of length 30.

In some alternative implementations of the present embodiment, converting the base frequency sequence into a sequence of sample values includes: the base frequency sequence is sampled and interpolated to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence. The base frequency sequence can be firstly subjected to logarithmic operation and then converted into a base frequency logarithmic sequence with a preset length to be used as a sample value sequence. This can reduce the amount of operation data.

In some alternative implementations of the present embodiment, converting the base frequency sequence into a sequence of sample values includes: the base frequency sequence is subjected to discrete cosine transform, and the discrete cosine transform coefficient is used as a sample value sequence. The discrete cosine transform (DCT for Discrete Cosine Transform) is a transform related to a fourier transform, which is similar to a discrete fourier transform but uses only real numbers. The discrete cosine transform has a strong "energy concentrating" property in that most of the energy of natural signals (including sound and images) is concentrated in the low frequency part after the discrete cosine transform, and when the signal has statistical properties close to Markov processes, the decorrelation of the discrete cosine transform is close to the performance of the K-L transform (Karhunen-loeve transform-which has an optimal decorrelation). Discrete cosine transform coefficients of a predetermined length may be taken as a sequence of sample values in a front-to-back order.

Step 204, clustering the sample value sequence with a reference sequence of known boundary tone types to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be marked.

In this embodiment, the boundary tone is the intonation of the syllable boundary. Different emotion types have corresponding intonation changes, so boundary intonation can be divided into four types, namely L-L, L-H, H-L and H-H, which respectively correspond to the general change trend of the fundamental frequency, wherein L-H represents the fundamental frequency rising from low and generally corresponds to the questionable intonation, H-H represents the fundamental frequency as a whole higher and generally corresponds to the praise intonation, and H-L and L-L represent general conditions. The classification can be performed by adopting a clustering method based on K-means and the like. A reference base frequency sequence of a known boundary tone type may be prepared in advance, and the reference sequence may be obtained in the same manner as the generation method of the sample value sequence described above. And clustering the sample value sequence with a reference sequence of a known boundary tone type, wherein the boundary tone type of the reference sequence is the boundary tone type of the sample value sequence when the sample value sequence is the same as the reference sequence. The sample value sequence may be clustered with reference sequences of four types L-L, L-H, H-L, H-H, and if the sample value sequence is of the same type as the reference sequence of the L-L type, the sample value sequence is of the L-L type.

In some optional implementations of the present embodiment, the sample value sequence is clustered with a reference sequence of a known boundary tone type, and a hierarchical clustering method of 3 layers may be selected, as shown in fig. 3:

in step 2041, the sample value sequence and the first reference sequence are clustered by pearson correlation coefficient, the sample value sequence is clustered into two classes according to positive and negative of the correlation coefficient, the class with positive slope is marked as the first class, and the class with negative slope is marked as the second class.

In this embodiment, the slope is the correlation coefficient. The first reference sequence is a reference sequence obtained by the same manner as the generation method of the sample value sequence described above for the reference base frequency sequence of the known boundary tone type. The sample value sequence is clustered with a first reference sequence of known boundary tone type, and if the slope is positive, the sample value sequence is classified as a first class, which is indicated by circle 1 in fig. 3. If the slope is negative, the sequence of sample values is classified as a second class, the first class being represented by circle 2 in FIG. 3.

And 2042, clustering the sample value sequence and the second reference sequence through Euclidean distance, clustering the first class into a third class and a fourth class according to the overall height of the fundamental frequency, and clustering the second class into a fifth class and a sixth class.

In this embodiment, the second reference sequence is a reference base frequency sequence whose base frequency is high overall or a reference base frequency sequence whose base frequency is low overall, which is obtained by the same manner as the above-described method of generating the sample value sequence, simply referred to as a high-frequency sequence and a low-frequency sequence. And respectively calculating Euclidean distances between the sample value sequence and the high-frequency sequence and between the sample value sequence and the low-frequency sequence, and if the Euclidean distance between the sample value sequence and the high-frequency sequence is smaller than the Euclidean distance between the sample value sequence and the low-frequency sequence, considering that the sample value sequence belongs to the high-frequency class. And if the Euclidean distance between the sample value sequence and the high-frequency sequence is larger than the Euclidean distance between the sample value sequence and the low-frequency sequence, the sample value sequence is considered to belong to the low-frequency class. If the distances are equal, the second reference sequence is replaced to continue the iterative comparison. As shown in fig. 3, the sample value sequences with closer euclidean distance between the first class and the low frequency sequence are classified into a third class, and a circle 3 in fig. 3 represents the third class. Sample value sequences with closer euclidean distances to high frequency sequences in the first class are classified into a fourth class, which is represented by circle 4 in fig. 3. Sample value sequences in the second class that are closer to the euclidean distance between the low frequency sequences are classified into a fifth class, which is represented by circle 5 in fig. 3. Sample value sequences in the second class that are closer to the euclidean distance between the high frequency sequences are classified into a sixth class, which is indicated by circle 6 in fig. 3.

Step 2043, clustering the sample value sequence and the third reference sequence by Euclidean distance, and respectively clustering the third class, the fourth class, the fifth class and the sixth class into two classes according to the variation amplitude of the fundamental frequency.

In this embodiment, the third reference sequence is a reference sequence obtained by the same method as the sample value sequence generation method described above for four types of reference baseband sequences with a baseband frequency variation range of L-H, L-L, H-H, H-L, and is abbreviated as L-H sequence, L-L sequence, H-H sequence, and H-L sequence.

For node 3 in fig. 3, euclidean distances between the third class sample value sequence and the L-H sequence, L-L sequence, respectively, are calculated. If the Euclidean distance between the third class sample value sequence and the L-H sequence is smaller than the Euclidean distance between the third class sample value sequence and the L-L sequence, the third class sample value sequence is considered to belong to the L-H class, otherwise, the third class sample value sequence is considered to belong to the L-L class. If the distances are equal, the third reference sequence is replaced to continue the iterative comparison.

For node 4 in fig. 3, euclidean distances between the fourth class of sample value sequences and the L-H sequence and H-H sequence are calculated, respectively. If the Euclidean distance between the fourth class sample value sequence and the L-H sequence is smaller than the Euclidean distance between the fourth class sample value sequence and the H-H sequence, the fourth class sample value sequence is considered to belong to the L-H class, otherwise, the fourth class sample value sequence is considered to belong to the H-H class. If the distances are equal, the third reference sequence is replaced to continue the iterative comparison.

For node 5 in fig. 3, euclidean distances between the fifth class of sample value sequences and the L-L sequence and H-L sequence are calculated, respectively. If the Euclidean distance between the fifth sample value sequence and the L-L sequence is smaller than the Euclidean distance between the fifth sample value sequence and the H-L sequence, the fifth sample value sequence is considered to belong to the L-L class, otherwise, the third sample value sequence is considered to belong to the H-L class. If the distances are equal, the third reference sequence is replaced to continue the iterative comparison.

For node 6 in fig. 3, euclidean distances between the sixth class of sample value sequences and the H-H and H-L sequences are calculated, respectively. If the Euclidean distance between the sixth class sample value sequence and the H-H sequence is smaller than the Euclidean distance between the sixth class sample value sequence and the H-L sequence, the sixth class sample value sequence is considered to belong to the H-H class, otherwise, the sixth class sample value sequence is considered to belong to the H-L class. If the distances are equal, the third reference sequence is replaced to continue the iterative comparison.

Step 205, outputting the boundary tone type of the syllable of the sample to be marked.

In this embodiment, after the clustering is completed, the clustering result may be marked in the text of the training data, so as to be used for training the front-end prediction model and the back-end acoustic model respectively.

According to the method provided by the embodiment of the application, the basic frequency sequence of the sample syllable is clustered with the reference sequence of the known boundary tone type to obtain the boundary tone type of the syllable, so that the automatic labeling of the boundary tone type of the sample syllable is realized.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for outputting information is shown. The flow 400 of the method for outputting information comprises the steps of:

step 401, obtaining an english text to be synthesized.

In this embodiment, the execution subject of the method for outputting information (e.g., the server shown in fig. 1) may acquire the english text to be synthesized from the terminal by a wired connection or a wireless connection. Wherein the english text comprises at least one word, the word comprising at least one syllable. One syllable may include at least one phoneme, which is divided into a vowel phoneme and a consonant phoneme.

Step 402, extracting the character of the word for the word in the at least one word, inputting the character of the word into the front end prediction model, and outputting the boundary tone type of the last syllable of the word.

In this embodiment, the front-end predictive model may be trained by the present server or by a third-party server. For each word in the english text, the feature of the word is extracted. Features may include word vectors, parts of speech, case features, prosodic pause types, punctuation information, and syllable numbers of words. The term vector herein refers to a vector into which a word is converted by a word embedding technique. Word embedding is a generic term for a set of language modeling and feature learning techniques in natural language processing, where words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from a space of one dimension per word to a continuous vector space with lower dimensions. Parts of speech refer to nouns, adjectives, verbs, past segmentations, and the like. The same word in english may have different pronunciations at different parts of speech. For example, read is read differently in word segmentation in the past than in the original. Case features are used to denote words as uppercase or lowercase. The prosodic pause types can be classified into three types, no pause, short pause, and long pause. The type of pause may be determined based on sentence structure or punctuation, e.g., no pause within a phrase, no pause between take off, but a short pause after off. If the word is followed by punctuation, then a long pause is provided. Punctuation information refers to whether and what punctuation is followed by a word. The syllable number of a word refers to the number of syllables included in the word.

The output of the front-end predictive model is the boundary tone type of the last syllable of the word. The 6 boundary tone types and borderless tones may be represented by 7-dimensional one-hot vectors. Wherein, the 6 boundary tone types include an H type and an L type in addition to L-L, L-H, H-L and H-H4 types. The H type is an L-H type or a combination of an H-H type and a short pause, i.e., the L-H type or the H-H type is merged into the H type when encountering a short pause. The L-type is an L-L type or a combination of an H-L type and a short pause, i.e., the L-L type or the H-L type is merged into the L type when encountering a short pause. Model selection includes DNN (Deep Neural Networks, deep neural network), SVM (Support Vector Machine ), LSTM (Long Short-Term Memory network), CRF (Conditional Random Field ), attention Model, wavenet, etc.

The front-end predictive model may be trained as follows.

In step 4021, a first set of training samples is obtained.

In this embodiment, the executing entity of the training step may obtain the first training sample set locally or remotely from another electronic device connected to the executing entity network. Wherein each first training sample includes a sample word and a boundary tone type corresponding to a last sample syllable of the sample word. For example, the type of boundary tone corresponding to the last sample syllable of the sample word may be automatically annotated by steps 201-205.

In step 4022, a sample word of a first training sample in the first training sample set is used as input, and a boundary tone type corresponding to a last sample syllable of the input sample word is used as output, so as to train to obtain a front-end prediction model.

In this embodiment, the execution body of the training step may input the sample word in the first training sample set into the initial neural network to obtain the boundary tone type corresponding to the last sample syllable of the sample word, and train the initial neural network by using the machine learning method with the boundary tone type marked in the first training sample as the expected output of the initial neural network. The initial neural network may include, but is not limited to, at least one of: DNN, SVM, LSTM, CRF, patent, wavenet. Specifically, the difference between the obtained boundary tone type and the boundary tone type in the first training sample may be calculated first using a preset loss function, for example, the difference between the obtained boundary tone type and the boundary tone type in the first training sample may be calculated using an L2 norm as the loss function. Then, based on the calculated difference, the network parameters of the initial neural network may be adjusted, and the training may be ended if a preset training end condition is satisfied. For example, the training end conditions preset herein may include, but are not limited to, at least one of: the training time exceeds the preset duration; the training times exceed the preset times; the calculated variance is less than a preset variance threshold.

Here, various implementations may be employed to adjust network parameters of the initial neural network based on the difference between the generated boundary tone type and the boundary tone type in the first training sample. For example, a BP (Back Propagation) algorithm or an SGD (Stochastic Gradient Descent, random gradient descent) algorithm may be employed to adjust network parameters of the initial neural network.

And determining the initial neural network obtained through training as a front-end prediction model.

Step 403, inputting the boundary modulation type of the last syllable of each word in the English text and the English text into the back-end acoustic model, and outputting acoustic parameters.

In this embodiment, the input of the back-end acoustic model is english text with the feature of adding boundary tones, and the output is acoustic parameters, where the acoustic parameters include fundamental frequency and frequency spectrum. Model selection includes HMM (Hidden Markov Model ), DNN, LSTM, CBHG, intent model, wavenet, etc.

The back-end acoustic model can be obtained by training the following steps:

step 4031, a second set of training samples is obtained.

In this embodiment, the executing entity of the training step may obtain the second training sample set locally or remotely from another electronic device connected to the executing entity network. Wherein each second training sample includes a boundary tone type of a sample syllable corresponding to the sample phoneme sequence and an acoustic parameter corresponding to the sample phoneme sequence.

Step 4032, taking the sample phoneme sequence of the second training sample in the second training sample set and the boundary tone type of the sample syllable corresponding to the sample phoneme sequence as input, taking the acoustic parameter corresponding to the input sample phoneme sequence as output, and training to obtain the back-end acoustic model.

In this embodiment, the execution body of the training step may input, into the initial neural network, a sample phoneme sequence in a second training sample in the second training sample set and a boundary adjustment type of a sample syllable corresponding to the sample phoneme sequence, obtain an acoustic parameter corresponding to the sample phoneme sequence, use the acoustic parameter marked in the second training sample as an expected output of the initial neural network, and train the initial neural network by using a machine learning method. The initial neural network may include, but is not limited to, at least one of: HMM, DNN, LSTM, CBHG, section model, wavenet. Specific training steps may be referred to 4022 and will not be described in detail herein.

And determining the initial neural network obtained through training as a back-end acoustic model.

Step 404, synthesizing the English text into English voice based on the output acoustic parameters.

In this embodiment, emotion voice is synthesized by a vocoder or unit concatenation method. Prosody pauses may also be considered in the synthesis.

As can be seen from fig. 4, compared to the corresponding embodiment of fig. 2, the flow 400 of the method for outputting information in this embodiment embodies the steps of speech synthesis with boundary tones. Therefore, the scheme described in the embodiment can add boundary tone information during voice synthesis, so as to achieve a better English emotion voice synthesis effect.

With continued reference to fig. 5, fig. 5 is a schematic diagram of an application scenario of the method for outputting information according to the present embodiment. In the application scenario of fig. 5, during the training phase, the server automatically labels syllables in the original library with the boundary tone type of the sample syllable by steps 201-205. Storing the marked syllables into a marked syllable library. And training a front-end prediction model and a back-end acoustic model through the steps 4021 and 4022 by utilizing syllables already marked in the marked sound library. In the synthesis stage, extracting the characteristics of each word in the text to be synthesized, and then respectively inputting the extracted characteristics into a front-end prediction model to obtain the boundary tone type of the last syllable of each word. And then inputting the text to be synthesized and the boundary tone type of the last syllable of each word into a back-end acoustic model together to obtain acoustic parameters. The acoustic parameters are finally converted to parametric speech by the vocoder. The spliced voice can also be obtained by a unit splicing method.

The method provided by the embodiment of the application trains a front-end prediction model and a back-end acoustic model through syllables of samples marked with boundary tone types. So that the front-end predictive model and the back-end acoustic model are used to synthesize intoned speech during the speech synthesis stage. Compared with manual labeling, the system period and the labor cost of voice synthesis are greatly reduced, and the experiment verifies that a good English emotion voice synthesis effect is obtained.

With further reference to fig. 6, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an apparatus for outputting information, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for outputting information of the present embodiment includes: an acquisition unit 601, an extraction unit 602, a conversion unit 603, a clustering unit 604, and an output unit 605. The acquiring unit 601 is configured to acquire a fundamental frequency curve corresponding to a syllable of a sample to be marked. The extraction unit 602 is configured to extract a base frequency sequence from the base frequency curve. The conversion unit 603 is configured to convert the sequence of fundamental frequencies into a sequence of sample values. The clustering unit 604 is configured to cluster the sample value sequence with a reference sequence of known boundary tone types, resulting in the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be annotated. The output unit 605 is configured to output the boundary tone type of the sample syllable to be annotated.

In the present embodiment, specific processes of the acquisition unit 601, the extraction unit 602, the conversion unit 603, the clustering unit 604, and the output unit 605 of the apparatus 600 for outputting information may refer to step 201, step 202, step 203, step 204, and step 205 in the corresponding embodiment of fig. 2.

In some optional implementations of the present embodiment, the conversion unit 603 is further configured to: the base frequency sequence is sampled and interpolated to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence.

In some optional implementations of the present embodiment, the conversion unit 603 is further configured to: the base frequency sequence is subjected to discrete cosine transform, and the discrete cosine transform coefficient is used as a sample value sequence.

In some optional implementations of the present embodiment, the clustering unit 604 is further configured to: clustering the sample value sequence and the first reference sequence through the pearson correlation coefficient, and classifying the sample value sequence into two classes according to the positive and negative of the correlation coefficient, wherein the class with positive slope is marked as a first class, and the class with negative slope is marked as a second class; clustering the sample value sequence and the second reference sequence through Euclidean distance, and clustering the first class into a third class and a fourth class and the second class into a fifth class and a sixth class according to the overall height of the fundamental frequency; and clustering the sample value sequence and a third reference sequence through Euclidean distance, and respectively classifying the third class, the fourth class, the fifth class and the sixth class into two classes according to the variation amplitude of the fundamental frequency.

In some optional implementations of this embodiment, the apparatus 600 further comprises a synthesizing unit configured to: and obtaining an English text to be synthesized, wherein the English text comprises at least one word, and the word comprises at least one syllable. For a word in at least one word, extracting features of the word, and inputting the features of the word into a pre-trained front-end predictive model to output a boundary tone type of a last syllable of the word. Inputting the English text and the boundary adjustment type of the last syllable of each word in the English text into a pre-trained back-end acoustic model, and outputting acoustic parameters. And synthesizing English text into English voice based on the output acoustic parameters.

In some optional implementations of the present embodiment, the apparatus 600 further comprises a first training unit (not shown) configured to: a first set of training samples is obtained, the first training samples including sample words and boundary tone types corresponding to last sample syllables of the sample words. And taking a sample word of a first training sample in the first training sample set as input, taking a boundary tone type corresponding to the last sample syllable of the input sample word as output, and training to obtain a front-end prediction model.

In some optional implementations of the present embodiment, the apparatus 600 further comprises a second training unit (not shown) configured to: and acquiring a second training sample set, wherein the second training sample comprises the boundary tone type of the sample syllable corresponding to the sample phoneme sequence and the acoustic parameter corresponding to the sample phoneme sequence. And taking a sample phoneme sequence of a second training sample in the second training sample set and a boundary tone type of a sample syllable corresponding to the sample phoneme sequence as input, taking acoustic parameters corresponding to the input sample phoneme sequence as output, and training to obtain a back-end acoustic model.

Referring now to fig. 7, there is illustrated a schematic diagram of a computer system 700 suitable for use in implementing an electronic device (e.g., a terminal device/server as illustrated in fig. 1) in accordance with an embodiment of the present application. The electronic device shown in fig. 7 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the system 700 are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, and the like; an output portion 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 708 including a hard disk or the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 709, and/or installed from the removable medium 711. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 701. It should be noted that, the computer readable medium described in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, an extraction unit, a conversion unit, a clustering unit, and an output unit. The names of these units do not limit the unit itself in some cases, and the acquisition unit may also be described as "a unit that acquires a fundamental frequency curve corresponding to a syllable of a sample to be labeled", for example.

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a fundamental frequency curve corresponding to a syllable of a sample to be marked; extracting a fundamental frequency sequence from a fundamental frequency curve; converting the base frequency sequence into a sequence of sample values; clustering the sample value sequence and a reference sequence of known boundary tone types to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be marked, and outputting the boundary tone type of the sample syllable to be marked.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A method for outputting information, comprising:

acquiring a fundamental frequency curve corresponding to a syllable of a sample to be marked;

extracting a fundamental frequency sequence from the fundamental frequency curve;

converting the base frequency sequence into a sequence of sample values;

clustering the sample value sequence and a reference sequence of known boundary tone types to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be marked;

outputting the boundary tone type of the sample syllable to be marked;

wherein the clustering the sample value sequence with a reference sequence of a known boundary tone type comprises:

Clustering the sample value sequence and a first reference sequence through a pearson correlation coefficient, classifying the sample value sequence into two types according to positive and negative of the correlation coefficient, wherein the type with positive slope is marked as a first type, and the type with negative slope is marked as a second type;

clustering the sample value sequence and a second reference sequence through Euclidean distance, clustering the first class into a third class and a fourth class according to the overall height of the fundamental frequency, and clustering the second class into a fifth class and a sixth class;

and clustering the sample value sequence and a third reference sequence through Euclidean distance, and respectively gathering the third class, the fourth class, the fifth class and the sixth class into two classes according to the variation amplitude of the fundamental frequency.

2. The method of claim 1, wherein the converting the sequence of fundamental frequencies into a sequence of sample values comprises:

and sampling and interpolating the base frequency sequence to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence.

3. The method of claim 1, wherein the converting the sequence of fundamental frequencies into a sequence of sample values comprises:

and performing discrete cosine transform on the base frequency sequence, and taking discrete cosine transform coefficients as a sample value sequence.

4. A method according to one of claims 1-3, wherein the method further comprises:

obtaining an English text to be synthesized, wherein the English text comprises at least one word, and the word comprises at least one syllable;

extracting the characteristics of the word from the words in the at least one word, inputting the characteristics of the word into a pre-trained front-end prediction model, and outputting the boundary tone type of the last syllable of the word;

inputting the English text and the boundary adjustment type of the last syllable of each word in the English text into a pre-trained back-end acoustic model, and outputting acoustic parameters;

and synthesizing the English text into English voice based on the output acoustic parameters.

5. The method of claim 4, wherein the front-end predictive model is trained by:

acquiring a first training sample set, wherein the first training sample comprises a sample word and a boundary tone type corresponding to the last sample syllable of the sample word;

and taking a sample word of a first training sample in the first training sample set as input, taking a boundary tone type corresponding to the last sample syllable of the input sample word as output, and training to obtain a front-end prediction model.

6. The method of claim 4, wherein the back-end acoustic model is trained by:

acquiring a second training sample set, wherein the second training sample comprises a boundary tone type of a sample syllable corresponding to a sample phoneme sequence and acoustic parameters corresponding to the sample phoneme sequence;

and taking a sample phoneme sequence of a second training sample in the second training sample set and a boundary tone type of a sample syllable corresponding to the sample phoneme sequence as input, taking acoustic parameters corresponding to the input sample phoneme sequence as output, and training to obtain a back-end acoustic model.

7. An apparatus for outputting information, comprising:

the acquisition unit is configured to acquire a fundamental frequency curve corresponding to a sample syllable to be marked;

an extraction unit configured to extract a base frequency sequence from the base frequency curve;

a conversion unit configured to convert the base frequency sequence into a sequence of sample values;

the clustering unit is configured to cluster the sample value sequence with a reference sequence of known boundary tone types to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be marked;

the output unit is configured to output the boundary tone type of the sample syllable to be marked;

Wherein the clustering unit is further configured to:

8. The apparatus of claim 7, wherein the conversion unit is further configured to:

9. The apparatus of claim 7, wherein the conversion unit is further configured to:

10. The apparatus according to one of claims 7-9, wherein the apparatus further comprises a synthesis unit configured to:

11. The apparatus of claim 10, wherein the apparatus further comprises a first training unit configured to:

12. The apparatus of claim 10, wherein the apparatus further comprises a second training unit configured to:

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.

14. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-6.