CN110930975A

CN110930975A - Method and apparatus for outputting information

Info

Publication number: CN110930975A
Application number: CN201811597465.4A
Authority: CN
Inventors: 周志平; 盖于涛; 陈昌滨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-08-31
Filing date: 2018-12-26
Publication date: 2020-03-27
Anticipated expiration: 2038-12-26
Also published as: CN110930975B

Abstract

The embodiment of the application discloses a method and a device for outputting information. One embodiment of the method comprises: obtaining a fundamental frequency curve corresponding to a sample syllable to be labeled; extracting a fundamental frequency sequence from the fundamental frequency curve; converting the base frequency sequence into a sample value sequence; clustering the sample value sequence and a reference sequence of a known boundary tone type to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be labeled, and outputting the boundary tone type of the sample syllable to be labeled. The implementation mode realizes the automatic labeling of the boundary tone in the English speech synthesis system, thereby shortening the labeling time and saving the cost.

Description

Method and apparatus for outputting information

Technical Field

The embodiment of the application relates to the technical field of speech synthesis, in particular to a method and a device for outputting information.

Background

The pronunciation of english has no tone, and mainly expresses emotion through tone variation, for example, the tail of a question sentence is generally expressed by a rising tone. Therefore, in the english synthesis system, intonation information needs to be added to synthesize sounds with emotion well, and most of the current english emotion synthesis systems add boundary intonation information to represent the variation types of intonation.

Related English emotion synthesis systems can synthesize voices with emotions well, but emotion types of training data need to be labeled manually, so that labeling personnel are required to have strong English professional knowledge, and large manpower and financial resources need to be consumed.

Disclosure of Invention

The embodiment of the application provides a method and a device for outputting information.

In a first aspect, an embodiment of the present application provides a method for outputting information, including: obtaining a fundamental frequency curve corresponding to a sample syllable to be labeled; extracting a fundamental frequency sequence from the fundamental frequency curve; converting the base frequency sequence into a sample value sequence; and clustering the sample value sequence and a reference sequence of a known boundary tone type to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be labeled. And outputting the boundary tone type of the sample syllable to be labeled.

In some embodiments, converting the fundamental frequency sequence to a sample value sequence comprises: and sampling and interpolating the base frequency sequence to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence.

In some embodiments, converting the fundamental frequency sequence to a sample value sequence comprises: and performing discrete cosine transform on the base frequency sequence, and taking a discrete cosine transform coefficient as a sample value sequence.

In some embodiments, clustering the sequence of sample values with a reference sequence of known boundary key type comprises: clustering the sample value sequence and a first reference sequence through a Pearson correlation coefficient, clustering the sample value sequence into two classes according to the positive and negative of the correlation coefficient, marking the class with positive slope as a first class, and marking the class with negative slope as a second class; clustering the sample value sequence and a second reference sequence through Euclidean distance, clustering the first class into a third class and a fourth class according to the overall height of the fundamental frequency, and clustering the second class into a fifth class and a sixth class; and clustering the sample value sequence and a third reference sequence through Euclidean distance, and respectively clustering a third class, a fourth class, a fifth class and a sixth class into two classes according to the variation amplitude of the fundamental frequency.

In some embodiments, the method further comprises: obtaining an English text to be synthesized, wherein the English text comprises at least one word, and the word comprises at least one syllable; for a word in at least one word, extracting the feature of the word, inputting the feature of the word into a pre-trained front-end prediction model, and outputting the boundary tone type of the last syllable of the word; inputting the English text and the boundary tone type of the last syllable of each word in the English text into a pre-trained rear-end acoustic model, and outputting acoustic parameters; and synthesizing the English text into English voice based on the output acoustic parameters.

In some embodiments, the front-end prediction model is trained by: obtaining a first training sample set, wherein the first training sample comprises a sample word and a boundary tone type corresponding to the last sample syllable of the sample word; and taking the sample word of the first training sample in the first training sample set as input, taking the boundary tone type corresponding to the last sample syllable of the input sample word as output, and training to obtain a front-end prediction model.

In some embodiments, the back-end acoustic model is trained by: acquiring a second training sample set, wherein the second training sample comprises a boundary tone type of a sample syllable corresponding to a sample phoneme sequence and an acoustic parameter corresponding to the sample phoneme sequence; and taking the sample phoneme sequence of the second training sample in the second training sample set and the boundary tone type of the sample syllable corresponding to the sample phoneme sequence as input, taking the acoustic parameters corresponding to the input sample phoneme sequence as output, and training to obtain a rear-end acoustic model.

In a second aspect, an embodiment of the present application provides an apparatus for outputting information, including: the acquisition unit is configured to acquire a fundamental frequency curve corresponding to a sample syllable to be labeled; an extraction unit configured to extract a fundamental frequency sequence from the fundamental frequency curve; a conversion unit configured to convert the base frequency sequence into a sample value sequence; and the clustering unit is configured to cluster the sample value sequence and a reference sequence of a known boundary tone type to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be labeled. An output unit configured to output a boundary key type of the sample syllable to be labeled.

In some embodiments, the conversion unit is further configured to: and sampling and interpolating the base frequency sequence to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence.

In some embodiments, the conversion unit is further configured to: and performing discrete cosine transform on the base frequency sequence, and taking a discrete cosine transform coefficient as a sample value sequence.

In some embodiments, the clustering unit is further configured to: clustering the sample value sequence and a first reference sequence through a Pearson correlation coefficient, clustering the sample value sequence into two classes according to the positive and negative of the correlation coefficient, marking the class with positive slope as a first class, and marking the class with negative slope as a second class; clustering the sample value sequence and a second reference sequence through Euclidean distance, clustering the first class into a third class and a fourth class according to the overall height of the fundamental frequency, and clustering the second class into a fifth class and a sixth class; and clustering the sample value sequence and a third reference sequence through Euclidean distance, and respectively clustering a third class, a fourth class, a fifth class and a sixth class into two classes according to the variation amplitude of the fundamental frequency.

In some embodiments, the apparatus further comprises a synthesis unit configured to: obtaining an English text to be synthesized, wherein the English text comprises at least one word, and the word comprises at least one syllable; for a word in at least one word, extracting the feature of the word, inputting the feature of the word into a pre-trained front-end prediction model, and outputting the boundary tone type of the last syllable of the word; inputting the English text and the boundary tone type of the last syllable of each word in the English text into a pre-trained rear-end acoustic model, and outputting acoustic parameters; and synthesizing the English text into English voice based on the output acoustic parameters.

In some embodiments, the apparatus further comprises a first training unit configured to: obtaining a first training sample set, wherein the first training sample comprises a sample word and a boundary tone type corresponding to the last sample syllable of the sample word; and taking the sample word of the first training sample in the first training sample set as input, taking the boundary tone type corresponding to the last sample syllable of the input sample word as output, and training to obtain a front-end prediction model.

In some embodiments, the apparatus further comprises a second training unit configured to: acquiring a second training sample set, wherein the second training sample comprises a boundary tone type of a sample syllable corresponding to a sample phoneme sequence and an acoustic parameter corresponding to the sample phoneme sequence; and taking the sample phoneme sequence of the second training sample in the second training sample set and the boundary tone type of the sample syllable corresponding to the sample phoneme sequence as input, taking the acoustic parameters corresponding to the input sample phoneme sequence as output, and training to obtain a rear-end acoustic model.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.

In a fourth aspect, the present application provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method according to any one of the first aspect.

According to the method and the device for outputting information, the base frequency sequence is extracted from the base frequency curve of the syllable, then the base frequency sequence is converted into the sample value sequence, clustering is carried out, and the boundary tone type of the syllable is obtained. The automatic labeling of the boundary tone in the English speech synthesis system is realized, so that the labeling time is shortened, and the cost is saved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for outputting information, in accordance with the present application;

FIG. 3 is a schematic diagram of an application scenario of a method for outputting information according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for outputting information according to the present application;

FIG. 5 is a schematic illustration of yet another application scenario of a method for outputting information according to the present application;

FIG. 6 is a schematic block diagram illustrating one embodiment of an apparatus for outputting information according to the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the present method for outputting information or apparatus for outputting information may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a speech synthesis application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting Audio playing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background speech synthesis server providing support for audio played on the

terminal devices

101, 102, 103. The background speech synthesis server can analyze and process the received data such as the sample syllables and the like to obtain the boundary tone types of the syllables. A front-end prediction model for predicting the boundary key types and a back-end acoustic model for generating the acoustic parameters can then be trained based on the boundary key types of a large number of sample syllables. And then, when receiving the English text to be synthesized, the server can synthesize the speech with the tone through the front-end prediction model and the rear-end acoustic model, and feed back the synthesized speech to the terminal equipment. The terminal device can also obtain the front-end prediction model and the back-end acoustic model from the server and then perform speech synthesis locally.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for outputting information provided in the embodiment of the present application is generally performed by the server 105, and accordingly, the apparatus for outputting information is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for outputting information in accordance with the present application is shown. The method for outputting information comprises the following steps:

step 201, obtaining a fundamental frequency curve corresponding to a sample syllable to be labeled.

In this embodiment, an executing entity (for example, the server shown in fig. 1) of the method for outputting information may obtain the fundamental frequency curve corresponding to the sample syllable to be labeled from the sound library of the third-party server through a wired connection manner or a wireless connection manner. The fundamental frequency curve can also be extracted from the sample syllables by the local server. Syllables are basic units of English pronunciation, and the pronunciation of any word is decomposed into syllable-by-syllable reading. Vowels (five total of a e i o u) are particularly loud in english, a vowel phone may constitute a syllable, and a combination of a vowel phone and one or more consonant phones may also constitute a syllable. Vowel phones are the bodies that make up syllables, and consonants are the boundaries of syllables. In sound, fundamental frequency refers to the frequency of a fundamental tone in a complex tone. Among the several tones constituting a complex tone, the fundamental tone has the lowest frequency and the highest intensity. The level of the fundamental frequency determines the level of a tone. The frequency of speech is usually the frequency of fundamental tones. There are generally fundamental components for extracting the fundamental frequency in vocoders and in various speech signal processing systems. The language fundamental frequency identification is to extract the fundamental frequency of the voice signal and display the size and the change shape of the voice signal in a color dynamic graph mode, namely, a fundamental frequency curve is formed. The example may employ a STRAIGHT-based fundamental frequency extraction algorithm.

Step 202, extracting a fundamental frequency sequence from the fundamental frequency curve.

In this embodiment, the fundamental frequency curve can be quantized in segments according to the boundary range of syllables. The baseband sequence may be formed by segmenting at a fixed duration, for example, taking one baseband value every 5 milliseconds from the baseband curve.

Step 203, convert the base frequency sequence into a sample value sequence.

In this embodiment, since the value of the fundamental frequency is large, in order to facilitate the subsequent clustering process, it is necessary to convert the fundamental frequency sequence into a sample value sequence of a predetermined length. The sample value sequence with uniform length can be obtained by sampling and interpolation. For example, a length 50 base frequency sequence is down-sampled and converted into a length 30 sample value sequence. The length 20 base frequency sequence is interpolated and converted into a length 30 sample value sequence.

In some optional implementations of this embodiment, converting the fundamental frequency sequence into the sample value sequence includes: and sampling and interpolating the base frequency sequence to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence. The base frequency sequence can be firstly subjected to logarithm operation and then converted into a base frequency logarithm sequence with a preset length to serve as a sample value sequence. This can reduce the amount of operation data.

In some optional implementations of this embodiment, converting the fundamental frequency sequence into the sample value sequence includes: and performing discrete cosine transform on the base frequency sequence, and taking a discrete cosine transform coefficient as a sample value sequence. The discrete Cosine Transform (DCT for discrete Cosine Transform) is a Transform related to fourier Transform, which is similar to discrete fourier Transform but uses only real numbers. The discrete cosine transform has a strong "energy concentration" characteristic in that most of the energy of natural signals (including sounds and images) is concentrated in a low frequency part after the discrete cosine transform, and when the signals have a statistical characteristic close to a Markov process (Markov process), decorrelation of the discrete cosine transform is close to the performance of a K-L transform (Karhunen-Lo ea ve transform, which has optimal decorrelation). Discrete cosine transform coefficients of a predetermined length may be taken as a sample value sequence in front-to-back order.

And 204, clustering the sample value sequence and a reference sequence of a known boundary tone type to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be labeled.

In this embodiment, the boundary tones are tone of syllable boundaries. Different emotion types have corresponding tone changes, so the boundary tones can be divided into four types, namely L-L, L-H, H-L and H-H, which respectively correspond to the general change trend of the fundamental frequency, L-H represents that the fundamental frequency is increased from low to high and generally corresponds to query tone, H-H represents that the fundamental frequency is overall higher and generally corresponds to imperative tone, and H-L and L-L represent general conditions. The classification can be carried out by adopting a clustering method based on K-means and the like. A reference base frequency sequence of a known boundary tone type may be prepared in advance, and the reference sequence may be obtained in the same manner as the generation method of the sample value sequence described above. And clustering the sample value sequence with a reference sequence of a known boundary tone type, wherein the boundary tone type of the reference sequence is the boundary tone type of the sample value sequence if the sample value sequence and the reference sequence are the same. The sample value sequence can be clustered with four types of reference sequences, namely L-L, L-H, H-L and H-H, and if the sample value sequence and the L-L type of reference sequence are in the same type, the sample value sequence is in the L-L type.

In some optional implementations of this embodiment, the sample value sequence is clustered with a reference sequence of a known boundary tone type, and a 3-layer hierarchical clustering method may be selected, as shown in fig. 3:

step 2041, clustering the sample value sequence with a first reference sequence by using a pearson correlation coefficient, clustering the sample value sequence into two classes according to the positive and negative of the correlation coefficient, marking the class with positive slope as a first class, and marking the class with negative slope as a second class.

In this embodiment, the slope is the correlation coefficient. The first reference sequence is a reference base frequency sequence of a known boundary tone type obtained in the same manner as the above-described generation method of the sample value sequence. The sample value sequence is clustered with a first reference sequence of a known boundary tone type, if the slope is positive, the sample value sequence is classified into a first type, and a circle 1 in fig. 3 represents the first type. If the slope is negative, the sequence of sample values is classified into a second class, which is indicated by the circle 2 in fig. 3.

Step 2042, clustering the sample value sequence and the second reference sequence by Euclidean distance, clustering the first class into a third class and a fourth class according to the overall height of the fundamental frequency, and clustering the second class into a fifth class and a sixth class.

In this embodiment, the second reference sequence is a reference fundamental frequency sequence whose fundamental frequency is high as a whole or a reference fundamental frequency sequence whose fundamental frequency is low as a whole, and is obtained in the same manner as the above-described generation method of the sample value sequence, and is simply referred to as a high-frequency sequence and a low-frequency sequence. And respectively calculating Euclidean distances between the sample value sequence and the high-frequency sequence and between the sample value sequence and the low-frequency sequence, and if the Euclidean distance between the sample value sequence and the high-frequency sequence is smaller than the Euclidean distance between the sample value sequence and the low-frequency sequence, determining that the sample value sequence belongs to a high-frequency category. And if the Euclidean distance between the sample value sequence and the high-frequency sequence is greater than the Euclidean distance between the sample value sequence and the low-frequency sequence, the sample value sequence is considered to belong to the low-frequency category. If the distances are equal, the second reference sequence is replaced and the iterative comparison is continued. As shown in fig. 3, sample value sequences closer in euclidean distance between the first class and the low frequency sequence are classified into a third class, and a circle 3 in fig. 3 represents the third class. Sample value sequences in the first class that are closer to the high-frequency sequence in euclidean distance are classified into a fourth class, and a circle 4 in fig. 3 represents the fourth class. Sample value sequences in the second class that are closer to the low-frequency sequence in euclidean distance are classified into a fifth class, and a circle 5 in fig. 3 represents the fifth class. Sample value sequences in the second class closer to the high-frequency sequence are classified into a sixth class, and a circle 6 in fig. 3 indicates the sixth class.

Step 2043, clustering the sample value sequence and the third reference sequence by Euclidean distance, and respectively clustering the third class, the fourth class, the fifth class and the sixth class into two classes according to the variation amplitude of the fundamental frequency.

In this embodiment, the third reference sequence is a reference sequence obtained by the four types of reference fundamental frequency sequences with fundamental frequency variation amplitudes of L-H, L-L, H-H, H-L in the same manner as the generation method of the sample value sequences, and is simply referred to as an L-H sequence, an L-L sequence, an H-H sequence, and an H-L sequence.

For the node 3 in fig. 3, euclidean distances between the third type sample value sequence and the L-H sequence and the L-L sequence are calculated respectively. And if the Euclidean distance between the third type of sample value sequence and the L-H sequence is smaller than the Euclidean distance between the third type of sample value sequence and the L-L sequence, the third type of sample value sequence is considered to belong to the L-H category, otherwise, the third type of sample value sequence is considered to belong to the L-L category. If the distances are equal, the third reference sequence is replaced and the iterative comparison is continued.

For the node 4 in fig. 3, the euclidean distances between the fourth type sample value sequence and the L-H sequence and the H-H sequence are calculated respectively. And if the Euclidean distance between the fourth type sample value sequence and the L-H sequence is smaller than the Euclidean distance between the fourth type sample value sequence and the H-H sequence, the fourth type sample value sequence is considered to belong to the L-H category, otherwise, the fourth type sample value sequence is considered to belong to the H-H category. If the distances are equal, the third reference sequence is replaced and the iterative comparison is continued.

For the node 5 in fig. 3, the euclidean distances between the fifth type sample value sequence and the L-L sequence and the H-L sequence are calculated respectively. And if the Euclidean distance between the fifth type sample value sequence and the L-L sequence is smaller than the Euclidean distance between the fifth type sample value sequence and the H-L sequence, the fifth type sample value sequence is considered to belong to the L-L category, otherwise, the third type sample value sequence is considered to belong to the H-L category. If the distances are equal, the third reference sequence is replaced and the iterative comparison is continued.

For the node 6 in fig. 3, the euclidean distances between the sixth type sample value sequence and the H-H sequence and the H-L sequence are calculated respectively. And if the Euclidean distance between the sixth type sample value sequence and the H-H sequence is smaller than the Euclidean distance between the sixth type sample value sequence and the H-L sequence, the sixth type sample value sequence is considered to belong to the H-H category, otherwise, the sixth type sample value sequence is considered to belong to the H-L category. If the distances are equal, the third reference sequence is replaced and the iterative comparison is continued.

Step 205, outputting the boundary tone type of the sample syllable to be labeled.

In this embodiment, after the clustering is completed, the clustering result may be labeled to a text of the training data, so as to be respectively used for the training of the front-end prediction model and the rear-end acoustic model in the following.

The method provided by the above embodiment of the application clusters the base frequency sequence of the sample syllable with the reference sequence of the known boundary tone type to obtain the boundary tone type of the syllable, thereby realizing the automatic labeling of the boundary tone type of the sample syllable.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for outputting information is shown. The process 400 of the method for outputting information includes the steps of:

step 401, obtaining an English text to be synthesized.

In the present embodiment, an execution subject (e.g., a server shown in fig. 1) of the method for outputting information may acquire english text to be synthesized from a terminal by a wired connection manner or a wireless connection manner. Wherein, the English text comprises at least one word, and the word comprises at least one syllable. A syllable may include at least one phone divided into a vowel phone and a consonant phone.

Step 402, for a word in at least one word, extracting the feature of the word, inputting the feature of the word into a front-end prediction model, and outputting the boundary tone type of the last syllable of the word.

In this embodiment, the front-end prediction model may be trained by the server, or may be trained by a third-party server. For each word in the English text, the characteristics of the word are extracted. Features may include word vectors, part of speech, case and case features, prosodic pause types, punctuation information, and the number of syllables of a word. The word vector herein refers to a vector into which a word is converted by a word embedding technique. Word embedding is the general term for a set of language modeling and feature learning techniques in natural language processing, in which words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with lower dimensions. Part of speech refers to nouns, adjectives, verbs, past segmentations, and the like. The same word in english may be read differently in different parts of speech. For example, read is different in pronunciation in past participles than in original form. The case feature is used to indicate whether a word is capitalized or small. The rhythm pause types can be divided into three types, namely no pause, short pause and long pause. The type of pause can be determined from sentence structure or punctuation, e.g., no pause within a phrase, no pause between take off, but a short pause after off. Long pauses are if the word is followed by punctuation. The punctuation information refers to whether a word is followed by a punctuation mark, and what kind of punctuation mark. The number of syllables of a word refers to the number of syllables that the word contains.

The output of the front prediction model is the boundary tone type of the last syllable of the word. The 6 boundary key types and no boundary key can be represented by a 7-dimensional one-hot vector. Wherein, the 6 boundary key types comprise H types and L types besides L-L, L-H, H-L and H-H4 types. The H type is an L-H type or a combination of an H-H type and a short pause, namely, the L-H type or the H-H type is merged into the H type when encountering the short pause. The L type is a combination of the L-L type or the H-L type and a short pause, namely, the L type or the H-L type is merged into the L type when encountering the short pause. The model selection includes DNN (Deep Neural Networks), SVM (Support Vector Machine), LSTM (Long Short-term memory network), CRF (Conditional Random Field), attentional model, and wavenet (speech generation model).

The front-end prediction model can be obtained by training as follows.

Step 4021, a first set of training samples is obtained.

In this embodiment, the executing agent of the training step may obtain the first training sample set from other electronic devices connected to the executing agent network locally or remotely. Wherein each first training sample comprises a sample word and a boundary key type corresponding to a last sample syllable of the sample word. For example, the boundary key type corresponding to the last sample syllable of the sample word can be automatically labeled through

step

201 and 205.

Step 4022, training to obtain a front-end prediction model by using the sample word of the first training sample in the first training sample set as input and using the boundary tone type corresponding to the last sample syllable of the input sample word as output.

In this embodiment, the main body of the training step may input the sample word in the first training sample set into the initial neural network, obtain the boundary tone type corresponding to the last sample syllable of the sample word, and train the initial neural network by using the machine learning method with the labeled boundary tone type in the first training sample as the expected output of the initial neural network. The initial neural network may include, but is not limited to, at least one of: DNN, SVM, LSTM, CRF, anchorage, wavenet. Specifically, the difference between the obtained boundary key type and the boundary key type in the first training sample may be first calculated by using a preset loss function, for example, the difference between the obtained boundary key type and the boundary key type in the first training sample may be calculated by using the L2 norm as the loss function. Then, the network parameters of the initial neural network may be adjusted based on the calculated difference, and the training may be ended in case that a preset training end condition is satisfied. For example, the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the calculated difference is less than a preset difference threshold.

Here, various implementations may be employed to adjust network parameters of the initial neural network based on differences between the generated boundary key types and the boundary key types in the first training sample. For example, a BP (back propagation) algorithm or an SGD (Stochastic Gradient Descent) algorithm may be used to adjust the network parameters of the initial neural network.

And determining the initial neural network obtained by training as a front-end prediction model.

Step 403, inputting the English text and the boundary tone type of the last syllable of each word in the English text into the back-end acoustic model, and outputting acoustic parameters.

In this embodiment, the input of the back-end acoustic model is an english text with a feature of adding a boundary key, and the output is an acoustic parameter, where the acoustic parameter includes a fundamental frequency and a frequency spectrum. Model selection includes HMM (Hidden Markov Model), DNN, LSTM, CBHG, attention Model, wavenet, and the like.

The back-end acoustic model can be obtained by training the following steps:

step 4031, a second training sample set is obtained.

In this embodiment, the executing agent of the training step may obtain the second training sample set from other electronic devices connected to the executing agent network locally or remotely. Wherein each second training sample comprises a boundary key type of a sample syllable corresponding to the sample phone sequence and an acoustic parameter corresponding to the sample phone sequence.

Step 4032, the sample phoneme sequence of the second training sample in the second training sample set and the boundary tone type of the sample syllable corresponding to the sample phoneme sequence are used as input, and the acoustic parameter corresponding to the input sample phoneme sequence is used as output to obtain a rear-end acoustic model through training.

In this embodiment, the main body of the training step may input the sample phoneme sequence in the second training sample set and the boundary tone type of the sample syllable corresponding to the sample phoneme sequence into the initial neural network, obtain the acoustic parameter corresponding to the sample phoneme sequence, and train the initial neural network by using the machine learning method with the labeled acoustic parameter in the second training sample as the expected output of the initial neural network. The initial neural network may include, but is not limited to, at least one of: HMM, DNN, LSTM, CBHG, attribute model, wavenet. The specific training step can refer to 4022, and is not described herein.

And determining the initial neural network obtained by training as a back-end acoustic model.

And step 404, synthesizing the English text into English voice based on the output acoustic parameters.

In the present embodiment, the emotional speech is synthesized by a vocoder or a unit concatenation method. Prosodic pauses may also be considered in the synthesis.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for outputting information in the present embodiment embodies the steps of performing speech synthesis using boundary tones. Therefore, the scheme described in the embodiment can add the boundary tone information during the speech synthesis, so as to achieve a better English emotion speech synthesis effect.

With continued reference to fig. 5, fig. 5 is a schematic diagram of an application scenario of the method for outputting information according to the present embodiment. In the application scenario of fig. 5, in the training phase, the server automatically labels the boundary tone type of the sample syllable with the syllables in the original sound library through

step

201 and 205. And storing the marked syllables into a marked syllable library. And training a front-end prediction model and a rear-end acoustic model by using the marked syllables in the marked syllable library through steps 4021 and 4022. In the synthesis stage, each word in the text to be synthesized is subjected to feature extraction and then is respectively input into the front-end prediction model, so that the boundary tone type of the last syllable of each word is obtained. And then inputting the text to be synthesized and the boundary tone type of the last syllable of each word into a rear-end acoustic model together to obtain acoustic parameters. The acoustic parameters are finally converted into parametric speech by a vocoder. And spliced voice can be obtained by a unit splicing method.

The method provided by the above embodiment of the present application trains a front-end prediction model and a back-end acoustic model by using sample syllables labeled with boundary key types. So that the front-end predictive model and the back-end acoustic model are used to synthesize the intonation speech in the speech synthesis stage. Compared with manual labeling, the system period and labor cost of speech synthesis are greatly reduced, and a good English emotion speech synthesis effect is obtained through experimental verification.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for outputting information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the apparatus 600 for outputting information of the present embodiment includes: an acquisition unit 601, an extraction unit 602, a conversion unit 603, a clustering unit 604, and an output unit 605. Wherein the obtaining unit 601 is configured to obtain a fundamental frequency curve corresponding to a sample syllable to be labeled. The extraction unit 602 is configured to extract the fundamental frequency sequence from the fundamental frequency curve. The conversion unit 603 is configured to convert the base frequency sequence into a sample value sequence. The clustering unit 604 is configured to cluster the sample value sequence with a reference sequence of known boundary tone types, resulting in the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be labeled. The output unit 605 is configured to output the boundary key type of the sample syllable to be labeled.

In the present embodiment, specific processing of the acquiring unit 601, the extracting unit 602, the converting unit 603, the clustering unit 604, and the outputting unit 605 of the apparatus 600 for outputting information may refer to step 201, step 202, step 203, step 204, and step 205 in the corresponding embodiment of fig. 2.

In some optional implementations of this embodiment, the conversion unit 603 is further configured to: and sampling and interpolating the base frequency sequence to obtain a base frequency logarithmic sequence with a preset length as a sample value sequence.

In some optional implementations of this embodiment, the conversion unit 603 is further configured to: and performing discrete cosine transform on the base frequency sequence, and taking a discrete cosine transform coefficient as a sample value sequence.

In some optional implementations of this embodiment, the clustering unit 604 is further configured to: clustering the sample value sequence and a first reference sequence through a Pearson correlation coefficient, clustering the sample value sequence into two classes according to the positive and negative of the correlation coefficient, marking the class with positive slope as a first class, and marking the class with negative slope as a second class; clustering the sample value sequence and a second reference sequence through Euclidean distance, clustering the first class into a third class and a fourth class according to the overall height of the fundamental frequency, and clustering the second class into a fifth class and a sixth class; and clustering the sample value sequence and a third reference sequence through Euclidean distance, and respectively clustering a third class, a fourth class, a fifth class and a sixth class into two classes according to the variation amplitude of the fundamental frequency.

In some optional implementations of this embodiment, the apparatus 600 further comprises a synthesizing unit configured to: and acquiring an English text to be synthesized, wherein the English text comprises at least one word, and the word comprises at least one syllable. For a word in at least one word, extracting the characteristics of the word, inputting the characteristics of the word into a pre-trained front-end prediction model, and outputting the boundary tone type of the last syllable of the word. Inputting the English text and the boundary tone type of the last syllable of each word in the English text into a pre-trained rear-end acoustic model, and outputting acoustic parameters. And synthesizing the English text into English voice based on the output acoustic parameters.

In some optional implementations of this embodiment, the apparatus 600 further comprises a first training unit (not shown) configured to: a first set of training samples is obtained, the first training samples including a sample word and a boundary key type corresponding to a last sample syllable of the sample word. And taking the sample word of the first training sample in the first training sample set as input, taking the boundary tone type corresponding to the last sample syllable of the input sample word as output, and training to obtain a front-end prediction model.

In some optional implementations of this embodiment, the apparatus 600 further comprises a second training unit (not shown) configured to: and acquiring a second training sample set, wherein the second training sample comprises the boundary tone type of the sample syllable corresponding to the sample phoneme sequence and the acoustic parameters corresponding to the sample phoneme sequence. And taking the sample phoneme sequence of the second training sample in the second training sample set and the boundary tone type of the sample syllable corresponding to the sample phoneme sequence as input, taking the acoustic parameters corresponding to the input sample phoneme sequence as output, and training to obtain a rear-end acoustic model.

Referring now to FIG. 7, a block diagram of a computer system 700 suitable for use in implementing an electronic device (e.g., the terminal device/server shown in FIG. 1) of an embodiment of the present application is shown. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an extraction unit, a conversion unit, a clustering unit, and an output unit. The names of these units do not in some cases form a limitation on the unit itself, and for example, the obtaining unit may also be described as a unit for obtaining a fundamental frequency curve corresponding to a sample syllable to be labeled.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: obtaining a fundamental frequency curve corresponding to a sample syllable to be labeled; extracting a fundamental frequency sequence from the fundamental frequency curve; converting the base frequency sequence into a sample value sequence; clustering the sample value sequence and a reference sequence of a known boundary tone type to obtain the boundary tone type of the sample value sequence as the boundary tone type of the sample syllable to be labeled, and outputting the boundary tone type of the sample syllable to be labeled.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for outputting information, comprising:

obtaining a fundamental frequency curve corresponding to a sample syllable to be labeled;

extracting a fundamental frequency sequence from the fundamental frequency curve;

converting the fundamental frequency sequence into a sample value sequence;

clustering the sample value sequence and a reference sequence of a known boundary tone type to obtain a boundary tone type of the sample value sequence as a boundary tone type of the sample syllable to be labeled;

and outputting the boundary tone type of the sample syllable to be labeled.

2. The method of claim 1, wherein the converting the fundamental frequency sequence into a sequence of sample values comprises:

and sampling and interpolating the fundamental frequency sequence to obtain a fundamental frequency logarithmic sequence with a preset length as a sample value sequence.

3. The method of claim 1, wherein the converting the fundamental frequency sequence into a sequence of sample values comprises:

and performing discrete cosine transform on the base frequency sequence, and taking a discrete cosine transform coefficient as a sample value sequence.

4. The method of claim 1, wherein the clustering the sequence of sample values with a reference sequence of known boundary key types comprises:

clustering the sample value sequence and a first reference sequence through a Pearson correlation coefficient, clustering the sample value sequence into two classes according to the positive and negative of the correlation coefficient, marking the class with positive slope as a first class, and marking the class with negative slope as a second class;

clustering the sample value sequence and a second reference sequence through Euclidean distance, clustering the first class into a third class and a fourth class according to the overall height of the fundamental frequency, and clustering the second class into a fifth class and a sixth class;

and clustering the sample value sequence and a third reference sequence through Euclidean distance, and respectively clustering the third class, the fourth class, the fifth class and the sixth class into two classes according to the variation amplitude of fundamental frequency.

5. The method according to one of claims 1-4, wherein the method further comprises:

obtaining an English text to be synthesized, wherein the English text comprises at least one word, and the word comprises at least one syllable;

for the words in the at least one word, extracting the characteristics of the words, inputting the characteristics of the words into a pre-trained front-end prediction model, and outputting the boundary tone type of the last syllable of the words;

inputting the English text and the boundary tone type of the last syllable of each word in the English text into a pre-trained rear-end acoustic model, and outputting acoustic parameters;

and synthesizing the English text into English voice based on the output acoustic parameters.

6. The method of claim 5, wherein the front-end prediction model is trained by:

obtaining a first training sample set, wherein the first training sample comprises a sample word and a boundary tone type corresponding to the last sample syllable of the sample word;

and taking the sample word of the first training sample in the first training sample set as input, taking the boundary tone type corresponding to the last sample syllable of the input sample word as output, and training to obtain a front-end prediction model.

7. The method of claim 5, wherein the back-end acoustic model is trained by:

acquiring a second training sample set, wherein the second training sample comprises a boundary tone type of a sample syllable corresponding to a sample phoneme sequence and an acoustic parameter corresponding to the sample phoneme sequence;

and taking the sample phoneme sequence of the second training sample in the second training sample set and the boundary tone type of the sample syllable corresponding to the sample phoneme sequence as input, taking the acoustic parameters corresponding to the input sample phoneme sequence as output, and training to obtain a rear-end acoustic model.

8. An apparatus for outputting information, comprising:

the acquisition unit is configured to acquire a fundamental frequency curve corresponding to a sample syllable to be labeled;

an extraction unit configured to extract a fundamental frequency sequence from the fundamental frequency curve;

a conversion unit configured to convert the base frequency sequence into a sample value sequence;

a clustering unit configured to cluster the sample value sequence with a reference sequence of a known boundary tone type to obtain a boundary tone type of the sample value sequence as a boundary tone type of the sample syllable to be labeled;

an output unit configured to output a boundary key type of the sample syllable to be labeled.

9. The apparatus of claim 8, wherein the conversion unit is further configured to:

10. The apparatus of claim 8, wherein the conversion unit is further configured to:

11. The apparatus of claim 8, wherein the clustering unit is further configured to:

12. The apparatus according to one of claims 8-12, wherein the apparatus further comprises a synthesis unit configured to:

13. The apparatus of claim 12, wherein the apparatus further comprises a first training unit configured to:

14. The apparatus of claim 12, wherein the apparatus further comprises a second training unit configured to:

15. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.