CN105679308A

CN105679308A - Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence

Info

Publication number: CN105679308A
Application number: CN201610122171.0A
Authority: CN
Inventors: 陈志杰; 康永国
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2016-03-03
Filing date: 2016-03-03
Publication date: 2016-06-15

Abstract

The invention provides a method and a device for generating a grapheme-to-phoneme (g2p) model based on artificial intelligence and a method and a device for synthesizing English speech based on artificial intelligence. The method for generating the g2p model includes the steps of: obtaining a corpus for training the g2p model; and training the corpus by using a neural network and obtaining the g2p model. The method can reduce the size of the g2p model based on the improvement of the g2p model performance, and further improves the English speech synthesis effect.

Description

Based on generation g2p model and English phoneme synthesizing method, the device of artificial intelligence

Technical field

The present invention relates to speech synthesis technical field, particularly relate to a kind of generation g2p model based on artificial intelligence and English phoneme synthesizing method, device.

Background technology

Artificial intelligence (ArtificialIntelligence, AI) is research, the new technological sciences developing the theory for simulating, extend and expand the intelligence of people, method, application system. Artificial intelligence is a branch of computer science, the essence of intelligence is understood in its attempt, and produce a kind of intelligent machine can made a response in the way of human intelligence is similar newly, the research in this field comprises intelligent robot, speech recognition, pattern recognition, natural language processing and expert systems etc.

Speech synthesis, also known as literary periodicals (TexttoSpeech) technology, the massage voice reading that any Word message can be converted into standard smoothness in real time is out. In English speech synthesis, an important module utilizes g2p model to carry out the module changed, and the full name of g2p model is grapheme-to-phoneme model, for letter is converted to phoneme.

In correlation technique, the main method relying on statistical language model of the training of g2p model, the level and smooth strategy of model is also the same with statistical language model. But the increase along with model order, model to take space resources also just bigger.

In correlation technique, if to be ensured the performance of g2p model, it is necessary to take bigger space resources; If reducing taking space resources, the performance that will sacrifice model, to reduce model size, can affect English speech synthesis effect undoubtedly.

Summary of the invention

One of technical problem that the present invention is intended to solve in correlation technique at least to a certain extent.

For this reason, it is an object of the present invention to a kind of method proposing generation g2p model based on artificial intelligence, the method on the basis improving g2p model performance, can reduce the size of g2p model, and then improves English speech synthesis effect.

Another object of the present invention is to propose the device of a kind of generation g2p model based on artificial intelligence.

Another object of the present invention is to propose a kind of English phoneme synthesizing method based on artificial intelligence.

Another object of the present invention is to propose a kind of English speech synthetic device based on artificial intelligence.

For achieving the above object, the method for the generation g2p model based on artificial intelligence that first aspect present invention embodiment proposes, comprising: obtain the language material for training g2p model;Adopt neural network to be trained by described language material, obtain g2p model.

The method of the generation g2p model based on artificial intelligence that first aspect present invention embodiment proposes, generates g2p model by neural network, it is possible to reduce the size of g2p model on the basis of performance improving g2p model.

For achieving the above object, the device of the generation g2p model based on artificial intelligence that second aspect present invention embodiment proposes, comprising: acquisition module, for obtaining the language material for training g2p model; Training module, for adopting neural network to be trained by described language material, obtains g2p model.

The device of g2p model is built in the generation based on artificial intelligence that second aspect present invention embodiment proposes, and generates g2p model by neural network, it is possible to reduce the size of g2p model on the basis of performance improving g2p model.

For achieving the above object, the English phoneme synthesizing method based on artificial intelligence that third aspect present invention embodiment proposes, comprising: obtain g2p model; Described g2p model is adopted to carry out English speech synthesis; Wherein, described g2p model adopts the method as described in item as arbitrary in first aspect present invention embodiment to generate.

The English phoneme synthesizing method based on artificial intelligence that third aspect present invention embodiment proposes, carries out English speech synthesis by the g2p model adopting above-mentioned employing neural network training to generate, it is possible to promote English speech synthesis effect.

For achieving the above object, the English speech synthetic device based on artificial intelligence that fourth aspect present invention embodiment proposes, comprising: acquisition module, for obtaining g2p model; Synthesis module, for adopting described g2p model to carry out English speech synthesis; Wherein, described g2p model adopts the method as described in item as arbitrary in first aspect present invention embodiment to generate.

The English speech synthetic device based on artificial intelligence that fourth aspect present invention embodiment proposes, carries out English speech synthesis by the g2p model adopting above-mentioned employing neural network training to generate, it is possible to promote English speech synthesis effect.

The aspect that the present invention adds and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by the practice of the present invention.

Accompanying drawing explanation

The present invention above-mentioned and/or additional aspect and advantage will become obviously with it should be readily understood that wherein from the following description of the accompanying drawings of embodiments:

Fig. 1 is the schematic flow sheet of the method for the generation g2p model based on artificial intelligence that one embodiment of the invention proposes;

Fig. 2 is the schematic diagram of a kind of BLSTM network in the embodiment of the present invention;

Fig. 3 is the schematic diagram of another kind of BLSTM network in the embodiment of the present invention;

Fig. 4 is the schematic flow sheet of the English phoneme synthesizing method based on artificial intelligence that another embodiment of the present invention proposes;

Fig. 5 is the structural representation of the device of the generation g2p model based on artificial intelligence that another embodiment of the present invention proposes;

Fig. 6 is the structural representation of the English speech synthetic device based on artificial intelligence that another embodiment of the present invention proposes.

Embodiment

Being described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar module or has module that is identical or similar functions from start to finish. It is exemplary below by the embodiment being described with reference to the drawings, only for explaining the present invention, and limitation of the present invention can not be interpreted as. On the contrary, embodiments of the invention comprise all changes within the scope of the spirit and intension falling into attached claim book, amendment and etc. jljl.

Fig. 1 is the schematic flow sheet of the method for the generation g2p model based on artificial intelligence that one embodiment of the invention proposes.See Fig. 1, the method comprises:

S11: obtain the language material for training g2p model.

Such as, it is possible to the aligned phoneme sequence collecting a large amount of English words and correspondence in advance is as language material.

Concrete, it is possible to adopt common language material collection mode to collect.

S12: adopt neural network to be trained by described language material, obtain g2p model.

In correlation technique, adopt statistical language model, also it is exactly that n-gram mode is trained, obtains g2p model.

In the present embodiment, when training generation g2p model, neural network is adopted to train.

Preferably, neural network is: two-way LSTM (BidirectionalLSTM, BLSTM) network, wherein, and memory models (Long-ShortTermMemory) when LSTM is length.

Concrete, it is possible to the BLSTM network of the ad hoc structure of g2p model is obtained being applicable to through a large amount of experiment.

G2p model is for being converted to phoneme by letter, and such as, corresponding word ACCENT, its alphabetical sequence is ACCENT, and the aligned phoneme sequence of its correspondence is AH0KSEH1NT. The task of g2p model is exactly that the alphabetical sequence of word to be converted to corresponding aligned phoneme sequence, and in fact this is exactly a sequence labelling problem. This is the strong point of LSTM, and unidirectional LSTM model, when predicting the phoneme of a certain time point, is merely able to be predicted by the phoneme of current point in time according to the alphabetical sequence before this time point and aligned phoneme sequence. Two-way LSTM, namely BLSTM is made up of two unidirectional LSTM, and one of them LSTM, from moment 1, scans whole sequence until terminating; Another LSTM from the ending of sequence, the whole sequence of reverse scan until arrive sequence starting position. BLSTM is when predicting the phoneme of a certain time point, it is possible to predicted by the phoneme of current point in time according to sequences all before and after this time point, it thus is seen that BLSTM is in this type of problem of process, more more outstanding than unidirectional LSTM.

See Fig. 2, it it is the schematic diagram of a kind of BLSTM network. As shown in Figure 2, from top to bottom, the first layer is input layer (inputlayer), the second layer and third layer is BLSTM layer to BLSTM network, and last layer is output layer (Outputlayer). Wherein, layer between input layer and output layer can be called hidden layer, in fig. 2, hidden layer is the BLSTM layer of a layer, wherein, the BLSTM layer of one layer is made up of two layers of LSTM, and one of them processes according to the input sequence of sequence, and another is then sequence processed according to the reverse direction of sequence inputting.

For the g2p model that the present invention to be realized, the input layer of BLSTM network is 38 unit, and output layer is 70 unit. For 38 unit of input layer, wherein there are 26 English alphabets, 10 numerals, and two special symbols (" _ " and " ' "). For 70 unit of output layer, 69 is phoneme, also has a blank unit.

In BLSTM network, when the number of plies of Hidden unit number and hidden layer is different, producing different result, aftermentioned experimental result will provide some network structures of inventive design.

Further, when generating g2p model, the output layer of above-mentioned BLSTM network is: connect sequential classification (ConnectionistTemporalClassification, CTC) output layer. Wherein, CTC output layer refers to and adopts CTC technology at output layer, and CTC technology makes the preprocessing process that the training of g2p model no longer needs language material to align, and does not also need the process of aftertreatment, thus achieves g2p model and train end to end and predict.

In the process of traditional g2p model training of Corpus--based Method language model, first training data to be done language material alignment, could be trained by the method for statistical language model by the language material after alignment, and the net result obtained often also can do some aftertreatments, optimum aligned phoneme sequence just can be obtained. And CTC technology makes neural network can be predicted by phoneme in any position of list entries, model is so just made no longer to need language material to align such preprocessing process. What is more important, CTC directly exports the probable value of whole sequence, and this is just without the need to there being any loaded down with trivial details last handling process again, because the sequence obtained has been optimal sequence.

Preferably, the hidden layer of above-mentioned BLSTM network comprises: three layers, from input layer to output layer direction respectively: the BLSTM layer of the BLSTM layer of 128 dimensions, the BLSTM layer of 128 dimensions and 64 dimensions. Now, the structure of BLSTM network is as shown in Figure 3.

Advantage for a better understanding of the present invention, table 1 gives and adopts neural network to generate g2p model and the experimental result of traditional employing statistical language model generation g2p model.

Table 1

The data that this experiment is used are from cmu pronunciation dictionary, and this dictionary one has 133853 words, after simple language material cleans, obtain 121864 words, and wherein training set is 104194 words, and checking collection is 5484 words, and test set is 12186 words. Bit error rate test in upper table is all from test set.

In upper table, the n-gram model representation of front 3 row be the g2p model that Corpus--based Method language model draws, be followed successively by quaternary, five yuan and seven yuan, it should be noted that Corpus--based Method language model training g2p model time, except parameter M, also have a parameter L. Wherein, M represents model order, and L then represents the maximum length of each state, and L is more big, and so the state of model is more many, and thus model is more big, and correspondingly, the performance of model can be better. In the experiment of the present invention, L value is 2. If increasing L to train, the performance of so corresponding rank digital-to-analogue type can be better again, and meanwhile, corresponding rank digital-to-analogue type also can become bigger.

The later model of front 3 row is exactly the g2p model training out based on BLSTM-CTC, and these models only give the description of network hidden layer in table, such as. 128-BLSTM+64-BLSTM is exactly the neural network model that a hidden layer has two BLSTM, is that 128 peacekeepings 64 are tieed up respectively.

As can be seen from the table, when performance is almost identical, based on the g2p model that the model size of the g2p model of BLSTM-CTC to be obtained much smaller than Corpus--based Method language model. In other words, if the g2p model of Corpus--based Method language model to be reached the g2p model based on BLSTM-CTC in performance, so just need to increase parameter L, or continue to expand the rank number of model, and the consequent to be exactly model size huger. Such as go up the 4-gram model in table and 128-BLSTM model shows quite in performance, but the size of 128-BLSTM model is less than 4-gram model; Same, the 7-gram model size in upper table is 43M, and 128-BLSTM+64-BLSTM+64-BLSTM model is under performance is better than the prerequisite of 7-gram model, moreover it is possible to less than 7-gram nearly 6 times. If the performance making 7-gram model promotes further, then need to increase L, and then train a model, will make the size of 7-gram model also will much larger than 43M like this.

From analysis above, the performance of English speech synthesis system can be significantly promoted based on the g2p model of BLSTM-CTC, for online English speech synthesis system, reduce error rate, improve the performance of system, for embedded English speech synthesis system (the English speech synthesis system of off-line can also be called), not only improve the performance of system, but also significantly reduce model size.

In the present embodiment, generate g2p model by neural network, it is possible to improve g2p model performance basis on reduce the size of g2p model. Concrete, as shown above, when adopting the BLSTM network training of above-mentioned CTC output layer to generate g2p model, when consistent with tradition g2p model errors rate, greatly reduce model size so that g2p model is applicable to embedded English speech synthesis system more. With in tradition g2p model situation of the same size, reduce error rate, effectively improve off-line and the performance of online English speech synthesis system.

Fig. 4 is the schematic flow sheet of the English phoneme synthesizing method based on artificial intelligence that another embodiment of the present invention proposes. See Fig. 4, the method comprises:

S41: obtain g2p model.

Wherein, g2p model adopts neural network training to generate.

Such as, the BLSTM network training of CTC output layer is adopted to generate g2p in advance, when English speech synthesis, so that it may to obtain the g2p model that this generates in advance. The product process of concrete g2p model can above-described embodiment, illustrate no longer in detail at this.

S42: text to be synthesized is carried out English speech synthesis, wherein, described English speech synthesis comprises:

Described g2p model is adopted to carry out the conversion of letter to phoneme.

Of course, it is understood that can also comprise other steps when English speech synthesis, other steps can perform with reference to usual mode.

In the present embodiment, carry out English speech synthesis by the g2p model adopting above-mentioned employing neural network training to generate, it is possible to promote English speech synthesis effect.

Fig. 5 is the structural representation of the device of the generation g2p model based on artificial intelligence that another embodiment of the present invention proposes. See Fig. 5, this device 50 comprises: acquisition module 51 and training module 52.

Acquisition module 51, for obtaining the language material for training g2p model.

Training module 52, for adopting neural network to be trained by described language material, obtains g2p model.

In some embodiments, the described neural network that described training module adopts is: BLSTM network.

In some embodiments, the output layer of the described neural network that described training module adopts is: CTC output layer.

In some embodiments, the hidden layer of described BLSTM network comprises: three layers, from input layer to output layer direction respectively: the BLSTM layer of the BLSTM layer of 128 dimensions, the BLSTM layer of 128 dimensions and 64 dimensions.

Optionally, described g2p model is used for English speech synthesis that is online or off-line.

It should be appreciated that the device of the present embodiment is corresponding with the embodiment of the method generating g2p model, particular content see the associated viscera in the embodiment of the method generating g2p model, can illustrate at this no longer in detail.

In the present embodiment, generate g2p model by neural network, it is possible to ensure g2p model performance basis on reduce the size of g2p model. Concrete, as shown above, adopt above-mentioned CTC output layer BLSTM network training generate g2p model time, when with tradition g2p model errors rate constant, greatly reduce model size so that g2p model is applicable to embedded English speech synthesis system more. With in tradition g2p model situation of the same size, reduce error rate, effectively improve off-line and the performance of online English speech synthesis system.

Fig. 6 is the structural representation of the English speech synthetic device based on artificial intelligence that another embodiment of the present invention proposes. See Fig. 6, this device 60 comprises: acquisition module 61 and synthesis module 62.

Acquisition module 61, for obtaining g2p model.

Wherein, g2p model adopts neural network training to generate.

Synthesis module 62, for text to be synthesized carries out English speech synthesis, wherein, described English speech synthesis comprises: adopt described g2p model to carry out the conversion of letter to phoneme.

It should be appreciated that the device of the present embodiment is corresponding with English phoneme synthesizing method embodiment, particular content see the associated viscera in English phoneme synthesizing method embodiment, can illustrate at this no longer in detail.

It should be noted that, in describing the invention, term " first ", " the 2nd " etc. are only for describing object, and can not be interpreted as instruction or hint relative importance. In addition, in describing the invention, unless otherwise explanation, the implication of " multiple " refers at least two.

Describe and can be understood in schema or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the performed instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carrying out n-back test, this should be understood by embodiments of the invention person of ordinary skill in the field.

It is to be understood that each several part of the present invention can realize with hardware, software, firmware or their combination. In the above-described embodiment, multiple step or method can realize with the software stored in memory and perform by suitable instruction execution system or firmware. Such as, if realized with hardware, the same with in another enforcement mode, can realize with the arbitrary item in following technology well known in the art or their combination: the discrete logic with the logic gates for data signal being realized logic function, there is the application specific integrated circuit of suitable combinational logic gating circuit, programmable gate array (PGA), field-programmable gate array (FPGA) etc.

Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is can be completed by the hardware that program carrys out instruction relevant, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, it is also possible to is that the independent physics of each unit exists, it is also possible to two or more unit are integrated in a module. Above-mentioned integrated module both can adopt the form of hardware to realize, it is also possible to adopts the form of software function module to realize. If described integrated module realize using the form of software function module and as independent production marketing or when using, it is also possible to be stored in a computer read/write memory medium.

The above-mentioned storage media mentioned can be read-only storage, disk or CD etc.

In the description of this specification sheets, at least one embodiment that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to be contained in the present invention in conjunction with concrete feature, structure, material or feature that this embodiment or example describe or example. In this manual, the schematic representation of above-mentioned term is not necessarily referred to identical embodiment or example. And, the concrete feature of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.

Although above it has been shown and described that embodiments of the invention, it is understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and above-described embodiment can be changed, revises, replace and modification by the those of ordinary skill of this area within the scope of the invention.

Claims

1. the method based on the generation g2p model of artificial intelligence, it is characterised in that, comprising:

Obtain the language material for training g2p model;

Adopt neural network to be trained by described language material, obtain g2p model.

2. method according to claim 1, it is characterised in that, described neural network is: BLSTM network.

3. method according to claim 2, it is characterised in that, the output layer of described neural network is: CTC output layer.

4. according to the method in claim 2 or 3, it is characterised in that, the hidden layer of described BLSTM network comprises: three layers, from input layer to output layer direction respectively: the BLSTM layer of the BLSTM layer of 128 dimensions, the BLSTM layer of 128 dimensions and 64 dimensions.

5. method according to claim 1, it is characterised in that, described g2p model is used for English speech synthesis that is online or off-line.

6. the English phoneme synthesizing method based on artificial intelligence, it is characterised in that,

Obtain g2p model;

Text to be synthesized carries out English speech synthesis, and wherein, described English speech synthesis comprises:

Described g2p model is adopted to carry out the conversion of letter to phoneme;

Wherein, described g2p model adopts the method as described in item as arbitrary in claim 1-5 to generate.

7. the device based on the generation g2p model of artificial intelligence, it is characterised in that, comprising:

Acquisition module, for obtaining the language material for training g2p model;

Training module, for adopting neural network to be trained by described language material, obtains g2p model.

8. device according to claim 7, it is characterised in that, the described neural network that described training module adopts is: BLSTM network.

9. device according to claim 8, it is characterised in that, the output layer of the described neural network that described training module adopts is: CTC output layer.

10. the English speech synthetic device based on artificial intelligence, it is characterised in that, comprising:

Acquisition module, for obtaining g2p model;

Synthesis module, for text to be synthesized carries out English speech synthesis, wherein, described English speech synthesis comprises: adopt described g2p model to carry out the conversion of letter to phoneme;