CN105679308A - Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence - Google Patents
Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence Download PDFInfo
- Publication number
- CN105679308A CN105679308A CN201610122171.0A CN201610122171A CN105679308A CN 105679308 A CN105679308 A CN 105679308A CN 201610122171 A CN201610122171 A CN 201610122171A CN 105679308 A CN105679308 A CN 105679308A
- Authority
- CN
- China
- Prior art keywords
- model
- artificial intelligence
- blstm
- english speech
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 33
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 12
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 42
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000013528 artificial neural network Methods 0.000 claims abstract description 28
- 239000000463 material Substances 0.000 claims description 23
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 7
- 230000008569 process Effects 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 210000001835 viscera Anatomy 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention provides a method and a device for generating a grapheme-to-phoneme (g2p) model based on artificial intelligence and a method and a device for synthesizing English speech based on artificial intelligence. The method for generating the g2p model includes the steps of: obtaining a corpus for training the g2p model; and training the corpus by using a neural network and obtaining the g2p model. The method can reduce the size of the g2p model based on the improvement of the g2p model performance, and further improves the English speech synthesis effect.
Description
Technical field
The present invention relates to speech synthesis technical field, particularly relate to a kind of generation g2p model based on artificial intelligence and English phoneme synthesizing method, device.
Background technology
Artificial intelligence (ArtificialIntelligence, AI) is research, the new technological sciences developing the theory for simulating, extend and expand the intelligence of people, method, application system. Artificial intelligence is a branch of computer science, the essence of intelligence is understood in its attempt, and produce a kind of intelligent machine can made a response in the way of human intelligence is similar newly, the research in this field comprises intelligent robot, speech recognition, pattern recognition, natural language processing and expert systems etc.
Speech synthesis, also known as literary periodicals (TexttoSpeech) technology, the massage voice reading that any Word message can be converted into standard smoothness in real time is out. In English speech synthesis, an important module utilizes g2p model to carry out the module changed, and the full name of g2p model is grapheme-to-phoneme model, for letter is converted to phoneme.
In correlation technique, the main method relying on statistical language model of the training of g2p model, the level and smooth strategy of model is also the same with statistical language model. But the increase along with model order, model to take space resources also just bigger.
In correlation technique, if to be ensured the performance of g2p model, it is necessary to take bigger space resources; If reducing taking space resources, the performance that will sacrifice model, to reduce model size, can affect English speech synthesis effect undoubtedly.
Summary of the invention
One of technical problem that the present invention is intended to solve in correlation technique at least to a certain extent.
For this reason, it is an object of the present invention to a kind of method proposing generation g2p model based on artificial intelligence, the method on the basis improving g2p model performance, can reduce the size of g2p model, and then improves English speech synthesis effect.
Another object of the present invention is to propose the device of a kind of generation g2p model based on artificial intelligence.
Another object of the present invention is to propose a kind of English phoneme synthesizing method based on artificial intelligence.
Another object of the present invention is to propose a kind of English speech synthetic device based on artificial intelligence.
For achieving the above object, the method for the generation g2p model based on artificial intelligence that first aspect present invention embodiment proposes, comprising: obtain the language material for training g2p model;Adopt neural network to be trained by described language material, obtain g2p model.
The method of the generation g2p model based on artificial intelligence that first aspect present invention embodiment proposes, generates g2p model by neural network, it is possible to reduce the size of g2p model on the basis of performance improving g2p model.
For achieving the above object, the device of the generation g2p model based on artificial intelligence that second aspect present invention embodiment proposes, comprising: acquisition module, for obtaining the language material for training g2p model; Training module, for adopting neural network to be trained by described language material, obtains g2p model.
The device of g2p model is built in the generation based on artificial intelligence that second aspect present invention embodiment proposes, and generates g2p model by neural network, it is possible to reduce the size of g2p model on the basis of performance improving g2p model.
For achieving the above object, the English phoneme synthesizing method based on artificial intelligence that third aspect present invention embodiment proposes, comprising: obtain g2p model; Described g2p model is adopted to carry out English speech synthesis; Wherein, described g2p model adopts the method as described in item as arbitrary in first aspect present invention embodiment to generate.
The English phoneme synthesizing method based on artificial intelligence that third aspect present invention embodiment proposes, carries out English speech synthesis by the g2p model adopting above-mentioned employing neural network training to generate, it is possible to promote English speech synthesis effect.
For achieving the above object, the English speech synthetic device based on artificial intelligence that fourth aspect present invention embodiment proposes, comprising: acquisition module, for obtaining g2p model; Synthesis module, for adopting described g2p model to carry out English speech synthesis; Wherein, described g2p model adopts the method as described in item as arbitrary in first aspect present invention embodiment to generate.
The English speech synthetic device based on artificial intelligence that fourth aspect present invention embodiment proposes, carries out English speech synthesis by the g2p model adopting above-mentioned employing neural network training to generate, it is possible to promote English speech synthesis effect.
The aspect that the present invention adds and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by the practice of the present invention.
Accompanying drawing explanation
The present invention above-mentioned and/or additional aspect and advantage will become obviously with it should be readily understood that wherein from the following description of the accompanying drawings of embodiments:
Fig. 1 is the schematic flow sheet of the method for the generation g2p model based on artificial intelligence that one embodiment of the invention proposes;
Fig. 2 is the schematic diagram of a kind of BLSTM network in the embodiment of the present invention;
Fig. 3 is the schematic diagram of another kind of BLSTM network in the embodiment of the present invention;
Fig. 4 is the schematic flow sheet of the English phoneme synthesizing method based on artificial intelligence that another embodiment of the present invention proposes;
Fig. 5 is the structural representation of the device of the generation g2p model based on artificial intelligence that another embodiment of the present invention proposes;
Fig. 6 is the structural representation of the English speech synthetic device based on artificial intelligence that another embodiment of the present invention proposes.
Embodiment
Being described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar module or has module that is identical or similar functions from start to finish. It is exemplary below by the embodiment being described with reference to the drawings, only for explaining the present invention, and limitation of the present invention can not be interpreted as. On the contrary, embodiments of the invention comprise all changes within the scope of the spirit and intension falling into attached claim book, amendment and etc. jljl.
Fig. 1 is the schematic flow sheet of the method for the generation g2p model based on artificial intelligence that one embodiment of the invention proposes.See Fig. 1, the method comprises:
S11: obtain the language material for training g2p model.
Such as, it is possible to the aligned phoneme sequence collecting a large amount of English words and correspondence in advance is as language material.
Concrete, it is possible to adopt common language material collection mode to collect.
S12: adopt neural network to be trained by described language material, obtain g2p model.
In correlation technique, adopt statistical language model, also it is exactly that n-gram mode is trained, obtains g2p model.
In the present embodiment, when training generation g2p model, neural network is adopted to train.
Preferably, neural network is: two-way LSTM (BidirectionalLSTM, BLSTM) network, wherein, and memory models (Long-ShortTermMemory) when LSTM is length.
Concrete, it is possible to the BLSTM network of the ad hoc structure of g2p model is obtained being applicable to through a large amount of experiment.
G2p model is for being converted to phoneme by letter, and such as, corresponding word ACCENT, its alphabetical sequence is ACCENT, and the aligned phoneme sequence of its correspondence is AH0KSEH1NT. The task of g2p model is exactly that the alphabetical sequence of word to be converted to corresponding aligned phoneme sequence, and in fact this is exactly a sequence labelling problem. This is the strong point of LSTM, and unidirectional LSTM model, when predicting the phoneme of a certain time point, is merely able to be predicted by the phoneme of current point in time according to the alphabetical sequence before this time point and aligned phoneme sequence. Two-way LSTM, namely BLSTM is made up of two unidirectional LSTM, and one of them LSTM, from moment 1, scans whole sequence until terminating; Another LSTM from the ending of sequence, the whole sequence of reverse scan until arrive sequence starting position. BLSTM is when predicting the phoneme of a certain time point, it is possible to predicted by the phoneme of current point in time according to sequences all before and after this time point, it thus is seen that BLSTM is in this type of problem of process, more more outstanding than unidirectional LSTM.
See Fig. 2, it it is the schematic diagram of a kind of BLSTM network. As shown in Figure 2, from top to bottom, the first layer is input layer (inputlayer), the second layer and third layer is BLSTM layer to BLSTM network, and last layer is output layer (Outputlayer). Wherein, layer between input layer and output layer can be called hidden layer, in fig. 2, hidden layer is the BLSTM layer of a layer, wherein, the BLSTM layer of one layer is made up of two layers of LSTM, and one of them processes according to the input sequence of sequence, and another is then sequence processed according to the reverse direction of sequence inputting.
For the g2p model that the present invention to be realized, the input layer of BLSTM network is 38 unit, and output layer is 70 unit. For 38 unit of input layer, wherein there are 26 English alphabets, 10 numerals, and two special symbols (" _ " and " ' "). For 70 unit of output layer, 69 is phoneme, also has a blank unit.
In BLSTM network, when the number of plies of Hidden unit number and hidden layer is different, producing different result, aftermentioned experimental result will provide some network structures of inventive design.
Further, when generating g2p model, the output layer of above-mentioned BLSTM network is: connect sequential classification (ConnectionistTemporalClassification, CTC) output layer. Wherein, CTC output layer refers to and adopts CTC technology at output layer, and CTC technology makes the preprocessing process that the training of g2p model no longer needs language material to align, and does not also need the process of aftertreatment, thus achieves g2p model and train end to end and predict.
In the process of traditional g2p model training of Corpus--based Method language model, first training data to be done language material alignment, could be trained by the method for statistical language model by the language material after alignment, and the net result obtained often also can do some aftertreatments, optimum aligned phoneme sequence just can be obtained. And CTC technology makes neural network can be predicted by phoneme in any position of list entries, model is so just made no longer to need language material to align such preprocessing process. What is more important, CTC directly exports the probable value of whole sequence, and this is just without the need to there being any loaded down with trivial details last handling process again, because the sequence obtained has been optimal sequence.
Preferably, the hidden layer of above-mentioned BLSTM network comprises: three layers, from input layer to output layer direction respectively: the BLSTM layer of the BLSTM layer of 128 dimensions, the BLSTM layer of 128 dimensions and 64 dimensions. Now, the structure of BLSTM network is as shown in Figure 3.
Advantage for a better understanding of the present invention, table 1 gives and adopts neural network to generate g2p model and the experimental result of traditional employing statistical language model generation g2p model.
Table 1
The data that this experiment is used are from cmu pronunciation dictionary, and this dictionary one has 133853 words, after simple language material cleans, obtain 121864 words, and wherein training set is 104194 words, and checking collection is 5484 words, and test set is 12186 words. Bit error rate test in upper table is all from test set.
In upper table, the n-gram model representation of front 3 row be the g2p model that Corpus--based Method language model draws, be followed successively by quaternary, five yuan and seven yuan, it should be noted that Corpus--based Method language model training g2p model time, except parameter M, also have a parameter L. Wherein, M represents model order, and L then represents the maximum length of each state, and L is more big, and so the state of model is more many, and thus model is more big, and correspondingly, the performance of model can be better. In the experiment of the present invention, L value is 2. If increasing L to train, the performance of so corresponding rank digital-to-analogue type can be better again, and meanwhile, corresponding rank digital-to-analogue type also can become bigger.
The later model of front 3 row is exactly the g2p model training out based on BLSTM-CTC, and these models only give the description of network hidden layer in table, such as. 128-BLSTM+64-BLSTM is exactly the neural network model that a hidden layer has two BLSTM, is that 128 peacekeepings 64 are tieed up respectively.
As can be seen from the table, when performance is almost identical, based on the g2p model that the model size of the g2p model of BLSTM-CTC to be obtained much smaller than Corpus--based Method language model. In other words, if the g2p model of Corpus--based Method language model to be reached the g2p model based on BLSTM-CTC in performance, so just need to increase parameter L, or continue to expand the rank number of model, and the consequent to be exactly model size huger. Such as go up the 4-gram model in table and 128-BLSTM model shows quite in performance, but the size of 128-BLSTM model is less than 4-gram model; Same, the 7-gram model size in upper table is 43M, and 128-BLSTM+64-BLSTM+64-BLSTM model is under performance is better than the prerequisite of 7-gram model, moreover it is possible to less than 7-gram nearly 6 times. If the performance making 7-gram model promotes further, then need to increase L, and then train a model, will make the size of 7-gram model also will much larger than 43M like this.
From analysis above, the performance of English speech synthesis system can be significantly promoted based on the g2p model of BLSTM-CTC, for online English speech synthesis system, reduce error rate, improve the performance of system, for embedded English speech synthesis system (the English speech synthesis system of off-line can also be called), not only improve the performance of system, but also significantly reduce model size.
In the present embodiment, generate g2p model by neural network, it is possible to improve g2p model performance basis on reduce the size of g2p model. Concrete, as shown above, when adopting the BLSTM network training of above-mentioned CTC output layer to generate g2p model, when consistent with tradition g2p model errors rate, greatly reduce model size so that g2p model is applicable to embedded English speech synthesis system more. With in tradition g2p model situation of the same size, reduce error rate, effectively improve off-line and the performance of online English speech synthesis system.
Fig. 4 is the schematic flow sheet of the English phoneme synthesizing method based on artificial intelligence that another embodiment of the present invention proposes. See Fig. 4, the method comprises:
S41: obtain g2p model.
Wherein, g2p model adopts neural network training to generate.
Such as, the BLSTM network training of CTC output layer is adopted to generate g2p in advance, when English speech synthesis, so that it may to obtain the g2p model that this generates in advance. The product process of concrete g2p model can above-described embodiment, illustrate no longer in detail at this.
S42: text to be synthesized is carried out English speech synthesis, wherein, described English speech synthesis comprises:
Described g2p model is adopted to carry out the conversion of letter to phoneme.
Of course, it is understood that can also comprise other steps when English speech synthesis, other steps can perform with reference to usual mode.
In the present embodiment, carry out English speech synthesis by the g2p model adopting above-mentioned employing neural network training to generate, it is possible to promote English speech synthesis effect.
Fig. 5 is the structural representation of the device of the generation g2p model based on artificial intelligence that another embodiment of the present invention proposes. See Fig. 5, this device 50 comprises: acquisition module 51 and training module 52.
Acquisition module 51, for obtaining the language material for training g2p model.
Such as, it is possible to the aligned phoneme sequence collecting a large amount of English words and correspondence in advance is as language material.
Concrete, it is possible to adopt common language material collection mode to collect.
Training module 52, for adopting neural network to be trained by described language material, obtains g2p model.
In correlation technique, adopt statistical language model, also it is exactly that n-gram mode is trained, obtains g2p model.
In the present embodiment, when training generation g2p model, neural network is adopted to train.
In some embodiments, the described neural network that described training module adopts is: BLSTM network.
In some embodiments, the output layer of the described neural network that described training module adopts is: CTC output layer.
In some embodiments, the hidden layer of described BLSTM network comprises: three layers, from input layer to output layer direction respectively: the BLSTM layer of the BLSTM layer of 128 dimensions, the BLSTM layer of 128 dimensions and 64 dimensions.
Optionally, described g2p model is used for English speech synthesis that is online or off-line.
It should be appreciated that the device of the present embodiment is corresponding with the embodiment of the method generating g2p model, particular content see the associated viscera in the embodiment of the method generating g2p model, can illustrate at this no longer in detail.
In the present embodiment, generate g2p model by neural network, it is possible to ensure g2p model performance basis on reduce the size of g2p model. Concrete, as shown above, adopt above-mentioned CTC output layer BLSTM network training generate g2p model time, when with tradition g2p model errors rate constant, greatly reduce model size so that g2p model is applicable to embedded English speech synthesis system more. With in tradition g2p model situation of the same size, reduce error rate, effectively improve off-line and the performance of online English speech synthesis system.
Fig. 6 is the structural representation of the English speech synthetic device based on artificial intelligence that another embodiment of the present invention proposes. See Fig. 6, this device 60 comprises: acquisition module 61 and synthesis module 62.
Acquisition module 61, for obtaining g2p model.
Wherein, g2p model adopts neural network training to generate.
Such as, the BLSTM network training of CTC output layer is adopted to generate g2p in advance, when English speech synthesis, so that it may to obtain the g2p model that this generates in advance. The product process of concrete g2p model can above-described embodiment, illustrate no longer in detail at this.
Synthesis module 62, for text to be synthesized carries out English speech synthesis, wherein, described English speech synthesis comprises: adopt described g2p model to carry out the conversion of letter to phoneme.
Of course, it is understood that can also comprise other steps when English speech synthesis, other steps can perform with reference to usual mode.
It should be appreciated that the device of the present embodiment is corresponding with English phoneme synthesizing method embodiment, particular content see the associated viscera in English phoneme synthesizing method embodiment, can illustrate at this no longer in detail.
In the present embodiment, carry out English speech synthesis by the g2p model adopting above-mentioned employing neural network training to generate, it is possible to promote English speech synthesis effect.
It should be noted that, in describing the invention, term " first ", " the 2nd " etc. are only for describing object, and can not be interpreted as instruction or hint relative importance. In addition, in describing the invention, unless otherwise explanation, the implication of " multiple " refers at least two.
Describe and can be understood in schema or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the performed instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carrying out n-back test, this should be understood by embodiments of the invention person of ordinary skill in the field.
It is to be understood that each several part of the present invention can realize with hardware, software, firmware or their combination. In the above-described embodiment, multiple step or method can realize with the software stored in memory and perform by suitable instruction execution system or firmware. Such as, if realized with hardware, the same with in another enforcement mode, can realize with the arbitrary item in following technology well known in the art or their combination: the discrete logic with the logic gates for data signal being realized logic function, there is the application specific integrated circuit of suitable combinational logic gating circuit, programmable gate array (PGA), field-programmable gate array (FPGA) etc.
Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is can be completed by the hardware that program carrys out instruction relevant, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, it is also possible to is that the independent physics of each unit exists, it is also possible to two or more unit are integrated in a module. Above-mentioned integrated module both can adopt the form of hardware to realize, it is also possible to adopts the form of software function module to realize. If described integrated module realize using the form of software function module and as independent production marketing or when using, it is also possible to be stored in a computer read/write memory medium.
The above-mentioned storage media mentioned can be read-only storage, disk or CD etc.
In the description of this specification sheets, at least one embodiment that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to be contained in the present invention in conjunction with concrete feature, structure, material or feature that this embodiment or example describe or example. In this manual, the schematic representation of above-mentioned term is not necessarily referred to identical embodiment or example. And, the concrete feature of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.
Although above it has been shown and described that embodiments of the invention, it is understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and above-described embodiment can be changed, revises, replace and modification by the those of ordinary skill of this area within the scope of the invention.
Claims (10)
1. the method based on the generation g2p model of artificial intelligence, it is characterised in that, comprising:
Obtain the language material for training g2p model;
Adopt neural network to be trained by described language material, obtain g2p model.
2. method according to claim 1, it is characterised in that, described neural network is: BLSTM network.
3. method according to claim 2, it is characterised in that, the output layer of described neural network is: CTC output layer.
4. according to the method in claim 2 or 3, it is characterised in that, the hidden layer of described BLSTM network comprises: three layers, from input layer to output layer direction respectively: the BLSTM layer of the BLSTM layer of 128 dimensions, the BLSTM layer of 128 dimensions and 64 dimensions.
5. method according to claim 1, it is characterised in that, described g2p model is used for English speech synthesis that is online or off-line.
6. the English phoneme synthesizing method based on artificial intelligence, it is characterised in that,
Obtain g2p model;
Text to be synthesized carries out English speech synthesis, and wherein, described English speech synthesis comprises:
Described g2p model is adopted to carry out the conversion of letter to phoneme;
Wherein, described g2p model adopts the method as described in item as arbitrary in claim 1-5 to generate.
7. the device based on the generation g2p model of artificial intelligence, it is characterised in that, comprising:
Acquisition module, for obtaining the language material for training g2p model;
Training module, for adopting neural network to be trained by described language material, obtains g2p model.
8. device according to claim 7, it is characterised in that, the described neural network that described training module adopts is: BLSTM network.
9. device according to claim 8, it is characterised in that, the output layer of the described neural network that described training module adopts is: CTC output layer.
10. the English speech synthetic device based on artificial intelligence, it is characterised in that, comprising:
Acquisition module, for obtaining g2p model;
Synthesis module, for text to be synthesized carries out English speech synthesis, wherein, described English speech synthesis comprises: adopt described g2p model to carry out the conversion of letter to phoneme;
Wherein, described g2p model adopts the method as described in item as arbitrary in claim 1-5 to generate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610122171.0A CN105679308A (en) | 2016-03-03 | 2016-03-03 | Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610122171.0A CN105679308A (en) | 2016-03-03 | 2016-03-03 | Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105679308A true CN105679308A (en) | 2016-06-15 |
Family
ID=56306702
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610122171.0A Pending CN105679308A (en) | 2016-03-03 | 2016-03-03 | Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105679308A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107808660A (en) * | 2016-09-05 | 2018-03-16 | 株式会社东芝 | Train the method and apparatus and audio recognition method and device of neutral net language model |
CN109147766A (en) * | 2018-07-06 | 2019-01-04 | 北京爱医声科技有限公司 | Audio recognition method and system based on end-to-end deep learning model |
CN109616102A (en) * | 2019-01-09 | 2019-04-12 | 百度在线网络技术(北京)有限公司 | Training method, device and the storage medium of acoustic model |
CN110889987A (en) * | 2019-12-16 | 2020-03-17 | 安徽必果科技有限公司 | Intelligent comment method for correcting spoken English |
CN110941427A (en) * | 2019-11-15 | 2020-03-31 | 珠海豹趣科技有限公司 | Code generation method and code generator |
CN113160804A (en) * | 2021-02-26 | 2021-07-23 | 深圳市北科瑞讯信息技术有限公司 | Hybrid voice recognition method and device, storage medium and electronic device |
US12099904B2 (en) | 2021-03-10 | 2024-09-24 | International Business Machines Corporation | Uniform artificial intelligence model conversion |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080004878A1 (en) * | 2006-06-30 | 2008-01-03 | Robert Bosch Corporation | Method and apparatus for generating features through logical and functional operations |
CN103632663A (en) * | 2013-11-25 | 2014-03-12 | 飞龙 | HMM-based method of Mongolian speech synthesis and front-end processing |
CN105095185A (en) * | 2015-07-21 | 2015-11-25 | 北京旷视科技有限公司 | Author analysis method and author analysis system |
CN105244020A (en) * | 2015-09-24 | 2016-01-13 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy model training method, text-to-speech method and text-to-speech device |
CN105355193A (en) * | 2015-10-30 | 2016-02-24 | 百度在线网络技术(北京)有限公司 | Speech synthesis method and device |
CN105590623A (en) * | 2016-02-24 | 2016-05-18 | 百度在线网络技术(北京)有限公司 | Letter-to-phoneme conversion model generating method and letter-to-phoneme conversion generating device based on artificial intelligence |
-
2016
- 2016-03-03 CN CN201610122171.0A patent/CN105679308A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080004878A1 (en) * | 2006-06-30 | 2008-01-03 | Robert Bosch Corporation | Method and apparatus for generating features through logical and functional operations |
CN103632663A (en) * | 2013-11-25 | 2014-03-12 | 飞龙 | HMM-based method of Mongolian speech synthesis and front-end processing |
CN105095185A (en) * | 2015-07-21 | 2015-11-25 | 北京旷视科技有限公司 | Author analysis method and author analysis system |
CN105244020A (en) * | 2015-09-24 | 2016-01-13 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy model training method, text-to-speech method and text-to-speech device |
CN105355193A (en) * | 2015-10-30 | 2016-02-24 | 百度在线网络技术(北京)有限公司 | Speech synthesis method and device |
CN105590623A (en) * | 2016-02-24 | 2016-05-18 | 百度在线网络技术(北京)有限公司 | Letter-to-phoneme conversion model generating method and letter-to-phoneme conversion generating device based on artificial intelligence |
Non-Patent Citations (4)
Title |
---|
KANISHKA RAO ET AL.: "《Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks》", 《ICASSP 2015》 * |
商俊蓓: "《基于双向长短时记忆递归神经网络的联机手写数字公式字符识别》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王永生等: "《英语语音合成中基于DFGA的字音转换算法》", 《计算机工程与应用》 * |
银珠: "《百度汉语语音识别获重大突破》", 《计算机与网络》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107808660A (en) * | 2016-09-05 | 2018-03-16 | 株式会社东芝 | Train the method and apparatus and audio recognition method and device of neutral net language model |
CN109147766A (en) * | 2018-07-06 | 2019-01-04 | 北京爱医声科技有限公司 | Audio recognition method and system based on end-to-end deep learning model |
CN109147766B (en) * | 2018-07-06 | 2020-08-18 | 北京爱医声科技有限公司 | Speech recognition method and system based on end-to-end deep learning model |
CN109616102A (en) * | 2019-01-09 | 2019-04-12 | 百度在线网络技术(北京)有限公司 | Training method, device and the storage medium of acoustic model |
CN109616102B (en) * | 2019-01-09 | 2021-08-31 | 百度在线网络技术(北京)有限公司 | Acoustic model training method and device and storage medium |
CN110941427A (en) * | 2019-11-15 | 2020-03-31 | 珠海豹趣科技有限公司 | Code generation method and code generator |
CN110941427B (en) * | 2019-11-15 | 2023-10-20 | 珠海豹趣科技有限公司 | Code generation method and code generator |
CN110889987A (en) * | 2019-12-16 | 2020-03-17 | 安徽必果科技有限公司 | Intelligent comment method for correcting spoken English |
CN113160804A (en) * | 2021-02-26 | 2021-07-23 | 深圳市北科瑞讯信息技术有限公司 | Hybrid voice recognition method and device, storage medium and electronic device |
US12099904B2 (en) | 2021-03-10 | 2024-09-24 | International Business Machines Corporation | Uniform artificial intelligence model conversion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105679308A (en) | Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence | |
CN106919646B (en) | Chinese text abstract generating system and method | |
CN106228980B (en) | Data processing method and device | |
CN105957518B (en) | A kind of method of Mongol large vocabulary continuous speech recognition | |
CN106503805B (en) | A kind of bimodal based on machine learning everybody talk with sentiment analysis method | |
CN110765759B (en) | Intention recognition method and device | |
CN108986797B (en) | A method and system for speech subject recognition | |
CN107408384A (en) | The end-to-end speech recognition of deployment | |
CN104217226B (en) | Conversation activity recognition methods based on deep neural network Yu condition random field | |
CN109376775B (en) | Online News Multimodal Sentiment Analysis Method | |
CN109829058A (en) | A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning | |
CN106652999A (en) | System and method for voice recognition | |
CN108427665A (en) | A kind of text automatic generation method based on LSTM type RNN models | |
CN108231066B (en) | Speech recognition system and method thereof and vocabulary establishing method | |
CN109949796B (en) | End-to-end architecture Lasa dialect voice recognition method based on Tibetan component | |
CN109086865B (en) | Sequence model establishing method based on segmented recurrent neural network | |
CN108538285A (en) | A kind of various keyword detection method based on multitask neural network | |
CN108268442A (en) | A kind of sentence Intention Anticipation method and system | |
CN113051887A (en) | Method, system and device for extracting announcement information elements | |
CN111161703B (en) | Speech synthesis method and device with language, computing equipment and storage medium | |
CN110263345A (en) | Keyword extracting method, device and storage medium | |
CN112163410A (en) | Ancient text pre-training system based on deep learning and training method thereof | |
CN111553157A (en) | Entity replacement-based dialog intention identification method | |
Zulfiqar et al. | Logical layout analysis using deep learning | |
CN110276081A (en) | Document creation method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160615 |