TW201828281A

TW201828281A - Method and device for constructing pronunciation dictionary capable of inputting a speech acoustic feature of the target vocabulary into a speech recognition decoder

Info

Publication number: TW201828281A
Application number: TW106102713A
Authority: TW
Inventors: 王志銘; 李曉輝; 李宏言
Original assignee: 阿里巴巴集團服務有限公司
Priority date: 2017-01-24
Filing date: 2017-01-24
Publication date: 2018-08-01

Abstract

The invention discloses a method for constructing a pronunciation dictionary, which is used to solve the problem of poor quality of a pronunciation dictionary constructed according to the prior technologies. The method includes: inputting a speech acoustic feature of the target vocabulary into a speech recognition decoder, wherein the pronunciation dictionary in the speech recognition decoder includes: a target vocabulary and a candidate pronunciation phoneme sequence of the target vocabulary; determining, according to the candidate pronunciation phoneme sequence output by the speech recognition decoder, a probability distribution of the target vocabulary corresponding to the output candidate pronunciation phoneme sequence; selecting, according to the probability distribution, a pronunciation phoneme sequence that is a correct pronunciation of the target vocabulary from the output candidate pronunciation phoneme sequence; and constructing a pronunciation dictionary based on the correctly pronounced phoneme sequence of the pronunciation. The invention also discloses a construction device for a pronunciation dictionary.

Description

Method and device for constructing pronunciation dictionary

本發明關於計算機技術領域，尤其關於一種發音詞典的建構方法及裝置。 The invention relates to the field of computer technology, and in particular to a method and a device for constructing a pronunciation dictionary.

語音互動技術早在二十世紀中期就已經開始出現，近幾年隨著智慧型手機的普及，大量的語音互動產品相繼出現，語音互動產品走進了普通使用者的日常生活之中。例如，語音輸入法就是透過接收並識別使用者發出的語音，然後將使用者的語音轉換成文字，省去了打字的繁瑣輸入；來電報號功能可以將文字以語音的形式輸出，在使用者不看螢幕的情況下，即可獲知來電方身份。 Voice interactive technology has already appeared in the middle of the twentieth century. In recent years, with the popularization of smart phones, a large number of voice interactive products have appeared one after another, and voice interactive products have entered the daily life of ordinary users. For example, the voice input method is to receive and recognize the voice sent by the user, and then convert the voice of the user into a text, thereby eliminating the cumbersome input of typing; the caller number function can output the text in the form of voice, in the user You can get the identity of the caller without looking at the screen.

在語音互動技術中，發音詞典是語音互動系統中重要的組成部分，是連接聲學模型和語言模型之間的橋樑，其覆蓋面和發音品質對系統的整體性能具有重大的影響。 In the voice interaction technology, the pronunciation dictionary is an important part of the voice interaction system. It is the bridge between the acoustic model and the language model. Its coverage and pronunciation quality have a significant impact on the overall performance of the system.

發音詞典中包含詞和發音音素序列之間的映射關係，通常可以採用詞轉換為音素(Grapheme-to-Phoneme，G2P)方法建立該映射關係。一般情況下，發音詞典經過語言學相關方面的專家審核校正，規模大小相對固定，因此其不可能覆蓋所有的詞彙，從而在實際應用中，有可能會根據需要，利用G2P方法確定新增詞彙所匹配的發音音素序列，即確定新增詞彙的正確發音，進而根據新增詞彙和與其匹配的發音音素序列，對現有的發音詞典進行擴充。 The pronunciation dictionary contains the mapping relationship between the word and the pronunciation phoneme sequence, and the mapping relationship can usually be established by using the word conversion to the phoneme (Graphme-to-Phoneme, G2P) method. Under normal circumstances, the pronunciation dictionary is corrected by experts in linguistics related aspects, and the scale is relatively fixed, so it is impossible to cover all vocabulary. Therefore, in practical applications, it is possible to use G2P method to determine new vocabulary according to needs. The matched pronunciation phoneme sequence determines the correct pronunciation of the newly added vocabulary, and then expands the existing pronunciation dictionary according to the newly added vocabulary and the pronunciation phoneme sequence matched with it.

目前，採用G2P方法，基本能夠準確確定常規詞彙的正確發音。但是，對於一些特別的詞彙，比如包含多音字的詞彙，採用該方法確定出的詞彙的正確發音的準確度往往較低，從而影響發音詞典的品質。 At present, using the G2P method, it is basically possible to accurately determine the correct pronunciation of conventional vocabulary. However, for some special words, such as vocabulary containing polyphonic words, the accuracy of the correct pronunciation of the vocabulary determined by this method is often low, thus affecting the quality of the pronunciation dictionary.

本發明實施例提供一種發音詞典的建構方法，用以解決按照現有技術建構的發音詞典的品質較差的問題。 The embodiment of the invention provides a method for constructing a pronunciation dictionary, which is used to solve the problem of poor quality of a pronunciation dictionary constructed according to the prior art.

本發明實施例還提供一種發音詞典的建構裝置，用以解決按照現有技術建構的發音詞典的品質較差的問題。 The embodiment of the invention further provides a device for constructing a pronunciation dictionary for solving the problem of poor quality of a pronunciation dictionary constructed according to the prior art.

本發明實施例採用下述技術方案：一種發音詞典的建構方法，包括：將目標詞彙的語音聲學特徵，輸入語音識別解碼器；其中，所述語音識別解碼器中的發音詞典包括：目標詞彙和目標詞彙的候選發音音素序列；根據所述語音識別解碼器以所述語音聲學特徵作為輸入而輸出的候選發音音素序列，確定所述目標詞彙對應於輸出的候選發音音素序列的機率分佈；根據所述機率分佈，從所述輸出的候選發音音素序列中，選擇作為所述目標詞彙的正確發音的發音音素序列；根據所述正確發音的發音音素序列，建構發音詞典。 The embodiment of the present invention adopts the following technical solution: a method for constructing a pronunciation dictionary, comprising: inputting a speech acoustic feature of a target vocabulary into a speech recognition decoder; wherein the pronunciation dictionary in the speech recognition decoder includes: a target vocabulary and a candidate pronunciation phoneme sequence of the target vocabulary; determining, according to the candidate pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature as an input, determining a probability distribution of the target vocabulary corresponding to the output candidate pronunciation phoneme sequence; Describe the probability distribution, select a pronunciation phoneme sequence that is the correct pronunciation of the target vocabulary from the output candidate phoneme sequence, and construct a pronunciation dictionary according to the correctly pronounced pronunciation phoneme sequence.

一種發音詞典的建構裝置，包括：解碼單元：用於將目標詞彙的語音聲學特徵，輸入語音識別解碼器中；其中，所述語音識別解碼器中的發音詞典包括：目標詞彙和目標詞彙的的候選發音音素序列；發音確定單元：用於根據所述語音識別解碼器以所述語音聲學特徵作為輸入而輸出的候選發音音素序列，確定所述目標詞彙對應於輸出的候選發音音素序列的機率分佈；根據所述機率分佈，從所述輸出的候選發音音素序列中，選擇作為所述目標詞彙的正確發音的發音音素序列；詞典建構單元：用於根據所述正確發音的發音音素序列，建構發音詞典。 A device for constructing a pronunciation dictionary, comprising: a decoding unit: configured to input a speech acoustic feature of a target vocabulary into a speech recognition decoder; wherein the pronunciation dictionary in the speech recognition decoder comprises: a target vocabulary and a target vocabulary a candidate pronunciation phoneme sequence; a pronunciation determining unit: configured to determine, according to the candidate pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature as an input, determining a probability distribution of the target vocabulary corresponding to the output candidate phoneme sequence And selecting, according to the probability distribution, a pronunciation phoneme sequence that is a correct pronunciation of the target vocabulary from the output candidate phoneme sequence; a dictionary constructing unit: configured to construct a pronunciation according to the correctly pronounced pronunciation phoneme sequence dictionary.

本發明實施例採用的上述至少一個技術方案能夠達到以下有益效果：由於引入了待預測發音的目標詞的語音聲學特徵，作為預測詞彙正確發音的依據之一，從而相對於僅依靠詞彙和發音音素序列的映射關係來作為預測詞彙正確發音依據的現有技術而言，可以更為準確地預測目標詞彙正確發音，提升了基於確定出的正確發音建構的發音詞典的品質。 The above at least one technical solution adopted by the embodiment of the present invention can achieve the following beneficial effects: since the speech acoustic feature of the target word to be predicted is introduced, as one of the basis for predicting the correct pronunciation of the vocabulary, thereby relying on only the vocabulary and the pronunciation phoneme As a prior art for predicting the correct pronunciation of vocabulary, the mapping relationship of sequences can more accurately predict the correct pronunciation of the target vocabulary and improve the quality of the pronunciation dictionary based on the determined correct pronunciation construction.

21‧‧‧解碼單元 21‧‧‧Decoding unit

22‧‧‧發音確定單元 22‧‧‧ pronunciation determination unit

23‧‧‧詞典建構單元 23‧‧‧ dictionary construction unit

此處所說明的附圖用來提供對本發明的進一步理解，構成本發明的一部分，本發明的示意性實施例及其說明用於解釋本發明，並不構成對本發明的不當限定。在附圖中：圖1為本發明實施例提供的一種發音詞典的建構方法的實現流程示意圖；圖2為本實施例提供的一種發音詞典的建構裝置的具體結構示意圖。 The drawings are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the accompanying drawings: FIG. 1 is a schematic diagram showing an implementation flow of a method for constructing a pronunciation dictionary according to an embodiment of the present invention; FIG. 2 is a schematic structural diagram of a device for constructing a pronunciation dictionary according to an embodiment of the present invention.

為使本發明的目的、技術方案和優點更加清楚，下面將結合本發明具體實施例及相應的附圖對本發明技術方案進行清楚、完整地描述。顯然，所描述的實施例僅是本發明一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員在沒有做出創造性勞動前提下所獲得的所有其他實施例，都屬□本發明保護的範圍。 The technical solutions of the present invention will be clearly and completely described in conjunction with the specific embodiments of the present invention and the accompanying drawings. It is apparent that the described embodiments are only a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

以下結合附圖，詳細說明本發明各實施例提供的技術方案。 The technical solutions provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Example 1

現有的發音預測方法通常是基於G2P轉換的方法，G2P方法透過建立詞彙和發音音素之間的映射關係，將詞彙轉換為發音音素序列。採用G2P方法，基本能夠準確得到與常規詞彙匹配的發音音素序列，但是由於該方法只利用了詞彙(字序列)和發音音素的映射關係，因此對於一些特別的詞彙，比如包含多音字的詞彙，採用該方法確定出的與詞彙匹配的發音音素序列的準確度往往較低，從而影響發音詞典的品質。 The existing pronunciation prediction method is usually based on the G2P conversion method, and the G2P method converts the vocabulary into a pronunciation phoneme sequence by establishing a mapping relationship between vocabulary and pronunciation phonemes. Using the G2P method, the pronunciation phoneme sequence matching the regular vocabulary can be basically obtained accurately, but since the method only utilizes the mapping relationship between the vocabulary (word sequence) and the pronunciation phoneme, for some special words, such as words containing polyphonic words, The accuracy of the pronunciation phoneme sequence determined by the method is often lower, which affects the quality of the pronunciation dictionary.

為解決由於現有技術不能準確預測詞彙的正確發音從而影響發音詞典的品質的問題，本發明實施例1提供了一種發音詞典的建構方法。 In order to solve the problem that the prior art cannot accurately predict the correct pronunciation of the vocabulary and thus affect the quality of the pronunciation dictionary, Embodiment 1 of the present invention provides a method for constructing a pronunciation dictionary.

本發明實施例提供的發音詞典的建構方法的執行主體可以是伺服器也可以是不同於伺服器的其他設備，等等。所述的執行主體並不構成對本發明的限定，為了便於描述，本發明實施例均以執行主體是伺服器為例進行說明。 The execution body of the pronunciation dictionary construction method provided by the embodiment of the present invention may be a server or other device different from the server, and the like. The executor of the present invention is not limited to the present invention. For the convenience of description, the embodiments of the present invention are described by taking an execution subject as a server as an example.

為便於描述，在本實施例中，存在對應關係的詞彙和語音聲學特徵可以用詞彙-語音聲學特徵來表示。 For convenience of description, in the present embodiment, vocabulary and phonetic acoustic features having a correspondence relationship may be represented by vocabulary-speech acoustic features.

類似的，存在對應關係的詞彙(字序列)和音素序列，以及存在對應關係的語音聲學特徵和語音音素序列，也可用上述表示方式表示。例如，存在對應關係的詞彙和音素序列，可以用詞彙-語音音素序列來表示。 Similarly, a vocabulary (word sequence) and a phoneme sequence in which a correspondence exists, and a phonetic acoustic feature and a phoneme sequence in which a correspondence exists, can also be expressed by the above representation. For example, there are vocabulary and phoneme sequences corresponding to each other, which can be represented by a lexical-speech phoneme sequence.

以下對本發明實施例提供該方法進行詳細介紹。 The method of the present invention is provided in detail below.

該方法的實現流程示意圖如圖1所示，包括下述步驟：步驟11：伺服器將目標詞彙的語音聲學特徵，輸入到嵌入有發音詞典、聲學模型和語言模型的語音識別解碼器；本發明實施例中，所述的目標詞彙，可以是任何詞彙，比如中文詞彙、英文詞彙或者其他語言的詞彙。若針對語音識別解碼器中已有的發音詞典而言，所述的目標詞彙，可以是指該發音詞典當前不包含的詞彙，即相對於該發音詞典的新增詞彙。 The schematic diagram of the implementation process of the method is as shown in FIG. 1 , and includes the following steps: Step 11: The server inputs the speech acoustic characteristics of the target vocabulary into a speech recognition decoder embedded with a pronunciation dictionary, an acoustic model and a language model; In the embodiment, the target vocabulary may be any vocabulary, such as a Chinese vocabulary, an English vocabulary or a vocabulary in other languages. For the pronunciation dictionary existing in the speech recognition decoder, the target vocabulary may refer to a vocabulary that is not currently included in the pronunciation dictionary, that is, a new vocabulary relative to the pronunciation dictionary.

本發明實施例中所述的目標詞彙的語音聲學特徵，可以但不限於包括從說出該目標詞彙所產生的語音信號中提取出的濾波器組(Filter Bank)特徵、梅爾頻率倒頻譜係數(Mel Frequency Cepstrum Coefficient；MFCC)特徵以及感知線性預測(Perceptual Linear Predictive；PLP)特徵等等中的至少一種。 The speech acoustic feature of the target vocabulary described in the embodiment of the present invention may include, but is not limited to, a Filter Bank feature extracted from a speech signal generated by speaking the target vocabulary, and a Mel frequency cepstral coefficient. (Mel Frequency Cepstrum Coefficient; MFCC) features and at least one of Perceptual Linear Predictive (PLP) features and the like.

本發明實施例中，所述的語音信號，比如可以是根據目標詞彙對應的音頻樣本。 In the embodiment of the present invention, the voice signal may be, for example, an audio sample corresponding to the target vocabulary.

目標詞彙對應的音頻樣本，可以但不限於是採用下述方式中的至少一種獲得的：一、委託專業的語音資料供應商進行人工錄音，從而獲得目標詞彙對應的音頻樣本；二、採用群眾外包的形式，以使用者的真實使用感受和切身體驗為出發點，將錄音任務以自由自願的形式委託給非特定的(而且通常是大型的)網路大眾，從而獲得目標詞彙對應的音頻樣本；三、分析使用者反饋的記錄日誌，從而獲得目標詞彙對應的音頻樣本。例如，在語音搜索任務中，使用者先透過語音輸入目標詞彙，如果語音識別系統識別錯誤，使用者繼續透過鍵盤輸入正確的目標詞彙，這一系列的行為可以透過日誌的形式記錄下來。 The audio sample corresponding to the target vocabulary can be obtained by using at least one of the following methods: 1. Entrusting a professional voice data provider to perform manual recording to obtain an audio sample corresponding to the target vocabulary; In the form of the user's real use experience and personal experience, the recording task is freely and voluntarily entrusted to the non-specific (and usually large) Internet mass to obtain the audio sample corresponding to the target vocabulary; The log of the user feedback is analyzed to obtain an audio sample corresponding to the target vocabulary. For example, in a voice search task, the user first inputs the target vocabulary through voice. If the voice recognition system recognizes an error and the user continues to input the correct target vocabulary through the keyboard, the series of behaviors can be recorded in the form of a log.

本發明實施例中，可以從目標詞彙對應的音頻樣本中分別獲得語音聲學特徵，進而將獲得的各語音聲學特徵作為所述目標詞彙的語音聲學特徵，分別輸入所述語音識別解碼器。 In the embodiment of the present invention, the speech acoustic features may be respectively obtained from the audio samples corresponding to the target vocabulary, and the obtained speech acoustic features are respectively input as the speech acoustic features of the target vocabulary into the speech recognition decoder.

以下進一步介紹步驟11中提及的語音識別解碼器的工作原理。 The operation of the speech recognition decoder mentioned in step 11 is further described below.

一般地，語音識別解碼器，是用於針對輸入的語音信號(或語音聲學特徵)，根據聲學模型、語言模型及發音詞典，尋找能夠以最大機率發出該語音信號(或與該語音聲學特徵相匹配的語音信號)的詞的虛擬或者實體設備。 Generally, a speech recognition decoder is used to search for a speech signal (or a phonetic acoustic feature) for an input, based on an acoustic model, a language model, and a pronunciation dictionary, to generate the speech signal with maximum probability (or with the acoustical characteristics of the speech) A virtual or physical device of the word that matches the voice signal).

在語音識別領域，對語音信號進行解碼的目標，就是尋找字序列W ^*(相當於上文所述的“詞”)，使得對應的語音聲學特徵X似然機率最大化，實質上就是一個基於貝葉斯準則的機器學習問題，即利用貝葉斯公式來計算最佳字序列W ^*，如公式[1.1]所示： In the field of speech recognition, the goal of decoding a speech signal is to find a word sequence W ^* (equivalent to the "word" described above), so that the probability of the corresponding acoustical feature X is maximized, which is essentially based on The machine learning problem of the Bayesian criterion is to use the Bayesian formula to calculate the optimal word sequence W ^* , as shown in the formula [1.1]:

其中P(X|W _i)為聲學模型，P(W _i)為語言模型。 Where P ( X | W _i ) is the acoustic model and P ( W _i ) is the language model.

聲學模型，是字序列W _i的語音聲學特徵為X的機率。一般可以利用大量的資料(包括語音聲學特徵以及對應的標簽序列)訓練得到聲學模型。 The acoustic model is the probability that the speech acoustic characteristic of the word sequence W _i is X. Acoustic models can generally be trained using a large amount of data, including speech acoustic features and corresponding tag sequences.

語言模型，是詞彙對應的字序列W _i的出現機率。該出現機率的含義一般為：構成詞彙的各個字依照所述各個字在該詞彙中的排列順序依次出現的機率。 The language model is the probability of occurrence of the word sequence W _i corresponding to the vocabulary. The probability of occurrence is generally: the probability that each word constituting a vocabulary appears in sequence according to the order in which the respective words are arranged in the vocabulary.

考慮到字序列一般會對應的不同的發音音素序列，比如用帶不同地方口音發出某個詞彙(可由字序列表示)的發音可能對應不同的發音音素，又或者包含多音字的詞彙也有可能對應不同的發音音素，因此，若假設是字序列W _i對應的各發音音素序列，那麼公式[1.1]可變為： Considering the different pronunciation phoneme sequences that the word sequence generally corresponds to, for example, the pronunciation of a certain vocabulary (which can be represented by a word sequence) with different local accents may correspond to different pronunciation phonemes, or the words containing multi-tone words may also correspond to different Pronunciation phoneme, therefore, if assumed Is the pronunciation phoneme sequence corresponding to the word sequence W _i , then the formula [1.1] can be changed to:

其中，W _i為字序列；P(X|)為聲學模型；P(W _i)為語言模型；P(|W _i)為發音詞典中的詞彙(由字序列W _i表示)的發音音素序列為的機率。 Where W _i is a sequence of words; P ( X | ) is an acoustic model; P ( W _i ) is a language model; P ( | W _i ) is the pronunciation phoneme sequence of the vocabulary in the pronunciation dictionary (represented by the word sequence W _i ) The chance.

對於發音學習的問題，進一步假定字序列W _i和對應的語音聲學特徵X是已知的，則公式[1.2]的計算目標，可以轉換是為了尋找字序列W _i對應的最佳發音音素序列Q ^*。這樣，公式[1.2]進一步可變為： For the problem of pronunciation learning, further assuming that the word sequence W _i and the corresponding speech acoustic feature X are known, the calculation target of the formula [1.2] can be converted to find the best pronunciation phoneme sequence Q corresponding to the word sequence W _i . ^* . Thus, the formula [1.2] can be further changed to:

公式[1.3]中：Q ^*為使得公式[1.3]中等號右側的值最大的發音音素序列，也即字序列W _i對應的候選發音音素序列的機率分佈的最大值；W _i為字序列，i為詞彙的編號； X表示W _i對應的語音聲學特徵；Q表示發音音素序列；j為發音音素序列的編號；表示編號為i的詞彙對應的語音音素序列中的、編號為j的發音音素序列。 In the formula [1.3]: Q ^* is the maximum value of the probability phoneme sequence of the candidate phoneme sequence corresponding to the value of the right side of the formula [1.3], that is, the maximum value of the candidate phoneme sequence corresponding to the word sequence W _i ; W _i is a word sequence, i is the number of the vocabulary; X is the phonetic acoustic feature corresponding to W _i ; Q is the phonetic sequence of the pronunciation; j is the number of the phoneme sequence of the pronunciation; The pronunciation phoneme sequence numbered j in the speech phoneme sequence corresponding to the vocabulary number i .

P(X|)為聲學模型，即發音音素序列對應的語音聲學特徵為X的機率。 P ( X | Acoustic model The corresponding speech acoustic feature is the probability of X.

目前，相關的語音識別技術中用到的聲學模型一般是對隱馬爾科夫模型-深度神經網路(Hidden Markov Model-Deep Neural Network，HMM-DNN)的混合模型訓練得到的，或者也可以是對DNN模型訓練得到的。本發明實施例中，可以預先透過巨量語音聲學特徵，對HMM-DNN的混合模型或DNN模型進行訓練得到聲學模型，並設置在本發明實施例所述的該語音識別解碼器中。 At present, the acoustic model used in the related speech recognition technology is generally obtained by training a hybrid model of the Hidden Markov Model-Deep Neural Network (HMM-DNN), or it may be Trained on the DNN model. In the embodiment of the present invention, the hybrid model or the DNN model of the HMM-DNN can be trained to obtain an acoustic model through a huge amount of speech acoustic features, and is set in the speech recognition decoder described in the embodiment of the present invention.

P(W _i)為語言模型--本實施例中的語言模型可以是N-Gram模型，該模型基於這樣一種假設，第N個詞的出現只與前面N-1個詞相關，而與其它任何詞都不相關，整句的機率就是各個詞出現機率的乘積，各個詞出現的機率可以透過直接從語料中統計N個詞同時出現的次數得到。本實施例中的語言模型也可以是基於條件隨機域或者基於深度神經網路策略的語言模型。該語言模型可以預先產生並設置在本發明實施例所述的該語音識別解碼器中。 P ( W _i ) is a language model--the language model in this embodiment may be an N-Gram model based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and other No word is relevant. The probability of a whole sentence is the product of the probability of occurrence of each word. The probability of occurrence of each word can be obtained by counting the number of simultaneous occurrences of N words directly from the corpus. The language model in this embodiment may also be a language model based on a conditional random domain or a deep neural network strategy. The language model can be pre-generated and arranged in the speech recognition decoder described in the embodiment of the present invention.

P(|W _i)為基於給定的發音詞典中的詞彙(由字序列W _i表示)的發音音素序列為的機率。 P ( | W _i ) is the pronunciation phoneme sequence based on the vocabulary in the given pronunciation dictionary (represented by the word sequence W _i ) The chance.

這裡所說的發音詞典，比如可以為加入了目標詞彙對應的各個候選發音音素序列的發音詞典。 The pronunciation dictionary mentioned here may be, for example, a pronunciation dictionary in which each candidate phoneme sequence corresponding to the target vocabulary is added.

目標詞彙的候選發音音素序列，是指可能作為目標詞彙正確發音的發音音素序列。本發明實施例中，可以但不限於採用G2P方法，為目標詞彙產生發音音素序列(本發明實施例中稱“候選發音音素序列”)，並將所述目標詞彙和產生的各候選發音音素序列，加入到發音詞典中。 The candidate pronunciation phoneme sequence of the target vocabulary refers to the pronunciation phoneme sequence that may be correctly pronounced as the target vocabulary. In the embodiment of the present invention, the G2P method may be used to generate a pronunciation phoneme sequence (referred to as "candidate phoneme sequence" in the embodiment of the present invention) for the target vocabulary, and the target vocabulary and the generated candidate phoneme sequence are generated. , added to the pronunciation dictionary.

其中，將所述目標詞彙和產生的各候選發音音素序列，加入到發音詞典中，可以是指，將包含目標詞彙-候選發音音素序列的詞條，添加到發音詞典中。 The adding the target vocabulary and the generated candidate phoneme sequence to the pronunciation dictionary may refer to adding the term including the target vocabulary-candidate phoneme sequence to the pronunciation dictionary.

需要說明的是，當當前不存在發音詞典時，將所述詞條添加到發音詞典中，可以是指根據所述詞條建構發音詞典；當當前已存在發音詞典時，將所述詞條添加到發音詞典中，可以是指根據所述詞條對該已有的發音詞典進行更新，得到更新後的發音詞典。 It should be noted that when the pronunciation dictionary is not currently present, adding the term to the pronunciation dictionary may refer to constructing a pronunciation dictionary according to the term; when the pronunciation dictionary currently exists, adding the term In the pronunciation dictionary, the existing pronunciation dictionary may be updated according to the term to obtain an updated pronunciation dictionary.

為便於描述，本發明實施例中假設當前已存在發音詞典。在這樣的場景下，所述目標詞彙為相對於當前已存在的發音詞典而言的新增詞彙。 For convenience of description, it is assumed in the embodiment of the present invention that a pronunciation dictionary currently exists. In such a scenario, the target vocabulary is a new vocabulary relative to the currently existing pronunciation dictionary.

本實施例中，為目標詞彙產生的對應的候選發音音素序列的個數視實際情況而定。 In this embodiment, the number of corresponding candidate phoneme sequences generated for the target vocabulary depends on the actual situation.

如，採用G2P方法，可以為目標詞彙“阿裡巴巴”產生十個以上候選的發音音素序列。以該些發音音素序列中的某一個發音音素序列為例，其可以表示為“a1/li3/ba1/ba1/”。該發音音素序列中，符號“/”用於區分不同發音音素，即“/”前後的符號表示不同的音素。比如，a1和li3為不同音素。音素中的數字代表聲調，即1代表聲調一聲，2代表聲調二聲，3代表聲調三聲，4代表聲調四聲。 For example, using the G2P method, more than ten candidate phoneme sequences can be generated for the target vocabulary "Alibaba." Taking one of the pronunciation phoneme sequences as an example, it can be expressed as "a1/li3/ba1/ba1/". In the pronunciation phoneme sequence, the symbol "/" is used to distinguish different phonemes, that is, the symbols before and after the "/" indicate different phonemes. For example, a1 and li3 are different phonemes. The number in the phoneme represents the tone, that is, 1 represents a tone, 2 represents a tone, 3 represents a tone, and 4 represents a tone of four.

基於嵌入有上述發音詞典、公式[1.3]中所示的聲學模型P(X|)和語言模型P(W _i)的語音識別解碼器，本發明實施例中，將目標詞彙的語音聲學特徵輸入到該語音識別解碼器中，可以觸發該語音識別解碼器透過對語音樣本聲學特徵的解碼，輸出該語音樣本聲學特徵對應的發音音素序列。 Based on the acoustic model P ( X | embedded in the above pronunciation dictionary, as shown in the formula [1.3] And a speech recognition decoder of the language model P ( W _i ), in the embodiment of the present invention, inputting a speech acoustic feature of the target vocabulary into the speech recognition decoder, which can trigger the speech recognition decoder to transmit acoustic characteristics to the speech sample Decoding, outputting the pronunciation phoneme sequence corresponding to the acoustic feature of the speech sample.

以下進一步介紹本發明實施例提供的該方法包含的後續步驟。 The subsequent steps included in the method provided by the embodiments of the present invention are further described below.

步驟12：確定語音識別解碼器以步驟11中所述的語音聲學特徵作為輸入而輸出的候選發音音素序列；並根據目標詞彙對應於所述輸出的候選發音音素序列的統計規律，確定目標詞彙對應於輸出的候選發音音素序列的機率分佈；根據所述機率分佈，從所述輸出的候選發音音素序列中，選擇作為目標詞彙的正確發音的發音音素序列；比如，若假定目標詞彙T對應的候選發音音素序列有2個，分別為A1 A2和B1 B2，且它們被添加到語音識別解碼器包含的發音詞典中。進一步地，若假設採集到的T的音頻樣本有100個，從而可以獲得這100個音頻樣本各自的語音聲學特徵(共100個語音聲學特徵)，透過執行步驟11，將這100個語音聲學特徵分別輸入到嵌入發音詞典、聲學模型和語言模型的語音識別解碼器中。 Step 12: determining a candidate pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature described in step 11 as an input; and determining a target vocabulary corresponding according to a statistical rule of the target vocabulary corresponding to the output candidate pronunciation phoneme sequence a probability distribution of the candidate pronunciation phoneme sequence outputted; selecting, according to the probability distribution, a pronunciation phoneme sequence that is the correct pronunciation of the target vocabulary from the output candidate pronunciation phoneme sequence; for example, if a candidate corresponding to the target vocabulary T is assumed There are two pronunciation phoneme sequences, A1 A2 and B1 B2, respectively, and they are added to the pronunciation dictionary included in the speech recognition decoder. Further, if it is assumed that there are 100 audio samples of the collected T, so that the respective acoustic acoustic features of the 100 audio samples (a total of 100 speech acoustic features) can be obtained, by performing step 11, the 100 speech acoustic features are obtained. They are input into a speech recognition decoder embedded in a pronunciation dictionary, an acoustic model, and a language model, respectively.

那麼，語音識別解碼器對這100個語音聲學特徵進行識別解碼，可以輸出候選發音音素序列，如輸出A1、A2、B1、B2的組合。 Then, the speech recognition decoder identifies and decodes the 100 speech acoustic features, and can output a candidate pronunciation phoneme sequence, such as a combination of outputs A1, A2, B1, B2.

進一步地，假設根據設置於該語音識別解碼器中的發音詞典，確定目標詞彙對應於所述輸出的候選發音音素序列的統計規律為：這100個語音聲學特徵中：有75個語音聲學特徵是透過發音詞典的詞條“T-A1 A2”映射到T，有25個語音聲學特徵是透過發音詞典的詞條“T-B1 B2”映射到T。 Further, it is assumed that according to a pronunciation dictionary set in the speech recognition decoder, a statistical rule of determining a target vocabulary corresponding to the output candidate phoneme sequence is: among the 100 speech acoustic features: 75 speech acoustic features are The term "T-A1 A2" through the pronunciation dictionary is mapped to T, and 25 speech acoustic features are mapped to T through the term "T-B1 B2" of the pronunciation dictionary.

那麼，根據該統計規律，可以得到如下機率分佈：T對應於A1 A2的機率為75/100=0.75 Then, according to the statistical law, the following probability distribution can be obtained: the probability that T corresponds to A1 A2 is 75/100=0.75

T對應於B1 B2的機率為25/100=0.25 The probability that T corresponds to B1 B2 is 25/100=0.25

一般地，伺服器可以將所述機率分佈中的最大機率值對應的候選發音音素序列，確定為所述目標詞彙正確的發音的發音音素序列。 Generally, the server may determine the candidate phoneme sequence corresponding to the maximum probability value in the probability distribution as the pronunciation phoneme sequence of the correct pronunciation of the target vocabulary.

沿用上例，則伺服器可以將所述機率分佈中的最大機率值0.75對應的候選發音音素序列A1 A2，確定為T正確發音的發音音素序列。 Following the above example, the server may determine the candidate phoneme sequence A1 A2 corresponding to the maximum probability value of 0.75 in the probability distribution as the pronunciation phoneme sequence of T correctly.

步驟13：根據作為目標詞彙正確發音的發音音素序列，建構發音詞典。 Step 13: Construct a pronunciation dictionary based on the pronunciation phoneme sequence that is correctly pronounced as the target vocabulary.

具體地，伺服器比如可以從加入了目標詞彙對應的各個候選發音音素序列的發音詞典中，刪除除作為目標詞彙正確發音的發音音素序列外的、與該目標詞彙對應的其他候選發音音素序列。或者，伺服器也可以根據作為目標詞彙正確發音的發音音素序列，重新建構新的發音詞典。 Specifically, the server may delete other candidate phoneme sequences corresponding to the target vocabulary other than the pronunciation phoneme sequence that is correctly pronounced as the target vocabulary, for example, from the pronunciation dictionary in which each candidate phoneme sequence corresponding to the target vocabulary is added. Alternatively, the server may reconstruct a new pronunciation dictionary based on the pronunciation phoneme sequence that is correctly pronounced as the target vocabulary.

採用本發明實施例1提供的上述方法，由於引入了待預測發音的目標詞的語音聲學特徵，作為預測詞彙正確發音的依據之一，從而相對於僅依靠詞彙和發音音素序列的映射關係來做為預測詞彙正確發音依據的現有技術而言，可以更為準確地預測目標詞彙正確發音，從而提升了語音詞典的品質。 According to the above method provided by Embodiment 1 of the present invention, since the speech acoustic characteristics of the target word to be predicted are introduced, as one of the basis for predicting the correct pronunciation of the vocabulary, the mapping relationship between the vocabulary and the pronunciation phoneme sequence is performed. In the prior art for predicting the correct pronunciation of vocabulary, the correct pronunciation of the target vocabulary can be predicted more accurately, thereby improving the quality of the speech dictionary.

Example 2

為解決採用現有技術會導致與詞彙匹配的發音音素序列的準確性較低的問題，本發明實施例提供一種發音詞典的建構裝置。該詞彙發音預測裝置的結構示意圖如圖3所示，主要包括下述功能單元：解碼單元21，用於將目標詞彙的語音聲學特徵，輸入語音識別解碼器中；其中，語音識別解碼器中的發音詞典包括：目標詞彙和目標詞彙的的候選發音音素序列；發音確定單元22，用於根據所述語音識別解碼器以所述語音聲學特徵作為輸入而輸出的候選發音音素序列，確定所述目標詞彙對應於輸出的候選發音音素序列的機率分佈；根據所述機率分佈，從所述輸出的候選發音音素序列中，選擇作為所述目標詞彙的正確發音的發音音素序列；詞典建構單元23，用於根據所述正確發音的發音音素序列，建構發音詞典。 In order to solve the problem that the accuracy of the pronunciation phoneme sequence matching the vocabulary is low by using the prior art, the embodiment of the present invention provides a device for constructing a pronunciation dictionary. The schematic diagram of the vocabulary pronunciation prediction apparatus is shown in FIG. 3, and mainly includes the following functional unit: a decoding unit 21, configured to input a speech acoustic feature of a target vocabulary into a speech recognition decoder; wherein, in the speech recognition decoder The pronunciation dictionary includes: a candidate pronunciation phoneme sequence of the target vocabulary and the target vocabulary; the pronunciation determining unit 22 is configured to determine the target according to the candidate pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature as an input The vocabulary corresponds to a probability distribution of the output candidate phoneme sequence; according to the probability distribution, a pronunciation phoneme sequence that is the correct pronunciation of the target vocabulary is selected from the output candidate phoneme sequence; the dictionary constructing unit 23 A pronunciation dictionary is constructed based on the pronunciation phoneme sequence of the correct pronunciation.

在一種實施方式中，本發明實施例提供的該裝置還可以包括音素序列處理單元。該單元用於在於將目標詞彙的語音聲學特徵，輸入語音識別解碼器中前，獲得目標詞彙的候選發音音素序列；並將目標詞彙和獲得的候選發音音素序列，加入到所述語音識別解碼器中的發音詞典中。 In an embodiment, the apparatus provided by the embodiment of the present invention may further include a phoneme sequence processing unit. The unit is configured to obtain a candidate phoneme sequence of the target vocabulary before inputting the phonetic acoustic feature of the target vocabulary into the speech recognition decoder; and adding the target vocabulary and the obtained candidate phoneme sequence to the speech recognition decoder In the pronunciation dictionary.

在一種實施方式中，音素序列處理單元，具體可以用於利用G2P方法，獲得目標詞彙的候選發音音素序列。 In an embodiment, the phoneme sequence processing unit may be specifically configured to obtain a candidate phoneme sequence of the target vocabulary by using the G2P method.

在一種實施方式中，所述解碼單元21，具體可以用於採集目標詞彙對應的音頻樣本；根據所述音頻樣本，獲得所述語音聲學特徵；將獲得的所述語音聲學特徵，輸入所述語音識別解碼器中。 In an embodiment, the decoding unit 21 may be specifically configured to collect an audio sample corresponding to the target vocabulary; obtain the voice acoustic feature according to the audio sample; and input the obtained voice acoustic feature into the voice Identify the decoder.

在一種實施方式中，所述發音確定單元22，具體可以用於確定所述機率分佈中的最大機率值；從所述輸出的候選發音音素序列中，選擇所述最大機率值對應的候選發音音素序列，作為所述目標詞彙的正確發音的發音音素序列。 In an embodiment, the pronunciation determining unit 22 may be specifically configured to determine a maximum probability value in the probability distribution; and select, from the output candidate pronunciation phoneme sequence, a candidate pronunciation phoneme corresponding to the maximum probability value. A sequence, a phoneme sequence that is the correct pronunciation of the target vocabulary.

在一種實施方式中，所述詞典建構單元23，具體可以用於根據作為所述目標詞彙正確發音的發音音素序列，從加入了目標詞彙和獲得的候選發音音素序列的發音詞典中，刪除目標詞彙對應的、除所述正確發音的發音音素序列外的其他候選發音音素序列。 In an embodiment, the dictionary constructing unit 23 may be specifically configured to delete the target vocabulary from the pronunciation dictionary that adds the target vocabulary and the obtained candidate phoneme sequence according to the pronunciation phoneme sequence that is correctly pronounced as the target vocabulary. Corresponding to other candidate phoneme sequences other than the correctly pronounced pronunciation phoneme sequence.

採用本發明實施例2提供的上述裝置，由於引入了待預測發音的目標詞的語音聲學特徵，作為預測詞彙正確發音的依據之一，從而相對於僅依靠詞彙和發音音素序列的映射關係來做為預測詞彙正確發音依據的現有技術而言，可以更為準確地預測目標詞彙正確發音。 According to the above apparatus provided in Embodiment 2 of the present invention, since the speech acoustic characteristics of the target word to be predicted are introduced, as one of the basis for predicting the correct pronunciation of the vocabulary, the mapping relationship between the vocabulary and the pronunciation phoneme sequence is performed. In the prior art for predicting the correct pronunciation of vocabulary, the correct pronunciation of the target vocabulary can be predicted more accurately.

本領域內的技術人員應明白，本發明的實施例可提供為方法、系統、或計算機程式產品。因此，本發明可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本發明可採用在一個或多個其中包含有計算機可用程式代碼的計算機可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的計算機程式產品的形式。 Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Thus, the present invention can take the form of a fully hardware embodiment, a fully software embodiment, or an embodiment combining soft and hardware aspects. Moreover, the present invention can take the form of a computer program product embodied on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer usable code therein. .

本發明是參照根據本發明實施例的方法、設備(系統)、和計算機程式產品的流程圖和/或方塊圖來描述的。應理解可由計算機程式指令實現流程圖和/或方塊圖中的每一流程和/或方塊、以及流程圖和/或方塊圖中的流程和/或方塊的結合。可提供這些計算機程式指令到通用計算機、專用計算機、嵌入式處理機或其他可程式化資料處理設備的處理器以產生一個機器，使得透過計算機或其他可程式化資料處理設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的裝置。 The present invention has been described with reference to flowchart illustrations and/or block diagrams of a method, a device (system), and a computer program product according to an embodiment of the invention. It will be understood that each flow and/or block of the flowcharts and/or <RTIgt; The computer program instructions can be provided to a processor of a general purpose computer, a special purpose computer, an embedded processor or other programmable data processing device to produce a machine for executing instructions by a processor of a computer or other programmable data processing device Means are generated for implementing the functions specified in one or more flows of the flowchart or in a block or blocks of the block diagram.

這些計算機程式指令也可儲存在能引導計算機或其他可程式化資料處理設備以特定方式工作的計算機可讀記憶體中，使得儲存在該計算機可讀記憶體中的指令產生包括指令裝置的製造品，該指令裝置實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能。 The computer program instructions can also be stored in a computer readable memory that can boot a computer or other programmable data processing device to operate in a particular manner, such that instructions stored in the computer readable memory produce an article of manufacture including the instruction device. The instruction means implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

這些計算機程式指令也可裝載到計算機或其他可程式化資料處理設備上，使得在計算機或其他可程式化設備上執行一系列操作步驟以產生計算機實現的處理，從而在計算機或其他可程式化設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的步驟。 These computer program instructions can also be loaded onto a computer or other programmable data processing device to perform a series of operational steps on a computer or other programmable device to produce computer-implemented processing for use in a computer or other programmable device The instructions executed on the steps provide steps for implementing the functions specified in one or more flows of the flowchart or in a block or blocks of the flowchart.

以上所述僅為本發明的實施例而已，並不用於限制本發明。對於本領域技術人員來說，本發明可以有各種更改和變化。凡在本發明的精神和原理之內所作的任何修改、等同替換、改進等，均應包含在本發明的專利範圍之內。 The above description is only an embodiment of the present invention and is not intended to limit the present invention. It will be apparent to those skilled in the art that various modifications and changes can be made in the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

A method for constructing a pronunciation dictionary, comprising: inputting a speech acoustic feature of a target vocabulary into a speech recognition decoder; wherein the pronunciation dictionary in the speech recognition decoder comprises: a candidate for a target vocabulary and a target vocabulary a phoneme sequence; a probability pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature as an input, determining a probability distribution of the target vocabulary corresponding to the output candidate phoneme sequence; according to the probability distribution, the output is In the candidate pronunciation phoneme sequence, the pronunciation phoneme sequence which is the correct pronunciation of the target vocabulary is selected; and the pronunciation dictionary is constructed based on the correctly pronounced pronunciation phoneme sequence.

The method of claim 1, wherein the voice acoustic feature is input to the speech recognition decoder, the method further comprising: obtaining a candidate phoneme sequence of the target vocabulary; and selecting the target vocabulary and the obtained candidate The pronunciation phoneme sequence is added to the pronunciation dictionary in the speech recognition decoder.

The method of claim 2, wherein obtaining the candidate phoneme sequence of the target vocabulary comprises: converting the word into a phoneme G2P method to obtain a candidate phoneme sequence of the target vocabulary.

The method of claim 1, wherein the acoustic model embedded in the speech recognition decoder is obtained by training a deep neural network.

The method of claim 1, wherein the inputting the speech acoustic feature of the target vocabulary into the speech recognition decoder comprises: collecting an audio sample corresponding to the target vocabulary; and obtaining the acoustical acoustic feature according to the audio sample The obtained speech acoustic feature is input to the speech recognition decoder.

The method of claim 1, wherein, according to the probability distribution, selecting a pronunciation phoneme sequence that is the correct pronunciation of the target vocabulary from the output candidate phoneme sequence includes: determining the probability distribution The maximum probability value; from the output candidate phoneme sequence, the candidate phoneme sequence corresponding to the maximum probability value is selected as the pronunciation phoneme sequence of the correct pronunciation of the target vocabulary.

The method according to any one of claims 1 to 6, wherein the pronunciation pronunciation dictionary is constructed according to the correctly pronounced pronunciation phoneme sequence, comprising: joining according to the pronunciation phoneme sequence which is correctly pronounced as the target vocabulary In the pronunciation dictionary of the target vocabulary and the obtained candidate phoneme sequence, the candidate phoneme sequence corresponding to the correctly pronounced pronunciation phoneme sequence corresponding to the target vocabulary is deleted.

A device for constructing a pronunciation dictionary, comprising: a decoding unit: configured to input a speech acoustic feature of a target vocabulary into a speech recognition decoder; wherein the pronunciation dictionary in the speech recognition decoder comprises: a target vocabulary And a candidate pronunciation phoneme sequence of the target vocabulary; a pronunciation determining unit: configured to determine, according to the candidate pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature as an input, the target vocabulary corresponding to the output candidate pronunciation phoneme sequence a probability distribution; according to the probability distribution, a pronunciation phoneme sequence that is a correct pronunciation of the target vocabulary is selected from the output candidate phoneme sequence; a dictionary construction unit: configured to construct a pronunciation dictionary according to the correctly pronounced pronunciation phoneme sequence.

The device of claim 8, wherein the device further comprises: a phoneme sequence processing unit, configured to obtain a candidate phoneme of the target vocabulary before inputting the phonetic acoustic feature of the target vocabulary into the speech recognition decoder a sequence; and the target vocabulary and the obtained candidate phoneme sequence are added to the pronunciation dictionary in the speech recognition decoder.

The device of claim 9, wherein the phoneme sequence processing unit is specifically configured to: use a word conversion to a phoneme G2P method to obtain a candidate phoneme sequence of the target vocabulary.

The apparatus of claim 8, wherein the acoustic model embedded in the speech recognition decoder is obtained by training a deep neural network.

The device of claim 8, wherein: the decoding unit is configured to collect an audio sample corresponding to the target vocabulary; obtain the speech acoustic feature according to the audio sample; and input the obtained acoustic acoustic feature into the In the speech recognition decoder.

The device of claim 8, wherein the pronunciation determining unit is specifically configured to: determine a maximum probability value in the probability distribution; and select, from the output candidate phoneme sequence, the maximum probability value. The candidate pronunciation phoneme sequence, which is the pronunciation phoneme sequence of the correct pronunciation of the target vocabulary.

The device according to any one of the preceding claims, wherein the dictionary construction unit is specifically configured to: add a target vocabulary and obtain a candidate pronunciation according to a pronunciation phoneme sequence that is correctly pronounced as the target vocabulary In the pronunciation dictionary of the phoneme sequence, other candidate phoneme sequences corresponding to the correctly pronounced pronunciation phoneme sequence corresponding to the target vocabulary are deleted.