JP2006126730A

JP2006126730A - Method and system for optimizing phoneme unit set

Info

Publication number: JP2006126730A
Application number: JP2004318208A
Authority: JP
Inventors: Jinsong Zhang; 勁松張; Soong Frank; フランク・スーン; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-11-01
Filing date: 2004-11-01
Publication date: 2006-05-18
Anticipated expiration: 2024-11-01
Also published as: JP4631076B2

Abstract

<P>PROBLEM TO BE SOLVED: To make it possible to optimize a phoneme basic unit set for a specific ASR (Automatic Speech Recognition) task. <P>SOLUTION: A method for minimizing the phoneme basic unit set for the specific ASR task includes the steps of; preparing (100) a basic unit set in a machine readable format; creating (102) a plurality of basic subsets by applying a leave one out method to the basic unit set; computing (104) a prescribed measure of linguistic discrimination power for each of the basic unit subsets; replacing (106, 108, 112) the basic unit set with one of the basic unit subsets that has the highest linguistic discrimination power; repeating (110) the steps of creating, computing, and replacing until a prescribed criterion is satisfied. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は自動音声認識（ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ：ＡＳＲ）に関し、特に、ＡＳＲで用いられる音素セット等の音素単位セットの最適化に関する。 The present invention relates to automatic speech recognition (ASR), and more particularly to optimization of phoneme unit sets such as phoneme sets used in ASR.

ＡＳＲはマン−マシン−インタラクションにおける必須のツールである。ＡＳＲによって、コンピュータは自然言語によるオペレータの指令を理解することができ、オペレータはコンピュータのための複雑なコマンドシステムを学ぶ必要がなくなる。 ASR is an essential tool in man-machine interaction. ASR allows the computer to understand the operator's commands in natural language, eliminating the need for the operator to learn a complex command system for the computer.

図６は基本的なＡＳＲの機構を示す。図６を参照して、ＡＳＲシステム１６２は、入力音声Ｘ１６０をデコードし、認識された（デコードされた）単語＾Ｗ１６４（文中「＾」の記号は本来文字Ｗの上に付されるものである。）を、以下の式１６６を用いて出力する。 FIG. 6 shows the basic ASR mechanism. Referring to FIG. 6, ASR system 162 decodes input speech X160, and recognized (decoded) word ^ W164 (the symbol “^” in the sentence is originally added above letter W). .) Is output using Equation 166 below.

ここでＰ（Ｘ｜Ｗ）は音響モデル確率を示し、Ｐ（Ｗ）は言語モデル確率を示す。これらのモデルは対象となる言語の単語を、それぞれの音素のシーケンスと共に記載するレキシコンを用いて構築される。音素は予め定められた基本音素セットのうちから選択される。

Here, P (X | W) represents the acoustic model probability, and P (W) represents the language model probability. These models are built using lexicons that describe the words of the language of interest along with their phoneme sequences. The phonemes are selected from a predetermined basic phoneme set.

大語彙連続音声認識（ＬａｒｇｅＶｏｃａｂｕｌａｒｙＣｏｎｔｉｎｕｏｕｓＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ：ＬＶＣＳＲ）システムでは、広く受入れられた音素セットが用いられる。 In the Large Vocabulary Continuous Speech Recognition (LVCSR) system, a widely accepted phoneme set is used.

簡単なＬＶＣＳＲタスクと、より複雑なＬＶＣＳＲタスクとで同じ音素セットを用いるべきか、という問題がある。数字認識タスク等の小さな語彙のタスクでは、数字等の単語が基本単位として用いられる。同様に、簡単なＬＶＣＳＲタスクでは、簡単な音素セットを用いることが有利かもしれない。 There is a question of whether the same phoneme set should be used for a simple LVCSR task and a more complex LVCSR task. In small vocabulary tasks such as number recognition tasks, words such as numbers are used as basic units. Similarly, for simple LVCSR tasks, it may be advantageous to use a simple phoneme set.

ＡＳＲに関する多くの研究では、いくつかの発見的手法により決定された音素セットが試され、ＡＳＲ認識性能に基づいて、１セットが選択される。 In many studies on ASR, phone sets determined by several heuristics are tried and a set is selected based on ASR recognition performance.

音素セットにより多くの単位が含まれれば、音素学的により識別性のある情報を提供するであろう。しかしこれは、より詳細な音響的差異を使用するという意味でもある。音声認識の場合、より詳細な、またはより小さい音響差異をモデル化する必要が生じると、ＡＭ（ＡｃｏｕｓｔｉｃＭｏｄｅｌ：音響モデル）の頑健性が低下する傾向がある。 If more units are included in a phoneme set, it will provide phonemeologically more discriminating information. However, this also means that a more detailed acoustic difference is used. In the case of speech recognition, if more detailed or smaller acoustic differences need to be modeled, the robustness of AM (Acoustic Model) tends to decrease.

音素セットに含まれる単位数が少なければ、より大きな音素セットに比べて、各音素ＡＭは、より多くのトレーニングデータを有することが通常である。さらに、音素の数が少ない場合、音素間での差異は多くの音素間での差異より大きくなる傾向がある。この結果、音素セットが小さくなればＡＭはより頑健になり得る。しかし、音素セットサイズを小さくすると別の問題が生じる。すなわち、言語空間内における識別力が失われることである。例えば、日本語の長母音「Ａ」と短母音「ａ」とが一つの母音にマージされるので、単語間の混同が増加するであろう。 If the number of units included in a phoneme set is small, each phoneme AM usually has more training data than a larger phoneme set. Furthermore, when the number of phonemes is small, the difference between phonemes tends to be larger than the difference between many phonemes. As a result, AM can be more robust if the phoneme set is smaller. However, reducing the phoneme set size creates another problem. That is, the discriminating power in the language space is lost. For example, Japanese long vowel “A” and short vowel “a” are merged into one vowel, so confusion between words will increase.

この点に関して、最新のＡＳＲ最適化は以下の考え方により行なわれる。上述の式を以下の形に書くことができる。 In this regard, the latest ASR optimization is performed according to the following concept. The above equation can be written in the following form:

ここでＦは基本単位シーケンスを示し、Ｐ（Ｘ｜Ｆ）は頑健な音響モデル化の優勢なトピックを示し、Ｐ（Ｆ｜Ｗ）は発音モデル化の注目のトピックを示し、Ｐ（Ｗ）は顕著な言語モデル化を示す。多くの場合、Ｆは音素セットである。

Where F represents the basic unit sequence, P (X | F) represents the dominant topic of robust acoustic modeling, P (F | W) represents the topic of interest for pronunciation modeling, and P (W) Indicates remarkable language modeling. In many cases, F is a phoneme set.

しかし、先行技術では、種々の基本単位のセットを用いた場合に関する比較についてはヒューリスティックな試みがいくつかあったものの、特に確率を用いたＡＳＲの枠組み全体を考慮して基本単位セットの最適化を行なうことはほとんど全くされていないといえる。 However, in the prior art, although there were some heuristic attempts for comparison with the case of using various basic unit sets, optimization of the basic unit set was particularly considered in consideration of the entire ASR framework using probability. It can be said that almost nothing has been done.

従って、この発明の目的の一つは、特定のＡＳＲタスクのための基本単位セットを最適化する方法と装置とを提供することである。 Accordingly, one object of the present invention is to provide a method and apparatus for optimizing the basic unit set for a particular ASR task.

この発明の別の目的は特定のＡＳＲタスクのための音素セットを最適化する方法と装置とを提供することである。 Another object of the present invention is to provide a method and apparatus for optimizing phoneme sets for specific ASR tasks.

この発明の一局面によれば、予め定められた言語の音素単位セットを最適化する方法は、コンピュータに、コンピュータ読出可能なフォーマットで基本単位セットを準備するステップと、基本単位セットにリーブ・ワン・アウト法を適用することによって複数個の基本単位サブセットを生成するステップと、基本単位サブセットの各々について言語的識別力の所定の尺度を計算するステップと、基本単位セットを、基本単位サブセットのうち最も高い言語的識別力を備えたもので置換えるステップと、生成するステップ、計算するステップ、及び置換えるステップを、所定の基準が満たされるまで繰返すステップとを実行させる。 According to one aspect of the present invention, a method for optimizing a phoneme unit set in a predetermined language includes: preparing a basic unit set in a computer readable format on a computer; and Generating a plurality of basic unit subsets by applying an out method; calculating a predetermined measure of linguistic discriminatory power for each of the basic unit subsets; The steps of replacing with the one with the highest linguistic discriminatory power, and the steps of generating, calculating, and repeating the replacement are repeated until a predetermined criterion is satisfied.

好ましくは、計算するステップは、基本単位セットと、基本単位サブセットの各々との間の相互情報を計算するステップを含む。 Preferably, the calculating step includes the step of calculating mutual information between the basic unit set and each of the basic unit subsets.

より好ましくは、置換えるステップは、基本単位セットを、基本単位サブセットのうち計算するステップで計算された相互情報の最も高い値を有するもので置換えるステップを含む。 More preferably, the step of replacing includes the step of replacing the basic unit set with the one having the highest value of mutual information calculated in the calculating step among the basic unit subsets.

さらに好ましくは、基本単位セットは予め定められた言語のための基本音素セットである。 More preferably, the basic unit set is a basic phoneme set for a predetermined language.

この発明の別の局面によれば、予め定められた言語の単位セットを最適化するシステムは、基本単位セットをコンピュータ読出可能なフォーマットで記憶するための記憶手段と、基本単位セットにリーブ・ワン・アウト法を適用することによって複数個の基本単位サブセットを生成するための生成手段と、基本単位サブセットの各々について言語的識別力の所定の尺度を計算するための計算手段と、記憶手段に記憶された基本単位セットを、最も高い言語的識別力を有する基本単位サブセットで置換えるための置換手段と、記憶手段、生成手段、計算手段及び置換手段を、所定の基準が満たされるまで繰返し動作するよう制御するための制御手段とを含む。 According to another aspect of the present invention, a system for optimizing a unit set of a predetermined language includes a storage means for storing a basic unit set in a computer-readable format, and a leave one for the basic unit set. A generating means for generating a plurality of basic unit subsets by applying the out method, a calculating means for calculating a predetermined measure of linguistic discriminatory power for each of the basic unit subsets, and storing in the storage means The replacement means for replacing the set of basic units with the basic unit subset having the highest linguistic discriminating power, the storage means, the generation means, the calculation means, and the replacement means are repeatedly operated until a predetermined criterion is satisfied. Control means for controlling the operation.

ＡＳＲの場合、二つの単語を識別するのに２種類の識別のための手段がある。一つは発音であり、他方は単語の文脈、すなわち言語モデル（ＬａｎｇｕａｇｅＭｏｄｅｌ：ＬＭ）である。一対の単語を音響スコアで識別することが困難な場合、例えば、同音語や類音語の場合、文脈的な単語情報があれば識別が容易になるであろう。例えば、「橋」と「箸」とは明らかに異なる文脈の単語である。 In the case of ASR, there are two types of identification means for identifying two words. One is pronunciation, and the other is a word context, that is, a language model (LM). When it is difficult to identify a pair of words by an acoustic score, for example, in the case of a homophone or a homolog, it will be easy to identify if there is contextual word information. For example, “bridge” and “chopsticks” are clearly different contextual words.

上述の議論に基づき、この実施例は特定のＡＳＲタスクのための音素セットの最適な設計、すなわちタスクに基づく音素設計を提案する。基本的な考え方は、ある大きな音素セットから１音素を削除しても言語的識別力が大きく減じられることがなければ、音素セットサイズを減じるためにその音素を削除してもよい、というものである。 Based on the above discussion, this example proposes an optimal design of a phoneme set for a particular ASR task, ie a task-based phoneme design. The basic idea is that deleting a phoneme from a large phoneme set may delete that phoneme to reduce the phoneme set size if the linguistic discriminatory power is not significantly reduced. is there.

この実施例では、最大相互情報（ＭｕｔｕａｌＩｎｆｏｒｍａｔｉｏｎ：ＭＩ）基準に基づく音素セット設計を採用する。すなわち、ＭＩを基本単位サブセットの言語的識別力の尺度として用いる。この実施例は中国語の最適化された音素セットを設計することに関するものである。 In this embodiment, a phoneme set design based on the maximum mutual information (MI) standard is adopted. That is, MI is used as a measure of the linguistic discriminatory power of the basic unit subset. This example relates to designing an optimized phoneme set for Chinese.

基本単位セットΦは二つの具体的な局面で重要となる。すなわち、これは音響空間全体の主たる分類を規定し、さらに、言語空間の分類の重要な手がかりを提供する。 The basic unit set Φ is important in two specific aspects. That is, it defines the main classification of the entire acoustic space, and further provides important clues for language space classification.

図１は異なる基本単位セット、Φ_１＝｛ｆ_１，ｆ_２，…ｆ_Ｎ｝及びΦ_２＝｛ｐ_１，ｐ_２，…ｐ_Ｍ｝による直観的な影響力を示す。図１を参照して、Φ_２の音素数ＭはΦ_１の音素数Ｎよりはるかに大きいと仮定する（すなわち、Ｎ＜＜Ｍ）。Φ_１は音響空間２０をＮ個のサブスペースｆ_１，ｆ_２，…ｆ_Ｎに分割し、Φ_２は同じ音響空間２２をより小さいサブスペースｐ_１，ｐ_２，…ｐ_Ｍに分割する。従って、Φ_１は頑健な音響モデルを提供することができるが、その一方で、識別力はΦ_２のそれに比して弱い。 Figure 1 shows the different basic unit _{_{_{set, Φ 1 = {f 1,}}} f 2, ... f N} and _{_{_{Φ 2 = {p 1, p}}} 2, ... p M} intuitive influence due. Referring to FIG. 1, assume that the number of phonemes M of Φ ₂ is much larger than the number of phonemes N of Φ ₁ (ie, N << M). [Phi ₁ is the acoustic space 20 N sub space _f _1, f 2, ... it is divided into _{f N,} [Phi ₂ is smaller than the same acoustic space 22 subspaces _p _1, _p _2, is divided into ... _{p M.} Therefore, [Phi ₁ is able to provide a robust acoustic model, on the other hand, discrimination is weak compared to that of [Phi _2.

図２はこの実施例の単位セットのトレーニングのための構成全体を示す。図２を参照して、トレーニングシステムは、トレーニング用の最新のＡＳＲシステム４０と、言語モデルのための記憶部４２と、レキシコンベースのデコードシステム４４とを含む。 FIG. 2 shows the overall configuration for unit set training in this embodiment. Referring to FIG. 2, the training system includes a state-of-the-art ASR system 40 for training, a storage unit 42 for a language model, and a lexicon-based decoding system 44.

トレーニング用ＡＳＲシステム４０は、入力されたテキストＷを音素シーケンスＦに変換するための音声生成及びＡＳＲモジュール５０と、音素シーケンスＦによって形成される単語ラティス内のデコードされた単語テキストのうちで最も確率の高い単語テキスト＾Ｗを、言語モデル４２を参照しつつラティスの各経路をスコアリングすることによって選択するための単語ラティススコアリングモジュール５２とを含む。 The training ASR system 40 is a speech generation and ASR module 50 for converting the input text W into a phoneme sequence F, and the most probable of the decoded word text in the word lattice formed by the phoneme sequence F. A word lattice scoring module 52 for selecting the high word text ^ W by scoring each path of the lattice with reference to the language model 42.

レキシコンベースのデコードシステム４４は、見出し語の各々を、それぞれの音素セットΦ_１及びΦ_２を用いて記述する辞書６２及び６４と、辞書６２及び６４をそれぞれ用いて、入力テキストＷを音素シーケンスＦ_１及びＦ_２に変換するためのレキシコンベースの変換モジュール６０と、音素シーケンスＦ_１及びＦ_２によって形成される単語ラティス内の単語テキストのうちで最も確率の高い単語テキストＷ_１及びＷ_２を、言語モデル４２を参照しつつラティスの各経路をスコアリングすることによって選択するための単語ラティススコアリングモジュール６６とを含む。図２では説明を簡潔にするため、二つの辞書のみを示す。この実施例は中国語のＡＳＲシステムに関し、音素セットΦ_１は声調情報を含み、一方音素セットΦ_２はこれを含まない。 Lexicon based decoding system 44, each entry word, a dictionary 62 and 64 described with respective phoneme set [Phi ₁ and [Phi _2, respectively using the dictionary 62 and 64, the phoneme sequence F input text W ₁ and a lexicon-based conversion module 60 for converting the F _2, the phoneme sequence F ₁ and higher word text W ₁ and W ₂ the most probable among the word text in the word lattice formed by the F _2, And a word lattice scoring module 66 for selecting by scoring each path of the lattice with reference to the language model 42. In FIG. 2, only two dictionaries are shown for the sake of brevity. This example relates to a Chinese ASR system, phoneme set Φ ₁ contains tone information, while phoneme set Φ ₂ does not.

トレーニング用ＡＳＲシステム４０はトレーニングテキストＷのコーパスを受け、以下の最大化式に従って、デコードされた単語＾Ｗを出力する。 The training ASR system 40 receives the corpus of the training text W and outputs a decoded word ^ W according to the following maximization formula.

確率Ｐ（Ｗ｜Ｆ）を最大にする音素セットが最適な音素セット＾Φとして選択される。すなわち、

The phoneme set that maximizes the probability P (W | F) is selected as the optimal phoneme set ^ Φ. That is,

トレーニング用ＡＳＲシステム４０とレキシコンベースのデコードシステム４４との動作により、上述の式に従って、Ｐ（Ｗ｜Ｆ）の要素を計算し、最適な音素セット＾Φを選択することができる。

By the operation of the training ASR system 40 and the lexicon-based decoding system 44, an element of P (W | F) can be calculated according to the above-described equation, and an optimal phoneme set ^ Φ can be selected.

図３はこの実施例の音素セット最適化システム８０の全体構造を示す図である。図３を参照して、音素セット最適化システム８０は、基本単位セット９０の記憶装置と、トレーニングテキスト９２の記憶装置と、基本単位セット９０及びトレーニングテキスト９２を用いて音素セットを最適化し、最適化された音素セット９４を出力するための音素セット最適化モジュール９６とを含む。 FIG. 3 is a diagram showing the overall structure of the phoneme set optimization system 80 of this embodiment. Referring to FIG. 3, the phoneme set optimization system 80 optimizes a phoneme set by using a storage device of a basic unit set 90, a storage device of a training text 92, a basic unit set 90, and a training text 92. A phoneme set optimizing module 96 for outputting the phoneme set 94.

音素セット最適化モジュール９６は、コンピュータ上で実行されるソフトウェアで実現可能である。ソフトウェアの制御の流れを図４のフロー図で示す。図４を参照して、音素セット最適化モジュール９６は以下のステップを実行する。初期音素セットΦ_０（すなわち基本単位セット９０）で作業中の音素セットΦを置換える（ステップ１００）。音素サブセットΦ_ｉ（ｉ＝１からΦの要素数まで；Φ_ｉ＝Φ―｛ｅ_ｉ｝；ｅ_ｉはΦ中のｉ番目の音素）を生成する（ステップ１０２）。作業中のセットΦとサブセットΦ_ｉの各々との間の相互情報ＭＩ_ｉを計算する（ステップ１０４）。以下の式を満たす指数Ｍを特定する（ステップ１０６）。 The phoneme set optimization module 96 can be realized by software executed on a computer. The flow of software control is shown in the flowchart of FIG. Referring to FIG. 4, the phoneme set optimization module 96 performs the following steps. The working phoneme set Φ is replaced with the initial phoneme set Φ ₀ (ie, the basic unit set 90) (step 100). A phoneme subset Φ _i (from i = 1 to the number of elements of Φ; Φ _i = Φ− {e _i }; e _i is the i-th phoneme in Φ) is generated (step 102). Calculating a mutual information MI _i between each set [Phi and subset [Phi _i are working (step 104). An index M satisfying the following equation is specified (step 106).

その後Ｍ番目の音素サブセットΦ_Ｍを選択し、選択されたサブセットΦ_Ｍ中の音素を用いてレキシコン及びテキストコーパスを作り変える（ステップ１０８）。作り変える過程において、レキシコンとテキストコーパスとは、レキシコンとテキストコーパス中で用いられている削除された音素を、それぞれ最も近い音素とマージするように更新される。

Then select the M-th phoneme subset [Phi _M, reshape lexicon and the text corpus with phonemes in the subset [Phi _M selected (step 108). In the remake process, the lexicon and text corpus are updated to merge the deleted phonemes used in the lexicon and text corpus with the nearest phonemes, respectively.

音素セット最適化モジュール９６はさらに、予め定められた停止条件が満たされたか否かを判断するステップを実行する（ステップ１１０）。もし条件が満たされれば、音素セット最適化モジュール９６は動作を停止する。さもなければ、制御はステップ１１２に進み、ここで選択されたサブセットΦ_Ｍで作業中のセットΦを置換え、その後制御はステップ１０２に戻る。 The phoneme set optimization module 96 further executes a step of determining whether or not a predetermined stop condition is satisfied (step 110). If the condition is met, the phoneme set optimization module 96 stops operating. Otherwise, control proceeds to step 112 where the working subset Φ is replaced with the selected subset Φ _M , after which control returns to step 102.

予め定められた数だけ繰返したあと、動作は停止する。これに代えて、相互情報の減少が予め定められたしきい値を超えた場合に動作を停止することもできる。 After repeating a predetermined number of times, the operation stops. Alternatively, the operation can be stopped when the decrease in mutual information exceeds a predetermined threshold.

音素セット最適化モジュール９６は以下のように動作する。始めに、ステップ１００で、基本単位セット９０が作業用セットΦとして選択される。ステップ１０２で音素サブセットΦ_１からΦ_Ｎまでが生成される。サブセットΦ_ｉは作業中のセットΦから音素ｅ_ｉを除くことで生成される。言換えれば、Φ_ｉは作業中のセットΦにリーブ・ワン・アウト法を適用することによって生成される。 The phoneme set optimization module 96 operates as follows. First, in step 100, the basic unit set 90 is selected as the working set Φ. In step 102 the phoneme subset [Phi ₁ until [Phi _N is generated. The subset Φ _i is generated by removing the phoneme e _i from the working set Φ. In other words, Φ _i is generated by applying a leave-one-out method to the working set Φ.

ステップ１０４で、作業中のセットΦとサブセットΦ_１からΦ_Ｎの各々との間の相互情報ＭＩ_ｉが計算される。ステップ１０６で、相互情報ＭＩ_ｉ中で対応の相互情報ＭＩ_Ｍを最大にする指数Ｍが選択される。 In step 104, the mutual information MI _i between each [Phi _N from the set [Phi and subset [Phi ₁ in operation is calculated. In step 106, an index M that maximizes the corresponding mutual information MI _M among the mutual information MI _i is selected.

ステップ１０８で、Ｍ番目の音素サブセット（サブセットΦ_Ｍ）が選択され、選択された音素サブセットΦ_Ｍを用いてレキシコンとテキストコーパスとが作り変えられる。 At step 108, the Mth phoneme subset (subset Φ _M ) is selected and the lexicon and text corpus are recreated using the selected phoneme subset Φ _M.

ステップ１１０で、停止条件が満たされたか否かが判断される。もし条件が満たされていなければ、制御はステップ１１２に進み、ここでΦがΦ_Ｍと置換される。その後、制御はステップ１０２に戻り、ステップ１０２から１０８までが繰返される。停止条件が満たされると、動作は停止する。 In step 110, it is determined whether the stop condition is satisfied. Unless if condition is satisfied, control proceeds to step 112, where [Phi is replaced with [Phi _M. Thereafter, the control returns to step 102, and steps 102 to 108 are repeated. When the stop condition is met, the operation stops.

こうして、詳細な音素分類に基づいたものであってかつサイズの大きい初期単位セット９０から始めて、音素セット最適化モジュール９６は何らかの基準に従って繰返しながら音素セットを減じることができる。 Thus, starting with a large initial unit set 90 based on detailed phoneme classification, the phoneme set optimization module 96 can reduce phoneme sets while iterating according to some criteria.

図５はこの実施例の検証実験の結果を示す。この実験では、声調情報を含む元の２０３単位からなるセットを減少させる。声調情報を含まない５９単位のセットを比較のために用いた。これら二つのセットは最新の中国語ＡＳＲシステムで広く用いられているものである。検証用テキストコーパスは１，６１４個の短文を含み、単語数は合計で９，４８４個である。 FIG. 5 shows the result of the verification experiment of this example. In this experiment, the original set of 203 units containing tone information is reduced. A set of 59 units without tone information was used for comparison. These two sets are widely used in the latest Chinese ASR system. The verification text corpus includes 1,614 short sentences, and the total number of words is 9,484.

図５を参照して、５９の声調なしの単位セットＣ（ボックス１３２で示す）と比較して、元の２０３の声調付きセットは、ビット表現でより高い相互情報を有する。線１３０で示す削減過程で、同じ５９単位の数を備えて生成された単位セットは、図５の点Ａで示すように、声調なしの単位セットに比べてより高い相互情報を維持した。言換えれば、生成されたセットＡは、数が同じであるにもかかわらず、伝統的な５９の声調なし単位セットよりも良好な言語的識別力を有する。図５の点Ｂの単位セットは、声調なし単位セットＣとほぼ同じ量の相互情報を維持しているが、単位数は遥かに少ない。数は４７であり、従ってこれはＣセットより効率が良い。 Referring to FIG. 5, compared to 59 toneless unit set C (indicated by box 132), the original 203 tone set has higher mutual information in bit representation. In the reduction process indicated by line 130, the unit set generated with the same number of 59 units maintained higher mutual information than the unit set without tone, as indicated by point A in FIG. In other words, the generated set A has better linguistic discriminatory power than the traditional 59 toneless unit set, despite the same number. The unit set at point B in FIG. 5 maintains the same amount of mutual information as the unit set C without tone, but the number of units is much smaller. The number is 47, so it is more efficient than C set.

上述の通り、この実施例のシステムと方法とは、相互情報を減じることなく、音素セット中の音素の数をうまく減じることができる。タスクを特定したテキストをトレーニングに用いれば、音素セットはそのタスク用に最適化でき、その音素セットを用いれば、そのタスクについて十分な識別力を有する頑健な音響モデルを得ることができる。また、十分に詳細な識別力を有する言語モデルを提供できる。 As described above, the system and method of this embodiment can successfully reduce the number of phonemes in a phoneme set without reducing mutual information. If the text specifying the task is used for training, the phoneme set can be optimized for the task, and if the phoneme set is used, a robust acoustic model having sufficient discrimination power for the task can be obtained. In addition, it is possible to provide a language model having sufficiently detailed discrimination power.

上述の実施例では音素セットを最適化したが、この発明は音素セットの最適化に限定されない。この発明は、ＡＳＲにおいて音素セットに置換可能ないずれの基本音素単位セットの最適化にも適用可能である。例えば、語彙が比較的小さい場合には、単位セットは語彙中の単語（単語発音）のセットであり得る。 In the above embodiment, the phoneme set is optimized, but the present invention is not limited to the optimization of the phoneme set. The present invention is applicable to optimization of any basic phoneme unit set that can be replaced with a phoneme set in ASR. For example, if the vocabulary is relatively small, the unit set can be a set of words (word pronunciations) in the vocabulary.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the embodiment described above. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

異なる基本単位セットからの直観的な影響力を示す図である。It is a figure which shows the intuitive influence from a different basic unit set. この実施例の単位セットのトレーニングの全体構成を示す図である。It is a figure which shows the whole structure of the training of the unit set of this Example. この実施例の音素セット最適化システム８０の全体構造を例示する図である。It is a figure which illustrates the whole structure of the phoneme set optimization system 80 of this Example. この実施例の音素セット最適化モジュール９６を実現するソフトウェアの制御フローを示す図である。It is a figure which shows the control flow of the software which implement | achieves the phoneme set optimization module 96 of this Example. この実施例の検証実験結果をグラフの形で示す図である。It is a figure which shows the verification experiment result of this Example in the form of a graph. 先行技術による基本ＡＳＲスキームを示す図である。FIG. 2 shows a basic ASR scheme according to the prior art.

Explanation of symbols

４０トレーニング用ＡＳＲシステム
４２言語モデル
４４レキシコンベースのデコードシステム
５０ＡＳＲモジュール
５２単語ラティススコアリングモジュール
６０レキシコンベースの変換モジュール
６２、６４辞書
６６単語ラティススコアリングモジュール
８０音素セット最適化システム
９０基本単位セット
９２トレーニングテキスト
９４最適化音素セット
９６音素セット最適化モジュール 40 Training ASR System 42 Language Model 44 Lexicon Based Decoding System 50 ASR Module 52 Word Lattice Scoring Module 60 Lexicon Based Transformation Module 62, 64 Dictionary 66 Word Lattice Scoring Module 80 Phoneme Set Optimization System 90 Basic Unit Set 92 Training text 94 Optimized phoneme set 96 Phoneme set optimization module

Claims

A method for optimizing a phoneme unit set of a predetermined language, comprising:
Preparing a basic unit set in a computer readable format;
Generating a plurality of basic unit subsets by applying a leave-one-out method to the basic unit set;
Calculating a predetermined measure of linguistic discriminatory power for each of said basic unit subsets;
Replacing the basic unit set with the highest linguistic discriminatory of the basic unit subsets;
A method for optimizing a phoneme unit set of a predetermined language, wherein the generating step, the calculating step, and the replacing step are repeated until a predetermined criterion is satisfied.

The method of claim 1, wherein the calculating includes calculating mutual information between the basic unit set and each of the basic unit subsets.

The method according to claim 2, wherein the replacing step includes the step of replacing the basic unit set with one of the basic unit subsets having the highest value of mutual information calculated in the calculating step.

The method according to claim 1, wherein the basic unit set is a basic phoneme set for the predetermined language.

A system for optimizing a unit set of a predetermined language,
Storage means for storing the basic unit set in a computer readable format;
Generating means for generating a plurality of basic unit subsets by applying a leave-one-out method to the basic unit set;
Calculating means for calculating a predetermined measure of linguistic discriminatory power for each of said basic unit subsets;
Replacement means for replacing the basic unit set stored in the storage means with a basic unit subset having the highest linguistic discriminatory power;
And a control means for controlling the storage means, the generating means, the calculating means, and the replacing means so as to repeatedly operate until a predetermined criterion is satisfied.