JP2014164261A

JP2014164261A - Information processor and information processing method

Info

Publication number: JP2014164261A
Application number: JP2013037690A
Authority: JP
Inventors: Shunsuke Sato; 俊介佐藤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-02-27
Filing date: 2013-02-27
Publication date: 2014-09-08

Abstract

PROBLEM TO BE SOLVED: To appropriately reject vocalization whose only pronunciation correction parts are different, and whose other parts are coincident or similar in pronunciation correction learning.SOLUTION: A grammar creation part 107 creates first to third phoneme sequence lists each including at least one phoneme sequence and including mutually different phoneme sequences. A voice recognition part 103 voice-recognizes which of the first to third phoneme sequence lists includes the phoneme sequence to which voice data to be input are equivalent. The second phoneme sequence list includes the phoneme sequence in which the predetermined subsequence of the phoneme sequence included in the first phoneme sequence list is replaced with a different phoneme subsequence. The third phoneme sequence list includes the phoneme sequence in which the replaced subsequence in the second phoneme sequence list is replaced with a further different phoneme subsequence.

Description

本発明は、利用者の発音の正確性を評価する情報処理に関する。 The present invention relates to information processing for evaluating the accuracy of pronunciation of a user.

外国語学習において、正確な発音での発声を学習する発音矯正学習は重要な練習の一つである。しかし、発音の評価を初学者が自ら行うことは難しく、専門の語学指導者に依頼する必要がある。音声認識技術を利用して、これを機械的な評価に置き換える方法が、例えば特許文献1に提案されている。 In foreign language learning, pronunciation correction learning, which learns utterance with accurate pronunciation, is one of the important exercises. However, it is difficult for beginners to evaluate their pronunciation by themselves, and it is necessary to ask a specialized language teacher. For example, Patent Document 1 proposes a method of replacing this with a mechanical evaluation using a voice recognition technique.

特許文献1が開示する方法は、単語の注意すべき発音を、対象になる音素に入れ替えた単語との二単語音声認識を行い、正しい方として認識されるか否かにより発音を評価する。 The method disclosed in Patent Document 1 performs two-word speech recognition with a word in which a pronounced pronunciation of a word is replaced with a target phoneme, and evaluates the pronunciation based on whether or not the correct pronunciation is recognized.

ここで、音声認識装置には、必ずしも有効な発声だけが入力されるとは限らず、利用者のフィラーや無関係な発言、および、突発的な雑音などが入力される可能性もある。そのような無関係な音について評価結果を提示すると利用者の混乱を招くので、無関係な音を適切に棄却する必要がある。 Here, not only valid utterances are necessarily input to the speech recognition apparatus, but there is a possibility that user fillers, irrelevant utterances, and sudden noises may be input. Presenting the evaluation result for such an irrelevant sound causes confusion for the user, and therefore it is necessary to appropriately reject the irrelevant sound.

認識対象の単語以外の無関係な音を棄却する手法として、すべての音素を満遍なく表現したガベージモデルを利用する技術が知られている（非特許文献1参照）。しかし、単純にガベージモデルを利用する手法には次の問題がある。 As a technique for rejecting irrelevant sounds other than words to be recognized, a technique using a garbage model that uniformly expresses all phonemes is known (see Non-Patent Document 1). However, the method that simply uses the garbage model has the following problems.

まず、利用者が何を言うかわからないので、音声自体がガベージにマッチするか否かを判断する必要がある。しかし、発音矯正部分のみが大きく異なり、その他の部分について一致または類似する単語（例えば「late」と「rate」）の矯正を行う場合、利用者は「date」「gate」または「えーと」などを発音する可能性がある。このような単語は、音声認識の際に矯正部分以外が一致して評価されるため、ガベージよりもむしろ「late」または「rate」によくマッチする可能性が高い。 First, since the user does not know what to say, it is necessary to determine whether or not the voice itself matches the garbage. However, only the pronunciation correction part is greatly different, and when correcting the same or similar words (for example, “late” and “rate”) in other parts, the user can enter “date”, “gate”, “Et”, etc. There is a possibility of pronunciation. Since such words are evaluated with matching except for the correction part at the time of speech recognition, there is a high possibility of matching well with “late” or “rate” rather than garbage.

「late」と「rate」の発音矯正学習において、発音を矯正すべき単語とは異なる「date」「gate」または「えーと」などを棄却せずに、「late」や「rate」に対する発音の評価を表示すれば、利用者に不信を与えることになる。言い替えれば、単語の特定部分の発音の正確さを評価する際に、評価対象ではない部分だけが一致した発音は棄却することが望ましい。 In pronunciation correction learning of "late" and "rate", the evaluation of pronunciation for "late" and "rate" without rejecting "date", "gate" or "eto" etc., which is different from the word whose pronunciation should be corrected Displaying will cause distrust to the user. In other words, when evaluating the accuracy of pronunciation of a specific part of a word, it is desirable to reject a pronunciation that matches only the part not to be evaluated.

特開2004-53652号公報JP 2004-53652 A

「ガーベジ HMMを用いた自由発話文中の不要語処理手法」電子情報通信学会論文誌、A、基礎・境界、J77-A(2)、215-222頁、1994年2月25日"Unnecessary Word Processing Method in Spoken Sentences Using Garbage HMM" IEICE Transactions, A, Fundamentals / Boundaries, J77-A (2), pp. 215-222, February 25, 1994

本発明は、発音矯正学習において、発音矯正部分のみが異なり、その他の部分について一致または類似する発声を適切に棄却することを目的とする。 An object of the present invention is to appropriately reject utterances that differ only in the pronunciation correction part and match or are similar in other parts in the pronunciation correction learning.

本発明は、前記の目的を達成する一手段として、以下の構成を備える。 The present invention has the following configuration as one means for achieving the above object.

本発明にかかる情報処理は、それぞれ少なくとも一つの音素系列を含み、互いに異なる音素系列を含む第一から第三の音素系列リストを作成し、入力される音声データが前記第一から第三のどの音素系列リストが含む音素系列に相当するかを音声認識し、前記第二の音素系列リストは、前記第一の音素系列リストが含む音素系列の所定の部分列を異なる音素の部分列に置換した音素系列を含み、前記第三の音素系列リストは、前記第二の音素系列リストにおける前記置換された部分列を、さらに異なる音素の部分列に置換した音素系列を含む。 The information processing according to the present invention creates first to third phoneme sequence lists each including at least one phoneme sequence and different phoneme sequences. Speech recognition is performed as to whether the phoneme sequence list corresponds to the phoneme sequence, and the second phoneme sequence list replaces a predetermined subsequence of the phoneme sequence included in the first phoneme sequence list with a subsequence of a different phoneme The phoneme sequence is included, and the third phoneme sequence list includes a phoneme sequence in which the replaced subsequence in the second phoneme sequence list is further replaced with a subsequence of a different phoneme.

本発明によれば、発音矯正学習において、発音矯正部分のみが異なり、その他の部分について一致または類似する発声を適切に棄却することができる。この棄却により、不適切な評価結果の表示を避けて、利用者に音声の再入力を促すことができる。 According to the present invention, in pronunciation correction learning, only the pronunciation correction part is different, and utterances that match or are similar to other parts can be appropriately rejected. By this rejection, it is possible to avoid the display of an inappropriate evaluation result and prompt the user to input voice again.

実施例1の情報処理装置の構成例を示すブロック図。1 is a block diagram illustrating a configuration example of an information processing apparatus according to a first embodiment. 発音矯正学習機能を用いる際の表示部の表示画面の遷移を表す図。The figure showing the transition of the display screen of the display part at the time of using a pronunciation correction learning function. 辞書項目DBおよび問題DBの構成例を説明する図。The figure explaining the structural example of dictionary item DB and problem DB. 実施例1の発音学習矯正処理を説明するフローチャート。6 is a flowchart for explaining pronunciation learning correction processing according to the first embodiment. 実施例1の発音学習矯正処理を説明するフローチャート。6 is a flowchart for explaining pronunciation learning correction processing according to the first embodiment. 実施例2の情報処理装置の構成例を示すブロック図。FIG. 3 is a block diagram illustrating a configuration example of an information processing apparatus according to a second embodiment. 実施例2の発音学習矯正処理を説明するフローチャート。9 is a flowchart for explaining pronunciation learning correction processing according to the second embodiment.

以下、本発明にかかる実施例の情報処理を図面を参照して詳細に説明する。以下で説明する情報処理装置は、電子辞書機能を有する、例えば、携帯型の電子辞書装置、電子辞書アプリケーション(AP)がインストールされたスマートフォンなどの携帯端末装置やタブレット型の端末装置などである。 Hereinafter, information processing according to an embodiment of the present invention will be described in detail with reference to the drawings. The information processing apparatus described below is an electronic dictionary function, for example, a portable electronic dictionary apparatus, a portable terminal apparatus such as a smartphone in which an electronic dictionary application (AP) is installed, a tablet terminal apparatus, or the like.

［装置の構成］
図1のブロック図により実施例1の情報処理装置の構成例を示す。 [Device configuration]
The block diagram of FIG. 1 shows a configuration example of the information processing apparatus of the first embodiment.

情報処理装置の音声入力部101は、マイクロフォン、アナログ-ディジタルコンバータ(ADC)、マイクロプロセッシングユニット(MPU)などによって構成される。音声入力部101は、マイクロフォンによって集音した音声をADCによって電気信号に変換した音声データを情報処理装置に入力する。 The audio input unit 101 of the information processing apparatus includes a microphone, an analog-digital converter (ADC), a microprocessing unit (MPU), and the like. The voice input unit 101 inputs voice data obtained by converting the voice collected by the microphone into an electrical signal by the ADC to the information processing apparatus.

音声記録部102は、RAMなどのメモリ、MPUなどによって構成され、音声入力部101から入力される音声データを記録する。音声認識部103は、MPUなどによって構成され、音声データを所定の単語として認識する。音声認識部103が認識する単語は、与えられた文法によって決定され、語彙リストの中で確度が高いと判断された所定数の単語を候補単語として選出する機能を有する。 The audio recording unit 102 includes a memory such as a RAM, an MPU, and the like, and records audio data input from the audio input unit 101. The voice recognition unit 103 is configured by an MPU or the like, and recognizes voice data as a predetermined word. The words recognized by the speech recognition unit 103 are determined according to a given grammar, and have a function of selecting a predetermined number of words determined as having high accuracy in the vocabulary list as candidate words.

音声認識部103は、メル周波数ケプストラム係数(MFCC)を特徴量とする隠れマルコフモデル(HMM)を用いて音声認識を行う。一つの音素とその前後の音素環境の情報の組み合わせごとにHMMが一つ対応する。この組み合わせをトライフォンと呼び、前方に音素L、後方に音素Rが連続した音素Pのトライフォンを「L-P+R」と表すことにする。単語の先頭および末尾については、無音の音素「sil」が接続するものと見做してトライフォンを構成する。 The speech recognition unit 103 performs speech recognition using a hidden Markov model (HMM) having a mel frequency cepstrum coefficient (MFCC) as a feature quantity. There is one HMM for each combination of one phoneme and information about the phoneme environment before and after it. This combination is called a triphone, and a triphone with a phoneme P in which a phoneme L is continuous in the front and a phoneme R is continuous in the back is expressed as “L-P + R”. At the beginning and end of the word, the triphone is constructed assuming that the silent phoneme “sil” is connected.

また、音声認識部103は、すべての音素から学習されたHMMであるガベージモデル（以下、GBG）を一つ保持する。GBGは、不特定の音素および雑音を表すモデルであり、特定の音素を表すものではない。なお、GBGは静的に保持されるだけでなく、音声入力部101によって集音される周辺雑音などを用いて、利用環境に適応するようGBGを更新してもよい。 Further, the speech recognition unit 103 holds one garbage model (hereinafter, GBG) which is an HMM learned from all phonemes. GBG is a model that represents unspecified phonemes and noise, and does not represent a specific phoneme. The GBG is not only statically held, but the GBG may be updated to adapt to the usage environment using ambient noise collected by the voice input unit 101 or the like.

表示部104は、液晶パネル、MPUなどによって構成され、利用者に提示する情報を表示する。表示される情報には、音声認識部103が選択した単語候補、選択された単語に対応する単語の意味などの辞書項目（以下、コンテンツ）が含まれる。なお、表示画面は液晶パネルに限らず、例えば、陰極線管や有機エレクトロルミネセンス(EL)パネルなどでもよい。 The display unit 104 includes a liquid crystal panel, an MPU, and the like, and displays information to be presented to the user. The displayed information includes dictionary items (hereinafter referred to as contents) such as the word candidate selected by the speech recognition unit 103 and the meaning of the word corresponding to the selected word. The display screen is not limited to the liquid crystal panel, and may be a cathode ray tube or an organic electroluminescence (EL) panel, for example.

操作部105は、表示部104の表示画面上に配置されたタッチパネル、MPUなどによって構成される。利用者は、表示部104の表示を参照して操作部105のタッチパネルを操作することで、ユーザ指示を情報処理装置に入力する。つまり、表示部104と操作部105の組み合わせによってユーザインタフェイス(UI)が提供される。 The operation unit 105 includes a touch panel, an MPU, and the like arranged on the display screen of the display unit 104. The user refers to the display on the display unit 104 and operates the touch panel of the operation unit 105 to input a user instruction to the information processing apparatus. That is, a user interface (UI) is provided by a combination of the display unit 104 and the operation unit 105.

ユーザ指示には、音声認識すべき単語の発声の開始、複数の単語候補から一つを選択する指示などが含まれる。なお、ユーザ指示を入力するデバイスは、タッチパネルに限らず、例えばスイッチやボタンでもよい。 The user instruction includes a start of utterance of a word to be recognized, an instruction to select one from a plurality of word candidates, and the like. The device for inputting the user instruction is not limited to the touch panel, and may be a switch or a button, for example.

問題選択部106は、MPUなどによって構成され、操作部105に対する操作に応じて、利用者が発音矯正を行うための問題を選択する。問題は、利用者が発音すべき一つの単語（後述する正解単語）と、正解単語の発音と紛らわしい（区別が難しい）発音の一つ以上の単語（後述する対照単語）からなる。問題は、後述の問題データベース(DB)111に格納された問題の中から選択される。 The problem selection unit 106 is configured by an MPU or the like, and selects a problem for the user to correct pronunciation according to an operation on the operation unit 105. The problem consists of one word to be pronounced by the user (a correct word to be described later) and one or more words (a reference word to be described later) that are confused with the pronunciation of the correct word (difficult to distinguish). The problem is selected from problems stored in a problem database (DB) 111 described later.

文法作成部107は、MPUなどによって構成され、問題選択部106が選択した問題に基づき音声認識の文法を作成する。作成された文法は音声認識部103によって使用される。 The grammar creation unit 107 is configured by an MPU or the like, and creates a speech recognition grammar based on the problem selected by the problem selection unit 106. The created grammar is used by the speech recognition unit 103.

音素クラスタリング部108は、MPUなどによって構成され、類似した音素のクラスタリングを行う。モデル統合部109は、MPUなどによって構成され、複数の音素の音響モデルを統合して、新たに音響モデルを作成する。 The phoneme clustering unit 108 is configured by an MPU or the like, and performs clustering of similar phonemes. The model integration unit 109 is configured by an MPU or the like, integrates a plurality of phoneme acoustic models, and newly creates an acoustic model.

辞書項目DB110は、ROMなどの不揮発性メモリやフラッシュメモリなどから構成され、情報処理装置が提供する、単語の意味などの辞書項目を格納する。辞書項目は見出語に対応付けられ、利用者は調べたい見出語を選択して、対応した辞書項目を閲覧することができる。問題DB111は、ROMなどの不揮発性メモリやフラッシュメモリなどから構成され、問題選択部106が選択するための問題を格納する。 The dictionary item DB 110 includes a nonvolatile memory such as a ROM, a flash memory, and the like, and stores dictionary items such as word meanings provided by the information processing apparatus. A dictionary item is associated with a headword, and a user can select a headword to be examined and browse the corresponding dictionary item. The problem DB 111 is composed of a nonvolatile memory such as a ROM, a flash memory, or the like, and stores problems for the problem selection unit 106 to select.

［利用形態］
まず、利用者が情報処理装置の発音矯正学習機能を利用する際の形態を説明する。図2により発音矯正学習機能を用いる際の表示部104の表示画面の遷移を表す。 [Usage form]
First, a mode when the user uses the pronunciation correction learning function of the information processing apparatus will be described. FIG. 2 shows the transition of the display screen of the display unit 104 when using the pronunciation correction learning function.

図2(A)において、符号201は情報処理装置の筐体を示す。符号202はタッチパネルを示し、利用者は、表示部104に表示される情報を参照し、ボタン表示にタッチするなどして、操作部105にユーザ指示を入力する。符号203は音声入力部101のマイクロフォンを示す。 In FIG. 2A, reference numeral 201 denotes a housing of the information processing apparatus. Reference numeral 202 denotes a touch panel, and the user inputs a user instruction to the operation unit 105 by referring to information displayed on the display unit 104 and touching a button display. Reference numeral 203 denotes a microphone of the voice input unit 101.

装置の電源がオンされる、電子辞書APが起動されるなどにより電子辞書機能が有効になり、さらに、利用者により発音矯正学習機能の実行が指示されると、表示部104は問題の一覧204を表示する（図2(A)）。利用者が練習したい問題にタッチすると、表示部104は、発声すべき単語と、その発声を促すメッセージを表示する（図2(B)）。 When the apparatus is turned on, the electronic dictionary AP is activated, etc., and the electronic dictionary function is enabled. Further, when the user instructs execution of the pronunciation correction learning function, the display unit 104 displays the problem list 204. Is displayed (Fig. 2 (A)). When the user touches the problem he / she wants to practice, the display unit 104 displays a word to be uttered and a message prompting the utterance (FIG. 2B).

利用者は、「発生開始」ボタン205にタッチした後、発声を開始する。操作部105が「音声入力」ボタン205に対応する位置のタッチを検知すると、音声記録部102は、音声入力部101から入力される音声データをメモリに記録する。 The user touches the “start” button 205 and then starts speaking. When the operation unit 105 detects a touch at a position corresponding to the “voice input” button 205, the voice recording unit 102 records voice data input from the voice input unit 101 in a memory.

音声認識部103は、VAD (voice activity detection)などを用いて音声入力の終了を判定する。そして、音声入力が終了したと判定すると、音声記録部102のメモリに記録された音声データによって音声認識を行い、その発声を評価する。表示部104は、評価に応じた内容を表示する。 The voice recognition unit 103 determines the end of voice input using VAD (voice activity detection) or the like. If it is determined that the voice input has been completed, voice recognition is performed based on the voice data recorded in the memory of the voice recording unit 102, and the utterance is evaluated. The display unit 104 displays contents corresponding to the evaluation.

図2(C)は、利用者の発声が発音すべき単語（図2においては「late」）に最も近いと評価された場合に、利用者が正しく発音したことを通知する表示である。図2(D)は、利用者の発声が類似する単語（例えば「rate」）に最も近いと評価された場合に、利用者の発音が正しくないことを通知する表示である。図2(E)は、利用者の発声が発音すべき単語、類似する単語の何れとも近くないと評価された場合に、利用者に再発声（音声データの再入力）を促す表示である。 FIG. 2C shows a display notifying that the user has pronounced correctly when it is evaluated that the utterance of the user is closest to the word to be pronounced (“late” in FIG. 2). FIG. 2D shows a display notifying that the user's pronunciation is incorrect when the user's utterance is evaluated to be closest to a similar word (for example, “rate”). FIG. 2 (E) is a display that prompts the user to re-utter (re-input voice data) when it is evaluated that the user's utterance is not close to either a word to be pronounced or a similar word.

［DBの構成］
図3により辞書項目DB110および問題DB111の構成例を説明する。なお、ここでは、辞書項目DB110および問題DB111が英和辞典を構成する例を示すが、他言語の辞典または百科事典などを構成していてもよい。 [DB configuration]
A configuration example of the dictionary item DB 110 and the problem DB 111 will be described with reference to FIG. Here, an example is shown in which the dictionary item DB 110 and the problem DB 111 constitute an English-Japanese dictionary, but a dictionary or encyclopedia in other languages may be constituted.

図3(A)に示すように、辞書項目DB110は、フィールドとして「項目ID」、「項目見出し」、「発音」、「コンテンツ」を有し、各フィールドの情報が記録された複数のレコード（項目レコード）を有す。 As shown in FIG. 3 (A), the dictionary item DB 110 has “item ID”, “item heading”, “pronunciation”, “content” as fields, and a plurality of records in which information of each field is recorded ( Field record).

「項目ID」フィールドには、辞書項目DB110の各レコードに一意に付加された識別番号（項目ID）が記録される。「項目見出し」フィールドには、項目レコードの項目を代表する語（見出語）が記録される。例えば英和辞典の場合、見出語は英単語である。 In the “item ID” field, an identification number (item ID) uniquely added to each record of the dictionary item DB 110 is recorded. In the “item heading” field, words (leading words) representing items of the item record are recorded. For example, in the case of an English-Japanese dictionary, the headword is an English word.

「発音表記」フィールドには、「項目見出し」フィールドに記録された見出語の読み方を示す音素系列が記録される。例えば英和辞典の場合、見出語である英単語の発音記号列である。例えば、図3(A)の項目ID=0の項目レコードに示すように、「発音表記」フィールドにはデリミタ「/」で区切られた四つの発音表記が記録される場合がある。つまり、一つの項目レコードに複数の発音表記が記録される場合がある。また、発音表記には国際発音記号(IPA)を用いるが、他の体系によって発音表記を記録してもよい。音声認識部103は発音表記を参照して音声認識を行う。 In the “phonetic notation” field, a phoneme sequence indicating how to read the headword recorded in the “item heading” field is recorded. For example, in the case of an English-Japanese dictionary, it is a pronunciation symbol string of English words that are headwords. For example, as shown in the item record of item ID = 0 in FIG. 3A, there are cases where four phonetic notations delimited by the delimiter “/” are recorded in the “phonetic notation” field. That is, a plurality of phonetic notations may be recorded in one item record. Moreover, although the international phonetic symbol (IPA) is used for the phonetic notation, the phonetic notation may be recorded by other systems. The voice recognition unit 103 performs voice recognition with reference to the phonetic notation.

「コンテンツ」フィールドには、項目レコードの見出語に関連する情報が記録される。「コンテンツ」フィールドに記録された情報は、利用者の閲覧対象情報であり、例えば英和辞典の場合、見出語である英単語の意味や用例などを解説するための日本語が主体の文章である。 Information related to the entry word of the item record is recorded in the “content” field. The information recorded in the “Content” field is the information to be browsed by the user. For example, in the case of English-Japanese dictionaries, Japanese text is mainly used to explain the meaning and examples of English words that are headwords. is there.

図3(B)に示すように、問題DB111は複数の問題レコードからなり、各問題レコードには一意の識別番号（語彙リストID）が付加されている。問題レコードは、正解フラグ、単語表記（単語の綴り）、音素系列（発音記号列）、矯正個所情報からなる単語リストである。 As shown in FIG. 3B, the problem DB 111 is composed of a plurality of problem records, and a unique identification number (vocabulary list ID) is added to each problem record. The problem record is a word list including a correct answer flag, word notation (word spelling), phoneme series (phonetic symbol string), and correction location information.

正解フラグは、単語リストの中で正しく発音すべき単語を示すフラグである。正解フラグがTRUEの単語を「正解単語」、FALSEの単語を「対照単語」と呼ぶことにする。問題レコードには、少なくとも一つの正解単語と、少なくとも一つの対照単語が含まれる。 The correct answer flag is a flag indicating a word that should be pronounced correctly in the word list. A word with the correct answer flag set to TRUE is called a “correct word”, and a word with a false flag is called a “control word”. The question record includes at least one correct word and at least one control word.

矯正個所情報は、単語の中の矯正すべき発音の位置を示す情報である。単語表記と音素系列それぞれについて、矯正すべき範囲の始点と終点のペアが、先頭から数えて何文字目または何音素目かを示す数値として保持される。音素系列中の矯正個所情報によって示された発音記号に対応する音素を「矯正音素」と呼ぶことにする。また、正解単語の矯正音素を「正解音素」、対照単語の矯正音素を「対照音素」と呼ぶことにする。なお、正解音素と対照音素は、一つ以上の音素からなる音素系列の部分列であり、単独の音素とは限らない。 The correction location information is information indicating the position of the pronunciation to be corrected in the word. For each word notation and phoneme series, the pair of the start point and end point of the range to be corrected is held as a numerical value indicating the number of characters or the number of phonemes from the beginning. The phoneme corresponding to the phonetic symbol indicated by the correction location information in the phoneme sequence is called “corrected phoneme”. The correct phoneme of the correct word is called “correct phoneme”, and the correct phoneme of the contrast word is called “control phoneme”. The correct phoneme and the contrast phoneme are partial strings of a phoneme sequence including one or more phonemes, and are not limited to a single phoneme.

予め、辞書項目DB110および問題DB111が作成され、情報処理装置の不揮発性メモリに格納されたり、電子辞書APをインストールするためのデータに組み込まれる。ただし、問題レコードは予め作成せずに、必要に応じて、与えられた条件に適合する辞書項目DB110の項目レコードを収集し、問題レコードを作成し、RAMなどのメモリに格納してもよい。必要に応じて問題レコードを作成すれば、不揮発性メモリの記憶容量を抑えることができるが、電子辞書機能を実行する際の処理負荷が大きくなる。 The dictionary item DB 110 and the problem DB 111 are created in advance and stored in the non-volatile memory of the information processing apparatus or incorporated in the data for installing the electronic dictionary AP. However, the problem records may not be created in advance, but may be collected as required by collecting item records in the dictionary item DB 110 that meet the given conditions, and may be created and stored in a memory such as a RAM. If a problem record is created as necessary, the storage capacity of the nonvolatile memory can be suppressed, but the processing load when the electronic dictionary function is executed increases.

［情報処理］
図4、図5のフローチャートにより実施例1の発音学習矯正処理を説明する。発音学習矯正処理は、図1に示す情報処理装置の不揮発性メモリに格納された電子辞書機能を実現するソフトウェアや、当該不揮発性メモリにインストールされた電子辞書APを、情報処理装置のMPUが実行することにより実現される。言い替えれば、MPUは、図1に示す構成の動作や処理を制御する制御部として機能する。 [Information processing]
The pronunciation learning correction process according to the first embodiment will be described with reference to the flowcharts of FIGS. The pronunciation learning correction process is executed by the MPU of the information processing device, which executes the electronic dictionary function stored in the non-volatile memory of the information processing device shown in FIG. 1 and the electronic dictionary AP installed in the non-volatile memory. It is realized by doing. In other words, the MPU functions as a control unit that controls the operation and processing of the configuration shown in FIG.

発音学習矯正機能が有効になると、MPUは、図2(A)に示すように、表示部104に問題DB111に格納された問題レコードを問題の一覧204として表示し(S401)、問題に対応する位置がタッチされるのを待つ(S402)。以下では、簡略化のために「タッチパネルの○○○に対応する位置がタッチされる」を「○○○がタッチされる」ように表現する。 When the pronunciation learning correction function is enabled, the MPU displays the problem record stored in the problem DB 111 on the display unit 104 as the problem list 204 as shown in FIG. 2 (A) (S401), and handles the problem. Wait for the position to be touched (S402). In the following, for the sake of simplification, “the position corresponding to XXX on the touch panel is touched” is expressed as “XX is touched”.

問題がタッチされると、MPUは、問題が選択されたとして、音素クラスタリング部108により、選択された問題に応じて音素のクラスタリングを行う(S403)。 When a problem is touched, the MPU performs phoneme clustering according to the selected problem by the phoneme clustering unit 108, assuming that the problem is selected (S403).

●クラスタリング(S403)
クラスタリングは、以下のトライフォンについて行う。
・選択された問題の正解音素の、正解単語内における前後の音素環境を伴うトライフォン（正解音素トライフォン）、
・選択された問題の対照音素の、対照単語内における前後の音素環境を伴うトライフォン（対照音素トライフォン）、
・以上のトライフォンの中心音素を、音声認識部103で用いる音素の全体でそれぞれ置換したトライフォン。 ● Clustering (S403)
Clustering is performed for the following triphones.
・ Triphone (correct phoneme triphone) with the phoneme environment of the correct and correct phonemes of the selected problem in the correct word,
A triphone with a phoneme environment before and after the control phoneme of the selected problem in the control word (control phoneme triphone),
A triphone in which the central phoneme of the above triphone is replaced with the entire phoneme used in the speech recognition unit 103.

ここでは、音声認識部103が用いる各トライフォンに対応するHMMに基づき各音素をユークリッド空間上の点にマッピングし、ユークリッド空間内でクラスタリングを行う。各HMMは状態数をs個とし、各状態は共分散をもたないN次元の単一ガウス分布とする。例えば、s=3、N=26とする。 Here, each phoneme is mapped to a point on the Euclidean space based on the HMM corresponding to each triphone used by the speech recognition unit 103, and clustering is performed in the Euclidean space. Each HMM has s states, and each state has an N-dimensional single Gaussian distribution with no covariance. For example, s = 3 and N = 26.

音素XのHMMの、第i番目の状態のガウス分布の平均mおよび分散σをそれぞれ下式で表す。
┌ ┐
│m_Xi1│
│ ： │
m_Xi = │m_Xij│ …(1)
│ ： │
│m_XiN│
└ ┘
┌ ┐
│σ_Xi1│
│ ： │
σ_Xi = │σ_Xij│ …(2)
│ ： │
│σ_XiN│
└ ┘ The mean m and variance σ of the Gaussian distribution of the i-th state of the HMM of phoneme X are respectively expressed by the following equations.
┌ ┐
│m _Xi1 │
│ ： │
_{_{m Xi = │m Xij │ ... (}} 1)
│ ： │
│m _XiN │
└ ┘
┌ ┐
│σ _Xi1 │
│ ： │
_{_{σ Xi = │σ Xij │ ... (}} 2)
│ ： │
│σ _XiN │
└ ┘

このとき、音素Xを下式で表されるs×N次元ユークリッド空間内の点にマッピングする。なお、ここで主成分分析(PCA)などの手法を用いて次元の圧縮を行ってもよい。
┌ ┐
│m_X11/σ_X11│
│ ： │
│m_X1N/σ_X1N│
│m_X21/σ_X21│
p_X = │ ： │ …(3)
│m_X2N/σ_X2N│
│ ： │
│m_Xs1/σ_Xs1│
│ ： │
│m_XsN/σ_XsN│
└ ┘ At this time, the phoneme X is mapped to a point in the s × N-dimensional Euclidean space expressed by the following equation. Here, the dimension compression may be performed using a technique such as principal component analysis (PCA).
┌ ┐
│m _X11 / σ _X11 │
│ ： │
│m _X1N / σ _X1N │
│m _X21 / σ _X21 │
p _X = │: │… (3)
│m _X2N / σ _X2N │
│ ： │
│m _Xs1 / σ _Xs1 │
│ ： │
│m _XsN / σ _XsN │
└ ┘

ここで、ユークリッド空間に次のような距離を導入する。点Pと点Qの距離を次式のように定める。
d_C(P, Q) = min[{w(c1)d(P, c1) + w(c2)d(Q, c2)|c1, c2∈C}∪{d(P, Q)}] …(4)
ここで、Cは正解音素トライフォンと対照音素トライフォンにマッピングされた点の集合、
dは通常のユークリッド距離、
wはCから非負実数への関数。 Here, the following distance is introduced into the Euclidean space. The distance between point P and point Q is determined as follows:
d _C (P, Q) = min [{w (c1) d (P, c1) + w (c2) d (Q, c2) | c1, c2∈C} ∪ {d (P, Q)}]… (Four)
Where C is the set of points mapped to the correct phoneme triphone and the contrast phoneme triphone,
d is the normal Euclidean distance,
w is a function from C to a non-negative real number.

例えば、関数wは、正解音素トライフォンに対して0.5、対照音素トライフォンに対して1を与える関数とする。関数wは、対照音素に近い音素よりも、正解音素に近い音素が同一クラスタに入り易くするための重みとして働く。なお、重みの与え方は、これに限るものではなく、また正解音素、対照音素で一律ではなく、各元について異なる重みを与えるようにしてもよい。 For example, the function w is a function that gives 0.5 for the correct phoneme triphone and 1 for the reference phoneme triphone. The function w serves as a weight for making it easier for a phoneme close to the correct phoneme to enter the same cluster than a phoneme close to the reference phoneme. Note that the method of giving weights is not limited to this, and the correct answer phoneme and the reference phoneme are not uniform, and different weights may be given to each element.

つまり、距離d_Cは、集合Cの要素をすべて同一視した上で、集合Cの要素の経由を認めた距離である。距離d_Cを使って、K-means法によりクラスタリングを行う。 That is, the distance d _C is a distance in which all elements of the set C are identified and allowed to pass through the elements of the set C. With the distance d _C, performs clustering by K-means clustering method.

集合Cの要素同士の距離d_Cは0であるから、集合Cの要素はすべて同じクラスタに含まれる。集合Cが含まれるクラスタを「正解クラスタ」、正解クラスタから集合Cの要素を除いた集合の要素を「除外音素」と呼ぶことにする。除外音素は存在しない可能性もある。 Since the distance d _C between elements of the set C is 0, all elements of the set C are included in the same cluster. A cluster including the set C is referred to as a “correct answer cluster”, and an element of the set obtained by removing the elements of the set C from the correct answer cluster is referred to as an “excluded phoneme”. There may be no excluded phonemes.

このクラスタリングは、音素間の類似度の判定手段である。すなわち、同一クラスタに分類されたトライフォンのHMMは、類似した音声の表現であると判断する。 This clustering is a means for determining the similarity between phonemes. That is, the triphone HMMs classified into the same cluster are determined to be similar speech expressions.

なお、HMMの状態数sと状態の次元数Nはs=3とN＝26に限られず、距離空間へのマッピングの方法も上記に限らない。また、HMMの各状態は共分散をもたないN次元の単一ガウス分布に限らず、共分散をもつガウス分布や、混合ガウス分布でもよいし、ポアソン分布など他の種類の分布でもよい。また、ここで用いた距離d_Cは距離の公理を満たす正当な距離であるが、必ずしもそうである必要はなく、例えば対称化Kullback-Divergenceのような疑似的な距離を用いることも可能である。 The number of states s and the number of state dimensions N of the HMM are not limited to s = 3 and N = 26, and the method of mapping to the metric space is not limited to the above. Each state of the HMM is not limited to an N-dimensional single Gaussian distribution having no covariance, but may be a Gaussian distribution having a covariance, a mixed Gaussian distribution, or another type of distribution such as a Poisson distribution. The distance d _C used here is a legitimate distance that satisfies the distance axiom, but this is not necessarily the case. For example, a pseudo distance such as symmetrized Kullback-Divergence can be used. .

また、正解音素、対照音素との距離について、集合Cの同一視によって距離を補正する代わりに、例えば集合Cの元を頂点として囲まれた図形（集合Cの元が二つのみの場合はそれらを結ぶ線分）に対する最短距離などを用いてもよい。 Also, instead of correcting the distance for the correct phoneme and the reference phoneme by the same view of the set C, for example, a figure surrounded by the elements of the set C as vertices (if there are only two elements of the set C, these The shortest distance to the line segment connecting the two) may be used.

音素のクラスタリングが終了すると、MPUは、モデル統合部109により、ステップS403のクラスタリングにより得たクラスタのうち正解クラスタ以外のクラスタについて音響モデルの統合を行う(S404)。 When the phoneme clustering is completed, the MPU integrates the acoustic model for the clusters other than the correct cluster among the clusters obtained by the clustering in step S403 by the model integration unit 109 (S404).

●音響モデルの統合(S404)
音響モデルの統合は、単純に、クラスタ内の一つのトライフォンに対応するHMMにクラスタを代表させることにより行う。ステップS403において、K-means法により定めたクラスタの重心に最も（距離d_Cにおいて）近いトライフォンのHMMをクラスタの音響モデルと見做す。 ● Acoustic model integration (S404)
The integration of the acoustic model is simply performed by having the HMM corresponding to one triphone in the cluster represent the cluster. In step S403, the triphone HMM closest to the center of gravity of the cluster determined by the K-means method (at a distance d _C ) is regarded as the acoustic model of the cluster.

統合したHMMには、クラスタを一意に特定する記号を割り当てる。この記号を一つの音素と見做し、クラスタの「統合音素」と呼ぶことにする。統合音素は、トライフォンとしては、前後の音素環境を代表したトライフォンと同様とする。 A symbol that uniquely identifies the cluster is assigned to the integrated HMM. This symbol is regarded as one phoneme and is called “integrated phoneme” of the cluster. The integrated phoneme is the same as a triphone that represents the phoneme environment before and after the triphone.

こうして、音声認識部103の計算対象のHMMを削減し、クラスタの計算量を削減することを目的とする。なお、ここでは各音素HMMを並列に接続したHMMを作成する、または、各音素HMMの状態ごとに距離に応じた重みを掛けた混合GMMとして、状態を用いたHMMを作成するなどして、音響モデルの統合を行ってもよい。 Thus, an object is to reduce the number of HMMs to be calculated by the speech recognition unit 103 and the amount of calculation of the cluster. Here, create an HMM in which each phoneme HMM is connected in parallel, or create an HMM using a state as a mixed GMM multiplied by a weight for each state of each phoneme HMM, An acoustic model may be integrated.

音響モデルの統合が終了すると、MPUは、文法作成部107により、全音素から正解クラスタの音素（正解音素）、対照音素、除外音素を除外した音素のリストを作成する(S405)。このリストに含まれる音素を「棄却用音素」と呼ぶことにする。また、GBG（ガベージモデル）も音素の一つと見做し、棄却用音素に追加する。 When the integration of the acoustic models is completed, the MPU creates a list of phonemes from which the phonemes (correct phonemes), reference phonemes, and excluded phonemes of the correct clusters are excluded from all phonemes by the grammar creation unit 107 (S405). The phonemes included in this list are called “rejecting phonemes”. GBG (garbage model) is also regarded as one of the phonemes and added to the rejecting phonemes.

除外音素は、正解音素・対照音素に近い音素であるから、発音矯正の途上にある利用者が正解音素や対照音素を発声するつもりで除外音素を発声する可能性がある。このような矯正対象の音素に近いものを棄却しないため、ここで除外する。 Since the excluded phoneme is a phoneme close to the correct phoneme / contrast phoneme, there is a possibility that a user who is in the process of correcting pronunciation will utter the excluded phoneme with the intention of uttering the correct phoneme or the reference phoneme. Since those close to the phonemes to be corrected are not rejected, they are excluded here.

次に、MPUは、文法作成部107により、問題の正解音素を、棄却用音素に置き換え、さらにその他のすべての音素を、それが属するクラスタの統合音素に置き換えた音素系列のリストを作成する(S406)。そして、作成したリストに含まれる音素系列に加えて、正解単語、対照単語、GBGを単語として含む文法を作成する(S407)。なお、正解クラスタに属する音素は置き換えを行わない。 Next, the MPU creates a list of phoneme sequences in which the correct phoneme in question is replaced with a rejecting phoneme, and all other phonemes are replaced with integrated phonemes of the cluster to which the MPU belongs by the grammar creation unit 107 ( S406). Then, in addition to the phoneme series included in the created list, a grammar including correct words, contrast words, and GBG as words is created (S407). Note that phonemes belonging to the correct cluster are not replaced.

初めに作成した音素系列リストの単語とGBGは、正解単語・対照単語の何れにも合致しないことを判断するための単語であり、ここでは「棄却用単語」と呼ぶことにする。言い替えれば、評価対象の音素を、他のすべての可能な音素に置換した単語を棄却用単語として音声認識の文法に加える。 The words and GBGs of the phoneme series list created at the beginning are words for determining that they do not match either the correct word or the reference word, and are referred to as “reject words” here. In other words, a word obtained by replacing the phoneme to be evaluated with all other possible phonemes is added as a rejection word to the speech recognition grammar.

文法の各単語はトライフォンの系列として表現し、対応するHMMによって音声認識を行う。ここでGBGおよび統合音素については、対応するHMMが決定しないトライフォンが文法中に出現する場合がある。この場合、モデル統合部109は、統合音素のクラスタに含まれる音素により、必要なトライフォンと同じ前後の環境をもつトライフォンの音素らを統合したHMMを作成して代用する。また、GBGの場合は、同様にすべての音素のトライフォンを用いてHMMを統合して代用する。 Each word in the grammar is expressed as a triphone sequence, and speech recognition is performed by the corresponding HMM. Here, for GBG and integrated phonemes, triphones that are not determined by the corresponding HMM may appear in the grammar. In this case, the model integration unit 109 creates and substitutes an HMM in which triphone phonemes having the same front and back environment as the necessary triphone are integrated using phonemes included in the cluster of integrated phonemes. Similarly, in the case of GBG, HMMs are integrated and replaced using triphones of all phonemes.

次に、MPUは、利用者がタッチパネル（操作部105）の「音声入力」ボタン205をタッチするのを待つ(S408)。「音声入力」ボタン205がタッチされると、音声入力部101により、音声入力部101による音声の入力と音声記録部102による音声データの記録を開始し(S409)、VADなどを行って音声入力の終了を判定する(S410)。例えば、音声および非音声のHMMを用意し、VADにおいて各モデルの尤度を比較して、音声のHMMにおける尤度が高ければ音声、非音声のHMMにおける尤度が高ければ非音声と判定して、音声入力の終了を判定する。なお、VADは、これに限らず、音声信号のエネルギが特定の閾値を超える場合に音声と判定する方法を採用してもよい。 Next, the MPU waits for the user to touch the “voice input” button 205 on the touch panel (operation unit 105) (S408). When the “voice input” button 205 is touched, the voice input unit 101 starts voice input by the voice input unit 101 and voice data recording by the voice recording unit 102 (S409), and performs voice input by performing VAD or the like. Is determined to end (S410). For example, voice and non-voice HMMs are prepared, and the likelihood of each model is compared in VAD. If the likelihood of the voice HMM is high, it is determined as voice, and if the likelihood in the non-voice HMM is high, it is determined as non-voice. Then, the end of voice input is determined. Note that the VAD is not limited to this, and may adopt a method of determining the sound when the energy of the sound signal exceeds a specific threshold.

音声入力が終了するまでステップS409が繰り返される。音声入力が終了すると、MPUは、音声認識部103により、記録された音声データの音声認識を行い(S411)、認識結果がステップS407で作成した文法に記述された正解単語、対照単語、棄却用単語の何れに近いかを判定する(S412)。 Step S409 is repeated until the voice input is completed. When the voice input is completed, the MPU performs voice recognition of the recorded voice data by the voice recognition unit 103 (S411), and the recognition result is the correct word, the reference word, and the rejection word described in the grammar created in step S407. It is determined which of the words is close (S412).

次に、MPUは、判定した単語が正解単語であれば、利用者が正しく発音したことの通知を表示部104に表示し（図2(C)）(S413)する。また、対照単語であれば、利用者の発声が対照単語のように認識されたこと、および、正しく発声するためのヒントの通知を表示部104に表示し（図2(D)）(S414)、処理を終了する。 Next, if the determined word is the correct word, the MPU displays a notification that the user pronounced correctly on the display unit 104 (FIG. 2 (C)) (S413). If it is a contrast word, the user's utterance is recognized as a contrast word and a hint notification for correctly speaking is displayed on the display unit 104 (FIG. 2 (D)) (S414) The process is terminated.

また、MPUは、棄却用単語と判定した場合は、利用者に再度発声を行うように促すメッセージを表示部104に表示し（図2(E)）(S415)、処理をステップS408に戻す。つまり、正解単語、対照単語の何れとも大きく異なる発音については棄却して、不適切な結果表示を回避した上で、利用者に再度、発声の入力を促す。 If the MPU determines that the word is a rejection word, the MPU displays a message prompting the user to speak again (FIG. 2 (E)) (S415), and the process returns to step S408. In other words, pronunciations that are significantly different from either the correct word or the contrast word are rejected, and an inappropriate result display is avoided, and the user is prompted to input the utterance again.

このように、正解単語の音素系列からなる第一の音素系列リストを作成する。第一の音素系列リストが含む音素系列の所定の部分列を異なる音素の部分列に置換した音素系列を含む、対照単語の第二の音素系列リストを作成する。第二の音素系列リストにおける置換された部分列を、さらに異なる音素の部分列に置換した音素系列を含む、棄却用単語の第三の音素系列リストを作成する。そして、音声データの認識結果が第一から第三の音素系列リストの何れの音素系列に相当するかを判定する。これにより、発音矯正部分のみが異なり、その他の部分について一致または類似する単語を適切に棄却し、不適切な結果の表示を回避して、利用者に再度、音声の入力を促すことができる。 In this way, a first phoneme sequence list composed of phoneme sequences of correct words is created. A second phoneme sequence list of reference words is created that includes a phoneme sequence in which a predetermined subsequence of a phoneme sequence included in the first phoneme sequence list is replaced with a subsequence of a different phoneme. A third phoneme sequence list of rejection words is generated, including a phoneme sequence in which the replaced subsequence in the second phoneme sequence list is further replaced with a different phoneme subsequence. Then, it is determined which phoneme sequence of the first to third phoneme sequence lists corresponds to the speech data recognition result. Thus, only the pronunciation correction part is different, and words that are the same or similar in other parts are appropriately rejected, display of inappropriate results can be avoided, and the user can be prompted to input voice again.

以下、本発明にかかる実施例2の情報処理を説明する。なお、実施例2において、実施例1と略同様の構成については、同一符号を付して、その詳細説明を省略する。 The information processing according to the second embodiment of the present invention will be described below. Note that the same reference numerals in the second embodiment denote the same parts as in the first embodiment, and a detailed description thereof will be omitted.

実施例1では、発音矯正学習機能を実行する際に文法を動的に作成する例を説明した。しかし、音素数が大きい場合や、正解単語、対照単語が多数存在する問題が選択された場合、処理負荷が大きくなり、処理速度の低下などが生じる場合がある。このような状況を考慮して、問題ごとに文法を予め作成し、作成した文法を格納しておく実施例2を説明する。 In the first embodiment, an example in which a grammar is dynamically created when the pronunciation correction learning function is executed has been described. However, when the number of phonemes is large, or when a problem with a large number of correct words and reference words is selected, the processing load may increase and the processing speed may decrease. In consideration of such a situation, Embodiment 2 in which a grammar is created in advance for each problem and the created grammar is stored will be described.

図6のブロック図により実施例2の情報処理装置の構成例を示す。図1の構成と異なるのは、文法作成部107、音素クラスタリング部108、モデル統合部109が、文法DB112と文法選択部113に置き換わっている点である。 A block diagram of FIG. 6 shows a configuration example of the information processing apparatus of the second embodiment. 1 is that the grammar creation unit 107, the phoneme clustering unit 108, and the model integration unit 109 are replaced with a grammar DB 112 and a grammar selection unit 113.

文法DB112は、音声認識部103が認識に用いる文法を格納する。文法は、実施例1で説明した方法と同様の手順で、問題ごとに予め作成されている。文法選択部113は、MPUなどによって構成され、文法DB112に格納された文法から、利用者が選択した問題に対応する文法を検索して取得する。 The grammar DB 112 stores a grammar used by the speech recognition unit 103 for recognition. The grammar is created in advance for each problem in the same procedure as the method described in the first embodiment. The grammar selection unit 113 is configured by an MPU or the like, and searches and acquires a grammar corresponding to the problem selected by the user from the grammar stored in the grammar DB 112.

図7のフローチャートにより実施例2の発音学習矯正処理を説明する。図4に示す処理と異なるのは、ステップS403からS407の処理が、文法選択部113により、選択された問題に対応する文法を文法DB112から選択し取得する処理(S421)に置き換わる点である。その他の処理は図4、図5の処理と略同一である。選択され取得された文法は、ステップS412において音声認識部103によって使用される点に違いはない。 The pronunciation learning correction process according to the second embodiment will be described with reference to the flowchart of FIG. The difference from the process shown in FIG. 4 is that the process from steps S403 to S407 is replaced with a process (S421) in which the grammar selection unit 113 selects and acquires a grammar corresponding to the selected problem from the grammar DB 112. Other processes are substantially the same as the processes in FIGS. There is no difference in the grammar selected and acquired by the speech recognition unit 103 in step S412.

このように、予め文法を作成しておけば、発音矯正学習機能の実行時の処理負荷を削減することができる。 Thus, if the grammar is created in advance, the processing load when the pronunciation correction learning function is executed can be reduced.

［その他の実施例］
また、本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、ネットワーク又は各種記憶媒体を介してシステム或いは装置に供給し、そのシステムあるいは装置のコンピュータ（又はCPUやMPU等）がプログラムを読み出して実行する処理である。 [Other Examples]
The present invention can also be realized by executing the following processing. That is, software (program) that realizes the functions of the above-described embodiments is supplied to a system or apparatus via a network or various storage media, and a computer (or CPU, MPU, etc.) of the system or apparatus reads the program. It is a process to be executed.

Claims

Creating means for creating first to third phoneme sequence lists each including at least one phoneme sequence and different phoneme sequences;
Recognizing means for recognizing which of the first to third phoneme series lists the input voice data corresponds to the phoneme series included in the list,
The second phoneme sequence list includes a phoneme sequence obtained by replacing a predetermined partial sequence of a phoneme sequence included in the first phoneme sequence list with a partial sequence of different phonemes,
The information processing apparatus includes a phoneme sequence in which the third phoneme sequence list is obtained by replacing the replaced subsequence in the second phoneme sequence list with a subsequence of a different phoneme.

2. The information processing apparatus according to claim 1, further comprising user interface means for a user to select a word for inputting the voice data and to display an evaluation based on the result of the voice recognition.

3. The information processing apparatus according to claim 2, wherein the user interface means performs a display prompting re-input of the speech data when the speech recognition result indicates a phoneme sequence included in a third phoneme sequence list.

The first phoneme sequence list includes phoneme sequences corresponding to the word selected by the user;
The second phoneme sequence list includes phoneme sequences corresponding to words that are difficult to distinguish between the selected word and pronunciation,
4. The phoneme sequence according to claim 1, wherein the third phoneme sequence list includes a phoneme sequence different from a phoneme sequence corresponding to the selected word and a phoneme sequence corresponding to the word whose pronunciation is difficult to distinguish. Information processing apparatus described in 1.

4. The phoneme sequence list according to any one of claims 1 to 3, wherein the third phoneme sequence list includes the entire phoneme sequence in which at least the entire phoneme recognized by the recognizing unit is replaced with a single subsequence. Information processing apparatus described in the section.

The recognition means recognizes noise;
4. The creation unit according to claim 1, wherein the third phoneme sequence list includes a phoneme sequence in which the recognized noise is regarded as a phoneme and a partial sequence is replaced. Information processing device.

The recognition means uses, for each phoneme, an acoustic model for each phoneme that is continuous before and after,
7. The information processing according to claim 6, further comprising: an acoustic model creating unit that creates an acoustic model when the phonemes regarded as the noise are consecutively used before and after using the acoustic model for each of the phonemes. apparatus.

Furthermore, a determination means for determining the similarity between phonemes,
4. The information processing apparatus according to claim 1, further comprising: exclusion means for excluding a phoneme sequence having similar substrings before and after the replacement from the third phoneme sequence list.

9. The information processing apparatus according to claim 8, wherein the determination unit clusters all phonemes recognized by the recognition unit and determines that phonemes included in the same cluster are similar.

10. The information processing apparatus according to claim 9, wherein the determination unit performs clustering on points where the phonemes are mapped to a predetermined metric space.

The determination means is a point mapping phonemes belonging to the replaced subsequence in the first phoneme sequence list, and a point mapping phonemes belonging to the replaced subsequence in the second phoneme sequence list; 11. The information processing apparatus according to claim 10, wherein a distance obtained by assuming that the two are the same is used for the clustering.

The determination means includes mapping a phoneme belonging to the replaced subsequence in the first phoneme sequence list and mapping a phoneme belonging to the replaced subsequence in the second phoneme sequence list. 11. The information processing apparatus according to claim 10, wherein distances obtained by multiplying different weights are used for the clustering.

The metric space is a Euclidean space, and the determination means includes a centroid of at least one point mapping a phoneme belonging to the replaced subsequence in the first phoneme sequence list, and a second phoneme sequence list. 11. The information processing apparatus according to claim 10, wherein a distance of a line segment connecting a centroid of at least one point to which a phoneme belonging to the replaced subsequence is mapped is used for the clustering.

Furthermore, from the acoustic model of a plurality of phonemes, comprising an integration means for creating a new acoustic model representing the plurality of phonemes,
The integration unit creates a new acoustic model from the acoustic models of phonemes included in the cluster clustered by the determination unit,
14. The information processing apparatus according to claim 8, wherein the recognizing unit substitutes the new acoustic model for a phoneme included in the cluster.

An information processing method for an information processing apparatus having a creation unit and a recognition unit,
The creating means creates first to third phoneme sequence lists each including at least one phoneme sequence and different phoneme sequences;
The recognizing means recognizes the input speech data as to which of the first to third phoneme sequence lists corresponds to the phoneme sequence included;
The second phoneme sequence list includes a phoneme sequence obtained by replacing a predetermined partial sequence of a phoneme sequence included in the first phoneme sequence list with a partial sequence of different phonemes,
The information processing method including the phoneme sequence in which the third phoneme sequence list further replaces the replaced subsequence in the second phoneme sequence list with a subsequence of a different phoneme.

15. A program for causing a computer to function as each unit of the information processing apparatus according to any one of claims 1 to 14.

A computer-readable recording medium in which the program according to claim 16 is stored.