JP2010266716A

JP2010266716A - Voice recognition device, and method and program of the same

Info

Publication number: JP2010266716A
Application number: JP2009118361A
Authority: JP
Inventors: Akio Jin; 明夫神; Hirokazu Masataki; 浩和政瀧; Satoshi Takahashi; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-05-15
Filing date: 2009-05-15
Publication date: 2010-11-25

Abstract

<P>PROBLEM TO BE SOLVED: To convert input voice into a recognition result easily understood by a user and output it. <P>SOLUTION: In this voice recognition device, phoneme information as an acoustic feature of each phoneme is previously stored in an acoustic model storage section. A word, reading way of this word, difficulty level of this word, and a word having the same meaning as this word and different difficulty level from it are previously stored in association with each other in a recognition dictionary storage section. A linguistic feature is previously stored in a language model storage section. The acoustic feature of voice information is determined using the voice information, similar phoneme information is retrieved from the acoustic model storage section using the acoustic feature of the voice information determined by a voice analysis section, and a word is retrieved from the recognition dictionary storage section using one or more retrieved pieces of phoneme information. Using the retrieved word and the linguistic feature of a language model storage section, a recognition result before conversion is estimated, the recognition dictionary storage section is referred, and a word of a difficulty level closest to a target difficulty level is selected from words having the same meaning as the word of the recognition result before conversion. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音響モデル、言語モデルを用いて、音声情報の内容を認識し、その内容をテキストデータとして求める音声認識装置、音声認識方法及び音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus, a speech recognition method, and a speech recognition program that recognize the content of speech information using an acoustic model and a language model and obtain the content as text data.

音響モデル及び言語モデルを用いて音声認識を行う装置が従来技術として知られている。図１は、従来の音声認識装置２０の構成例を示す。音声認識装置２０は、音声入力部２１を介して、音声を入力される。音声分析部２３において入力音声を分析し、探索部２５は、音響モデル記憶部２７、言語モデル記憶部３１を用いてどのような単語の系列が出現しているのかを推定し、推定の結果、最も確率が高い単語系列を音声認識結果として出力する。このとき、言語モデル記憶部３１を用いた単語系列の推定では、認識辞書記憶部２９に登録されている単語を並べて単語系列とする。 An apparatus for performing speech recognition using an acoustic model and a language model is known as a conventional technique. FIG. 1 shows a configuration example of a conventional speech recognition apparatus 20. The voice recognition device 20 receives voice through the voice input unit 21. The speech analysis unit 23 analyzes the input speech, and the search unit 25 uses the acoustic model storage unit 27 and the language model storage unit 31 to estimate what word sequence appears, and as a result of the estimation, The word sequence with the highest probability is output as the speech recognition result. At this time, in the word sequence estimation using the language model storage unit 31, the words registered in the recognition dictionary storage unit 29 are arranged to form a word sequence.

なお、多岐の認識文言に対して同一の出力文字をリストにおいて対応付けることで、多岐の認識文言を同一の出力文字として視認可能とする方法として、特許文献１が知られている。 Patent Document 1 is known as a method for making various recognition words visible as the same output characters by associating the same output characters with various recognition words in a list.

特開２００５−３０９０６５号公報Japanese Patent Laid-Open No. 2005-309065

従来技術において、発話した文章に相当する全ての単語が認識辞書記憶部に登録されており、音声認識が理想的に動作する場合、難解な単語（例えば「遺憾の意」）を発声すれば認識結果も難解なテキストデータとなる。従来の音声認識の目的は、「発話した内容を、その通り正確にテキスト化すること」であるため、認識結果が難解なテキストデータであっても問題はなかった。しかし、「発話した内容を、利用者が理解しやすいようにテキスト化すること」を目的とした場合には、入力音声を利用者が理解しやすい認識結果に変換して出力するという課題がある。 In the prior art, when all words corresponding to the spoken sentence are registered in the recognition dictionary storage unit and speech recognition works ideally, it is recognized if a difficult word (for example, “regret”) is spoken. The result is also difficult text data. Since the purpose of the conventional speech recognition is “to make the uttered content exactly as it is”, there is no problem even if the recognition result is difficult text data. However, when the purpose is to "text the spoken content so that the user can easily understand", there is a problem that the input speech is converted into a recognition result that is easy for the user to understand and output. .

上記の課題を解決するために、本発明の音声認識技術は、音響モデル記憶部には各音素の音響的な特徴である音素情報が予め記憶され、認識辞書記憶部には単語と、この単語の読みと、この単語の難易度と、この単語と同じ意味を持ち異なる難易度の単語とを対応付けて予め記憶され、言語モデル記憶部には言語的な特徴が予め記憶されているものとし、音声情報を用いてこの音声情報の音響的な特徴を求め、音声分析部で求めたこの音声情報の音響的な特徴を用いて音響モデル記憶部から類似する音素情報を探索し、探索した１以上の音素情報を用いて認識辞書記憶部から単語を探索し、探索した１以上の単語と言語モデル記憶部の言語的な特徴を用いて、変換前認識結果を推定し、認識辞書記憶部を参照して、変換前認識結果の単語と同じ意味を持つ単語の中から目標とする難易度に最も近い単語を選択する。 In order to solve the above problem, in the speech recognition technology of the present invention, phoneme information that is an acoustic feature of each phoneme is stored in advance in the acoustic model storage unit, and a word and the word are stored in the recognition dictionary storage unit. , The difficulty level of this word, and a word having the same meaning as this word and having a different difficulty level are stored in advance, and language features are stored in advance in the language model storage unit. The acoustic feature of the speech information is obtained using speech information, and similar phoneme information is searched from the acoustic model storage unit using the acoustic feature of the speech information obtained by the speech analysis unit. The phoneme information is used to search for a word from the recognition dictionary storage unit, and the one or more searched words and the linguistic features of the language model storage unit are used to estimate the recognition result before conversion, and the recognition dictionary storage unit is Refer to the same word as the recognition result before conversion. It means to select the closest word to the degree of difficulty of the target from within a word with.

本発明は、変換部を設けることによって、利用者が理解しやすい認識結果を求めることができるという効果を奏する。 The present invention provides an effect that it is possible to obtain a recognition result that is easy for the user to understand by providing the conversion unit.

従来の音声認識装置２０の構成例を示す図。The figure which shows the structural example of the conventional speech recognition apparatus 20. FIG. 音声認識装置１００の構成例を示す図。The figure which shows the structural example of the speech recognition apparatus. 音声認識装置１００の処理フロー例を示す図。The figure which shows the example of a processing flow of the speech recognition apparatus 100. 認識辞書記憶部１２９に記憶されるデータ例を示す図。The figure which shows the example of data memorize | stored in the recognition dictionary memory | storage part 129. 変換部１１０の処理フロー例を示す図。The figure which shows the example of a processing flow of the conversion part. （Ａ）は難易度３の単語を含む変換前認識結果例を、（Ｂ）は（Ａ）を変換前認識結果とし難易度選択情報として難易度２を設定した場合の変換後認識結果例を、（Ｃ）は（Ａ）を変換前認識結果とし難易度選択情報として難易度１を設定した場合の変換後認識結果例を、（Ｄ）は（Ａ）を変換前認識結果とし難易度選択情報として難易度１を設定し、表示選択情報として非表示可能として設定した場合の変換後認識結果例を、（Ｅ）は難易度１の単語を含む変換前認識結果例を、（Ｆ）は（Ｅ）を変換前認識結果とし難易度選択情報として難易度２を設定した場合の変換後認識結果例を示す図。(A) is an example of a recognition result before conversion including a word with difficulty level 3, and (B) is an example of a recognition result after conversion when difficulty level 2 is set as difficulty level selection information with (A) being a recognition result before conversion. , (C) is a recognition result example after conversion when (A) is the recognition result before conversion and difficulty level 1 is set as the difficulty level selection information. (D) is a difficulty level selection with (A) being the recognition result before conversion. Example of recognition result after conversion when difficulty level 1 is set as information and non-displayable is set as display selection information, (E) is an example of recognition result before conversion including a word of difficulty level 1, (F) is The figure which shows the example of a recognition result after conversion when the difficulty level 2 is set as the recognition result before conversion as (E) and difficulty level selection information. 本実施例における音声認識装置１００のハードウェア構成を例示したブロック図。The block diagram which illustrated the hardware constitutions of the speech recognition apparatus 100 in a present Example. 実施例２に係る変換部２１０の構成例を示す図。FIG. 10 is a diagram illustrating a configuration example of a conversion unit 210 according to the second embodiment. 変換部２１０の処理フロー例を示す図。The figure which shows the example of a processing flow of the conversion part 210.

以下、本発明の実施の形態について、詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

＜音声認識装置１００＞
図２は音声認識装置１００の構成例を、図３は音声認識装置１００の処理フロー例を示す。図２及び図３を用いて、実施例１に係る音声認識装置１００を説明する。音声認識装置１００は、音声情報の内容を認識し、その内容をテキストデータとして求める。音声認識装置１００は、例えば、音声入力部２１、音声分析部２３、探索部２５、音響モデル記憶部２７、認識辞書記憶部１２９を備える言語モデル１３１、変換部１１０、記憶部１０３及び制御部１０５を有する。なお、求めたテキストデータは、図示しない出力装置（例えば、ディスプレイやプリンター等）や外部記録媒体等に出力してもよいし、後述する記憶部１０３に記憶してもよい。 <Voice recognition apparatus 100>
FIG. 2 shows a configuration example of the speech recognition apparatus 100, and FIG. 3 shows a processing flow example of the speech recognition apparatus 100. The speech recognition apparatus 100 according to the first embodiment is described with reference to FIGS. 2 and 3. The speech recognition apparatus 100 recognizes the content of speech information and obtains the content as text data. The speech recognition apparatus 100 includes, for example, a language model 131 including a speech input unit 21, a speech analysis unit 23, a search unit 25, an acoustic model storage unit 27, and a recognition dictionary storage unit 129, a conversion unit 110, a storage unit 103, and a control unit 105. Have The obtained text data may be output to an output device (not shown) (for example, a display or a printer), an external recording medium, or the like, or may be stored in the storage unit 103 described later.

＜記憶部１０３及び制御部１０５＞
記憶部１０３は、入出力される各データや演算過程の各データを、逐一、格納・読み出しする。それにより各演算処理が進められる。但し、必ずしも記憶部１０３に記憶しなければならないわけではなく、各部間で直接データを受け渡してもよい。
制御部１０５は、各処理を制御する。 <Storage unit 103 and control unit 105>
The storage unit 103 stores / reads each input / output data and each data of the calculation process one by one. Thereby, each calculation process is advanced. However, the data need not necessarily be stored in the storage unit 103, and data may be directly transferred between the units.
The control unit 105 controls each process.

＜音声入力部２１＞
音声入力部２１は、例えば、マイクロフォン及び入力インターフェース等であり、音声を電気的信号に変換し、さらに、Ａ／Ｄ変換器等を備え、デジタルデータに変換する。なお、本明細書において、音声、電気的信号に変換された音声及びＡ／Ｄ変換器により変換された音声デジタルデータを併せて音声情報という。音声認識装置１００は、音声入力部２１を介して音声を入力される（ｓ２１）。但し、外部記録媒体や記憶部１０３に記憶されている音声情報の内容を認識する場合には、音声入力部２１を設けなくともよい。 <Voice input unit 21>
The sound input unit 21 is, for example, a microphone and an input interface, and converts sound into an electrical signal, and further includes an A / D converter and converts the sound into digital data. In this specification, voice, voice converted to an electrical signal, and voice digital data converted by an A / D converter are collectively referred to as voice information. The voice recognition apparatus 100 receives a voice via the voice input unit 21 (s21). However, when recognizing the content of the audio information stored in the external recording medium or the storage unit 103, the audio input unit 21 may not be provided.

＜音声分析部２３＞
音声分析部２３は、音声情報を用いて、この音声情報の音響的な特徴を求める（ｓ２３）。例えば、音響的な特徴は、ＭＦＣＣ（mel-frequency cepstral coefficient）等である。 <Speech analysis unit 23>
The voice analysis unit 23 obtains an acoustic feature of the voice information using the voice information (s23). For example, the acoustic feature is MFCC (mel-frequency cepstral coefficient) or the like.

＜音響モデル記憶部２７＞
音響モデル記憶部２７は、各音素の音響的な特徴である音素情報が予め記憶される。例えば、音響モデル記憶部２７には、標準パターン（標準モデル）のＭＦＣＣ等が予め記憶されている。 <Acoustic model storage unit 27>
The acoustic model storage unit 27 stores phoneme information that is an acoustic feature of each phoneme in advance. For example, the acoustic model storage unit 27 stores a MFCC of a standard pattern (standard model) or the like in advance.

＜認識辞書記憶部１２９＞
図４は、認識辞書記憶部１２９に記憶されるデータ例を示す。認識辞書記憶部１２９は、単語と、その単語の読みと、その単語の難易度と、その単語と同じ意味を持ち異なる難易度の単語とを対応付けて予め記憶される。また、例えば、認識辞書記憶部１２９は、単語と、変換後にこの単語を表示するか否かを決定する表示フラグを対応付けて、予め記憶されてもよい。例えば、認識辞書記憶部１２９に記憶されるデータは、使用に先立って、製造者や利用者等によって予め登録される。 <Recognition dictionary storage unit 129>
FIG. 4 shows an example of data stored in the recognition dictionary storage unit 129. The recognition dictionary storage unit 129 stores a word, a reading of the word, a difficulty level of the word, and a word having the same meaning as the word and a difficulty level different from each other. In addition, for example, the recognition dictionary storage unit 129 may store a word in advance in association with a display flag that determines whether to display the word after conversion. For example, data stored in the recognition dictionary storage unit 129 is registered in advance by a manufacturer, a user, or the like prior to use.

＜言語モデル記憶部１３１＞
言語モデル記憶部１３１は、言語的な特徴が予め記憶される。例えば、言語モデル記憶部は、認識辞書記憶部１２９を備え、さらに、各単語列の生起確率等が予め記憶されている。 <Language model storage unit 131>
The language model storage unit 131 stores linguistic features in advance. For example, the language model storage unit includes a recognition dictionary storage unit 129, and the occurrence probability of each word string is stored in advance.

＜探索部２５＞
探索部２５は、音声分析部２３で求めた音声情報の音響的な特徴を用いて音響モデル記憶部から類似する音素情報を探索する。さらに、探索部２５は、探索した１以上の音素情報を用いて認識辞書記憶部１２９から単語を探索する。最後に、探索部２５は、探索した１以上の単語と言語モデル記憶部１３１の言語的な特徴を用いて、変換前認識結果を推定する（ｓ２３）。 <Search unit 25>
The search unit 25 searches for similar phoneme information from the acoustic model storage unit using the acoustic features of the voice information obtained by the voice analysis unit 23. Further, the search unit 25 searches for a word from the recognition dictionary storage unit 129 using the searched one or more phoneme information. Finally, the search unit 25 estimates the recognition result before conversion using one or more searched words and the linguistic features of the language model storage unit 131 (s23).

例えば、探索部２５は、音声分析部２３で求めた音響的な特徴（ＭＦＣＣ）と音響モデル記憶部２７に記憶される標準パターンのＭＦＣＣからユークリッド距離を算出し、類似する音素情報を探索する。さらに、探索した音素情報から、隠れマルコフモデルを用いて読みを推定し、対応する単語を認識辞書記憶部１２９から探索する。探索した単語からなる単語列と言語モデル記憶部１３１に記憶されている単語列の生起確率から確率の高い単語列を探索し、変換前認識結果を推定する。 For example, the search unit 25 calculates Euclidean distance from the acoustic feature (MFCC) obtained by the speech analysis unit 23 and the MFCC of the standard pattern stored in the acoustic model storage unit 27, and searches for similar phoneme information. Furthermore, reading is estimated from the searched phoneme information using a hidden Markov model, and a corresponding word is searched from the recognition dictionary storage unit 129. A word string having a high probability is searched from a word string composed of the searched words and the occurrence probability of the word string stored in the language model storage unit 131, and a recognition result before conversion is estimated.

なお、一般的に音声認識とは、以下の式の左辺Ｐ（ｗ＾｜ｙ）を求めることであり、ｗは推定しようとするテキストデータを、ｙは音声分析部２３で求める時系列の音響的な特徴を、Ｐ（ｙ｜ｗ）は音響モデル記憶部２７の持つ情報を、Ｐ（ｗ）は言語モデル記憶部１３１の持つ情報を表す（参考文献：中川聖一著、「確率モデルによる音声認識」、コロナ社、昭和63年7月1日発行、p.33〜p.34）。 Note that speech recognition generally refers to obtaining the left side P (w ^ | y) of the following equation, where w is the text data to be estimated and y is the time-series sound obtained by the speech analysis unit 23. P (y | w) represents information held by the acoustic model storage unit 27, and P (w) represents information held by the language model storage unit 131 (reference: written by Seiichi Nakagawa, “By Probability Model”). Speech recognition ", Corona, July 1, 1988, p.33-p.34).

但し、本発明は、本実施例に限定されるものではない。例えば、音声入力部２１、音声分析部２３、探索部２５、音響モデル記憶部２７の処理内容は、他の従来技術を用いて、変換前認識結果を算出してもよい。また、認識辞書記憶部１２９や言語モデル記憶部１３１には、上述の情報以外に従来技術で用いた情報（例えば、各単語のクラス等）を記憶して使用してもよい。 However, the present invention is not limited to this embodiment. For example, the processing contents of the voice input unit 21, the voice analysis unit 23, the search unit 25, and the acoustic model storage unit 27 may calculate the recognition result before conversion using another conventional technique. The recognition dictionary storage unit 129 and the language model storage unit 131 may store and use information (for example, class of each word) used in the prior art in addition to the above information.

＜変換部１１０＞
図５は変換部１１０の処理フロー例を示す。変換部１１０は、認識辞書記憶部１２９を参照して、変換前認識結果の単語と同じ意味を持つ単語の中から目標とする難易度に最も近い単語を選択する（ｓ１１０）。なお、同じ意味を持つ単語とは、その単語自身及び認識辞書記憶１２９において対応付けられた単語である。 <Conversion unit 110>
FIG. 5 shows a processing flow example of the conversion unit 110. The conversion unit 110 refers to the recognition dictionary storage unit 129 and selects a word closest to the target difficulty level from words having the same meaning as the word of the recognition result before conversion (s110). The word having the same meaning is a word associated with the word itself and the recognition dictionary storage 129.

例えば、変換部１１０は、変換後の目標とする難易度を予め選択され、認識辞書記憶部１２９を参照して、変換前認識結果の単語と同じ意味を持ち選択された難易度に最も近い単語を選択する（ｓ１１６）。また、変換部１１０は、認識辞書記憶部１２９を参照して、変換前認識結果の単語を変換後に表示するか否かを判定し（ｓ１１５）、表示フラグが表示することを意味する場合には、この単語を同じ意味を持つ単語の中から目標とする難易度に最も近い単語を選択し（ｓ１１６）、表示フラグが表示しないことを意味する場合には、変換後認識結果にこの単語と同じ意味を持つ単語を表示しないように変換する（ｓ１１７）。 For example, the conversion unit 110 selects a target difficulty level after conversion in advance, refers to the recognition dictionary storage unit 129, and has the same meaning as the word of the recognition result before conversion and is closest to the selected difficulty level. Is selected (s116). Further, the conversion unit 110 refers to the recognition dictionary storage unit 129 to determine whether or not to display the word of the recognition result before conversion after conversion (s115), and when the display flag indicates that it is displayed. When the word closest to the target difficulty level is selected from the words having the same meaning as this word (s116) and it means that the display flag is not displayed, the converted recognition result is the same as this word. Conversion is performed so that a meaningful word is not displayed (s117).

＜処理フロー例＞
例えば、利用者は、予め図示していない入力装置（例えば、マウスやキーボード等）を用いて、難易度選択情報、表示選択情報等である設定情報を変換部１１０へ送信する。なお、難易度選択情報とは、変換後認識結果の目標とする単語の難易度を決定する情報であり、例えば、難易度１と設定される。表示選択情報とは、変換後認識結果に対応する単語を表示するか否かを決定する情報であり、例えば、非表示可または非表示不可と設定される。但し、音声認識装置１００が、目標とする難易度が予め決まっている場合（例えば「難易度２」）や、変換後認識結果に同じ意味を持つ単語を表示するか否かが予め決まっている場合（例えば「非表示不可」）には、難易度選択情報及び表示選択情報を入力しなくともよい。 <Example of processing flow>
For example, the user transmits setting information such as difficulty level selection information and display selection information to the conversion unit 110 by using an input device (for example, a mouse, a keyboard, etc.) not shown in advance. The difficulty level selection information is information that determines the difficulty level of the word that is the target of the recognition result after conversion. For example, the difficulty level selection information is set to 1. The display selection information is information for determining whether or not to display a word corresponding to the recognition result after conversion, and is set to be non-displayable or non-displayable, for example. However, when the target difficulty level of the speech recognition apparatus 100 is determined in advance (for example, “difficulty level 2”), whether or not to display words having the same meaning in the post-conversion recognition result is determined in advance. In some cases (for example, “non-display not possible”), it is not necessary to input the difficulty level selection information and the display selection information.

変換部１１０は、変換前認識結果を入力され、変換後認識結果を出力する。変換部１１０は、探索部２５から入力される変換前認識結果をバッファ等に記憶し、そこから単語を一つ取り出す（ｓ１１２）。表示選択情報が、非表示可と設定されている場合には、認識辞書１２９を参照して、取り出した単語の表示フラグが非表示可能か否か判定し（ｓ１１５）、表示フラグが非表示不可の場合には、この単語と同じ意味を持つ単語の中から目標とする難易度に最も近い単語を選択する（ｓ１１６）。変換部１１０は、ｓ１１６において、同じ意味を持つ単語として、その単語自身が選択された場合には、特に変換処理を行わず、出力用のバッファ等にその単語を記録し、異なる難易度の単語が選択された場合には、その単語を選択された単語に変換し、出力用のバッファ等に変換後の単語を記録する。表示フラグが非表示可の場合には、変換後認識結果にこの単語を表示しないように変換する（ｓ１１７）。例えば、この単語を削除する。なお、表示選択情報が非表示不可と設定されている場合には、表示フラグの判定（ｓ１１５）や非表示変換（ｓ１１７）を行わず、各単語について変換処理（ｓ１１６）を行う。 The conversion unit 110 receives the recognition result before conversion and outputs the recognition result after conversion. The conversion unit 110 stores the recognition result before conversion input from the search unit 25 in a buffer or the like, and extracts one word therefrom (s112). If the display selection information is set to be non-displayable, it is determined whether or not the display flag of the extracted word is non-displayable with reference to the recognition dictionary 129 (s115), and the display flag cannot be non-displayable. In the case of, a word closest to the target difficulty level is selected from words having the same meaning as this word (s116). When the word itself is selected as a word having the same meaning in s116, the conversion unit 110 records the word in an output buffer or the like without performing any conversion process, and the word has a different difficulty level. When is selected, the word is converted into the selected word, and the converted word is recorded in an output buffer or the like. If the display flag is non-displayable, conversion is performed so that this word is not displayed in the recognition result after conversion (s117). For example, this word is deleted. If the display selection information is set to be non-displayable, the display flag determination (s115) and non-display conversion (s117) are not performed, and the conversion process (s116) is performed for each word.

変換処理を行った単語は、図示しない出力用のバッファ等に記憶される。変換前認識結果の最後の単語か否かを判定し（ｓ１１８）、最後の単語ではない場合には、変換前認識結果から次の単語を一つ取り出し、上記処理を繰り返す。最後の単語の場合には、変換後認識結果として、上記変換を行った認識結果を出力用のバッファ等から取り出し出力する（ｓ１１９）。但し、変換後認識結果を変換毎に出力する構成としてもよい。その場合には、認識結果を出力した後に（ｓ１１９）、最後の単語か否かを判定する（ｓ１１８）。 The word subjected to the conversion process is stored in an output buffer or the like (not shown). It is determined whether or not it is the last word of the recognition result before conversion (s118). If it is not the last word, one next word is extracted from the recognition result before conversion and the above process is repeated. In the case of the last word, as a recognition result after conversion, the recognition result obtained by the above conversion is extracted from an output buffer or the like and output (s119). However, the configuration may be such that the recognition result after conversion is output for each conversion. In this case, after outputting the recognition result (s119), it is determined whether or not it is the last word (s118).

＜具体例＞
図６（Ａ）は難易度３の単語を含む変換前認識結果例を、（Ｂ）は（Ａ）を変換前認識結果とし難易度選択情報として難易度２を設定した場合の変換後認識結果例を、（Ｃ）は（Ａ）を変換前認識結果とし難易度選択情報として難易度１を設定した場合の変換後認識結果例を、（Ｄ）は（Ａ）を変換前認識結果とし難易度選択情報として難易度１を設定し、表示選択情報として非表示可能として設定した場合の変換後認識結果例を、（Ｅ）は難易度１の単語を含む変換前認識結果例を、（Ｆ）は（Ｅ）を変換前認識結果とし難易度選択情報として難易度２を設定した場合の変換後認識結果例を示す。なお、（Ｂ）、（Ｃ）、（Ｆ）の変換処理において、表示選択情報は非表示不可として設定してあるものとする。また、（Ａ）において、各単語の下部に難易度を説明を容易にするために表示しているが、実際の使用に際しては表示されなくともよい。また、変換される単語の下線も同様に表示されなくともよい。 <Specific example>
FIG. 6A shows an example of a recognition result before conversion including a word with difficulty level 3, and FIG. 6B shows a recognition result after conversion when difficulty level selection information is set as difficulty recognition information with (A) as a recognition result before conversion. For example, (C) is a recognition result example after conversion when (A) is a recognition result before conversion and difficulty level 1 is set as difficulty level selection information, and (D) is a difficulty with (A) being a recognition result before conversion. An example of a recognition result after conversion when difficulty level 1 is set as degree selection information and non-displayable is set as display selection information, (E) is an example of recognition result before conversion including a word of difficulty level (F) ) Shows an example of a recognition result after conversion when (E) is a recognition result before conversion and difficulty level 2 is set as difficulty level selection information. In the conversion processes (B), (C), and (F), the display selection information is set to be non-displayable. In (A), the difficulty level is displayed at the bottom of each word for ease of explanation, but it may not be displayed in actual use. The underline of the word to be converted may not be displayed in the same manner.

例えば、図４のデータ例のように、認識辞書記憶部１２９に情報が予め登録され、難易度選択情報が難易度１の場合には、変換前認識結果として「今日」、「私」等の難易度２が入力されると、それぞれ、難易度１の「きょう」、「わたし」に変換して出力する。難易度選択情報が難易度２の場合に、変換前認識結果として「我輩」、「痛恨の極み」等の難易度３が入力されると、それぞれ、難易度２の「私」、「非常に残念」に変換して出力する。このようにして、例えば、図６（Ａ）の変換前認識結果が変換部１１０に入力された場合には、難易度２が設定されていれば図６（Ｂ）の変換後認識結果を出力し、難易度１が設定されていれば図６（Ｃ）の変換後認識結果を出力する。なお、図６（Ａ）において、「えーと」は難易度１である。しかし、図４には難易度の異なる同じ意味を持つ単語が存在しないため、選択された難易度２に最も近い同じ意味を持つ単語は単語「えーと」自身となり、変換処理を行わず、出力用のバッファ等に出力される。また、図６（Ａ）において、「遺憾の意」は難易度３である。難易度選択情報の示す難易度が１だった場合、図４には、「遺憾の意」に対応する難易度１の単語が登録されていないため、難易度２の対応する単語「申し訳ない気持ち」が選択された難易度１に最も近い同じ意味を持つ単語となる。 For example, as shown in the data example of FIG. 4, when information is registered in advance in the recognition dictionary storage unit 129 and the difficulty level selection information is difficulty level 1, the recognition results before conversion include “today” and “I”. When difficulty level 2 is input, it is converted to “Kyo” and “I” with difficulty level 1 and output. When difficulty level selection information is difficulty level 2, if difficulty level 3 such as “I am” or “extreme pain” is input as the recognition result before conversion, “I”, “very” Convert to "Sorry" and output. In this way, for example, when the pre-conversion recognition result of FIG. 6A is input to the conversion unit 110, the post-conversion recognition result of FIG. 6B is output if the difficulty level 2 is set. If the difficulty level 1 is set, the recognition result after conversion shown in FIG. 6C is output. In FIG. 6A, “Ut” is difficulty level 1. However, in FIG. 4, there are no words having the same meaning with different difficulty levels, so the word having the same meaning closest to the selected difficulty level 2 is the word “Uto” itself, and is not subjected to conversion processing, and is used for output. Is output to the buffer. Further, in FIG. 6A, “will of regret” has a difficulty level of 3. When the difficulty level indicated by the difficulty level selection information is 1, since the word of difficulty level 1 corresponding to “regret” is not registered in FIG. 4, the word “sorry feeling” corresponding to difficulty level 2 is not registered. "Is the word having the same meaning closest to the selected difficulty level 1.

また、どのような単語でも、全て発話した単語を表示すると、読みづらくなってしまうという場合がある。例えば、「非常に」等の程度を表す副詞は記述しなくても大意は通じる単語である。「えーと」、「あのー」等の曖昧語や「とても」「少しだけ」等の程度を表す副詞を変換後表示結果に表示しないほうが理解しやすい認識結果となる場合もある。本実施例では、予め認識辞書記憶部１２９の各単語に対し「表示フラグ」を登録し、表示フラグが表示しないことを意味する場合（非表示可）には、変換後認識結果に単語と同じ意味を持つ単語を表示しないように変換する。この場合出力用のバッファ等には何も出力しなくともよい。図６（Ａ）の変換前認識結果が変換部１０に入力され、難易度選択情報として難易度１が、表示選択情報として非表示が選択された場合には、図６（Ｄ）の変換後認識結果を出力する。この場合には、単語「えーと」を表示しないように変換している。 In addition, it may be difficult to read any word when all spoken words are displayed. For example, an adverb that indicates the degree of “very” or the like is a word that can be understood even if it is not described. In some cases, it may be easier to understand the recognition result if it is not displayed in the display result after conversion of an ambiguous word such as “um” or “an” or an adverb indicating the degree of “very” or “just”. In this embodiment, when a “display flag” is registered in advance for each word in the recognition dictionary storage unit 129 and it means that the display flag is not displayed (can be hidden), the recognition result after conversion is the same as the word. Convert so that words with meaning are not displayed. In this case, nothing needs to be output to the output buffer or the like. When the pre-conversion recognition result of FIG. 6 (A) is input to the conversion unit 10 and difficulty level 1 is selected as difficulty level selection information and non-display is selected as display selection information, after the conversion of FIG. 6 (D). Output the recognition result. In this case, the conversion is made so that the word “um” is not displayed.

難易度選択情報が難易度２の場合には、変換前認識結果として「あんよ」、「なっちゃって」、「ママ」等の難易度１が入力されると、それぞれ、難易度２の「足」、「なって」、「お母さん」に変換して出力することができる。例えば、図６（Ｅ）が入力され、難易度選択情報として易度２が選択された場合には、図６（Ｆ）の変換後認識結果を出力する。このような構成とすることによって、低い難易度の幼児言葉等を高い難易度の変換前認識結果に変換することができ、利用者が理解しやすい認識結果を求めることができる。 When the difficulty level selection information is difficulty level 2, when difficulty level 1 such as “Anyo”, “Natachatte”, “Mama”, etc. is input as the recognition result before conversion, It can be converted into “leg”, “get”, “mom” and output. For example, when FIG. 6E is input and the difficulty level 2 is selected as the difficulty level selection information, the recognition result after conversion in FIG. 6F is output. By adopting such a configuration, it is possible to convert an infant word or the like having a low difficulty level into a recognition result before conversion having a high difficulty level, and a recognition result that is easy for the user to understand can be obtained.

また、一つの発話内容（変換前認識結果）の中に、同じ意味を持つ単語「我輩」、「私」が含まれる場合には、同一の単語に変換し、利用者が理解しやす認識結果を求めることができる。 In addition, when the words “I” and “I” with the same meaning are included in one utterance content (pre-conversion recognition result), it is converted into the same word and the recognition result is easy for the user to understand. Can be requested.

なお、本実施例では、難易度を１から３の３段階としているが、難易度は３段階である必要はなく、２段階、または、４段階以上であってもよい。 In the present embodiment, the difficulty level is set to three levels from 1 to 3. However, the difficulty level does not have to be three levels, and may be two levels or four or more levels.

また、変換部１１０は探索部２５の一部であってもよい。この場合、探索部２５において、変換前認識結果の単語が推定される毎に、変換処理を行う。その場合、図５の変換前認識結果から単語を一つ取り出す処理（ｓ１１２）、最後の単語か否かの判定（ｓ１１８）は行わなくてもよい。 The conversion unit 110 may be a part of the search unit 25. In this case, the search unit 25 performs a conversion process each time a pre-conversion recognition result word is estimated. In this case, the process of extracting one word from the pre-conversion recognition result in FIG. 5 (s112) and the determination of whether or not it is the last word (s118) may not be performed.

＜効果＞
このような構成によって、利用者が理解しやすい認識結果を求めることができるという効果を奏する。認識結果のテキストデータ内の各単語の難易度を統一して表現することにより、利用者が理解しやすくなる。例えば、子供や日本語の苦手な外国人等には、難易度を下げ、平易な表現で認識結果を提示することができる。逆に、発話者に対して利用者が高い言語能力を有する場合には、難易度の高い認識結果を提示することもできる。変換部１１０に難易度選択情報を入力することによって、利用者の理解度に応じて異なる難易度のテキストデータを提示することができる。認識辞書記憶部１２９に登録されるデータに表示フラグを設け、変換部１１０に表示選択情報を入力することによって、必要に応じて重要度の低い単語を非表示とすることができ、テキストデータをより分かりやすく簡潔に提示することができる。また、特定の用語（単語）を使用することを定められた議事録の作成等においては、用語を統一して提示することもできる。 <Effect>
With such a configuration, it is possible to obtain a recognition result that is easy for the user to understand. By unifying the difficulty level of each word in the text data of the recognition result, the user can easily understand. For example, a child or a foreigner who is not good at Japanese can reduce the difficulty level and present the recognition result in plain expression. Conversely, when the user has a high language ability with respect to the speaker, a recognition result having a high difficulty level can be presented. By inputting the difficulty level selection information to the conversion unit 110, text data with different difficulty levels can be presented according to the user's understanding level. By providing a display flag for data registered in the recognition dictionary storage unit 129 and inputting display selection information to the conversion unit 110, words with low importance can be hidden as necessary, and text data can be displayed. It can be presented more clearly and concisely. In addition, in the creation of minutes, etc., that are specified to use specific terms (words), the terms can be presented in a unified manner.

＜ハードウェア構成＞
図７は、本実施例における音声認識装置１００のハードウェア構成を例示したブロック図である。図７に例示するように、この例の音声認識装置１００は、それぞれＣＰＵ（Central Processing Unit）１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ（Read Only Memory）１５、ＲＡＭ（Random Access Memory）１６及びバス１７を有している。 <Hardware configuration>
FIG. 7 is a block diagram illustrating a hardware configuration of the speech recognition apparatus 100 according to the present embodiment. As illustrated in FIG. 7, the speech recognition apparatus 100 of this example includes a CPU (Central Processing Unit) 11, an input unit 12, an output unit 13, an auxiliary storage device 14, a ROM (Read Only Memory) 15, and a RAM (Random). Access Memory) 16 and a bus 17.

この例のＣＰＵ１１は、制御部１１ａ、演算部１１ｂ及びレジスタ１１ｃを有し、レジスタ１１ｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部１２は、データが入力される入力インターフェース、キーボード、マウス等であり、出力部１３は、データが出力される出力インターフェース等である。補助記憶装置１４は、例えば、ハードディスク、ＭＯ（Magneto-Optical disc）、半導体メモリ等であり、音声認識装置１００としてコンピュータを機能させるためのプログラムが格納されるプログラム領域１４ａ及び各種データが格納されるデータ領域１４ｂを有している。また、ＲＡＭ１６は、ＳＲＡＭ (Static Random Access Memory)、ＤＲＡＭ (Dynamic Random Access Memory)等であり、上記のプログラムが格納されるプログラム領域１６ａ及び各種データが格納されるデータ領域１６ｂを有している。また、バス１７は、ＣＰＵ１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ１５及びＲＡＭ１６を通信可能に接続する。なお、このようなハードウェアの具体例としては、例えば、パーソナルコンピュータの他、サーバ装置やワークステーション等を例示できる。 The CPU 11 in this example includes a control unit 11a, a calculation unit 11b, and a register 11c, and executes various calculation processes according to various programs read into the register 11c. The input unit 12 is an input interface for inputting data, a keyboard, a mouse, and the like, and the output unit 13 is an output interface for outputting data. The auxiliary storage device 14 is, for example, a hard disk, an MO (Magneto-Optical disc), a semiconductor memory, or the like, and stores a program area 14a in which a program for causing the computer to function as the voice recognition device 100 is stored and various data. It has a data area 14b. The RAM 16 is an SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 16a in which the above programs are stored and a data area 16b in which various data are stored. The bus 17 connects the CPU 11, the input unit 12, the output unit 13, the auxiliary storage device 14, the ROM 15, and the RAM 16 so that they can communicate with each other. In addition, as a specific example of such hardware, a server apparatus, a workstation, etc. other than a personal computer can be illustrated, for example.

＜プログラム構成＞
上述のように、プログラム領域１４ａ，１６ａには、本実施例の音声認識装置１００の各処理を実行するための各プログラムが格納される。音声認識プログラムを構成する各プログラムは、単一のプログラム列として記載されていてもよく、また、少なくとも一部のプログラムが別個のモジュールとしてライブラリに格納されていてもよい。また、各プログラムが単体でそれぞれの機能を実現してもよいし、各プログラムがさらに他のライブラリを読み出して各機能を実現するものでもよい。 <Program structure>
As described above, each program for executing each process of the speech recognition apparatus 100 of the present embodiment is stored in the program areas 14a and 16a. Each program constituting the speech recognition program may be described as a single program sequence, or at least a part of the program may be stored in the library as a separate module. In addition, each program may realize each function alone, or each program may read each other library to realize each function.

＜ハードウェアとプログラムとの協働＞
ＣＰＵ１１（図７）は、読み込まれたＯＳ（Operating System）プログラムに従い、補助記憶装置１４のプログラム領域１４ａに格納されている上述のプログラムをＲＡＭ１６のプログラム領域１６ａに書き込む。同様にＣＰＵ１１は、補助記憶装置１４のデータ領域１４ｂに格納されている各種データを、ＲＡＭ１６のデータ領域１６ｂに書き込む。そして、このプログラムやデータが書き込まれたＲＡＭ１６上のアドレスがＣＰＵ１１のレジスタ１１ｃに格納される。ＣＰＵ１１の制御部１１ａは、レジスタ１１ｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１６上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１１ｂに順次実行させ、その演算結果をレジスタ１１ｃに格納していく。 <Cooperation between hardware and program>
The CPU 11 (FIG. 7) writes the above-described program stored in the program area 14 a of the auxiliary storage device 14 in the program area 16 a of the RAM 16 in accordance with the read OS (Operating System) program. Similarly, the CPU 11 writes various data stored in the data area 14 b of the auxiliary storage device 14 in the data area 16 b of the RAM 16. The address on the RAM 16 where the program and data are written is stored in the register 11c of the CPU 11. The control unit 11a of the CPU 11 sequentially reads these addresses stored in the register 11c, reads a program and data from the area on the RAM 16 indicated by the read address, causes the calculation unit 11b to sequentially execute the operation indicated by the program, The calculation result is stored in the register 11c.

図２は、このようにＣＰＵ１１に上述のプログラムが読み込まれて実行されることにより構成される音声認識装置１００の機能構成を例示したブロック図である。 FIG. 2 is a block diagram illustrating the functional configuration of the speech recognition apparatus 100 configured by reading and executing the above-described program in the CPU 11 as described above.

ここで、記憶部１０３は、補助記憶装置１４、ＲＡＭ１６、レジスタ１１ｃ、その他のバッファメモリやキャッシュメモリ等の何れか、あるいはこれらを併用した記憶領域に相当する。また、記憶手段１０３、制御手段１０５、音声分析部２３、音響モデル記憶部２７、認識辞書記憶部１２９、言語モデル記憶部１３１、探索部２５、変換部１１０は、ＣＰＵ１１に音声認識プログラムを実行させることにより構成されるものである。また、本形態の音声認識装置１００は、制御部１０５の制御のもと各処理を実行する。 Here, the storage unit 103 corresponds to any one of the auxiliary storage device 14, the RAM 16, the register 11 c, other buffer memory and cache memory, or a storage area using these in combination. In addition, the storage unit 103, the control unit 105, the speech analysis unit 23, the acoustic model storage unit 27, the recognition dictionary storage unit 129, the language model storage unit 131, the search unit 25, and the conversion unit 110 cause the CPU 11 to execute a speech recognition program. It is constituted by. Further, the speech recognition apparatus 100 according to the present embodiment executes each process under the control of the control unit 105.

実施例１と異なる部分のみ説明する。実施例１とは、変換部の構成が異なる。
＜変換部２１０＞
図８は実施例２に係る変換部２１０の構成例を、図９は変換部２１０の処理フロー例を示す。変換部２１０は、処理部２１１と代表値算出部２１７を有する。 Only parts different from the first embodiment will be described. The configuration of the conversion unit is different from that of the first embodiment.
<Conversion unit 210>
FIG. 8 illustrates a configuration example of the conversion unit 210 according to the second embodiment, and FIG. 9 illustrates a processing flow example of the conversion unit 210. The conversion unit 210 includes a processing unit 211 and a representative value calculation unit 217.

＜代表値算出部２１７＞
代表値算出部２１７は、変換前認識結果の所定の区間における難易度の代表値を求める（ｓ２１３）。 <Representative value calculation unit 217>
The representative value calculation unit 217 obtains a representative value of the difficulty level in a predetermined section of the recognition result before conversion (s213).

＜処理部２１１＞
処理部２１１は、目標とする難易度を求めた代表値とし、認識辞書記憶部１２９を参照して、変換前認識結果の単語と同じ意味を持ち代表値に最も近い単語を選択する（ｓ２１６）。 <Processing unit 211>
The processing unit 211 uses the representative value obtained as the target difficulty level, refers to the recognition dictionary storage unit 129, and selects the word that has the same meaning as the word of the recognition result before conversion and is closest to the representative value (s216). .

＜処理フロー例＞
変換部２１０は、変換前認識結果を入力され、変換後認識結果を出力する。変換部２１０は、探索部２５から入力される変換前認識結果をバッファ等に記憶する。そこから所定の区間分の単語を取り出し認識辞書１２９を参照して各単語の難易度を取得し、所定の区間内での単語の難易度の代表値を算出する（ｓ２１３）。これを各区間について行い、全区間の代表値をそれぞれ算出する。なお、所定の区間とは、変換する単語数、処理時間、また変換前認識結果の単語全てであってもよい。代表値としては、例えば、平均値、最頻値、中央値等が考えられる。 <Example of processing flow>
The converter 210 receives the pre-conversion recognition result and outputs the post-conversion recognition result. The conversion unit 210 stores the recognition result before conversion input from the search unit 25 in a buffer or the like. Then, words for a predetermined section are extracted, the difficulty level of each word is acquired with reference to the recognition dictionary 129, and a representative value of the difficulty level of the word in the predetermined section is calculated (s213). This is performed for each section, and representative values for all sections are calculated. Note that the predetermined section may be the number of words to be converted, the processing time, or all the words of the recognition result before conversion. As the representative value, for example, an average value, a mode value, a median value, or the like can be considered.

次に、バッファ等から変換前認識結果の単語を一つ取り出す（ｓ１１２）。表示選択情報が、非表示可と設定されている場合には、認識辞書１２９を参照して、取り出した単語の表示フラグが非表示可能か否か判定し（ｓ１１５）、表示フラグが非表示不可の場合には、この単語と同じ意味を持つ単語の中から、代表値の難易度に最も近い単語を選択する（ｓ２１６）。変換部１１０は、ｓ２１６において、同じ意味を持つ単語として、その単語自身が選択された場合には、特に変換処理を行わず、出力用のバッファ等にその単語を記録し、異なる難易度の単語が選択された場合には、その単語を選択された単語に変換し、出力用のバッファ等に変換後の単語を記録する。表示フラグが非表示可の場合には、変換後認識結果にこの単語を表示しないように変換する（ｓ１１７）。なお、各単語は、各単語の属する区間の代表値の難易度に最も近い単語に変換される。なお、表示選択情報が非表示不可と設定されている場合には、表示フラグの判定（ｓ１１５）や非表示変換（ｓ１１７）を行わず、各単語について変換処理（ｓ２１６）を行う。 Next, one word of the recognition result before conversion is extracted from the buffer or the like (s112). If the display selection information is set to be non-displayable, it is determined whether or not the display flag of the extracted word is non-displayable with reference to the recognition dictionary 129 (s115), and the display flag cannot be non-displayable. In the case of, a word closest to the difficulty level of the representative value is selected from words having the same meaning as this word (s216). When the word itself is selected as a word having the same meaning in s216, the conversion unit 110 records the word in an output buffer or the like without performing a conversion process, and the word has a different difficulty level. When is selected, the word is converted into the selected word, and the converted word is recorded in an output buffer or the like. If the display flag is non-displayable, conversion is performed so that this word is not displayed in the recognition result after conversion (s117). Each word is converted into a word closest to the difficulty level of the representative value of the section to which each word belongs. If the display selection information is set to be non-displayable, the display flag determination (s115) and non-display conversion (s117) are not performed, and conversion processing (s216) is performed for each word.

変換処理を行った単語は、図示しない出力用のバッファ等に記憶される。変換前認識結果の最後の単語か否かを判定し（ｓ１１８）、最後の単語ではない場合には、変換前認識結果から次の単語を一つ取り出し、上記処理を繰り返す。最後の単語の場合には、変換後認識結果として、上記変換を行った認識結果を出力用のバッファ等から取り出し出力する（ｓ１１９）。但し、変換後認識結果を変換毎に出力する構成としてもよい。 The word subjected to the conversion process is stored in an output buffer or the like (not shown). It is determined whether or not it is the last word of the recognition result before conversion (s118). If it is not the last word, one next word is extracted from the recognition result before conversion and the above process is repeated. In the case of the last word, as a recognition result after conversion, the recognition result obtained by the above conversion is extracted from an output buffer or the like and output (s119). However, the configuration may be such that the recognition result after conversion is output for each conversion.

例えば、図６（Ａ）の変換前認識結果が入力された場合には、所定の区間を変換前認識結果の全単語とし代表値を平均値とすると、所定の区間の難易度の代表値は１．５となる。例えば、この値を切り上げ、難易度の代表値を２すると、変換後認識結果は図６（Ｂ）となる。 For example, when the recognition result before conversion shown in FIG. 6A is input, assuming that a predetermined section is all words of the recognition result before conversion and the representative value is an average value, the representative value of the difficulty of the predetermined section is 1.5. For example, when this value is rounded up and the representative value of the difficulty level is 2, the recognition result after conversion is as shown in FIG.

＜効果＞
このような構成によって、利用者が理解しやすい認識結果を求めることができるという効果を奏する。例えば、変換前認識結果の各単語の難易度から、所定の区間（例えば変換前認識結果全体）の難易度の代表値（例えば平均値）を求めることによって、各単語を含む所定の区間全体の難易度傾向が求められ、これらの代表値を変換後認識結果の難易度とすれば、極端に周りの単語と異なる難易度の単語がそのまま表示されるのを防止し、前後関係に見合った難易度の単語を表示することができる。 <Effect>
With such a configuration, it is possible to obtain a recognition result that is easy for the user to understand. For example, by obtaining a representative value (for example, an average value) of the difficulty level of a predetermined section (for example, the entire recognition result before conversion) from the difficulty level of each word of the recognition result before conversion, the entire predetermined section including each word is obtained. If a difficulty level trend is required, and these representative values are used as the difficulty level of the recognition result after conversion, words with difficulty levels that are extremely different from the surrounding words are prevented from being displayed as they are, and difficulty levels that match the context The word of the degree can be displayed.

本発明は、変換前認識結果を単に同じ意味の単語に変換するのではなく、難易度を用いて変換するため、利用者が理解しやすいテキストデータを求めることができる。 In the present invention, the recognition result before conversion is not simply converted into words having the same meaning, but is converted using the degree of difficulty, so that it is possible to obtain text data that is easy for the user to understand.

なお、設定情報として、実施例１の方法（手動）と実施例２の方法（自動）を何れかを選択する変換設定情報を変換部に入力する構成とし、実施例１と実施例２を組合せて利用することもできる。この場合、利用者が出力されるテキストデータを選択することができる。手動を選択した場合には、難易度選択情報を変換部に入力する。 In addition, as setting information, it is set as the structure which inputs the conversion setting information which selects either the method (manual) of Example 1 and the method (automatic) of Example 2 to a conversion part, and Example 1 and Example 2 are combined. Can also be used. In this case, the user can select text data to be output. When manual is selected, difficulty level selection information is input to the conversion unit.

１００音声認識装置２１音声入力部
２３音声分析部２５探索部
２７音響モデル記憶部１２９認識辞書記憶部
１３１言語モデル記憶部１１０、２１０変換部
２１１処理部２１７代表値算出部 DESCRIPTION OF SYMBOLS 100 Speech recognition apparatus 21 Speech input part 23 Speech analysis part 25 Search part 27 Acoustic model memory | storage part 129 Recognition dictionary memory | storage part 131 Language model memory | storage part 110,210 Conversion part 211 Processing part 217 Representative value calculation part

Claims

A speech recognition device that recognizes the content of speech information and obtains the content as text data,
A voice analysis unit for obtaining acoustic characteristics of the voice information using the voice information;
An acoustic model storage unit in which phoneme information that is an acoustic feature of each phoneme is stored in advance;
A recognition dictionary storage unit which stores in advance a word, a reading of the word, a difficulty level of the word, and a word having the same meaning as the word and a difficulty level different from each other;
A language model storage unit in which linguistic features are stored in advance;
Search for similar phoneme information from the acoustic model storage unit using the acoustic features of the speech information obtained by the speech analysis unit, and search for words from the recognition dictionary storage unit using the searched one or more phoneme information. A search unit that estimates a recognition result before conversion using one or more searched words and linguistic features of the language model storage unit;
With reference to the recognition dictionary storage unit, a conversion unit that selects a word closest to the target difficulty level from words having the same meaning as the word of the recognition result before conversion,
A speech recognition apparatus.

The speech recognition apparatus according to claim 1,
The conversion unit selects a target difficulty level after conversion in advance, refers to the recognition dictionary storage unit, and selects a word closest to the selected difficulty level with the same meaning as the word of the recognition result before conversion. select,
A speech recognition apparatus characterized by that.

The speech recognition apparatus according to claim 1,
The conversion unit includes a representative value calculation unit that obtains a representative value of the difficulty level in a predetermined section of the recognition result before conversion, sets the target difficulty level as the representative value, and refers to the recognition dictionary storage unit, Selecting a word that has the same meaning as the word of the recognition result before conversion and is closest to the representative value;
A speech recognition apparatus characterized by that.

The speech recognition device according to any one of claims 1 to 3,
The recognition dictionary storage unit determines a word, a reading of the word, a difficulty level of the word, a word having the same meaning as the word and a different difficulty level, and whether to display the word after conversion. Pre-stored in association with the display flag,
The conversion unit refers to the recognition dictionary storage unit, determines whether or not to display the word of the recognition result before conversion after conversion, and displays the display flag of the word of the recognition result before conversion If it means that the word closest to the target difficulty level is selected from the words having the same meaning as the word, and the display flag is not displayed, Convert to not display words with the same meaning,
A speech recognition apparatus characterized by that.

A speech recognition method for recognizing the content of speech information using an acoustic model storage unit, a recognition dictionary storage unit, and a language model storage unit, and obtaining the content as text data,
The phone model information that is the acoustic feature of each phoneme is stored in advance in the acoustic model storage unit, and the word, the reading of the word, the difficulty of the word, and the same as the word are stored in the recognition dictionary storage unit Assume that words having different meanings and different difficulty levels are stored in advance, and language features are stored in advance in the language model storage unit.
A voice analysis step for obtaining an acoustic feature of the voice information using the voice information;
Search for similar phoneme information from the acoustic model storage unit using the acoustic features of the speech information obtained in the speech analysis step, and search for words from the recognition dictionary storage unit using the searched one or more phoneme information. A search step for estimating a pre-conversion recognition result using one or more searched words and linguistic features of the language model storage unit;
A conversion step of referring to the recognition dictionary storage unit and selecting a word closest to a target difficulty level from words having the same meaning as the word of the recognition result before conversion;
A speech recognition method comprising:

The speech recognition method according to claim 5,
In the conversion step, a target difficulty level after conversion is selected in advance, and referring to the recognition dictionary storage unit, a word that has the same meaning as the word of the recognition result before conversion and is closest to the selected difficulty level is selected. select,
A speech recognition method characterized by the above.

The speech recognition method according to claim 5,
The conversion step includes a representative value calculation step for obtaining a representative value of the difficulty level in a predetermined section of the recognition result before conversion, the target difficulty level as the representative value, and referring to the recognition dictionary storage unit, Selecting a word that has the same meaning as the word of the recognition result before conversion and is closest to the representative value;
A speech recognition method characterized by the above.

The speech recognition method according to any one of claims 5 to 7,
The recognition dictionary storage unit determines a word, a reading of the word, a difficulty level of the word, a word having the same meaning as the word and a different difficulty level, and whether to display the word after conversion. Pre-stored in association with the display flag,
The conversion step refers to the recognition dictionary storage unit to determine whether to display the pre-conversion recognition result word after conversion, and to display the pre-conversion recognition word display flag If it means that the most vowed word is selected for the target difficulty level from words having the same meaning as that of the word, and the display flag is not displayed, the word is included in the recognition result after conversion. To avoid displaying words with the same meaning as
A speech recognition method characterized by the above.

A speech recognition program for causing a computer to function as the speech recognition apparatus according to claim 1.