JP5396530B2

JP5396530B2 - Speech recognition apparatus and speech recognition method

Info

Publication number: JP5396530B2
Application number: JP2012270688A
Authority: JP
Inventors: 真也飯塚; 孝輔辻野
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2012-12-11
Filing date: 2012-12-11
Publication date: 2014-01-22
Anticipated expiration: 2030-06-17
Also published as: JP2013050742A

Description

本発明は、音声認識装置および音声認識方法に関する。 The present invention relates to a voice recognition device and a voice recognition method.

マイクから入力された音声から、この音声の内容を示す文字列を生成して出力する音声認識技術が知られている。具体的には、音響モデルおよび言語モデルを参照して、入力された音声と文字列とのマッチング処理をおこなうことで、統計的に尤もらしい文字列を認識結果として出力する。音響モデルとは、音声の特徴量と文字との対応関係およびその統計情報を示すものである。また、言語モデルは、文字列間の接続関係およびその統計情報を示すものである。 A speech recognition technique is known that generates and outputs a character string indicating the content of speech from speech input from a microphone. Specifically, by referring to the acoustic model and the language model and performing matching processing between the input speech and the character string, a statistically likely character string is output as a recognition result. The acoustic model indicates a correspondence relationship between voice feature amounts and characters and statistical information thereof. The language model indicates the connection relationship between character strings and its statistical information.

このような音声認識技術では、周囲の雑音などにより音声に歪みが生じた場合や、言語モデルに登録されていない語彙が発音された場合などには、音声が誤認識されて、間違った文字列を認識結果として出力してしまう。この場合、ユーザは、出力された認識結果のうちの間違っている文字列を、キーボードなどの入力デバイスを操作して、手作業で訂正する必要があった。 In such speech recognition technology, when the speech is distorted due to ambient noise, or when a vocabulary that is not registered in the language model is pronounced, the speech is misrecognized and the wrong character string Is output as a recognition result. In this case, the user has to manually correct an incorrect character string in the output recognition result by operating an input device such as a keyboard.

そこで、特許文献１には、このようなユーザの操作負担を軽減することを目的とし、認識結果の単語に対して、この単語との競合確率が近い単語を訂正候補単語として提示し、認識結果の単語をこの訂正候補単語に変換するかをユーザに選択させる技術が開示されている。この技術によれば、ユーザは訂正候補単語を選択するだけで認識結果の単語を訂正することができるので、ユーザの操作負担を軽減することができるとされている。 Therefore, in Patent Document 1, for the purpose of reducing such an operation burden on the user, a word having a near-competition probability with this word is presented as a correction candidate word with respect to the word of the recognition result, and the recognition result Has been disclosed that allows the user to select whether to convert this word into a correction candidate word. According to this technique, the user can correct a word as a recognition result only by selecting a correction candidate word, so that the operation burden on the user can be reduced.

特開２００６−１４６００８号公報JP 2006-146008 A

しかしながら、上記特許文献１に記載の技術では、音声認識処理についてもその訂正処理についても、１つの言語モデルの内容に基づいておこなっているから、なおも正しい単語となる訂正単語が言語モデルに登録されていない場合には、正しい単語をユーザに提示することができない。従って、結局、ユーザは、出力された認識結果のうちの間違っている文字列を、キーボードなどの入力デバイスを操作して、手作業で訂正する必要があった。 However, in the technique described in Patent Document 1, both the speech recognition process and the correction process are performed based on the contents of one language model, so that a correct word that is still a correct word is registered in the language model. If not, the correct word cannot be presented to the user. Therefore, in the end, the user has to manually correct an incorrect character string in the output recognition result by operating an input device such as a keyboard.

本発明の目的は、音声認識処理で間違って認識された文字列に対する変換候補を、上記音声認識処理とは異なる根拠に従って決定することにある。 An object of the present invention is to determine conversion candidates for a character string that is erroneously recognized in the voice recognition process according to a different ground from the voice recognition process.

上記課題を解決するため、本発明にかかる音声認識装置は、音声データを取得する音声データ取得部と、音声の特徴量と文字との対応関係を示す音響モデルおよび文字列間の接続関係を示す第１言語モデルを参照して、前記音声データに対する音声認識処理をおこない、その認識結果を示す認識文字列を生成する音声認識部と、前記認識文字列のうち、ユーザによって指定された文字列、認識結果としての信頼度が所定の閾値よりも低い文字列、または、認識結果としての信頼度が所定の閾値よりも低い文字列の組み合わせからなる文字列を、変換対象文字列として決定する変換対象文字列決定部と、前記認識文字列において、決定された前記変換対象文字列の直前または直後に接続された文字列から、ユーザによって指定された単位の文字列または数の文字列を参照文字列として決定する参照文字列決定部と、ユーザが過去に入力した文字列から抽出された文字列同士の接続関係を示す第２言語モデルを参照して、前記音声データに対する音声認識処理の認識結果に因らない文字列であって、決定された前記参照文字列との接続関係が示されている文字列を、前記変換対象文字列を変換する候補の変換候補文字列として決定する変換候補文字列決定部と、決定された前記変換候補文字列を出力する出力部とを備えることを特徴とする。 In order to solve the above-described problems, a speech recognition apparatus according to the present invention shows a speech data acquisition unit that acquires speech data, an acoustic model that indicates the correspondence between speech feature values and characters, and a connection relationship between character strings. A speech recognition unit that performs speech recognition processing on the speech data with reference to the first language model and generates a recognized character string indicating the recognition result; a character string designated by the user among the recognized character strings; Conversion target that determines a character string having a reliability as a recognition result lower than a predetermined threshold or a character string formed by a combination of character strings having a reliability as a recognition result lower than a predetermined threshold as a conversion target character string A character string determining unit and a character in a unit specified by the user from a character string connected immediately before or immediately after the determined character string to be converted in the recognized character string Or a reference string determining unit for determining the number of strings as the reference string, the user refers to the second language model that indicates the connection relationship of the string between extracted from the character string entered in the past, the voice Candidate conversion candidates for converting the conversion target character string into character strings that do not depend on the recognition result of the voice recognition processing for data and that indicate the connection relationship with the determined reference character string A conversion candidate character string determination unit that is determined as a character string, and an output unit that outputs the determined conversion candidate character string.

好ましくは、前記第２言語モデルにおいて、前記参照文字列に対して所定の範囲外の数の前記変換候補文字列が得られた場合、前記参照文字列決定部は、前記認識文字列において、前記変換対象文字列の直前または直後に接続された文字列の数を増減して、その増減後の数の文字列を新たな参照文字列として決定し、前記変換候補文字列決定部は、前記第２言語モデルにおいて前記新たな参照文字列との接続関係が示されている文字列を、前記変換候補文字列として決定するとよい。 Preferably, in the second language model, when the number of conversion candidate character strings out of a predetermined range with respect to the reference character string is obtained, the reference character string determination unit may The number of character strings connected immediately before or after the conversion target character string is increased / decreased, and the number of character strings after the increase / decrease is determined as a new reference character string. A character string indicating a connection relationship with the new reference character string in the bilingual model may be determined as the conversion candidate character string.

好ましくは、前記出力部は、前記認識文字列における前記変換対象文字列を、前記変換候補文字列に変換して、変換後の前記認識文字列を出力するとよい。 Preferably, the output unit may convert the conversion target character string in the recognized character string into the conversion candidate character string and output the converted recognized character string.

好ましくは、前記変換候補文字列決定部は、前記第２言語モデルにおいて前記参照文字列との接続関係が示されている文字列のそれぞれについて、前記変換対象文字列との相関度を算出し、少なくとも、当該相関度が最も高い文字列または当該相関度が閾値よりも高い文字列を、前記変換対象文字列を変換する候補の変換候補文字列として決定するとよい。 Preferably, the conversion candidate character string determination unit calculates a degree of correlation with the conversion target character string for each of the character strings that indicate a connection relationship with the reference character string in the second language model, It is preferable that at least the character string having the highest correlation degree or the character string having the correlation degree higher than the threshold value is determined as a candidate conversion candidate character string for converting the conversion target character string.

好ましくは、前記出力部は、前記変換候補文字列の音素のうち、前記変換対象文字列の音素と一致しない音素が削除された変換候補文字列、または、前記変換候補文字列と前記参照文字列との接続関係について文法チェックをおこなうことにより特定された前記変換候補文字列に含まれる出力不要な文字が削除された変換候補文字列を出力するとよい。 Preferably, the output unit converts a conversion candidate character string from which phonemes that do not match the phonemes of the conversion target character string are deleted from the phonemes of the conversion candidate character string, or the conversion candidate character string and the reference character string. It is preferable to output a conversion candidate character string from which unnecessary characters included in the conversion candidate character string specified by performing a grammatical check on the connection relation with the character string are deleted.

好ましくは、前記音響モデルおよび前記第１言語モデルを記憶したサーバ装置との通信をおこなう通信部を備え、前記音声認識部は、前記通信部による前記サーバ装置との通信によって、前記サーバ装置が記憶している前記音響モデルおよび前記第１言語モデルを参照し、前記音声認識処理をおこなうとよい。 Preferably, a communication unit that communicates with the server device that stores the acoustic model and the first language model is included, and the voice recognition unit is stored in the server device by communication with the server device by the communication unit. The speech recognition process may be performed with reference to the acoustic model and the first language model.

また、本発明にかかる音声認識方法は、音声認識装置による音声認識方法であって、音声データを取得する音声データ取得工程と、音声の特徴量と文字との対応関係を示す音響モデルおよび文字列間の接続関係を示す第１言語モデルを参照して、前記音声データに対する音声認識処理をおこない、その認識結果を示す認識文字列を生成する音声認識工程と、前記認識文字列のうち、ユーザによって指定された文字列、認識結果としての信頼度が所定の閾値よりも低い文字列、または、認識結果としての信頼度が所定の閾値よりも低い文字列の組み合わせからなる文字列を、変換対象文字列として決定する変換対象文字列決定工程と、前記認識文字列において、決定された前記変換対象文字列の直前または直後に接続された文字列から、ユーザによって指定された単位の文字列または数の文字列を参照文字列として決定する参照文字列決定工程と、ユーザが過去に入力した文字列から抽出された文字列同士の接続関係を示す第２言語モデルを参照して、前記音声データに対する音声認識処理の認識結果に因らない文字列であって、決定された前記参照文字列との接続関係が示されている文字列を、前記変換対象文字列を変換する候補の変換候補文字列として決定する変換候補文字列決定工程と、決定された前記変換候補文字列を出力する出力工程とを備えることを特徴とする。 The speech recognition method according to the present invention is a speech recognition method by a speech recognition apparatus, and includes an audio data acquisition step for acquiring audio data, an acoustic model and a character string indicating a correspondence relationship between a feature amount of audio and characters. A speech recognition process for performing speech recognition processing on the speech data and generating a recognized character string indicating the recognition result, and a user of the recognized character strings The character string to be converted is a specified character string, a character string whose reliability as a recognition result is lower than a predetermined threshold, or a character string consisting of a combination of character strings whose reliability as a recognition result is lower than a predetermined threshold. A conversion target character string determination step for determining as a string, and a character string connected immediately before or immediately after the determined conversion target character string in the recognition character string Thus, the second language indicating the connection relationship between the reference character string determining step for determining the character string or the number of character strings specified as the reference character string and the character strings extracted from the character strings previously input by the user With reference to the model, a character string that does not depend on the recognition result of the voice recognition process for the voice data and that indicates the connection relationship with the determined reference character string is converted into the character to be converted. A conversion candidate character string determining step for determining a conversion candidate character string as a candidate for converting a column, and an output step for outputting the determined conversion candidate character string.

本発明によれば、音声認識処理で間違って認識された文字列に対する変換候補を、上記音声認識処理とは異なる根拠に従って決定することができる。 According to the present invention, it is possible to determine conversion candidates for a character string that is erroneously recognized in the speech recognition process according to a different ground from the speech recognition process.

第１実施形態にかかる音声認識装置１００の構成を示す。1 shows a configuration of a speech recognition apparatus 100 according to a first embodiment. 音声認識装置１００のハードウェア構成の一例を示す。An example of the hardware constitutions of the speech recognition apparatus 100 is shown. 音声認識装置１００の機能構成を示す。2 shows a functional configuration of the speech recognition apparatus 100. 第２言語モデルの一例を示す。An example of a 2nd language model is shown. 音声認識装置１００による処理の手順を示す。The procedure of the process by the speech recognition apparatus 100 is shown. 音声認識部３１４による音声認識処理の一例を示す。An example of the speech recognition process by the speech recognition unit 314 is shown. 音声認識部３１４によって算出された信頼度の一例を示す。An example of the reliability calculated by the speech recognition unit 314 is shown. 変換候補文字列の相関度の一例を示す。An example of the correlation degree of a conversion candidate character string is shown. 変換候補文字列の出力例を示す。The output example of a conversion candidate character string is shown. 変換対象文字列の変換例を示す。A conversion example of the conversion target character string is shown. 第２実施形態にかかる第２言語モデルの一例を示す。An example of the 2nd language model concerning a 2nd embodiment is shown. 第３実施形態にかかる第２言語モデルの一例を示す。An example of the 2nd language model concerning a 3rd embodiment is shown. 第４実施形態にかかる音声認識装置１００の機能構成を示す。The function structure of the speech recognition apparatus 100 concerning 4th Embodiment is shown.

本発明は、一実施形態のために示された添付図面を参照して以下の詳細な記述を考慮することによって容易に理解することができる。引き続き、添付図面を参照しながら本発明の実施形態を説明する。可能な場合には、同一の部分には同一の符号を付して、重複する説明を省略する。 The present invention can be readily understood by considering the following detailed description with reference to the accompanying drawings shown for the embodiments. Subsequently, embodiments of the present invention will be described with reference to the accompanying drawings. Where possible, the same parts are denoted by the same reference numerals, and redundant description is omitted.

（第１実施形態）
まず、第１実施形態を説明する。図１は、第１実施形態にかかる音声認識装置１００の構成を示す。音声認識装置１００は、入力された音声を認識して、認識した音声に応じた文字を出力する装置である。この第１実施形態では、音声認識機能を有するパーソナル・コンピュータを、音声認識装置１００として用いている。 (First embodiment)
First, the first embodiment will be described. FIG. 1 shows a configuration of a speech recognition apparatus 100 according to the first embodiment. The speech recognition device 100 is a device that recognizes input speech and outputs characters according to the recognized speech. In the first embodiment, a personal computer having a voice recognition function is used as the voice recognition apparatus 100.

音声認識装置１００は、本体１１０、マイク１２０、ディスプレイ１３０、スピーカ１４０、キーボード１５０、およびマウス１６０を備える。たとえば、ユーザが音声を発すると、この音声がマイク１２０によって本体１１０へ入力される。本体１１０は、入力された音声に対する音声認識処理をおこなうことにより、入力された音声を認識して、認識した音声に応じた文字を出力する。たとえば、本体１１０は、認識された文字を示す画像をディスプレイ１３０に表示させたり、認識された文字に応じた音声をスピーカ１４０から発せさせたりする。キーボード１５０およびマウス１６０は、音声認識処理をおこなう際、ユーザからの指示の入力が必要な場合に、当該入力をおこなうための入力装置として利用される。 The speech recognition apparatus 100 includes a main body 110, a microphone 120, a display 130, a speaker 140, a keyboard 150, and a mouse 160. For example, when the user utters voice, the voice is input to the main body 110 by the microphone 120. The main body 110 recognizes the input voice by performing voice recognition processing on the input voice, and outputs a character corresponding to the recognized voice. For example, the main body 110 displays an image indicating the recognized character on the display 130 or causes the speaker 140 to emit a sound corresponding to the recognized character. When performing voice recognition processing, the keyboard 150 and the mouse 160 are used as input devices for performing an input when an instruction from the user is required.

図２は、音声認識装置１００のハードウェア構成の一例を示す。音声認識装置１００は、すでに説明した本体１１０、マイク１２０、ディスプレイ１３０、スピーカ１４０、キーボード１５０、およびマウス１６０に加え、本体１１０の内部に、ＣＰＵ１５０５、ＲＯＭ１５１０、ＲＡＭ１５２０、外部メモリドライブ１５４０、外部メモリ１５４２、通信インターフェース１５５０、および入出力機器インターフェース１５６０を備える。 FIG. 2 shows an exemplary hardware configuration of the speech recognition apparatus 100. In addition to the main body 110, the microphone 120, the display 130, the speaker 140, the keyboard 150, and the mouse 160 described above, the voice recognition device 100 includes a CPU 1505, ROM 1510, RAM 1520, external memory drive 1540, and external memory 1542 inside the main body 110. A communication interface 1550 and an input / output device interface 1560.

ＲＯＭ１５１０、ＲＡＭ１５２０、および外部メモリ１５４２は、各種データおよび各種プログラムを格納する。ＣＰＵ１５０５は、ＲＯＭ１５１０、ＲＡＭ１５２０、または外部メモリ１５４２に格納されたプログラムを実行することで、各種データ処理および各種ハードウェア制御をおこなう。 The ROM 1510, the RAM 1520, and the external memory 1542 store various data and various programs. The CPU 1505 executes various types of data processing and various types of hardware control by executing programs stored in the ROM 1510, the RAM 1520, or the external memory 1542.

通信インターフェース１５５０は、外部装置との通信を制御する。外部メモリドライブ１５４０は、外部メモリ１５４２に接続し、外部メモリ１５４２に対するデータの読み書きをおこなう。外部メモリ１５４２としては、たとえば、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disc）、メモリカード等が挙げられる。 The communication interface 1550 controls communication with an external device. The external memory drive 1540 is connected to the external memory 1542 and reads / writes data from / to the external memory 1542. Examples of the external memory 1542 include a CD (Compact Disc), a DVD (Digital Versatile Disc), and a memory card.

入出力機器インターフェース１５６０は、本体１１０に接続された各種入出力機器による、データの入出力を制御する。すでに説明したとおり、本体１１０には、入出力機器として、マイク１２０、ディスプレイ１３０、スピーカ１４０、キーボード１５０、およびマウス１６０が接続されている。よって、入出力機器インターフェース１５６０は、これらの入出力機器による、データの入出力を制御する。 The input / output device interface 1560 controls input / output of data by various input / output devices connected to the main body 110. As described above, the microphone 110, the display 130, the speaker 140, the keyboard 150, and the mouse 160 are connected to the main body 110 as input / output devices. Therefore, the input / output device interface 1560 controls the input / output of data by these input / output devices.

たとえば、音声認識装置１００においては、ＲＯＭ１５１０、ＲＡＭ１５２０、または外部メモリ１５４２が、図３以降で説明する音響モデル格納部３０２、第１言語モデル格納部３０４、および第２言語モデル格納部３０６として機能する。また、音声認識装置１００においては、ＣＰＵ１５０５が、ＲＯＭ１５１０、ＲＡＭ１５２０、または外部メモリ１５４２に格納されている音声認識プログラムを実行して、マイク１２０を制御することにより、図３以降で説明する音声データ取得部３１２として機能する。また、音声認識装置１００においては、ＣＰＵ１５０５が、上記音声認識プログラムを実行することにより、図３以降で説明する音声認識部３１４、変換対象文字列決定部３１６、参照文字列決定部３１８、および変換候補文字列決定部３２０として機能する。また、音声認識装置１００においては、ＣＰＵ１５０５が、上記音声認識プログラムを実行して、ディスプレイ１３０またはスピーカ１４０を制御することにより、図３以降で説明する出力部３２２として機能する。また、音声認識装置１００においては、ＣＰＵ１５０５が、上記音声認識プログラムを実行して、通信インターフェース１５５０を制御することにより、図１３で説明する通信部３３０として機能する。 For example, in the speech recognition apparatus 100, the ROM 1510, the RAM 1520, or the external memory 1542 functions as an acoustic model storage unit 302, a first language model storage unit 304, and a second language model storage unit 306 described in FIG. . In the speech recognition apparatus 100, the CPU 1505 executes a speech recognition program stored in the ROM 1510, the RAM 1520, or the external memory 1542, and controls the microphone 120, thereby obtaining speech data described in FIG. The unit 312 functions. Further, in the speech recognition apparatus 100, the CPU 1505 executes the speech recognition program, so that the speech recognition unit 314, the conversion target character string determination unit 316, the reference character string determination unit 318, and the conversion described later with reference to FIG. It functions as a candidate character string determination unit 320. In the speech recognition apparatus 100, the CPU 1505 executes the speech recognition program and controls the display 130 or the speaker 140, thereby functioning as an output unit 322 described in FIG. In the speech recognition apparatus 100, the CPU 1505 executes the speech recognition program and controls the communication interface 1550, thereby functioning as the communication unit 330 described in FIG.

上記音声認識プログラムは、たとえば、音声認識装置１００にインストールされた状態で、ユーザに提供される。他の例として、上記音声認識プログラムは、コンピュータ読み取り可能な記録媒体に格納されてユーザに提供され、音声認識装置１００にインストールされてもよい。この他にも、上記音声認識プログラムは、ＣＰＵ１５０５が実行するプログラムは、外部装置から通信ネットワークを介してユーザに提供され、音声認識装置１００にインストールされてもよい。 The voice recognition program is provided to the user in a state where it is installed in the voice recognition device 100, for example. As another example, the voice recognition program may be stored in a computer-readable recording medium, provided to the user, and installed in the voice recognition apparatus 100. In addition to the above, the voice recognition program may be installed in the voice recognition device 100 by providing the program executed by the CPU 1505 to the user from an external device via a communication network.

図３は、音声認識装置１００の機能構成を示す。ここでは、音声認識装置１００が備える機能のうち、音声認識処理に関連する機能を中心に説明する。音声認識装置１００は、音響モデル格納部３０２、第１言語モデル格納部３０４、第２言語モデル格納部３０６、音声データ取得部３１２、音声認識部３１４、変換対象文字列決定部３１６、参照文字列決定部３１８、変換候補文字列決定部３２０、および出力部３２２を備える。 FIG. 3 shows a functional configuration of the speech recognition apparatus 100. Here, it demonstrates centering on the function relevant to a speech recognition process among the functions with which the speech recognition apparatus 100 is provided. The speech recognition apparatus 100 includes an acoustic model storage unit 302, a first language model storage unit 304, a second language model storage unit 306, a speech data acquisition unit 312, a speech recognition unit 314, a conversion target character string determination unit 316, and a reference character string. A determination unit 318, a conversion candidate character string determination unit 320, and an output unit 322 are provided.

音響モデル格納部３０２は、音声の特徴量と文字との対応関係およびその統計情報が示されたいわゆる音響モデルを格納する。音声の特徴量としては、ＭＦＣＣ（Mel Frequency Cepstrum Coefficient：メル周波数ケプストラム係数）が挙げられる。第１言語モデル格納部３０４は、文字列間の接続関係およびその統計情報が示されたいわゆる言語モデルを格納する。第２言語モデル格納部３０６は、少なくとも文字列間の接続関係が示された言語モデルを格納する。 The acoustic model storage unit 302 stores a so-called acoustic model in which the correspondence between voice feature amounts and characters and statistical information thereof are shown. Examples of the voice feature amount include MFCC (Mel Frequency Cepstrum Coefficient). The first language model storage unit 304 stores a so-called language model in which connection relations between character strings and statistical information thereof are shown. The second language model storage unit 306 stores at least a language model indicating a connection relationship between character strings.

ここで、上記接続関係および上記統計情報について説明すると、たとえば、文字列「私に」は、文字列「私」と文字列「に」とが接続されたものといえる。また、文字列「私が」は、文字列「私」と文字列「が」とが接続されたものといえる。そこで、第１言語モデルおよび第２言語モデルには、上記した文字列「私」と文字列「に」との接続関係や、文字列「私」と文字列「が」との接続関係といったような、各文字列間の接続関係が示される。統計情報とは、多数の文章データにおいて、形態素や単語といった単位での文字列の接続がどの程度の確率で起こりうるかを示すものである。たとえば、上記例でいえば、多数の文章データにおける、「私」に「に」が続く確率や、「私」に「が」が続く確率などがこれに該当する。ただし、言語モデルは上記例のような２つの形態素間の接続関係に限らず、「私に」や「私が」を１文字列として接続関係を保ってもよく、その接続数も１、２、３、…のいずれでもよく、音声認識の処理が実行できればよい。 Here, the connection relationship and the statistical information will be described. For example, it can be said that the character string “me” is connected to the character string “me” and the character string “ni”. The character string “I ga” can be said to be a character string “I” connected to the character string “ga”. Therefore, in the first language model and the second language model, the connection relationship between the character string “I” and the character string “ni”, the connection relationship between the character string “I” and the character string “GA”, etc. The connection relationship between each character string is shown. The statistical information indicates the probability of connection of character strings in units such as morphemes and words in a large number of text data. For example, in the above example, the probability that “I” is followed by “ni” and the probability that “I” is followed by “ga” in a large number of text data corresponds to this. However, the language model is not limited to the connection relationship between the two morphemes as in the above example, but the connection relationship may be maintained with “me” or “I am” as one character string, and the number of connections is 1, 2 Any of 3, 3,... Suffice as long as the speech recognition process can be executed.

第２言語モデル格納部３０６に格納されている言語モデルは、第１言語モデル格納部３０４に格納されている言語モデルとは別に設けられている言語モデルであり、第１言語モデル格納部３０４に格納されている言語モデルを補うものであるが、その内容は第１言語モデルと異なるものであっても同じものであってもよい。以降、これらを区別するため、第１言語モデル格納部３０４に格納されている言語モデルを「第１言語モデル」と示す。また、第２言語モデル格納部３０６に格納されている言語モデルを「第２言語モデル」と示す。上記において第１言語モデルと第２言語モデルとが別に設けられているとは論理上のことを意味しており、物理的には同一の装置に設けられていてもよく、別々の装置に設けられていてもよい。 The language model stored in the second language model storage unit 306 is a language model provided separately from the language model stored in the first language model storage unit 304, and is stored in the first language model storage unit 304. It is intended to supplement the stored language model, but the content may be different from or the same as the first language model. Hereinafter, in order to distinguish these, the language model stored in the first language model storage unit 304 is referred to as “first language model”. The language model stored in the second language model storage unit 306 is referred to as a “second language model”. In the above description, the fact that the first language model and the second language model are provided separately means logically, and may be physically provided in the same device or provided in different devices. It may be done.

図４は、第２言語モデルの一例を示す。図４は、第２言語モデル格納部３０６に格納されている第２言語モデルの一例を示すものである。ここで、すでに説明したとおり、第２言語モデルは、少なくとも文字列の接続関係が示されている。この第２言語モデルは、たとえば、ユーザが過去に入力した文章データベースを分析することにより、予め生成することや、ユーザの入力によって逐次更新するものである。なお、第２言語モデルには、文字列の接続関係のほか、統計情報が示されていてもよい。また、第２言語モデルにおいては、その情報量を抑制するため、統計確率の低い接続関係や、時系列的に古い接続関係が削除されてもよい。図４に示す例では、第２言語モデルとして、形態素単位の文字列間の接続関係が示されている。たとえば、図４に示す例では、第１の文字列「京都」の後方に接続する第２の文字列として、「府」、「市」、「駅」、「の」のそれぞれが対応付けられている。これらの接続関係は、たとえば、ユーザが過去に入力した「京都府の県庁所在地は京都市です」という文字列と、「京都駅は京都の中心ですか」という文字列とが分析されて、これらの文字列に含まれる形態素単位の文字列間の接続関係として、第２言語モデルに加えられたものである。このように、ユーザが過去に入力した文字列に基づいて、ユーザオリジナルの第２言語モデルを形成することで、そのユーザによってまた利用される可能性の高い文字列を、変換候補文字列として提示できる。 FIG. 4 shows an example of the second language model. FIG. 4 shows an example of the second language model stored in the second language model storage unit 306. Here, as already explained, the second language model shows at least the connection relationship of the character strings. For example, the second language model is generated in advance by analyzing a text database input by the user in the past, or is updated sequentially by user input. The second language model may indicate statistical information in addition to character string connection relationships. In the second language model, in order to suppress the amount of information, connection relationships with a low statistical probability and old connection relationships in time series may be deleted. In the example shown in FIG. 4, the connection relationship between character strings in units of morphemes is shown as the second language model. For example, in the example illustrated in FIG. 4, “fu”, “city”, “station”, and “no” are associated as the second character string connected behind the first character string “Kyoto”. ing. For example, these connection relationships are analyzed by analyzing the character string “Kyoto prefecture is located in Kyoto city” and the character string “Kyoto station is the center of Kyoto” entered by the user in the past. The connection relationship between character strings in morpheme units included in the character string is added to the second language model. In this way, by forming the user's original second language model based on the character string input by the user in the past, a character string that is likely to be used again by the user is presented as a conversion candidate character string. it can.

図５は、音声認識装置１００による処理の手順を示す。以降、図５に示す処理手順にしたがって、図３に示した各機能部の詳細について説明する。 FIG. 5 shows a processing procedure by the speech recognition apparatus 100. Hereinafter, according to the processing procedure shown in FIG. 5, details of each functional unit shown in FIG. 3 will be described.

（ステップＳ５０２）
音声データ取得部３１２が、音声認識処理の対象とする音声データを取得する。具体的には、音声データ取得部３１２は、マイク１２０から入力された音声信号を、音声認識処理の対象とする音声データとして取得する。既に音声データがメモリ等の記憶媒体に格納されている場合、音声データ取得部３１２は、この記憶媒体に格納されている音声データを、音声認識処理の対象とする音声データとして取得してもよい。 (Step S502)
The voice data acquisition unit 312 acquires voice data to be subjected to voice recognition processing. Specifically, the audio data acquisition unit 312 acquires the audio signal input from the microphone 120 as audio data to be subjected to the audio recognition process. When the voice data is already stored in a storage medium such as a memory, the voice data acquisition unit 312 may acquire the voice data stored in the storage medium as voice data to be subjected to voice recognition processing. .

（ステップＳ５０４）
音声認識部３１４が、音響モデル格納部３０２に格納された音響モデルおよび第１言語モデル格納部３０４に格納された第１言語モデルに基づく音声認識処理をおこなうことで、ステップＳ５０２で取得された音声データが示す音声を認識して、尤もらしい文字列を、当該認識結果を示す認識文字列として生成する。ここで、音声認識部３１４による音声認識処理の一例について説明する。図６は、音声認識部３１４による音声認識処理の一例を示す。この例では、音声認識部３１４によって、「今から京都駅に行きます」という音声の音声データから、「今から京都的に行きます」という認識文字列が生成されている。この例では、音声認識部３１４は、「駅」を「的」と誤って認識してしまっている。この理由としては、たとえば、「駅」という単語が第１言語モデルに示されていない、音声データに含まれている雑音などにより音声に歪みが生じて「駅」が認識結果およびその候補とされなかった等、様々な理由が挙げられる。なお、音響モデルの構成、第１言語モデルの構成、および音声認識部３１４による音声認識処理の具体的な方法は様々である。本実施形態では、音声データから認識文字列を生成することができればよいため、これを実現することができるものであれば、これらについてどのようなものを用いてもよい。 (Step S504)
The speech recognition unit 314 performs speech recognition processing based on the acoustic model stored in the acoustic model storage unit 302 and the first language model stored in the first language model storage unit 304, thereby acquiring the speech acquired in step S502. The speech represented by the data is recognized, and a plausible character string is generated as a recognized character string indicating the recognition result. Here, an example of voice recognition processing by the voice recognition unit 314 will be described. FIG. 6 shows an example of voice recognition processing by the voice recognition unit 314. In this example, the voice recognition unit 314 generates a recognition character string “I will go to Kyoto now” from voice data “I will go to Kyoto station now”. In this example, the voice recognition unit 314 erroneously recognizes “station” as “target”. The reason for this is, for example, that the word “station” is not shown in the first language model, and the speech is distorted due to noise included in the voice data, so that “station” is the recognition result and its candidate. There are various reasons such as the absence. The configuration of the acoustic model, the configuration of the first language model, and the specific method of speech recognition processing by the speech recognition unit 314 are various. In the present embodiment, it is only necessary to be able to generate a recognized character string from speech data, and any of these may be used as long as this can be realized.

（ステップＳ５０６）
音声認識部３１４が、ステップＳ５０４で生成された認識文字列を構成する文字列のそれぞれについて、音声認識の信頼度を算出する。ここで、音声認識部３１４によって算出された信頼度の一例について説明する。図７は、音声認識部３１４によって算出された信頼度の一例を示す。図７は、音声認識部３１４によって生成された「今から京都的に行きます」という認識文字列を構成する形態素のそれぞれについての、音声認識部３１４が算出した音声認識の信頼度を示す。ここでいう信頼度とは、音声データから文字列が正しく認識された可能性の度合いを示す。信頼度が高いほど、その文字列が正しく認識された可能性が高いことを示す。たとえば、図７に示す例では、「今から京都的に行きます」という認識文字列を構成する、「今」、「から」、「京都」、「的」、「に」、「行き」、「ます」の各形態素のそれぞれについて、信頼度が示されている。この例では、「今」の信頼度は「９５」、「から」の信頼度は「９２」、「京都」の信頼度は「８０」、「的」の信頼度は「５０」、「に」の信頼度は「７０」、「行き」の信頼度は「９３」、「ます」の信頼度は「９３」である。 (Step S506)
The voice recognition unit 314 calculates the reliability of voice recognition for each of the character strings constituting the recognized character string generated in step S504. Here, an example of the reliability calculated by the speech recognition unit 314 will be described. FIG. 7 shows an example of the reliability calculated by the voice recognition unit 314. FIG. 7 shows the speech recognition reliability calculated by the speech recognition unit 314 for each of the morphemes constituting the recognized character string “going to Kyoto now” generated by the speech recognition unit 314. Here, the reliability indicates the degree of possibility that the character string is correctly recognized from the voice data. The higher the reliability, the higher the possibility that the character string is correctly recognized. For example, in the example shown in FIG. 7, “Now”, “From”, “Kyoto”, “Target”, “Ni”, “To”, The reliability is shown for each morpheme of “mas”. In this example, the reliability of “now” is “95”, the reliability of “kara” is “92”, the reliability of “Kyoto” is “80”, the reliability of “target” is “50”, "Is" 70 "," going "is" 93 ", and" mas "is" 93 ".

たとえば、音声認識部３１４は、音声データの特徴量と音響モデルの特徴量との一致度が高いほど、その文字列の信頼度を高く算出する。また、音声認識部３１４は、第１言語モデルに示されている統計情報に基づいて、この文字列の出現確率や出現頻度が高いほど、この文字列の信頼度を高く算出する。さらに、音響モデルと言語モデルの双方を参照し複合的に信頼度を決定することも考えられる。たとえば、「京都」、「的」、「に」、「行き」、「ます」からなる認識文字列について、「京都」と「に」と「行き」と「ます」とが組み合わせてよく利用される場合は、これらの文字列の出現確率が高いということであるから、これらの信頼度を高める。一方、「的」については、「京都」、「行き」、「ます」との組み合わせはほとんど使用されないとすると、この文字列「的」の出現確率が低いということであるから、この文字列「的」の信頼度を低める。 For example, the voice recognition unit 314 calculates a higher reliability of the character string as the degree of coincidence between the feature amount of the voice data and the feature amount of the acoustic model is higher. In addition, the voice recognition unit 314 calculates the reliability of the character string higher as the appearance probability and appearance frequency of the character string are higher based on the statistical information indicated in the first language model. Furthermore, it is conceivable that the reliability is determined in combination with reference to both the acoustic model and the language model. For example, the recognition character string consisting of “Kyoto”, “Target”, “Ni”, “Go”, “Mas” is often used in combination with “Kyoto”, “Ni”, “Go” and “Mas”. If this is the case, the probability of appearance of these character strings is high, so that the reliability of these character strings is increased. On the other hand, with regard to “target”, if the combination of “Kyoto”, “go”, and “mas” is rarely used, it means that the occurrence probability of this character string “target” is low. Reduce the reliability of the target.

上記以外にも、文法上の不自然さが少ないものほど信頼性を高く算出したり、同音異義語が多いほど、信頼性を低く算出したりするようにしてもよい。要するに、間違っている可能性が低いものであるほど、その文字列の信頼度を高く算出すればよく、その方法はどのようなものであってもよい。 In addition to the above, the reliability may be calculated higher as the grammatical unnaturalness is smaller, or the reliability may be calculated lower as the number of homonyms is higher. In short, the lower the possibility of being wrong, the higher the reliability of the character string may be calculated, and any method may be used.

（ステップＳ５０８）
変換対象文字列決定部３１６が、ステップＳ５０４で生成された認識文字列のうちの、変換対象とする変換対象文字列を決定する。たとえば、変換対象文字列決定部３１６は、音声認識部３１４が生成した認識文字列のうちの、認識結果としての信頼度が所定の閾値よりも低い文字列、あるいはその組み合わせからなる文字列を、変換対象文字列として決定する。この方法を適用した場合、間違っている可能性の高い文字列を、変換対象文字列として決定することができる。また、変換対象文字列をユーザに選択させるようなことがないので、ユーザの操作負担を軽減することができる。たとえば、所定の閾値として「６０」が設定されており、図７に示したとおり、「今から京都的に行きます」という認識文字列を構成する、「今」、「から」、「京都」、「的」、「に」、「行き」、「ます」の各形態素のそれぞれについての信頼度が算出されている場合、変換対象文字列決定部３１６は、これら形態素のうち、信頼度として「６０」よりも低い「５０」が算出されている「的」を、変換対象文字列として決定する。 (Step S508)
The conversion target character string determination unit 316 determines a conversion target character string to be converted from the recognized character strings generated in step S504. For example, the conversion target character string determination unit 316 selects, from among the recognized character strings generated by the speech recognition unit 314, a character string whose reliability as a recognition result is lower than a predetermined threshold, or a character string formed by a combination thereof. It is determined as a conversion target character string. When this method is applied, a character string that is highly likely to be wrong can be determined as a conversion target character string. Further, since the user is not allowed to select a conversion target character string, the operation burden on the user can be reduced. For example, “60” is set as the predetermined threshold value, and as shown in FIG. 7, “Now”, “From”, “Kyoto”, which constitutes the recognition character string “I will go to Kyoto now”, are formed. , “Target”, “ni”, “go”, and “mas”, the conversion target character string determination unit 316 includes, as the reliability, “ The “target” for which “50” lower than “60” is calculated is determined as the conversion target character string.

他の例として、変換対象文字列決定部３１６は、音声認識部３１４が生成した認識文字列のうちの、任意の文字列をユーザに指定させ、ユーザによって指定された文字列を、変換対象文字列として決定するようにしてもよい。たとえば、変換対象文字列決定部３１６は、音声認識部３１４が生成した認識文字列をディスプレイ１３０に表示させる。ユーザは、ディスプレイ１３０に表示された認識文字列に対して、キーボード１５０またはマウス１６０を用いて、誤認識されたと判断した任意の文字列を指定する。そして、変換対象文字列決定部３１６は、このようにしてユーザによって指定された任意の文字列を、変換対象文字列として決定する。この方法を適用した場合、変換対象文字列をユーザが選択するので、間違っていることが確実な文字列を、変換対象文字列として決定することができる。すなわち、高い精度で変換対象文字列を決定することができる。なお、ユーザによる任意の文字列の指定方法は、上記したものに限らない。たとえば、ディスプレイ１３０の表面にタッチパネルが設けられている場合、このタッチパネルによってユーザが任意の文字列を指定するようにしてもよい。 As another example, the conversion target character string determination unit 316 causes the user to specify an arbitrary character string among the recognized character strings generated by the speech recognition unit 314, and converts the character string specified by the user into the conversion target character string. It may be determined as a column. For example, the conversion target character string determination unit 316 causes the display 130 to display the recognized character string generated by the voice recognition unit 314. The user uses the keyboard 150 or the mouse 160 to specify an arbitrary character string that is determined to be erroneously recognized with respect to the recognized character string displayed on the display 130. Then, the conversion target character string determination unit 316 determines an arbitrary character string specified by the user as described above as a conversion target character string. When this method is applied, the user selects a conversion target character string, and therefore a character string that is surely wrong can be determined as the conversion target character string. That is, the conversion target character string can be determined with high accuracy. Note that the method for specifying an arbitrary character string by the user is not limited to the above. For example, when a touch panel is provided on the surface of the display 130, the user may specify an arbitrary character string using the touch panel.

（ステップＳ５１０）
参照文字列決定部３１８が、ステップＳ５０４で生成された認識文字列のうちの、ステップＳ５０８で決定された変換対象文字列の前または後ろに接続された一部の文字列を参照文字列として決定する。具体的には、参照文字列決定部３１８は、音声認識部３１４が生成した認識文字列の前または後ろに接続された文字列のうちの、予め定められた条件に合致する文字列を、参照文字列として決定する。この条件には、変換対象文字列を基準とした参照文字列の方向、参照文字列を構成する文字列の単位、および参照文字列を構成する文字列の数が含まれる。たとえば、認識文字列を基準とした参照文字列の方向には、「前方」、「後方」、または「前方と後方の双方」のいずれかが設定される。また、参照文字列を構成する文字列の単位には、「文字」、「単語」、「形態素」、「文節」等が設定される。また、参照文字列を構成する文字列の数としては、「１」等の任意の整数が設定される。これらが組み合わされて、たとえば「変換対象文字列の前方の１形態素を参照文字列とする」といった条件が予めメモリ等の記憶媒体に格納されているのである。この「変換対象文字列の前方の１形態素を参照文字列とする」という条件によれば、たとえば、上記したとおり、「今から京都的に行きます」という認識文字列のうちの、「的」が変換対象文字列として決定された場合、この「的」の前方にある１つの形態素である「京都」が、参照文字列として決定される。また、「変換対象文字列の前方および後方の１文字を参照文字列とする」という条件によれば、この「的」の前方の１文字である「都」と、後方の１文字である「に」とが、参照文字列として決定される。 (Step S510)
The reference character string determination unit 318 determines, as a reference character string, a part of the character string connected before or after the conversion target character string determined in step S508 among the recognized character strings generated in step S504. To do. Specifically, the reference character string determination unit 318 refers to a character string that matches a predetermined condition among character strings connected before or after the recognized character string generated by the speech recognition unit 314. Determine as a string. This condition includes the direction of the reference character string based on the conversion target character string, the unit of the character string constituting the reference character string, and the number of character strings constituting the reference character string. For example, the direction of the reference character string based on the recognized character string is set to “front”, “rear”, or “both front and rear”. In addition, “character”, “word”, “morpheme”, “sentence”, etc. are set as the unit of the character string constituting the reference character string. Further, an arbitrary integer such as “1” is set as the number of character strings constituting the reference character string. By combining these, a condition such as “one morpheme in front of the character string to be converted as a reference character string” is stored in advance in a storage medium such as a memory. According to the condition that “one morpheme in front of the character string to be converted is a reference character string”, for example, as described above, the “target” among the recognized character strings “I will go to Kyoto now” Is determined as the conversion target character string, “Kyoto”, which is one morpheme in front of this “target”, is determined as the reference character string. Further, according to the condition that “one character in front of and behind the character string to be converted is a reference character string”, “Miyako” that is one character in front of this “target” and one character that is in the rear are “ "" Is determined as the reference character string.

（ステップＳ５１２）
変換候補文字列決定部３２０が、第２言語モデルを参照して、当該第２言語モデルにおいてステップＳ５１０で決定された参照文字列との接続関係が示されている文字列を、ステップＳ５０８で決定された変換対象文字列を変換する候補の変換候補文字列として決定する。たとえば、上記したとおり、変換対象文字列として「的」が決定され、参照文字列として「京都」が決定されたとする。そして、図４に示すとおり、第２言語モデルにおいて、文字列「京都」と、「府」、「市」、「駅」、「の」のそれぞれとの対応関係が示されているとする。この場合、変換候補文字列決定部３２０は、これら「府」、「市」、「駅」、「の」のそれぞれを変換候補文字列として決定する。変換候補文字列には、変換対象文字列と同じ文字列が含まれていてもよい。たとえば、上記例では、変換対象文字列「的」に対し、変換候補文字列「的」が含まれていてもよい。 (Step S512)
The conversion candidate character string determination unit 320 refers to the second language model, and determines a character string indicating a connection relationship with the reference character string determined in step S510 in the second language model in step S508. The conversion target character string is determined as a candidate conversion candidate character string to be converted. For example, as described above, it is assumed that “target” is determined as the character string to be converted and “Kyoto” is determined as the reference character string. Then, as shown in FIG. 4, it is assumed that the second language model shows the correspondence between the character string “Kyoto” and each of “fu”, “city”, “station”, and “no”. In this case, the conversion candidate character string determination unit 320 determines each of these “fu”, “city”, “station”, and “no” as conversion candidate character strings. The conversion candidate character string may include the same character string as the conversion target character string. For example, in the above example, the conversion target character string “target” may include the conversion candidate character string “target”.

ここで、変換候補文字列決定部３２０は、これら複数の変換候補文字列を、変換対象文字列を変換する候補の変換候補文字列として決定するようにしてもよい。また、変換候補文字列決定部３２０は、変換候補文字列の出力数を抑えるべく、これら複数の変換候補文字列のうちの一部を、変換対象文字列を変換する候補の変換候補文字列として決定するようにしてもよい。 Here, the conversion candidate character string determination unit 320 may determine the plurality of conversion candidate character strings as candidate conversion candidate character strings for converting the conversion target character string. Further, the conversion candidate character string determination unit 320 sets a part of the plurality of conversion candidate character strings as candidate conversion candidate character strings for converting the conversion target character strings in order to suppress the number of output conversion candidate character strings. It may be determined.

たとえば、変換候補文字列決定部３２０は、参照文字列に対応付けられている複数の文字列のそれぞれについて、変換対象文字列との相関度を算出し、算出した相関度に基づいて、複数の文字列の中から、変換候補文字列を決定してもよい。たとえば、変換候補文字列決定部３２０は、変換対象文字列との相関度の最も高い文字列を変換候補文字列として決定してもよい。また、変換候補文字列決定部３２０は、変換対象文字列との相関度が閾値よりも高い文字列を変換候補文字列として決定してもよい。また、変換対象文字列との相関度の高い順に所定数の文字列を変換候補文字列として決定してもよい。また、変換対象文字列との相関度が閾値以上の文字列を変換候補文字列として決定してもよい。 For example, the conversion candidate character string determination unit 320 calculates the degree of correlation with the conversion target character string for each of the plurality of character strings associated with the reference character string, and based on the calculated degree of correlation, A conversion candidate character string may be determined from the character string. For example, the conversion candidate character string determination unit 320 may determine the character string having the highest degree of correlation with the conversion target character string as the conversion candidate character string. Further, the conversion candidate character string determination unit 320 may determine a character string having a degree of correlation with the conversion target character string higher than a threshold as a conversion candidate character string. Alternatively, a predetermined number of character strings may be determined as conversion candidate character strings in descending order of correlation with the conversion target character string. Moreover, you may determine the character string whose correlation degree with a conversion object character string is more than a threshold value as a conversion candidate character string.

図８は、変換候補文字列の相関度の一例を示す。図８は、変換対象文字列「的」に対して、変換候補文字列決定部３２０が決定した「府」、「市」、「駅」、「の」という変換候補文字列のそれぞれについての、変換候補文字列決定部３２０が算出した相関度を示す。ここでいう相関度とは、変換対象文字列の類似度を示すものである。たとえば、変換候補文字列決定部３２０は、変換対象文字列との発音の類似度がより高い変換候補文字列の相関度をより高く算出する。この例では、変換候補文字列決定部３２０は、変換対象文字列との発音の類似度を、変換対象文字列と一致する音素（よみがなを示すローマ字）の数に基づいて算出する。すなわち、変換対象文字列との音素の一致度を、変換対象文字列との発音の類似度として算出する。たとえば、「的」の音素は「ｔ」、「ｅ、「ｋ」、「ｉ」である。これに対し、「府」の音素は「ｈ」、「ｕ」である。このように、「的」と「府」とでは、一致する音素の数が「０」であるから、変換候補文字列決定部３２０は、この「０」を、変換候補文字列「府」の相関度として決定する。一方、「駅」の音素は「ｅ」、「ｋ」、「ｉ」である。このように、「的」と「駅」とでは、一致する音素の数が「３」であるから、変換候補文字列決定部３２０は、この「３」を、変換候補文字列「駅」の相関度として決定する。このようにして、変換候補文字列決定部３２０は、複数の変換候補文字列のそれぞれの相関度を算出するのである。そして、たとえば、変換候補文字列決定部３２０は、複数の変換候補文字列のうち、相関度が最も高い文字列、閾値よりも高い文字列、あるいは相関度の高い順に所定数の文字列を、変換対象文字列を変換する候補の変換候補文字列として決定したりするのである。なお、変換候補文字列決定部３２０は、変換対象文字列との発音の類似度を、変換対象文字列と一致するよみがなの数に基づいて算出するなど、上記した変換対象文字列との音素の一致度による方法以外の方法によって算出してもよい。 FIG. 8 shows an example of the degree of correlation between conversion candidate character strings. FIG. 8 illustrates the conversion candidate character strings “fu”, “city”, “station”, and “no” determined by the conversion candidate character string determination unit 320 for the conversion target character string “target”. The correlation degree calculated by the conversion candidate character string determination unit 320 is shown. The degree of correlation here indicates the degree of similarity of the conversion target character string. For example, the conversion candidate character string determination unit 320 calculates a higher correlation degree between conversion candidate character strings having higher pronunciation similarities with the conversion target character string. In this example, the conversion candidate character string determination unit 320 calculates the similarity of pronunciation with the conversion target character string based on the number of phonemes (Roman characters indicating reading) corresponding to the conversion target character string. That is, the phoneme coincidence with the conversion target character string is calculated as the pronunciation similarity with the conversion target character string. For example, “target” phonemes are “t”, “e”, “k”, and “i”. On the other hand, the phonemes of “fu” are “h” and “u”. Thus, since the number of phonemes that match between “target” and “fu” is “0”, the conversion candidate character string determination unit 320 converts this “0” into the conversion candidate character string “fu”. Determined as the degree of correlation. On the other hand, phonemes of “station” are “e”, “k”, and “i”. Thus, since the number of phonemes that match “target” and “station” is “3”, the conversion candidate character string determination unit 320 converts this “3” into the conversion candidate character string “station”. Determined as the degree of correlation. In this way, the conversion candidate character string determination unit 320 calculates the degree of correlation of each of the plurality of conversion candidate character strings. Then, for example, the conversion candidate character string determination unit 320 selects a character string having the highest degree of correlation, a character string higher than the threshold value, or a predetermined number of character strings in order of the degree of correlation among the plurality of conversion candidate character strings. The conversion target character string is determined as a candidate conversion candidate character string to be converted. Note that the conversion candidate character string determination unit 320 calculates the similarity of pronunciation with the conversion target character string based on the number of readings that match the conversion target character string, and the like. It may be calculated by a method other than the method based on the degree of coincidence.

変換候補文字列決定部３２０は、算出した相関度をそのまま用いて変換候補文字列を決定するのではなく、算出した相関度に対して、所定の係数を乗じたり、加算するなどしてから、変換候補文字列を決定するようにしてもよい。たとえば、子音の一致数に対して、この子音に応じた係数を乗じたり、加算したりし、母音の一致数に対して、この母音に応じた係数を乗じたり、加算したりして、最終的に相関度を決定してもよい。たとえば、変換対象文字列が「的」（ｔｅｋｉ）で、変換候補文字列が「劇」（ｇｅｋｉ）であれば、音素の一致数のみで判断すると、変換候補文字列が「劇」の相関度は「３」となる。このうち、子音の一致数は「１」である。また、母音の一致数は、「２」である。たとえば、子音に応じた係数が「１」であり、母音に応じた係数が「２」であるとする。子音の一致数である「１」に対して、この子音に応じた係数「１」を乗じると、子音の一致数に基づく相関度は「１」となる。また、母音の一致数である「２」に対して、この母音に応じた係数「２」を乗じると、母音の一致数に基づく相関度は「４」となる。これらの相関度を合計することで、最終的な変換候補文字列の相関度を「５」とすることができる。 The conversion candidate character string determination unit 320 does not determine the conversion candidate character string using the calculated correlation degree as it is, but multiplies or adds a predetermined coefficient to the calculated correlation degree. A conversion candidate character string may be determined. For example, the number of consonant matches is multiplied or added by a coefficient corresponding to this consonant, and the number of vowel matches is multiplied or added by a coefficient corresponding to this vowel. The degree of correlation may be determined automatically. For example, if the conversion target character string is “target” (teki) and the conversion candidate character string is “play” (geki), the correlation degree of the conversion candidate character string “play” is determined only by the number of phoneme matches. Becomes “3”. Of these, the number of consonant matches is “1”. The number of coincidence of vowels is “2”. For example, it is assumed that the coefficient corresponding to the consonant is “1” and the coefficient corresponding to the vowel is “2”. When the consonant coincidence number “1” is multiplied by a coefficient “1” corresponding to the consonant, the correlation degree based on the consonant coincidence number is “1”. Further, when “2” which is the number of coincidence of vowels is multiplied by a coefficient “2” corresponding to this vowel, the degree of correlation based on the number of coincidence of vowels is “4”. By summing up these correlations, the correlation of the final conversion candidate character string can be set to “5”.

（ステップＳ５１４）
出力部３２２が、ステップＳ５１２で決定された変換候補文字列を出力する。たとえば、出力部３２２は、変換候補文字列を、認識文字列とともに、ユーザが視認できるよう、ディスプレイ１３０に表示させる。ここで、変換候補文字列が一つの場合、出力部３２２は、認識文字列における変換対象文字列を変換候補文字列に変換するか否かをユーザが選択可能な形態で出力する。また、変換候補文字列が複数の場合、出力部３２２は、これら複数の変換候補文字列を、いずれの変換候補文字列で変換対象文字列を変換するかをユーザが選択可能な形態で出力する。 (Step S514)
The output unit 322 outputs the conversion candidate character string determined in step S512. For example, the output unit 322 displays the conversion candidate character string together with the recognized character string on the display 130 so that the user can visually recognize the conversion candidate character string. Here, when there is one conversion candidate character string, the output unit 322 outputs in a form that the user can select whether or not to convert the conversion target character string in the recognized character string into a conversion candidate character string. When there are a plurality of conversion candidate character strings, the output unit 322 outputs the plurality of conversion candidate character strings in a form in which the user can select which conversion candidate character string is used to convert the conversion target character string. .

ここで、変換候補文字列決定部３２０が決定した変換候補文字列が一つであれば、出力部３２２は、認識文字列における変換対象文字列をこの変換候補文字列に変換して、変換後の認識文字列を出力してもよい。また、変換候補文字列が複数であれば、出力部３２２は、これら複数の変換候補文字列を、相関度の高い順に出力するようにしてもよい。この場合、出力部３２２は、相関度の最も高い変換候補文字列だけを出力するようにしてもよく、相関度の高い順に所定数の変換候補文字列を出力するようにしてもよく、相関度の高い順に相関度が閾値以上の変換候補文字列を出力するようにしてもよい。また、出力部３２２は、認識文字列における変換対象文字列を相関度の最も高い変換候補文字列に変換して、変換後の認識文字列を出力してもよい。 Here, if there is one conversion candidate character string determined by the conversion candidate character string determination unit 320, the output unit 322 converts the conversion target character string in the recognized character string into this conversion candidate character string, and after conversion The recognized character string may be output. If there are a plurality of conversion candidate character strings, the output unit 322 may output the plurality of conversion candidate character strings in descending order of correlation. In this case, the output unit 322 may output only the conversion candidate character string having the highest degree of correlation, or may output a predetermined number of conversion candidate character strings in order of the degree of correlation. Conversion candidate character strings having a correlation degree equal to or higher than a threshold may be output in descending order. The output unit 322 may convert the conversion target character string in the recognized character string into a conversion candidate character string having the highest degree of correlation, and output the converted recognized character string.

図９は、変換候補文字列の出力例を示す。図９に示す画面９００は、出力部３２２による出力処理によって、ディスプレイ１３０に表示された画面である。この画面９００は、認識文字列に含まれる変換対象文字列を、変換候補文字列に変換するための画面である。この画面９００には、認識文字列として「今から京都的に行きます」が表示されている。このうち、変換対象文字列である「的」については、これが変換対象文字列であることをユーザが認識できるように、太字および下線によって、強調表示されている。また、画面９００には、変換候補文字列として「府」、「市」、「駅」、「の」が表示されている。ここで、図８に示したように、これらの変換候補文字列に対して、予め相関度が求められているから、画面９００において、これらの変換候補文字列は、相関度の高い順に表示されている。ユーザは、複数の変換候補文字列の中から任意の変換候補文字列を選択することで、変換対象文字列を、選択した変換候補文字列に変換して、認識文字列を確定することができる。また、ユーザは、任意の変換候補文字列を選択せずに、「確定」ボタン９１０を選択することで、変換対象文字列を変換せずに、認識文字列「今から京都的に行きます」を確定することができる。 FIG. 9 shows an output example of the conversion candidate character string. A screen 900 illustrated in FIG. 9 is a screen displayed on the display 130 by the output processing by the output unit 322. This screen 900 is a screen for converting a conversion target character string included in the recognized character string into a conversion candidate character string. In this screen 900, “I will go to Kyoto now” is displayed as a recognized character string. Among these, the “target” that is the conversion target character string is highlighted by bold and underline so that the user can recognize that it is the conversion target character string. The screen 900 displays “fu”, “city”, “station”, and “no” as conversion candidate character strings. Here, as shown in FIG. 8, since the correlation degrees are obtained in advance for these conversion candidate character strings, these conversion candidate character strings are displayed on the screen 900 in descending order of the correlation degrees. ing. The user can select an arbitrary conversion candidate character string from a plurality of conversion candidate character strings, thereby converting the conversion target character string into the selected conversion candidate character string and confirming the recognized character string. . In addition, the user selects the “Confirm” button 910 without selecting any conversion candidate character string, so that the conversion target character string is not converted. Can be confirmed.

（ステップＳ５１６）
変換対象文字列が変更され、もしくは変更されずに、認識文字列が確定すると、音声認識装置１００は、この認識文字列をユーザが利用できるように、メモリ等の記憶媒体に格納したり、他のアプリケーションの入力文字列としたりして、一連の音声認識処理を終了する。 (Step S516)
When the character string to be converted is changed or not changed, and the recognized character string is confirmed, the speech recognition apparatus 100 stores the recognized character string in a storage medium such as a memory so that the user can use it. Or a series of speech recognition processing is completed.

図１０は、変換対象文字列の変換例を示す。図１０に示す画面９００においては、認識文字列として「今から京都駅に行きます」が表示されている。これは、図９に示した画面９００において、変換候補文字列「駅」をユーザが選択したことにより、変換対象文字列「的」が、変換候補文字列「駅」に変換され、この変換後の認識文字列が表示されたからである。ユーザは、「確定」ボタン９１０を選択することで、この認識文字列「今から京都駅に行きます」を、認識文字列として確定することができる。なお、音声認識装置１００は、すでに説明したとおり、このようにユーザの選択によって変換対象文字列「的」を変換するのではなく、変換対象文字列「的」を、最も相関度の高い変換候補文字列「駅」に自動的に変換し、変換後の認識文字列「今から京都駅に行きます」を最初から表示するようにしてもよい。 FIG. 10 shows a conversion example of the conversion target character string. On the screen 900 shown in FIG. 10, “I will go to Kyoto station now” is displayed as the recognition character string. This is because the conversion target character string “target” is converted into the conversion candidate character string “station” by the user selecting the conversion candidate character string “station” on the screen 900 shown in FIG. This is because the recognized character string is displayed. The user can confirm the recognized character string “I will go to Kyoto Station now” as the recognized character string by selecting the “confirm” button 910. Note that the speech recognition apparatus 100 does not convert the conversion target character string “target” by the user's selection as described above, but converts the conversion target character string “target” to the conversion candidate having the highest correlation. The character string “station” may be automatically converted, and the converted character string “I will go to Kyoto station now” after the conversion may be displayed from the beginning.

以上説明したように、第１実施形態の音声認識装置１００は、変換対象文字列に接続された参照文字列を決定し、第２言語モデルにおいて参照文字列との接続関係が示されている文字列を、変換候補文字列として出力することとした。すなわち、第１実施形態の音声認識装置１００は、音声認識処理で間違って認識された文字列に対する変換候補を、上記音声認識処理とは異なる根拠に従って決定することができるものである。これにより、音声認識処理で参照した第１言語モデルにおいて、変換対象文字列を変換するための正しい文字列が登録されていない場合であっても、この正しい文字列が第２言語モデルに登録されていれば、これを変換候補文字列として出力することができる。よって、第１実施形態の音声認識装置１００は、音声認識処理で間違って認識された文字列に対する訂正候補として、より適切な文字列を出力することができる。 As described above, the speech recognition apparatus 100 according to the first embodiment determines the reference character string connected to the conversion target character string, and the character whose connection relationship with the reference character string is indicated in the second language model. The sequence is output as a conversion candidate character string. That is, the speech recognition apparatus 100 according to the first embodiment can determine conversion candidates for a character string that is erroneously recognized in the speech recognition process according to a different basis from the speech recognition process. Thereby, even if the correct character string for converting the character string to be converted is not registered in the first language model referred to in the speech recognition process, the correct character string is registered in the second language model. If so, it can be output as a conversion candidate character string. Therefore, the speech recognition apparatus 100 according to the first embodiment can output a more appropriate character string as a correction candidate for a character string that is erroneously recognized in the speech recognition process.

（第２実施形態）
次に、第２実施形態を説明する。第１実施形態では、第２言語モデルにおいて文字列間の接続関係が形態素単位で示されていた。これに対し、第２実施形態では、第２言語モデルにおいて文字列間の接続関係がユーザ入力単位で示されている。ユーザがある文字列を入力する場合に、この文字列を複数の部分的な文字列に区切って段階的に入力する場合がある。たとえば、「京都府の県庁所在地は京都市です」という文字列を入力する場合に、この文字列を「京都」、「府の」、「県庁所在地」、「は」、「京都」、「市です」、という複数の部分的な文字列に区切って段階的に入力するといった具合である。 (Second Embodiment)
Next, a second embodiment will be described. In the first embodiment, the connection relationship between character strings is shown in morpheme units in the second language model. On the other hand, in the second embodiment, the connection relationship between character strings in the second language model is shown in units of user input. When a user inputs a character string, the character string may be divided into a plurality of partial character strings and input step by step. For example, if you enter the string "Kyoto Prefecture is located in Kyoto City", this string will be changed to "Kyoto", "Funo", "Prefectural Office Location", "Ha", "Kyoto", "City" ”, Etc., and enter them step by step.

図１１は、第２実施形態にかかる第２言語モデルの一例を示す。図１１は、第２言語モデル格納部３０６に格納されている第２言語モデルの一例を示すものであり、この例では、第２言語モデルとして、ユーザ入力単位の文字列間の接続関係が示されている。たとえば、図１１に示す例では、第１の文字列「京都」の後方に接続する第２の文字列として、「府の」、「市です」、「駅は」、「の」のそれぞれが対応付けられている。これらの接続関係は、たとえば、ユーザが過去に入力した「京都府の県庁所在地は京都市です」という文字列と、「京都駅は京都の中心ですか」という文字列とが分析されて、これらの文字列に含まれるユーザ入力単位の文字列の接続関係として、第２言語モデルに加えられたものである。 FIG. 11 shows an example of the second language model according to the second embodiment. FIG. 11 shows an example of the second language model stored in the second language model storage unit 306. In this example, the connection relationship between character strings in user input units is shown as the second language model. Has been. For example, in the example shown in FIG. 11, as the second character string connected behind the first character string “Kyoto”, each of “funo”, “city is”, “station is”, “no” It is associated. For example, these connection relationships are analyzed by analyzing the character string “Kyoto prefecture is located in Kyoto city” and the character string “Kyoto station is the center of Kyoto” entered by the user in the past. Is added to the second language model as a connection relationship of character strings in user input units included in the character string.

部分的な文字列の区切りは、たとえば、ユーザがＥＮＴＥＲキーを押したタイミングや、文字列を変換した単位などによって決定される。たとえば、「京都」、ＥＮＴＥＲキーを押下、「府の」、ＥＮＴＥＲキーを押下という順番で入力がなされば、「京都」、「府の」といった単位で部分的な文字列が決定される。また、「きょうとふの」と入力された後に、「京都ふの」、「京都府の」といった順番で部分的な変換がなされた場合も同様に、「京都」、「府の」といった単位で部分的な文字列が決定される。 The partial character string delimiter is determined by, for example, the timing when the user presses the ENTER key, the unit in which the character string is converted, or the like. For example, if the input is made in the order of “Kyoto”, ENTER key pressed, “Fu no”, and Enter key pressed, a partial character string is determined in units of “Kyoto”, “Fu no”. Similarly, if "Kyoto Funo" is entered and then partial conversions are made in the order of "Kyoto Funo" and "Kyoto Prefecture", the unit is also "Kyoto" and "Fuino". A partial string is determined.

ここで、「今から京都的に行きます」という認識文字列から、変換対象文字列決定部３１６によって変換対象文字列として「的」が決定され、参照文字列決定部３１８によって参照文字列として「京都」が決定されたとする。そして、図１１に示すとおり、第２言語モデルにおいて、文字列「京都」に後続する文字列として、「府の」、「市です」、「駅は」、「の」のそれぞれが対応付けられているとする。そして、変換候補文字列決定部３２０が、これらの文字列の中から、「駅は」を変換候補文字列として決定したとする。この場合、「今から京都的に行きます」という認識文字列に対し、変換対象文字列「的」が、変換候補文字列「駅は」に変換されてしまうと、変換後の認識文字列は「今から京都駅はに行きます」となり、「は」が余分に含まれたものとなってしまう。この余分な文字を、ユーザが手作業で削除するようにしてもよいが、この場合、変換候補文字列の前後関係を考慮して手作業で不要な文字を削除する必要があり、ユーザの手間となってしまう。 Here, “target” is determined as the conversion target character string by the conversion target character string determination unit 316 from the recognized character string “I will go to Kyoto now”, and the reference character string determination unit 318 determines “ Assume that “Kyoto” is determined. As shown in FIG. 11, in the second language model, “fu”, “city”, “station”, and “no” are associated as character strings following the character string “Kyoto”. Suppose that Then, it is assumed that the conversion candidate character string determination unit 320 determines “station is” as a conversion candidate character string from these character strings. In this case, if the conversion target character string “Target” is converted to the conversion candidate character string “Station is” for the recognition character string “I will go to Kyoto now”, the converted recognition character string is “From now on I will go to Kyoto station”, and “ha” will be included extra. This extra character may be manually deleted by the user, but in this case, it is necessary to delete unnecessary characters manually in consideration of the context of the conversion candidate character string. End up.

そこで、この第２実施形態では、このように変換候補文字列に含まれている余分な文字列を、出力部３２２が自動的に削除してから、変換後の変換候補文字列を出力することとした。以下、その具体的な方法を説明する。なお、変換候補文字列決定部３２０が変換候補文字列を決定するまでの処理は、これまで説明したとおりである。 Therefore, in the second embodiment, after the output unit 322 automatically deletes the extra character string included in the conversion candidate character string in this way, the converted conversion candidate character string is output. It was. Hereinafter, the specific method will be described. The processing until the conversion candidate character string determining unit 320 determines the conversion candidate character string is as described above.

まず、変換候補文字列と変換対象文字列との相関関係に基づいて、変換候補文字列に含まれる不要な文字を削除する自動的に認識し、これを削除する第１の方法および第２の方法について説明する。 First, based on the correlation between the conversion candidate character string and the conversion target character string, an unnecessary character included in the conversion candidate character string is automatically deleted, and the first method and the second method for deleting this are automatically recognized. A method will be described.

第１の方法は、変換対象文字列のよみがなの文字数にあわせて、変換候補文字列から不要な文字を削除する方法である。たとえば、上記例でいえば、変換対象文字列「的」はよみがなで２文字であり、変換候補文字列「駅は」はよみがなで３文字であるから、これにあわせて、変換候補文字列「駅は」の後端１文字である「は」を削除するといった具合である。 The first method is a method of deleting unnecessary characters from the conversion candidate character string in accordance with the number of characters in the conversion target character string. For example, in the above example, the conversion target character string “target” is 2 characters in the reading character, and the conversion candidate character string “station” is 3 characters in the reading character. Accordingly, the conversion candidate character string “ For example, “ha”, which is the last character of “station”, is deleted.

第２の方法は、変換候補文字列から変換対象文字列との相関性の高い部分を残しておき、その他の部分を削除する方法である。たとえば、上記例でいえば、変換対象文字列「的」の音素（よみがなを示すローマ字）は「ｔ」、「ｅ」、「ｋ」、「ｉ」である。また、変換候補文字列「駅は」の音素は「ｅ」、「ｋ」、「ｉ」、「ｗ」、「ａ」である。変換候補文字列「駅は」の音素のうち、変換対象文字列「的」の音素との相関性の高い音素として「ｅ」、「ｋ」、「ｉ」を残しておき、その他の音素である「ｗ」、「ａ」を削除する。その結果、変換候補文字列「駅は」から、「は」が削除される。 The second method is a method of leaving a portion having a high correlation with the conversion target character string from the conversion candidate character string and deleting the other portions. For example, in the above example, the phonemes (Roman characters indicating the reading) of the conversion target character string “ma” are “t”, “e”, “k”, and “i”. Moreover, the phonemes of the conversion candidate character string “station is” are “e”, “k”, “i”, “w”, and “a”. Among the phonemes of the conversion candidate character string “station is”, leave “e”, “k”, “i” as phonemes highly correlated with the phoneme of the conversion target character string “target”, and use other phonemes Delete certain “w” and “a”. As a result, “ha” is deleted from the conversion candidate character string “station is”.

つぎに、変換候補文字列と参照文字列との接続関係に基づいて、変換候補文字列に含まれる不要な文字を削除する自動的に認識し、これを削除する第３の方法について説明する。 Next, a third method for automatically recognizing and deleting unnecessary characters included in the conversion candidate character string based on the connection relationship between the conversion candidate character string and the reference character string will be described.

第３の方法は、変換後の認識文字列を仮生成し、これに対して（特に、変換候補文字列と参照文字列との接続関係について）文法チェックをおこなうことにより、変換候補文字列における不要な文字を特定し、これを削除する方法である。たとえば、上記例でいえば、「今から京都的に行きます」という認識文字列に対し、変換対象文字列「的」を、変換候補文字列「駅は」に変換して、変換後の認識文字列として「今から京都駅はに行きます」を仮生成する。そして、これに対して文法チェック処理をおこなえば、「は」が不要になることは明らかであるから、変換候補文字列「駅は」から「は」を削除する。 In the third method, a recognition character string after conversion is provisionally generated, and a grammar check is performed on this (particularly, the connection relationship between the conversion candidate character string and the reference character string). This is a method of identifying unnecessary characters and deleting them. For example, in the above example, for the recognized character string “I will go to Kyoto now”, the conversion target character string “target” is converted to the conversion candidate character string “station is”, and the recognition after conversion Temporarily generate "Kyoto station will go to now" as a character string. If the grammar check process is performed on this, it is clear that “ha” is unnecessary, and therefore “ha” is deleted from the conversion candidate character string “station”.

このように、第２実施形態の音声認識装置は、変換候補文字列と変換対象文字列との相関関係、または変換候補文字列と参照文字列との接続関係に基づいて、変換候補文字列における不要な文字を自動的に認識し、これを削除してから、削除後の変換候補文字列を出力するので、変換候補文字列の前後関係を考慮して手作業で不要な文字を削除する、などといった、ユーザの手間を省くことができる。 As described above, the speech recognition apparatus according to the second embodiment is based on the correlation between the conversion candidate character string and the conversion target character string or the connection relationship between the conversion candidate character string and the reference character string. Unnecessary characters are automatically recognized and deleted, and then the conversion candidate character string after deletion is output, so the unnecessary characters are manually deleted in consideration of the context of the conversion candidate character string. It is possible to save the user's troubles such as.

（第３実施形態）
次に、第３実施形態を説明する。第１実施形態では、第２言語モデルにおいて、参照文字列と複数の文字列との接続関係が示されている場合、これら複数の文字列の一部または全てを変換候補文字列として出力することとした。ここで、参照文字列との接続関係が示されている文字列が膨大な数の場合、これら複数の文字列の全てを変換候補文字列として出力してしまうと、ユーザが混乱してしまう。また、これら複数の文字列の一部を変換候補文字列として出力すると、膨大な数の文字列の中から変換候補文字列を決定するための処理に時間がかかってしまうばかりか、適切な変換候補文字列を決定することができない。 (Third embodiment)
Next, a third embodiment will be described. In the first embodiment, when the connection relationship between a reference character string and a plurality of character strings is indicated in the second language model, a part or all of the plurality of character strings are output as conversion candidate character strings. It was. Here, if there are a large number of character strings that indicate the connection relationship with the reference character string, if all of the plurality of character strings are output as conversion candidate character strings, the user is confused. Also, if some of these multiple character strings are output as conversion candidate character strings, it takes time to determine conversion candidate character strings from a large number of character strings, as well as appropriate conversion. Candidate character strings cannot be determined.

図１２は、第３実施形態にかかる第２言語モデルの一例を示す。図１２は、第２言語モデル格納部３０６に格納されている第２言語モデルの一例を示すものである。たとえば、図１２に示す例では、第２の文字列「に」は、第１の文字列として、「東京」、「京都」、「教頭」、「犬」、「音声」、「学校」、「どこか」、・・・というように、膨大な数の文字列との接続関係を有する。このために、この第２の文字列「に」を参照文字列として、この第２の文字列「に」との接続関係を有する第１の文字列の全部を変換候補文字列として出力してしまうと、ユーザが混乱してしまう。また、この第２の文字列「に」との接続関係を有する第１の文字列の一部を変換候補文字列として出力すると、変換候補文字列を決定するための処理に時間がかかってしまうばかりか、適切な変換候補文字列を決定することができない。 FIG. 12 shows an example of the second language model according to the third embodiment. FIG. 12 shows an example of the second language model stored in the second language model storage unit 306. For example, in the example shown in FIG. 12, the second character string “ni” is “Tokyo”, “Kyoto”, “Vice Head”, “Dog”, “Voice”, “School”, It has a connection relationship with a huge number of character strings such as “somewhere”. For this purpose, the second character string “ni” is used as a reference character string, and all of the first character strings having a connection relationship with the second character string “ni” are output as conversion candidate character strings. If this happens, the user will be confused. Further, if a part of the first character string having the connection relation with the second character string “ni” is output as a conversion candidate character string, it takes time to determine the conversion candidate character string. In addition, an appropriate conversion candidate character string cannot be determined.

このように、一つの参照文字列からでは、適切な変換候補文字列を決定できない場合がある。そこで、この第３実施形態では、文字列の範囲が異なる複数の参照文字列を決定し、決定された複数の参照文字列のうちの、適切な変換候補文字列が得られる参照文字列を用いて、変換対象文字列を変換する候補の変換候補文字列を決定することで、適切な変換候補文字列を決定することができる構成とした。以下、その具体的な方法の一例として、上記したように参照文字列との接続関係を有する文字列が膨大な数である場合、参照文字列の範囲を拡張し、これを新たな参照文字列としてから、あらためて、参照文字列との接続関係を有する文字列の一部または全部を変換候補文字列として出力する例を説明する。 As described above, an appropriate conversion candidate character string may not be determined from one reference character string. Therefore, in the third embodiment, a plurality of reference character strings having different character string ranges are determined, and a reference character string from which an appropriate conversion candidate character string is obtained among the determined plurality of reference character strings is used. Thus, an appropriate conversion candidate character string can be determined by determining a candidate conversion candidate character string for converting the conversion target character string. Hereinafter, as an example of the specific method, when there are a large number of character strings having a connection relationship with the reference character string as described above, the range of the reference character string is expanded, and this is replaced with a new reference character string. Then, a description will be given of an example in which part or all of a character string having a connection relationship with a reference character string is output as a conversion candidate character string.

この第３実施形態では、第２言語モデルにおいて、参照文字列決定部３１８が決定した参照文字列が所定数以上の文字列との接続関係を有する場合、参照文字列決定部３１８は、認識文字列のうちの、参照文字列とした文字列の範囲を拡張して、拡張後の文字列を新たな参照文字列として決定する。そして、変換候補文字列決定部３２０は、第２言語モデルを参照して、当該第２言語モデルにおいて上記新たな参照文字列との接続関係が示されている文字列を、変換候補文字列として決定する。 In the third embodiment, in the second language model, when the reference character string determined by the reference character string determining unit 318 has a connection relationship with a predetermined number or more of character strings, the reference character string determining unit 318 The range of the character string as a reference character string in the sequence is expanded, and the expanded character string is determined as a new reference character string. Then, the conversion candidate character string determination unit 320 refers to the second language model, and uses the character string that indicates the connection relationship with the new reference character string in the second language model as the conversion candidate character string. decide.

たとえば、参照文字列決定部３１８が、「変換対象文字列の後方の１形態素を参照文字列とする」といった条件にしたがって、「今から京都的に行きます」という認識文字列の中から、変換対象文字列である「的」に後続する「に」を参照文字列として決定したとする。 For example, the reference character string determination unit 318 converts from the recognized character string “I will go to Kyoto now” according to the condition “use one morpheme behind the character string to be converted as a reference character string”. Assume that “ni” following the target character string “target” is determined as a reference character string.

この場合、変換候補文字列決定部３２０が、第２言語モデルを参照して、当該第２言語モデルにおいて参照文字列「に」が後続する文字列として、参照文字列「に」との接続関係が示されている複数の第１の文字列を、変換候補文字列として決定する。たとえば、図１２に示す第２言語モデルによると、「東京」、「京都」、「教頭」、「犬」、「音声」、「学校」、「どこか」・・・といった膨大な数の文字列が、変換候補文字列として仮決定される。 In this case, the conversion candidate character string determination unit 320 refers to the second language model, and the connection relationship with the reference character string “ni” as a character string followed by the reference character string “ni” in the second language model. Are determined as conversion candidate character strings. For example, according to the second language model shown in FIG. 12, an enormous number of characters such as “Tokyo”, “Kyoto”, “President”, “Dog”, “Speech”, “School”, “Somewhere”,. A column is provisionally determined as a conversion candidate character string.

ここで、仮決定された変換候補文字列の数が所定数よりも少ない場合は、この仮決定された変換候補文字列の一部または全部が、正式な変換候補文字列として決定される。一方、仮決定された変換候補文字列の数が所定数よりも多い場合、参照文字列決定部３１８が、「今から京都的に行きます」という認識文字列のうちの、参照文字列とした文字列の範囲を拡張して、拡張後の文字列を新たな参照文字列として決定することとなる。具体的には、参照文字列決定部３１８は、「変換対象文字列の後方の１形態素を参照文字列とする」という条件のうちの形態素の数を増やして、新たな参照文字列を決定する。たとえば、参照文字列決定部３１８は、「変換対象文字列の後方の１形態素を参照文字列とする」という条件のうちの形態素の数を１つ増やして、「変換対象文字列の後方の２形態素を参照文字列とする」と条件に変更し、これに該当する「に行き」を新たな参照文字列とするのである。 Here, when the number of conversion candidate character strings tentatively determined is smaller than a predetermined number, part or all of the temporarily determined conversion candidate character strings are determined as formal conversion candidate character strings. On the other hand, when the number of conversion candidate character strings tentatively determined is larger than the predetermined number, the reference character string determination unit 318 sets the reference character string among the recognized character strings “I will go to Kyoto now”. The range of the character string is expanded, and the expanded character string is determined as a new reference character string. Specifically, the reference character string determination unit 318 determines a new reference character string by increasing the number of morphemes in the condition that “one morpheme behind the conversion target character string is a reference character string”. . For example, the reference character string determination unit 318 increments the number of morphemes in the condition that “one morpheme after the conversion target character string is a reference character string” by one, and “2 after the conversion target character string” The morpheme is used as a reference character string ”condition, and the corresponding“ ni-go ”is used as a new reference character string.

この場合、変換候補文字列決定部３２０が、第２言語モデルを参照して、当該第２言語モデルにおいて参照文字列「に行き」が後続する文字列として、参照文字列「に行き」との接続関係が示されている複数の第１の文字列を、変換候補文字列として決定する。たとえば、図１２に示す第２言語モデルによると、「東京」、「京都」、「学校」、「どこか」といった４つの文字列が、変換候補文字列として仮決定される。 In this case, the conversion candidate character string determination unit 320 refers to the second language model and sets the reference character string “ni-go” as a character string followed by the reference character string “ni-go” in the second language model. A plurality of first character strings for which connection relations are indicated are determined as conversion candidate character strings. For example, according to the second language model shown in FIG. 12, four character strings such as “Tokyo”, “Kyoto”, “school”, and “somewhere” are provisionally determined as conversion candidate character strings.

ここで、仮決定された変換候補文字列の数が所定数よりも少ない場合は、この仮決定された変換候補文字列の一部または全部が、正式な変換候補文字列として決定される。なおも、仮決定された変換候補文字列の数が所定数よりも多い場合、参照文字列決定部３１８が、「今から京都的に行きます」という認識文字列のうちの、参照文字列とした文字列の範囲をさらに拡張して、拡張後の文字列を新たな参照文字列として決定することとなる。たとえば、参照文字列決定部３１８は、「変換対象文字列の後方の２形態素を参照文字列とする」と変更された条件のうちの形態素の数をさらに１つ増やして、「変換対象文字列の後方の３形態素を参照文字列とする」という条件に変更し、これに該当する「に行きます」を新たな参照文字列とするのである。 Here, when the number of conversion candidate character strings tentatively determined is smaller than a predetermined number, part or all of the temporarily determined conversion candidate character strings are determined as formal conversion candidate character strings. If the number of conversion candidate character strings tentatively determined is larger than the predetermined number, the reference character string determination unit 318 determines that the reference character string of the recognized character strings “going to Kyoto now” and The expanded character string range is further expanded, and the expanded character string is determined as a new reference character string. For example, the reference character string determination unit 318 further increases the number of morphemes in the changed condition “use the two morphemes behind the conversion target character string as the reference character string”, The three morphemes behind are used as reference character strings, and “go to” corresponding to this is used as a new reference character string.

この場合、変換候補文字列決定部３２０が、第２言語モデルを参照して、当該第２言語モデルにおいて参照文字列「に行きます」が後続する文字列として、参照文字列「に行きます」との接続関係が示されている複数の第１の文字列を、変換候補文字列として決定する。たとえば、図１２に示す第２言語モデルによると、「東京」、「京都」といった２つの文字列が、変換候補文字列として仮決定される。 In this case, the conversion candidate character string determination unit 320 refers to the second language model, and the reference character string “goes to” as the character string followed by the reference character string “goes to” in the second language model. Are determined as conversion candidate character strings. For example, according to the second language model shown in FIG. 12, two character strings such as “Tokyo” and “Kyoto” are provisionally determined as conversion candidate character strings.

このように、第３実施形態の音声認識装置は、参照文字列に対応付けられている文字列の数が所定数よりも少なくなるまで参照文字列の範囲を拡張していく、といった簡素な処理によって、膨大な数の変換候補文字列を、適切な数かつ適切な内容の変換候補文字列へと絞り込んでいくことができるのである。 As described above, the speech recognition apparatus according to the third embodiment performs a simple process of extending the range of the reference character string until the number of character strings associated with the reference character string is less than a predetermined number. Thus, it is possible to narrow down an enormous number of conversion candidate character strings to conversion candidate character strings having an appropriate number and appropriate contents.

（第４実施形態）
次に、第４実施形態を説明する。実施形態で説明した各機能部について、これらの一部を外部の情報処理装置に設けて、音声認識装置１００は、外部の情報処理装置から一部のデータを参照したり、外部の情報処理装置に一部の処理をおこなわせたりする構成としてもよい。すなわち、音声認識装置１００は、第１実施形態のように単独で音声認識処理をおこなうものに限らず、他の情報処理装置との通信をおこなって、音声認識処理をおこなうようなものであってもよい。この第４実施形態では、音声認識装置１００を、他の情報処理装置との通信をおこなって、音声認識処理をおこなうように構成する場合の一例を説明する。 (Fourth embodiment)
Next, a fourth embodiment will be described. A part of each functional unit described in the embodiment is provided in an external information processing apparatus, and the speech recognition apparatus 100 refers to a part of data from the external information processing apparatus or external information processing apparatus. Alternatively, a part of the processing may be performed. That is, the speech recognition apparatus 100 is not limited to performing speech recognition processing alone as in the first embodiment, but performs communication with other information processing apparatuses to perform speech recognition processing. Also good. In the fourth embodiment, an example will be described in which the speech recognition apparatus 100 is configured to perform speech recognition processing by communicating with another information processing apparatus.

図１３は、第４実施形態にかかる音声認識装置１００の機能構成を示す。この第４実施形態では、音声認識装置１００がクライアント装置１１となり、サーバ装置１２とともに、音声認識システム１０を構成する。そして、音声認識装置１００からの要求に応じて、サーバ装置１２が音声認識処理をおこない、その処理結果を音声認識装置１００へ送信する構成となっている。 FIG. 13 shows a functional configuration of the speech recognition apparatus 100 according to the fourth embodiment. In the fourth embodiment, the voice recognition device 100 becomes the client device 11 and constitutes the voice recognition system 10 together with the server device 12. In response to a request from the voice recognition device 100, the server device 12 performs voice recognition processing and transmits the processing result to the voice recognition device 100.

図１３は、第４実施形態にかかる音声認識装置１００の機能構成を示す。この第４実施形態では、音声認識装置１００がサーバ装置１２との通信をおこなう通信部３３０を備える。また、音響モデル格納部３０２、第１言語モデル格納部３０４、および音声認識部３１４が、サーバ装置１２に格納されている。これに応じて、音声認識装置１００は、通信部３３０によるサーバ装置１２との通信によって、サーバ装置１２の音声認識部３１４へ音声データを受け渡し、音声認識処理をおこなわせ、その認識結果（認識文字列）を、サーバ装置１２から受け取るようになっている。 FIG. 13 shows a functional configuration of the speech recognition apparatus 100 according to the fourth embodiment. In the fourth embodiment, the voice recognition device 100 includes a communication unit 330 that communicates with the server device 12. In addition, the acoustic model storage unit 302, the first language model storage unit 304, and the speech recognition unit 314 are stored in the server device 12. In response to this, the voice recognition device 100 passes voice data to the voice recognition unit 314 of the server device 12 through communication with the server device 12 by the communication unit 330, performs voice recognition processing, and the recognition result (recognized character). Column) is received from the server device 12.

このように構成された音声認識システム１０によれば、音響モデル格納部３０２、第１言語モデル格納部３０４、および音声認識部３１４をサーバ装置１２に設けたことで、たとえば、複数のクライアント装置１１が音声認識システムに設けられている場合、これら複数のクライアント装置１１で、音声認識処理を共有することができる。これにより、クライアント装置１１によって音声認識処理結果が異なるようなこともなく、複数のクライアント装置１１に対して均質な音声認識処理結果を提供することができる。また、音響モデルおよび第１言語モデルが一元化されているため、音響モデルおよび第１言語モデルのメンテナンスを容易におこなうことができるようになる。 According to the speech recognition system 10 configured as described above, the acoustic model storage unit 302, the first language model storage unit 304, and the speech recognition unit 314 are provided in the server device 12, for example, a plurality of client devices 11 Is provided in the voice recognition system, the plurality of client apparatuses 11 can share the voice recognition process. As a result, the voice recognition processing result does not differ depending on the client device 11, and a uniform voice recognition processing result can be provided to the plurality of client devices 11. In addition, since the acoustic model and the first language model are unified, the acoustic model and the first language model can be easily maintained.

また、第２言語モデル格納部３０６をサーバ装置１２に設けず、音声認識装置１００に設けたことで、たとえば、複数のクライアント装置１１が音声認識システムに設けられている場合、複数のクライアント装置１１の各々の第２モデルの内容を、各々のクライアント装置１１のユーザの傾向に応じて異ならせることができる。これにより、音声認識システム１０は、あるクライアント装置１１において音声認識処理の要求があると、そのクライアント装置１１のユーザの傾向に応じた第２モデルを適用して、そのユーザが過去に入力した文章に含まれている文字列の中から、変換候補文字列を決定することができるので、クライアント装置１１のユーザに対して、より適切な処理結果を提供することができる。 Further, since the second language model storage unit 306 is not provided in the server device 12 but is provided in the speech recognition device 100, for example, when a plurality of client devices 11 are provided in the speech recognition system, the plurality of client devices 11 are provided. The contents of each of the second models can be made different according to the tendency of the user of each client device 11. Thereby, when there is a request for voice recognition processing in a certain client device 11, the voice recognition system 10 applies the second model according to the tendency of the user of the client device 11, and the text input by the user in the past Since the conversion candidate character string can be determined from among the character strings included in, a more appropriate processing result can be provided to the user of the client device 11.

（変形例）
なお、本発明は、上記した形態での実施に限らず、以下のように変形させて実施してもよい。また、以下の変形例を組み合わせてもよい。 (Modification)
The present invention is not limited to the embodiment described above, and may be modified as follows. Further, the following modifications may be combined.

（変形例１）
各実施形態では、音声認識装置の一例として音声認識機能を有するパーソナル・コンピュータを用いたが、これに限らず、音声認識装置は、実施形態で説明した音声認識装置１００と同様の音声認識機能を実現することができるものであれば、携帯電話機、ＰＤＡ（Personal Digital Assistant）、ナビゲーション装置、携帯音楽プレーヤー、ノートＰＣ（Personal Computer）、家電製品など、どのような機器であってもよい。 (Modification 1)
In each embodiment, a personal computer having a voice recognition function is used as an example of a voice recognition device. However, the present invention is not limited to this, and the voice recognition device has the same voice recognition function as the voice recognition device 100 described in the embodiment. Any device such as a cellular phone, a PDA (Personal Digital Assistant), a navigation device, a portable music player, a notebook PC (Personal Computer), or a home appliance may be used as long as it can be realized.

（変形例２）
第３実施形態では、文字列の範囲が異なる複数の参照文字列を決定し、決定された複数の参照文字列のうちの、適切な変換候補文字列が得られる参照文字列を用いて、変換対象文字列を変換する候補の変換候補文字列を決定する構成の一具体例として、参照文字列に対応付けられている文字列の数が所定数よりも少なくなるまで参照文字列の範囲を後方に拡張していくことによって、変換候補文字列の数を絞り込む例を説明したが、これに限定するものではない。たとえば、参照文字列の範囲を前方に拡張したり、前後の双方に拡張したりすることによって、変換候補文字列の数を絞り込むようにしてもよい。また、参照文字列に対応付けられている文字列の数が下限値よりも少ない場合は、下限値よりも多くなるまで参照文字列の範囲を縮小していくことによって変換候補文字列の数を増やすようにしてもよい。また、参照文字列に対応付けられている文字列のうちの、所定の相関度以上の相関度を有する文字列の数が上限値よりも少なくなるか、もしくは下限値よりも多くなるまで、参照文字列の範囲を拡張していくことによって変換候補文字列の数を絞り込むか、もしくは、参照文字列の範囲を縮小していくことによって変換候補文字列の数を増やすようにしてもよい。また、全ての文字列の相関度がある閾値よりも低ければ、この閾値よりも相関度が高い文字列が出現するまで、参照文字列の範囲を縮小していくことによって変換候補文字列の数を増やすようにしてもよい。また、全ての文字列の相関度がある閾値よりも高ければ、この閾値よりも相関度が低い文字列が出現するまで、参照文字列の範囲を拡張していくことによって変換候補文字列の数を絞り込むようにしてもよい。 (Modification 2)
In the third embodiment, a plurality of reference character strings having different character string ranges are determined, and conversion is performed using a reference character string from which the appropriate conversion candidate character string is obtained among the determined plurality of reference character strings. As a specific example of the configuration for determining the candidate conversion candidate character string for converting the target character string, the reference character string range is moved backward until the number of character strings associated with the reference character string is less than a predetermined number. Although the example of narrowing down the number of conversion candidate character strings by extending to the above has been described, the present invention is not limited to this. For example, the number of conversion candidate character strings may be narrowed down by extending the range of the reference character string forward or by extending it both forward and backward. If the number of character strings associated with the reference character string is less than the lower limit value, the number of conversion candidate character strings is reduced by reducing the range of the reference character string until it exceeds the lower limit value. You may make it increase. Also, reference is made until the number of character strings having a correlation degree equal to or higher than a predetermined correlation degree among the character strings associated with the reference character string is less than the upper limit value or more than the lower limit value. The number of conversion candidate character strings may be narrowed by expanding the range of character strings, or the number of conversion candidate character strings may be increased by reducing the range of reference character strings. Further, if the correlation degree of all the character strings is lower than a certain threshold value, the number of conversion candidate character strings is reduced by reducing the range of the reference character string until a character string having a higher correlation degree than the threshold value appears. May be increased. Also, if the correlation degree of all the character strings is higher than a certain threshold, the number of conversion candidate character strings is expanded by expanding the range of the reference character string until a character string having a lower correlation degree than this threshold appears. You may make it narrow down.

（変形例３）
第４実施形態において、複数のクライアント装置１１で音声認識部３１４による音声認識処理を共有することで、複数のクライアント装置１１の各々に対して、均質な処理結果を提供することができる構成であれば、音声認識システム１０は、どのような装置構成であってもよく、また、各装置に対してどのような機能が設けられていてもよい。たとえば、第４実施形態において、音声認識部３１４を、音声認識装置１００に設け、音声認識部３１４は、通信部３３０によるサーバ装置１２との通信によって、サーバ装置１２に設けられた音響モデル格納部３０２および第１言語モデル格納部３０４から、音響モデルおよび第１言語モデルを参照するように構成してもよい。また、音声認識部３１４だけでなく、変換対象文字列決定部３１６や、参照文字列決定部３１８を、サーバ装置１２に設ける構成としてもよい。また、サーバ装置１２およびクライアント装置１１とは別に、音声認識装置１００を設け、音声認識装置１００が、クライアント装置１１から音声データを取得して、そのクライアント装置１１に設けられた第２言語モデルを参照して、これまでに説明した音声認識処理をおこない、その処理結果として、認識文字列および変換候補文字列をそのクライアント装置１１に出力する構成としてもよい。 (Modification 3)
In the fourth embodiment, by sharing the voice recognition processing by the voice recognition unit 314 with a plurality of client devices 11, a configuration that can provide a uniform processing result to each of the plurality of client devices 11. For example, the speech recognition system 10 may have any device configuration, and any function may be provided for each device. For example, in the fourth embodiment, the speech recognition unit 314 is provided in the speech recognition device 100, and the speech recognition unit 314 is an acoustic model storage unit provided in the server device 12 through communication with the server device 12 by the communication unit 330. The acoustic model and the first language model may be referred to from the 302 and the first language model storage unit 304. Further, not only the voice recognition unit 314 but also the conversion target character string determination unit 316 and the reference character string determination unit 318 may be provided in the server device 12. In addition, the voice recognition device 100 is provided separately from the server device 12 and the client device 11. The voice recognition device 100 acquires voice data from the client device 11, and uses the second language model provided in the client device 11. It is also possible to refer to the speech recognition process described so far, and output the recognized character string and the conversion candidate character string to the client device 11 as the processing result.

（変形例４）
各実施形態において、音声認識部３１４は、複数の認識文字列を生成するものであってもよい。たとえば、変換対象文字列決定部３１６が、「今から京都駅に行きます」という音声データから、第１候補の「今から京都的に行きます」という認識文字列と、第２候補の「今から京都劇に行きます」という認識文字列とを生成するといった具合である。 (Modification 4)
In each embodiment, the speech recognition unit 314 may generate a plurality of recognized character strings. For example, the conversion target character string determination unit 316 uses the first candidate “I will go to Kyoto now” and the second candidate “Now” from the voice data “I will go to Kyoto station now”. For example, the recognition character string “I will go to Kyoto drama” will be generated.

ここで、変換対象文字列決定部３１６が、第１候補の認識文字列から「的」を第１候補の変換対象文字列として決定し、第２候補の認識文字列から「劇」を第２候補の変換対象文字列として決定し、参照文字列決定部３１８が、これらの変換対象文字列に共通する「京都」を参照文字列として決定したとする。この場合、変換候補文字列決定部３２０は、参照文字列「京都」との接続関係を有する文字列の中から、第１候補の変換対象文字列「的」との相関度と、第２候補の変換対象文字列「劇」との相関度との双方に基づいて、変換候補文字列を決定するようにしてもよい。 Here, the conversion target character string determination unit 316 determines “target” from the first candidate recognized character string as the first candidate conversion target character string, and sets “drama” as the second candidate recognized character string. Assume that the candidate character string to be converted is determined, and the reference character string determining unit 318 determines “Kyoto” common to these character strings to be converted as the reference character string. In this case, the conversion candidate character string determination unit 320 determines the correlation between the first candidate conversion target character string “target” from the character strings having the connection relationship with the reference character string “Kyoto”, and the second candidate. The conversion candidate character string may be determined based on both the degree of correlation with the conversion target character string “play”.

上記において、たとえば「駅」、「市」、「府」、「の」のそれぞれが変換候補文字列として決定された場合、出力部３２２は、認識文字列として、第１候補の「今から京都的に行きます」という認識文字列を出力し、変換候補文字列として、「駅」、「市」、「府」、「の」のそれぞれに加えて、第２候補の変換対象文字列「劇」を出力してもよい。また、仮に第２の候補の変換対象文字列が、変換候補文字列と同じ「駅」であった場合、変換候補文字列「駅」の優先度を高めて、変換候補文字列として、「駅」、「市」、「府」、「の」を出力してもよい。これによれば、たとえば、「駅」と発音されたが、音声の歪みにより、「駅」が第２の候補の変換対象文字列となってしまった場合であっても、このように優先度を高めることで、正しく「駅」に訂正される可能性を高めることができる。 In the above, for example, when each of “station”, “city”, “fu”, and “no” is determined as the conversion candidate character string, the output unit 322 uses the first candidate “from now on as Kyoto” as the recognition character string. The recognition character string “I will go” is output, and in addition to “station”, “city”, “fu”, “no” as the conversion candidate character string, the conversion candidate character string “play” of the second candidate May be output. Also, if the conversion candidate character string of the second candidate is the same “station” as the conversion candidate character string, the priority of the conversion candidate character string “station” is increased and the conversion candidate character string “ ”,“ City ”,“ fu ”, and“ no ”may be output. According to this, for example, even when the word “station” is pronounced, but the “station” becomes the second candidate conversion target character string due to the distortion of the voice, the priority is set in this way. By increasing, the possibility of being correctly corrected to “station” can be increased.

１００…音声認識装置、１１０…本体、１２０…マイク、１３０…ディスプレイ、１４０…スピーカ、１５０…キーボード、１６０…マウス、３０２…音響モデル格納部、３０４…第１言語モデル格納部、３０６…第２言語モデル格納部、３１２…音声データ取得部、３１４…音声認識部、３１６…変換対象文字列決定部、３１８…参照文字列決定部、３２０…変換候補文字列決定部、３２２…出力部、３３０…通信部 DESCRIPTION OF SYMBOLS 100 ... Voice recognition apparatus, 110 ... Main body, 120 ... Microphone, 130 ... Display, 140 ... Speaker, 150 ... Keyboard, 160 ... Mouse, 302 ... Acoustic model storage part, 304 ... First language model storage part, 306 ... Second Language model storage unit, 312 ... voice data acquisition unit, 314 ... voice recognition unit, 316 ... conversion target character string determination unit, 318 ... reference character string determination unit, 320 ... conversion candidate character string determination unit, 322 ... output unit, 330 ... communication department

Claims

An audio data acquisition unit for acquiring audio data;
A recognition character string indicating a recognition result by performing a voice recognition process on the voice data with reference to an acoustic model indicating a correspondence relation between a voice feature and a character and a first language model indicating a connection relation between the character strings. A voice recognition unit for generating
Among the recognized character strings, a character string specified by the user, a character string whose reliability as a recognition result is lower than a predetermined threshold, or a combination of character strings whose reliability as a recognition result is lower than a predetermined threshold A conversion target character string determination unit that determines a character string consisting of as a conversion target character string;
In the recognized character string, a reference character string that determines a character string of a unit or number specified by the user as a reference character string from a character string connected immediately before or after the determined character string to be converted A decision unit;
A character string that does not depend on the recognition result of the speech recognition process for the speech data , with reference to a second language model that indicates a connection relationship between the character strings extracted from a character string previously input by the user, and is determined A conversion candidate character string determining unit that determines a character string indicating a connection relationship with the reference character string as a conversion candidate character string of a candidate for converting the conversion target character string;
An output unit that outputs the determined conversion candidate character string.

In the second language model, when the number of conversion candidate character strings out of a predetermined range with respect to the reference character string is obtained,
The reference character string determination unit
In the recognized character string, increase or decrease the number of character strings connected immediately before or immediately after the conversion target character string, determine the number of character strings after the increase or decrease as a new reference character string,
The conversion candidate character string determination unit
The speech recognition apparatus according to claim 1, wherein a character string that indicates a connection relationship with the new reference character string in the second language model is determined as the conversion candidate character string.

The output unit is
The speech recognition apparatus according to claim 1, wherein the character string to be converted in the recognized character string is converted into the conversion candidate character string, and the converted recognized character string is output.

The conversion candidate character string determination unit
For each of the character strings for which the connection relationship with the reference character string is indicated in the second language model, the degree of correlation with the conversion target character string is calculated, and at least the character string having the highest degree of correlation or the The speech recognition apparatus according to any one of claims 1 to 3, wherein a character string having a correlation degree higher than a threshold value is determined as a candidate conversion candidate character string for converting the conversion target character string.

The output unit is
Among the phonemes of the conversion candidate character string, a conversion candidate character string from which a phoneme that does not match the phoneme of the conversion target character string is deleted, or a grammatical check on a connection relationship between the conversion candidate character string and the reference character string The speech recognition apparatus according to any one of claims 1 to 4, wherein a conversion candidate character string from which unnecessary characters included in the conversion candidate character string specified by performing is deleted is output.

A communication unit that communicates with the server device storing the acoustic model and the first language model;
The voice recognition unit
6. The voice recognition process is performed by referring to the acoustic model and the first language model stored in the server device by communication with the server device by the communication unit. The speech recognition device according to any one of the above.

A speech recognition method by a speech recognition device,
An audio data acquisition process for acquiring audio data;
A recognition character string indicating a recognition result by performing a voice recognition process on the voice data with reference to an acoustic model indicating a correspondence relation between a voice feature and a character and a first language model indicating a connection relation between the character strings. A speech recognition process for generating
Among the recognized character strings, a character string specified by the user, a character string whose reliability as a recognition result is lower than a predetermined threshold, or a combination of character strings whose reliability as a recognition result is lower than a predetermined threshold A conversion target character string determination step for determining a character string consisting of as a conversion target character string;
In the recognized character string, a reference character string that determines a character string of a unit or number specified by the user as a reference character string from a character string connected immediately before or after the determined character string to be converted A decision process;
A character string that does not depend on the recognition result of the speech recognition process for the speech data , with reference to a second language model that indicates a connection relationship between the character strings extracted from a character string previously input by the user, and is determined A conversion candidate character string determining step for determining a character string indicating a connection relationship with the reference character string as a conversion candidate character string of a candidate for converting the conversion target character string;
An output step of outputting the determined conversion candidate character string.