JP2010055044A

JP2010055044A - Device, method and system for correcting voice recognition result

Info

Publication number: JP2010055044A
Application number: JP2008285550A
Authority: JP
Inventors: Shi Cho; 志鵬張; Nobuhiko Naka; 信彦仲; Yusuke Nakajima; 悠輔中島
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2008-04-22
Filing date: 2008-11-06
Publication date: 2010-03-11
Anticipated expiration: 2028-11-06
Also published as: JP4709887B2; CN101567189B; TW200951940A; TWI427620B; CN101567189A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and method for correcting a voice recognition result, capable of correcting a recognition error without user's time and effort, when the error arises in the recognition result. <P>SOLUTION: A feature amount data of voice are transmitted to a server device 120. Voice recognition is performed in the sever device 120, and the recognition result is received from the server device 120 by a receiving section 235. An error section indicating section 240 indicates an error section where the recognition error arises, on the basis of a degree of confidence or the like. An error section feature amount extracting section 260 extracts a feature amount data of the error section, and a correction section 270 performs correction processing by performing recognition processing on the recognition result in the extracted error section. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声認識されたデータを訂正する音声認識結果訂正装置および音声認識結果訂正方法、ならびに音声認識結果訂正システムに関する。 The present invention relates to a speech recognition result correcting apparatus, a speech recognition result correcting method, and a speech recognition result correcting system for correcting speech-recognized data.

携帯端末において入力された音声をサーバに出力し、当該サーバにおいて音声を認識し、ここで認識結果を携帯端末に送信することで、携帯端末において音声結果を取得することができる技術が、特許文献１に記載されているように知られている。
特開２００３−２９５８９３号公報 Japanese Patent Application Laid-Open Publication No. 2003-228867 discloses a technology that outputs voice input to a portable terminal to a server, recognizes the voice on the server, and transmits the recognition result to the portable terminal. 1 is known.
JP 2003-295893 A

しかしながら、サーバにおいて認識された認識結果に誤りがある場合、その訂正を行うことが考慮されていない。一般に、認識結果に誤りがある場合には、ユーザにおいて手入力により操作することにより訂正することが考えられるが、大変手間がかかるものである。例えば、認識結果としての文章をユーザが把握し、誤りを認識し、その誤っている箇所を指定し、そして訂正する、といった手間がかかる。 However, if there is an error in the recognition result recognized by the server, the correction is not considered. In general, when there is an error in the recognition result, it can be considered that the user corrects it by manual operation, but this is very time-consuming. For example, it takes time and effort for the user to grasp a sentence as a recognition result, to recognize an error, to specify the erroneous part, and to correct it.

そこで、本発明では、認識結果に誤りがあった場合に、ユーザの手間をかけることなく認識誤りを訂正することができる音声認識結果訂正装置および音声認識結果訂正方法、ならびに音声認識結果訂正システムを提供することを目的とする。 Therefore, in the present invention, a speech recognition result correction apparatus, a speech recognition result correction method, and a speech recognition result correction system capable of correcting a recognition error without any user effort when there is an error in the recognition result. The purpose is to provide.

上述の課題を解決するために、本発明の音声認識結果訂正装置は、音声を入力する入力手段と、前記入力手段により入力された音声に基づいて特徴量データを算出する算出手段と、前記算出手段により算出された特徴量データを記憶する記憶手段と、前記入力手段により入力された音声に対する認識結果を取得する取得手段と、前記取得手段により認識された認識結果において、認識誤りが発生している誤り区間を指定する指定手段と、前記記憶手段に記憶されている特徴量データから前記指定手段により指定された誤り区間に対応する特徴量データを抽出し、当該抽出した特徴量データを用いて再認識を行うことにより、前記取得手段により得られた認識結果の訂正を実行する訂正手段と、を備えている。 In order to solve the above-described problem, a speech recognition result correction apparatus according to the present invention includes an input unit that inputs speech, a calculation unit that calculates feature data based on the speech input by the input unit, and the calculation A recognition error occurs in the storage unit that stores the feature amount data calculated by the unit, the acquisition unit that acquires the recognition result for the voice input by the input unit, and the recognition result that is recognized by the acquisition unit. Extracting the feature amount data corresponding to the error section specified by the specifying unit from the feature amount data stored in the storage unit, and using the extracted feature amount data Correction means for correcting the recognition result obtained by the acquisition means by performing re-recognition.

また、本発明の音声認識結果訂正方法は、音声を入力する入力ステップと、前記入力ステップにより入力された音声に基づいて特徴量データを算出する算出ステップと、前記算出ステップにより算出された特徴量データを記憶する記憶ステップと、前記入力ステップにより入力された音声に対する認識結果を取得する取得する取得ステップと、前記取得ステップにより認識された認識結果において、認識誤りが発生している誤り区間を指定する指定ステップと、前記記憶ステップにおいて記憶された特徴量データから前記指定手段により指定された誤り区間に対応する特徴量データを抽出し、当該抽出した特徴量データを用いて再認識を行うことにより、前記取得ステップにより得られた認識結果の訂正を実行する訂正ステップと、を備えている。 The speech recognition result correction method of the present invention includes an input step for inputting speech, a calculation step for calculating feature amount data based on the speech input by the input step, and a feature amount calculated by the calculation step. A storage step for storing data, an acquisition step for acquiring a recognition result for the speech input by the input step, and an error section in which a recognition error has occurred in the recognition result recognized by the acquisition step Extracting the feature amount data corresponding to the error section specified by the specifying means from the feature amount data stored in the storing step, and performing re-recognition using the extracted feature amount data. A correction step for correcting the recognition result obtained by the acquisition step, That.

この発明によれば、入力された音声の特徴量データを記憶するとともに、その音声に対する認識された認識結果において、認識誤りが発生している誤り区間を指定する。そして、指定された誤り区間における特徴量データを再認識することにより認識結果を訂正する。これにより、認識した結果のうち必要な部分を訂正するため、簡易に訂正処理を行うことができるとともに、正しい認識結果を得ることができる。これにより、ユーザに負担をかけることなく、簡単に訂正処理を行うことができ、正しい音声認識結果を得ることができる。 According to the present invention, the feature amount data of the input voice is stored, and the error section where the recognition error has occurred is designated in the recognized recognition result for the voice. Then, the recognition result is corrected by re-recognizing the feature data in the designated error section. Accordingly, since a necessary part of the recognized result is corrected, correction processing can be easily performed and a correct recognition result can be obtained. Accordingly, correction processing can be easily performed without imposing a burden on the user, and a correct speech recognition result can be obtained.

また、本発明の音声認識結果訂正装置において、前記取得手段は、前記入力手段により入力された音声を、音声認識装置に送信する送信手段と、前記音声認識装置において認識された認識結果を受信する受信手段とから構成され、前記指定手段は、前記受信手段により受信された認識結果において、認識誤りが発生している誤り区間を指定することが好ましい。 In the speech recognition result correction apparatus according to the present invention, the acquisition unit receives a speech input by the input unit and transmits a speech to the speech recognition device, and receives a recognition result recognized by the speech recognition device. Preferably, the designation means designates an error section in which a recognition error has occurred in the recognition result received by the reception means.

この発明によれば、入力された音声を、音声認識装置に送信し、この音声認識装置において認識された認識結果を受信する。そして、受信された認識結果において、認識誤りが発生している誤り区間を指定し、指定された誤り区間における認識結果を訂正する。これにより、認識した結果のうち必要な部分を訂正するため、簡易に音声認識の誤りを訂正することができ、正しい認識結果を得ることができる。 According to the present invention, the input voice is transmitted to the voice recognition device, and the recognition result recognized by the voice recognition device is received. Then, in the received recognition result, an error section in which a recognition error has occurred is specified, and the recognition result in the specified error section is corrected. Accordingly, since a necessary part of the recognized result is corrected, an error in speech recognition can be easily corrected, and a correct recognition result can be obtained.

また、本発明の音声認識結果訂正装置において、前記指定手段は、ユーザ操作を受け付けることにより、誤り区間を指定することが好ましい。 In the speech recognition result correcting apparatus of the present invention, it is preferable that the specifying unit specifies an error section by accepting a user operation.

この発明によれば、ユーザ操作を受け付けることにより、誤り区間を指定することができ、より簡易に誤り区間を指定することができるとともに、正しい音声認識結果を得ることができる。 According to the present invention, by accepting a user operation, an error interval can be specified, an error interval can be specified more easily, and a correct speech recognition result can be obtained.

また、本発明の音声認識結果訂正装置において、前記指定手段は、前記認識結果に付与されている認識結果の信頼度に基づいて誤り区間を判断し、当該判断した誤り区間を指定することが好ましい。 In the speech recognition result correction apparatus of the present invention, it is preferable that the specifying unit determines an error interval based on a reliability of the recognition result given to the recognition result, and specifies the determined error interval. .

この発明によれば、認識結果に付与されている認識結果の信頼度に基づいて誤り区間を判断し、当該判断した誤り区間を指定するにより、自動的に誤り区間を指定することができ、より簡易に誤り区間を指定することができる。 According to this invention, an error interval can be automatically specified by determining an error interval based on the reliability of the recognition result given to the recognition result, and specifying the determined error interval. An error interval can be specified easily.

また、本発明の音声認識結果訂正装置において、前記指定手段は、前記認識結果の信頼度を計算し、当該信頼度に基づいて誤り区間を判断し、当該判断した誤り区間を指定することが好ましい。 In the speech recognition result correcting apparatus of the present invention, it is preferable that the specifying unit calculates a reliability of the recognition result, determines an error interval based on the reliability, and specifies the determined error interval. .

この発明によれば、認識結果の信頼度を計算し、当該信頼度に基づいて誤り区間を判断し、当該判断した誤り区間を指定することができ、より簡易に誤り区間を指定することができる。さらに、サーバ装置などに音声認識をさせる場合においても、そのサーバ装置から信頼度を計算させなくてもよく、より使い勝手のよい装置を提供することができる。 According to the present invention, the reliability of the recognition result is calculated, the error interval is determined based on the reliability, the determined error interval can be specified, and the error interval can be specified more easily. . Furthermore, even when a server device or the like performs voice recognition, it is not necessary to calculate reliability from the server device, and a more convenient device can be provided.

また、本発明の音声認識結果訂正装置は、前記指定手段により指定された誤り区間の直前の少なくとも一つの単語、若しくは直後の少なくとも一つの単語、または前記直前の単語および直後の単語の両方、のいずれかを形成する認識結果を特定する特定手段をさらに備え、前記訂正手段は、前記特定手段により特定された認識結果を拘束条件として、この拘束条件にしたがって、誤り区間の直前の単語、直後の単語を含む区間に対応する特徴量データを前記記憶手段から抽出し、抽出した特徴量データに対し認識処理を行うことが好ましい。 The speech recognition result correcting apparatus according to the present invention includes at least one word immediately before the error section specified by the specifying means, or at least one word immediately after, or both the immediately preceding word and the immediately following word. The correction means further comprises a specifying means for specifying a recognition result forming any one of the words, and the correction means uses the recognition result specified by the specifying means as a constraint condition, and in accordance with the constraint condition, It is preferable that feature amount data corresponding to a section including a word is extracted from the storage unit and a recognition process is performed on the extracted feature amount data.

この発明によれば、指定された誤り区間の直前の少なくとも一つの単語、若しくは直後の少なくとも一つの単語、または前記直前の単語および直後の単語の両方の単語、のいずれかを形成する認識結果を特定し、特定された認識結果を拘束条件として、この拘束条件にしたがって、予め記憶されている特徴量データの認識処理を行うができる。これにより、より正確な認識処理を行うことができ、よって正しい音声認識結果を得ることができる。 According to the present invention, the recognition result forming either at least one word immediately before the specified error section, at least one word immediately after, or both of the immediately preceding word and the immediately following word is obtained. Using the identified recognition result as a constraint condition, it is possible to perform a process for recognizing feature data stored in advance according to the constraint condition. Thereby, more accurate recognition processing can be performed, and thus a correct speech recognition result can be obtained.

また、本発明の音声認識結果訂正装置は、前記指定手段により指定された誤り区間の直前の少なくとも一つの単語、若しくは直後の少なくとも一つの単語、または前記直前の単語および直後の単語の両方、のいずれかを形成する認識結果を特定する特定手段をさらに備え、前記訂正手段は、前記特定手段により特定された認識結果を拘束条件として、この拘束条件にしたがって、誤り区間に対応する特徴量データを前記記憶手段から抽出し、抽出した特徴量データに対し認識処理を行うことが好ましい。 The speech recognition result correcting apparatus according to the present invention includes at least one word immediately before the error section specified by the specifying means, or at least one word immediately after, or both the immediately preceding word and the immediately following word. The correction means further includes a specifying means for specifying any recognition result, and the correction means uses the recognition result specified by the specifying means as a constraint condition, and according to the constraint condition, the feature amount data corresponding to the error section is obtained. It is preferable to perform recognition processing on the extracted feature value data extracted from the storage means.

この発明によれば、指定された誤り区間の直前の少なくとも一つの単語、若しくは直後の少なくとも一つの単語、または前記直前の単語および直後の単語の両方の単語、のいずれかを形成する認識結果を特定し、特定された認識結果を拘束条件として、この拘束条件にしたがって、予め記憶されている特徴量データの認識処理を行うができる。すなわち、この発明においては誤り区間のみの特徴量データを用いて認識処理を行うことができる。これにより、より正確な認識処理を行うことができ、よって正しい音声認識結果を得ることができる。 According to the present invention, the recognition result forming either at least one word immediately before the specified error section, at least one word immediately after, or both of the immediately preceding word and the immediately following word is obtained. Using the identified recognition result as a constraint condition, it is possible to perform a process for recognizing feature data stored in advance according to the constraint condition. That is, according to the present invention, recognition processing can be performed using feature amount data only in error sections. Thereby, more accurate recognition processing can be performed, and thus a correct speech recognition result can be obtained.

また、本発明の音声認識結果訂正装置は、前記指定手段により指定された誤り区間の直前の少なくとも一つの単語を特定するための情報である単語情報、若しくは直後の少なくとも一つの単語の単語情報、または前記直前の単語の単語情報および直後の単語の単語情報の両方、のいずれかを形成する認識結果における単語の単語情報を特定する単語情報特定手段をさらに備え、前記訂正手段は、前記単語情報特定手段により特定された単語情報を拘束条件として、この拘束条件にしたがって、誤り区間の直前の単語、直後の単語を含む区間に対応する特徴量データを前記記憶手段から抽出し、抽出した特徴量データに対し認識処理を行うことが好ましい。 Further, the speech recognition result correction apparatus of the present invention is word information that is information for specifying at least one word immediately before the error section specified by the specifying means, or word information of at least one word immediately after, Or word information specifying means for specifying word information of a word in a recognition result forming either the word information of the immediately preceding word or the word information of the immediately following word, and the correcting means includes the word information Using the word information specified by the specifying means as a constraint condition, in accordance with this constraint condition, feature quantity data corresponding to the word immediately before the error section and the section including the word immediately after is extracted from the storage means, and the extracted feature quantity It is preferable to perform recognition processing on the data.

この発明によれば、単語を特定するための単語情報を拘束条件として、訂正処理を行うことにより、より正確な認識処理を行うことができる。 According to this invention, more accurate recognition processing can be performed by performing correction processing using word information for specifying a word as a constraint.

例えば、単語情報として、単語の品詞を示す品詞情報および単語の読み方を示す読み情報、のいずれか１つまたは複数を含むことが好ましい。 For example, it is preferable that the word information includes one or more of part-of-speech information indicating the word part-of-speech and reading information indicating how to read the word.

また、本発明の音声認識結果訂正装置は、前記単語情報に基づいて、前記指定手段により指定された誤り区間の直前の少なくとも一つの単語、若しくは直後の少なくとも一つの単語、または前記直前の単語および直後の単語の両方、のいずれかを形成する認識結果の単語が、未知語か否かを判定する、未知語判定手段をさらに備え、前記未知語判定手段により前記認識結果の単語が未知語であると判定されると、前記訂正手段は、前記単語情報をもとに、認識結果の訂正処理を行うことが好ましい。 Further, the speech recognition result correcting apparatus according to the present invention is based on the word information, the at least one word immediately before the error section specified by the specifying means, the at least one word immediately after, or the word immediately before and It is further provided with an unknown word determination means for determining whether or not the recognition result word forming either of the immediately following words is an unknown word, and the recognition result word is an unknown word by the unknown word determination means. If it is determined that there is, it is preferable that the correction means corrects the recognition result based on the word information.

この発明によれば、未知語である場合に、単語情報を拘束条件とした認識処理を行うことにより、より正確な音声認識結果を得ることができる。 According to the present invention, when an unknown word is used, a more accurate speech recognition result can be obtained by performing recognition processing using word information as a constraint condition.

また、本発明の音声認識結果訂正装置は、単語同士の接続確率を記憶する接続確率記憶手段をさらに備え、前記訂正手段は、訂正処理したことによって当該誤り区間の単語およびその前後またはその一方における単語との接続確率を作成し、当該接続確率を用いて前記接続確率記憶手段に記憶されている接続確率を更新することが好ましい。 The speech recognition result correction apparatus according to the present invention further includes connection probability storage means for storing connection probabilities between words, and the correction means performs correction processing on the word in the error section and before or after the word. It is preferable to create a connection probability with a word and update the connection probability stored in the connection probability storage means using the connection probability.

この発明によれば、単語同士の接続確率を記憶しておき、これを訂正処理をするたびに接続確率は変ることになるため、その接続確率を計算して更新することでより正確な音声認識結果を得ることができる。 According to the present invention, since the connection probability between words is stored and the connection probability changes every time correction processing is performed, more accurate speech recognition can be performed by calculating and updating the connection probability. The result can be obtained.

また、本発明の音声認識結果訂正装置は、前記単語情報特定手段により特定された単語情報または前記特定手段により特定された単語を拘束条件として記憶する拘束条件記憶手段をさらに備え、前記訂正手段は、前記拘束条件記憶手段に記憶されている拘束条件に従って訂正処理を行うことが好ましい。 The speech recognition result correcting apparatus according to the present invention further includes a constraint condition storage unit that stores the word information specified by the word information specifying unit or the word specified by the specifying unit as a constraint condition, and the correction unit The correction processing is preferably performed in accordance with the constraint conditions stored in the constraint condition storage means.

これにより、拘束条件となる単語または単語情報を記憶し、必要に応じて記憶されている拘束条件に従った訂正処理を行うことができ、訂正処理を行うたびに拘束条件を生成する必要がなくなり、迅速な訂正処理（音声認識処理）を行うことができる。 As a result, it is possible to store a word or word information as a constraint condition and perform correction processing according to the stored constraint condition as necessary, and it is not necessary to generate a constraint condition every time correction processing is performed. Rapid correction processing (voice recognition processing) can be performed.

また、本発明の音声認識結果訂正装置は、ユーザから文字情報を受け付ける受付手段をさらに備え、前記訂正手段は、前記受付手段により受け付けられた文字情報を拘束条件として、誤り区間における認識結果の訂正処理を行うことが好ましい。 The speech recognition result correction apparatus according to the present invention further includes a reception unit that receives character information from a user, and the correction unit corrects the recognition result in the error section using the character information received by the reception unit as a constraint. It is preferable to carry out the treatment.

この発明によれば、ユーザが直接拘束条件となる文字を指定することができ、より正確な認識処理を行うことができ、よって正しい音声認識結果を得ることができる。 According to the present invention, the user can directly specify the character that is the constraint condition, can perform more accurate recognition processing, and thus can obtain a correct speech recognition result.

また、本発明の音声認識結果訂正装置は、受信手段により受信された認識結果と前記記憶手段に記憶されている特徴量データとに基づいて、認識結果における経過時間を算出する時間情報算出手段をさらに備え、前記指定手段は、前記時間情報算出手段により算出された時間情報に基づいて誤り区間を指定することが好ましい。 The speech recognition result correcting apparatus according to the present invention further includes a time information calculating unit that calculates an elapsed time in the recognition result based on the recognition result received by the receiving unit and the feature amount data stored in the storage unit. Further, it is preferable that the specifying unit specifies an error section based on the time information calculated by the time information calculating unit.

この発明によれば、受信された認識結果と記憶されている特徴量データとに基づいて、認識結果における経過時間を算出し、この時間情報に基づいて誤り区間を指定することができる。これにより認識結果に時間情報を含んでいない場合でも誤り区間に対応する適切な特徴量データを抽出することができる。 According to the present invention, the elapsed time in the recognition result can be calculated based on the received recognition result and the stored feature amount data, and the error section can be designated based on this time information. As a result, even when the recognition result does not include time information, it is possible to extract appropriate feature amount data corresponding to the error section.

また、本発明の音声認識結果訂正装置は、前記訂正手段により訂正された認識結果を表示する表示手段をさらに備え、前記表示手段は、前記取得手段により取得された認識結果を表示しないことが好ましい。これにより、認識誤りのある可能性のある認識結果を表示しないため、ユーザに誤解を与えることがない。 The speech recognition result correction apparatus according to the present invention preferably further includes display means for displaying the recognition result corrected by the correction means, and the display means does not display the recognition result acquired by the acquisition means. . Thereby, since the recognition result which may have a recognition error is not displayed, a misunderstanding is not given to a user.

また、本発明の音声認識結果訂正装置は、前記訂正手段により再認識により得られた認識結果と、前記取得手段により取得された認識結果とが同じであった場合、またはこれら認識結果それぞれに含まれる時間情報にずれが生じている場合には、認識誤りと判断され、前記表示手段は認識結果を表示しないことが好ましい。これにより、誤った認識結果を表示することを防止することができる。 The speech recognition result correction apparatus according to the present invention includes a case where the recognition result obtained by re-recognition by the correction unit is the same as the recognition result obtained by the acquisition unit, or included in each of these recognition results. If there is a deviation in the time information to be displayed, it is determined that a recognition error has occurred, and the display means preferably does not display the recognition result. Thereby, it can prevent displaying an incorrect recognition result.

また、本発明の音声認識結果訂正装置において、前記指定手段は、ユーザ操作により誤り区間の始点を指定し、前記取得手段により取得された認識結果に付与されている認識結果の信頼度に基づいて誤り区間の終点を指定することが好ましい。これにより、ユーザの入力習慣にあった訂正方法を実現することができ、使い勝手のよい装置を提供することができる。 Further, in the speech recognition result correction apparatus according to the present invention, the designation means designates a start point of an error section by a user operation, and based on the reliability of the recognition result given to the recognition result acquired by the acquisition means. It is preferable to specify the end point of the error interval. Thereby, the correction method suitable for the user's input habit can be realized, and a user-friendly device can be provided.

また、本発明の音声認識結果訂正装置において、前記指定手段は、ユーザ操作により誤り区間の始点を指定し、当該始点に基づいて所定認識単位数あけて誤り区間の終点を指定することが好ましい。これにより、ユーザの入力習慣にあった訂正方法を実現することができ、使い勝手のよい装置を提供することができる。 In the speech recognition result correcting apparatus according to the present invention, it is preferable that the specifying unit specifies a start point of an error section by a user operation, and specifies an end point of the error section with a predetermined number of recognition units based on the start point. Thereby, the correction method suitable for the user's input habit can be realized, and a user-friendly device can be provided.

また、本発明の音声認識結果訂正装置において、前記指定手段は、ユーザ操作により誤り区間の始点を指定し、前記取得手段により取得された認識結果における所定の発音記号に基づいて誤り区間の終点を指定することが好ましい。これにより、ユーザの入力習慣にあった訂正方法を実現することができ、使い勝手のよい装置を提供することができる。 Further, in the speech recognition result correcting apparatus according to the present invention, the designation unit designates a start point of an error section by a user operation, and determines an end point of the error section based on a predetermined phonetic symbol in the recognition result acquired by the acquisition unit. It is preferable to specify. Thereby, the correction method suitable for the user's input habit can be realized, and a user-friendly device can be provided.

また、本発明の音声認識結果訂正装置において、前記取得手段は、認識結果を取得する際、認識結果として複数の認識候補を取得し、前記指定手段は、ユーザ操作により誤り区間の始点を指定し、前記取得手段により取得された認識候補の数に基づいて終点を指定することが好ましい。これにより、認識結果の信頼度に基づいた終点を指定することができ、効率的に訂正処理を実現することができる。 In the speech recognition result correction apparatus of the present invention, when the acquisition unit acquires a recognition result, the acquisition unit acquires a plurality of recognition candidates as a recognition result, and the specifying unit specifies a start point of an error interval by a user operation. Preferably, the end point is designated based on the number of recognition candidates acquired by the acquisition unit. Thereby, the end point based on the reliability of the recognition result can be designated, and the correction process can be realized efficiently.

また、本発明の音声認識結果訂正装置において、前記算出手段により算出された特徴量データの誤り区間を含む区間の平均値を算出する算出手段をさらに備え、前記訂正手段は、抽出した特徴量データから前記算出手段により算出された平均値を減算し、その減算して得られたデータを特徴量データとして再認識処理を行うことが好ましい。これにより、マイクなどの音を入力する集音装置の特性を除去した音に対して訂正処理を行うことができ、より正確な訂正（音声認識）を実現することができる。 In the speech recognition result correction apparatus of the present invention, the speech recognition result correction apparatus further includes a calculation unit that calculates an average value of sections including error sections of the feature amount data calculated by the calculation unit, and the correction unit extracts the extracted feature amount data. It is preferable to subtract the average value calculated by the calculation means from the above and perform re-recognition processing using the data obtained by the subtraction as feature quantity data. As a result, correction processing can be performed on the sound from which the characteristics of the sound collector that inputs sound such as a microphone are removed, and more accurate correction (voice recognition) can be realized.

また、本発明の音声認識結果訂正装置において、音声を入力する入力手段と、前記入力手段により入力された音声に対する認識結果を取得する取得手段と、前記取得手段により認識された認識結果において、認識誤りが発生している誤り区間を指定する指定手段と、前記指定手段により指定された誤り区間を外部サーバに通知することにより前記外部サーバに当該誤り区間の再認識処理を依頼する通知手段と、前記通知手段による依頼に応じて、前記外部サーバにおいて再認識された誤り区間の認識結果を受信する受信手段と、を備えている。 Further, in the speech recognition result correcting apparatus according to the present invention, an input unit for inputting speech, an acquisition unit for acquiring a recognition result for the speech input by the input unit, and a recognition result recognized by the acquisition unit Designation means for designating an error section in which an error has occurred; notification means for requesting the external server to re-recognize the error section by notifying the external server of the error section designated by the designation means; Receiving means for receiving a recognition result of an error section re-recognized in the external server in response to a request from the notification means.

また、本発明の音声認識結果訂正方法において、音声を入力する入力ステップと、前記入力ステップにより入力された音声に対する認識結果を取得する取得ステップと、前記取得ステップにより認識された認識結果において、認識誤りが発生している誤り区間を指定する指定ステップと、前記指定ステップにより指定された誤り区間を外部サーバに通知することにより前記外部サーバに当該誤り区間の再認識処理を依頼する通知ステップと、前記通知ステップによる依頼に応じて、前記外部サーバにおいて再認識された誤り区間の認識結果を受信する受信ステップと、を備えている。 Further, in the speech recognition result correction method of the present invention, in the input step of inputting speech, the acquisition step of acquiring the recognition result for the speech input by the input step, and the recognition result recognized by the acquisition step, A designation step for designating an error section in which an error has occurred; a notification step for requesting the external server to perform re-recognition processing of the error section by notifying the external server of the error section designated by the designation step; A receiving step of receiving a recognition result of the error section re-recognized in the external server in response to the request in the notification step.

また、本発明の音声認識結果訂正装置は、前記取得手段により取得された認識結果において、サブワード区間を指定するサブワード区間指定手段と、を備え、前記訂正手段は、前記指定手段により指定された誤り区間においてさらに前記サブワード区間指定手段により指定されたサブワード区間に対応する特徴量データを、前記記憶手段から抽出し、当該抽出した特徴量データを用いて再認識を行うことにより、前記取得手段により得られた認識結果の訂正を実行することが好ましい。 The speech recognition result correcting apparatus according to the present invention further comprises subword section specifying means for specifying a subword section in the recognition result acquired by the acquiring means, wherein the correcting means is an error specified by the specifying means. Further, in the section, feature amount data corresponding to the subword section specified by the subword section specifying unit is extracted from the storage unit, and re-recognition is performed using the extracted feature amount data. It is preferable to perform correction of the recognized recognition result.

これにより、サブワード区間に対応する特徴量データを用いて認識結果の訂正を実行することができ、より正確な訂正処理を行うことができる。すなわち、サブワード区間といった未知語の区間にしたがった再認識を行うことができる。 As a result, the recognition result can be corrected using the feature data corresponding to the subword section, and more accurate correction processing can be performed. That is, re-recognition can be performed according to an unknown word section such as a subword section.

また、本発明の音声認識結果訂正装置は、前記サブワード区間指定手段により指定されたサブワード区間に従って、前記取得手段から取得された認識結果を複数の区間に分割する分割手段をさらに備え、 The speech recognition result correction apparatus of the present invention further includes a dividing unit that divides the recognition result acquired from the acquiring unit into a plurality of sections according to the subword section specified by the subword section specifying unit,

前記訂正手段は、前記分割手段により分割された分割区間ごとに、認識結果の訂正を実行することが好ましい。 It is preferable that the correction means corrects the recognition result for each divided section divided by the dividing means.

これにより、認識結果を複数の区間の分割することで、認識対象を短くすることができ、より正確な認識処理を行うことができる。 Thereby, the recognition target can be shortened by dividing the recognition result into a plurality of sections, and more accurate recognition processing can be performed.

また、本発明の音声認識結果訂正装置における分割手段は、サブワード区間の終点を一の分割区間の終点とするとともに、サブワード区間の始点を、前記一の分割区間の次の分割区間の始点とするよう認識結果を分割することが好ましい。 The dividing means in the speech recognition result correcting apparatus of the present invention uses the end point of the subword section as the end point of one divided section, and the start point of the subword section as the start point of the next divided section of the one divided section. It is preferable to divide the recognition result.

これにより、サブワード区間が、分割区間のいずれにも含まれることになる。よって、認識処理する際には必ずサブワード区間が含まれることにより、サブワード文字列を拘束条件とした認識処理を行うことができる。 As a result, the subword section is included in any of the divided sections. Therefore, when the recognition process is performed, the subword section is always included, so that the recognition process using the subword character string as a constraint condition can be performed.

また、本発明の音声認識結果訂正装置における訂正手段は、前記分割手段により分割された分割区間ごとに、認識結果の訂正を実行するとともに、前記サブワード区間を各分割区間の訂正における拘束条件とすることが好ましい。 Further, the correcting means in the speech recognition result correcting apparatus of the present invention executes correction of the recognition result for each divided section divided by the dividing means, and uses the subword section as a constraint condition in correcting each divided section. It is preferable.

これにより、認識処理する際には必ずサブワード区間が含まれることになり、サブワード文字列を拘束条件とした認識処理を行うことができる。 Thereby, when the recognition process is performed, the subword section is always included, and the recognition process using the subword character string as a constraint condition can be performed.

また、本発明の音声認識結果訂正装置において、訂正手段は、前記サブワード区間指定手段により指定されたサブワード区間に記述されているサブワード文字列を含む仮説を認識の探索過程として保持し、当該仮説から最終的な認識結果を選択することにより訂正を実行することが好ましい。 Further, in the speech recognition result correcting apparatus of the present invention, the correcting means holds a hypothesis including a subword character string described in the subword section designated by the subword section designating means as a recognition search process, and from the hypothesis The correction is preferably performed by selecting the final recognition result.

これにより、必ずサブワード文字列を用いた認識処理を行うことができる。 Thereby, the recognition process using the subword character string can be performed without fail.

また、本発明の音声認識結果訂正装置は、前記サブワード区間指定手段により指定されたサブワード区間におけるサブワード文字列を認識処理のための辞書データベースに追加する辞書追加手段をさらに備えることが好ましい。 The speech recognition result correcting apparatus according to the present invention preferably further comprises a dictionary adding means for adding a subword character string in the subword section specified by the subword section specifying means to a dictionary database for recognition processing.

これにより、サブワード文字列を蓄積することができ、今後の認識処理に有効に用い、より正確な認識処理を行うことができる。 As a result, subword character strings can be accumulated, and can be used effectively for future recognition processing, and more accurate recognition processing can be performed.

本発明の音声認識結果訂正装置は、ユーザにより生成された辞書データベースをさらに備え、前記訂正手段は、サブワード文字列を前記辞書データベースにしたがって変換された文字列を用いて訂正処理を行うことが好ましい。 The speech recognition result correction apparatus according to the present invention preferably further includes a dictionary database generated by a user, and the correction unit preferably performs a correction process using a character string obtained by converting a subword character string according to the dictionary database. .

また、本発明の音声認識結果訂正システムは、上述音声認識結果訂正装置と、前記音声認識結果訂正装置から送信された音声に基づいて音声認識を行い、認識結果として前記音声認識結果訂正装置に送信するサーバ装置と、を備えている。この音声認識結果訂正システムは、カテゴリーが相違するだけで、作用効果において上述音声認識結果訂正装置と同じである。 The speech recognition result correction system of the present invention performs speech recognition based on the speech recognition result correction device and the speech transmitted from the speech recognition result correction device, and transmits the recognition result to the speech recognition result correction device. A server device. This speech recognition result correction system is the same as the speech recognition result correction device described above in terms of operational effects except that the category is different.

本発明によれば、認識した結果のうち必要な部分を訂正することができ、簡易に訂正処理を行うことができるとともに、正しい認識結果を得ることができる。 According to the present invention, a necessary part of the recognized result can be corrected, correction processing can be easily performed, and a correct recognition result can be obtained.

添付図面を参照しながら本発明の実施形態を説明する。可能な場合には、同一の部分には同一の符号を付して、重複する説明を省略する。 Embodiments of the present invention will be described with reference to the accompanying drawings. Where possible, the same parts are denoted by the same reference numerals, and redundant description is omitted.

＜第１の実施形態＞
図１は、本実施形態の音声認識結果訂正装置であるクライアント装置１１０およびクライアント装置１１０から送信された音声を認識し、その結果をクライアント装置１１０に返信するサーバ装置１２０を備える通信システムのシステム構成図である。本実施形態では、クライアント装置１１０は、例えば携帯電話などの携帯端末であって、ユーザが発声した音声を入力し、入力した音声を、無線通信を用いてサーバ装置１２０に送信し、サーバ装置１２０からの返信である認識結果を受信することができる。 <First Embodiment>
FIG. 1 shows a system configuration of a communication system including a client apparatus 110 that is a voice recognition result correction apparatus according to the present embodiment and a server apparatus 120 that recognizes a voice transmitted from the client apparatus 110 and returns the result to the client apparatus 110. FIG. In the present embodiment, the client device 110 is a mobile terminal such as a mobile phone, for example. The client device 110 inputs voice uttered by the user, transmits the input voice to the server device 120 using wireless communication, and the server device 120. The recognition result which is a reply from can be received.

サーバ装置１２０は、音声認識部を備え、入力された音声を、音響モデル、言語モデルなどのデータベースを用いて音声認識を行い、その認識結果をクライアント装置１１０に返信する。 The server device 120 includes a speech recognition unit, performs speech recognition on the input speech using a database such as an acoustic model and a language model, and returns the recognition result to the client device 110.

つぎに、このクライアント装置１１０の構成について説明する。図２は、クライアント装置１１０の機能を示すブロック図である。このクライアント装置１１０は、特徴量算出部２１０（入力手段、算出手段）、特徴量圧縮部２２０、送信部２２５（取得手段、送信手段）、特徴量保存部２３０（記憶手段）、受信部２３５（取得手段、受信手段）、誤り区間指定部２４０（指定手段）、誤り区間前後コンテキスト指定部２５０（特定手段）、誤り区間特徴量抽出部２６０、訂正部２７０（訂正手段）、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３、統合部２８０、表示部２９０を含んで構成されている。 Next, the configuration of the client device 110 will be described. FIG. 2 is a block diagram illustrating functions of the client device 110. The client device 110 includes a feature amount calculation unit 210 (input unit, calculation unit), a feature amount compression unit 220, a transmission unit 225 (acquisition unit, transmission unit), a feature amount storage unit 230 (storage unit), and a reception unit 235 ( (Acquisition unit, reception unit), error section specifying unit 240 (designating unit), context section specifying unit 250 before and after error section 250 (identifying unit), error section feature quantity extracting unit 260, correcting unit 270 (correcting unit), acoustic model holding unit 281 , A language model holding unit 282, a dictionary holding unit 283, an integration unit 280, and a display unit 290.

図３は、クライアント装置１１０のハードウェア構成図である。図２に示されるクライアント装置１１０は、物理的には、図３に示すように、ＣＰＵ１１、主記憶装置であるＲＡＭ１２及びＲＯＭ１３、入力デバイスであるキーボード及びマウス等の入力装置１４、ディスプレイ等の出力装置１５、ネットワークカード等のデータ送受信デバイスである通信モジュール１６、ハードディスク等の補助記憶装置１７などを含むコンピュータシステムとして構成されている。図２において説明した各機能は、図３に示すＣＰＵ１１、ＲＡＭ１２等のハードウェア上に所定のコンピュータソフトウェアを読み込ませることにより、ＣＰＵ１１の制御のもとで入力装置１４、出力装置１５、通信モジュール１６を動作させるとともに、ＲＡＭ１２や補助記憶装置１７におけるデータの読み出し及び書き込みを行うことで実現される。以下、図２に示す機能ブロックに基づいて、各機能ブロックを説明する。 FIG. 3 is a hardware configuration diagram of the client device 110. As shown in FIG. 3, the client device 110 shown in FIG. 2 physically includes a CPU 11, a RAM 12 and a ROM 13 as main storage devices, an input device 14 such as a keyboard and a mouse as input devices, and an output from a display or the like. The computer 15 includes a device 15, a communication module 16 that is a data transmission / reception device such as a network card, an auxiliary storage device 17 such as a hard disk, and the like. Each function described in FIG. 2 has the input device 14, the output device 15, and the communication module 16 under the control of the CPU 11 by reading predetermined computer software on the hardware such as the CPU 11 and the RAM 12 shown in FIG. 3. This is realized by reading and writing data in the RAM 12 and the auxiliary storage device 17. Hereinafter, each functional block will be described based on the functional blocks shown in FIG.

特徴量算出部２１０は、マイク（図示せず）から入力されたユーザの声を入力し、当該入力された声から音声認識スペクトルであって、音響特徴を示す特徴量データを算出する部分である。例えば、特徴量算出部２１０は、ＭＦＣＣ（Mel Frequency Cepstrum Coefficient）のような周波数で表される音響特徴を示す特徴量データを算出する。 The feature amount calculation unit 210 is a part that inputs a user's voice input from a microphone (not shown) and calculates feature amount data indicating a speech recognition spectrum and indicating acoustic features from the input voice. . For example, the feature amount calculation unit 210 calculates feature amount data indicating an acoustic feature represented by a frequency such as MFCC (Mel Frequency Cepstrum Coefficient).

特徴量圧縮部２２０は、特徴量算出部２１０において算出された特徴量データを圧縮する部分である。 The feature amount compression unit 220 is a portion that compresses the feature amount data calculated by the feature amount calculation unit 210.

送信部２２５は、特徴量圧縮部２２０において圧縮された圧縮特徴量データをサーバ装置１２０に送信する部分である。この送信部２２５は、ＨＴＴＰ（Hyper Text Transfer Protocol）、ＭＲＣＰ（Media Resource Control Protocol）、ＳＩＰ（SessionInitiation Protocol）などを用いて送信処理を行う。また、このサーバ装置１２０では、これらプロトコルを用いて受信処理を行い、また返信処理を行う。さらに、このサーバ装置１２０では、圧縮特徴量データを解凍することができ、特徴量データを用いて音声認識処理を行うことができる。この特徴量圧縮部２２０は、通信トラフィックを軽減するためにデータ圧縮するためのものであることから、この送信部２２５は、圧縮されることなくそのままの特徴量データを送信することも可能である。 The transmission unit 225 is a part that transmits the compressed feature value data compressed by the feature value compression unit 220 to the server device 120. The transmission unit 225 performs transmission processing using Hyper Text Transfer Protocol (HTTP), Media Resource Control Protocol (MRCP), Session Initiation Protocol (SIP), and the like. The server device 120 performs reception processing and reply processing using these protocols. Further, the server device 120 can decompress the compressed feature amount data, and perform voice recognition processing using the feature amount data. Since the feature amount compression unit 220 is for data compression in order to reduce communication traffic, the transmission unit 225 can transmit the feature amount data as it is without being compressed. .

特徴量保存部２３０は、特徴量算出部２１０において算出された特徴量データを一時的に記憶する部分である。 The feature amount storage unit 230 is a part that temporarily stores the feature amount data calculated by the feature amount calculation unit 210.

受信部２３５は、サーバ装置１２０から返信された音声認識結果を受信する部分である。この音声認識結果には、テキストデータ、時間情報、および信頼度情報が含まれており、時間情報はテキストデータの一認識単位ごとの経過時間を示し、信頼度情報は、その認識結果における正当確度を示す情報である。 The receiving unit 235 is a part that receives the voice recognition result returned from the server device 120. The speech recognition result includes text data, time information, and reliability information. The time information indicates the elapsed time for each recognition unit of the text data, and the reliability information indicates the correctness accuracy in the recognition result. It is information which shows.

例えば、認識結果として、図４に示される情報が受信される。図４では、発声内容、認識内容、音声区間、信頼度が対応付けて記載されているが、発声内容は実際には含まれていない。ここで、音声区間で示されている数字は、フレームのインデックスを示すものであり、その認識単位の最初のフレームのインデックスが示されている。ここで１フレームは１０ｍｓｅｃ程度である。また、信頼度は、サーバ装置１２０において認識された音声認識結果の一認識単位ごとの信頼度を示すものであり、どの程度正しいかを示す数値である。これは、認識結果に対して確率などを用いて生成されたものであり、サーバ装置１２０において、認識された単語単位に付加されたものである。例えば、信頼度の生成方法として、以下の参考文献に記載されている。
参考文献：李晃伸，河原達也，鹿野清宏．「２パス探索アルゴリズムにおける高速な単語事後確率に基づく信頼度算出法」、情報処理学会研究報告, 2003-SLP-49-48, 2003-12．
図４では、例えば、認識結果である「売れて」は、３３フレームから５７フレームまでで構成され、その信頼度は０．８６であることが示されている。 For example, the information shown in FIG. 4 is received as the recognition result. In FIG. 4, the utterance content, the recognition content, the voice section, and the reliability are described in association with each other, but the utterance content is not actually included. Here, the number shown in the voice section indicates the index of the frame, and the index of the first frame of the recognition unit is shown. Here, one frame is about 10 msec. Further, the reliability indicates the reliability for each recognition unit of the speech recognition result recognized by the server device 120, and is a numerical value indicating how correct it is. This is generated using a probability or the like for the recognition result, and is added to the recognized word unit in the server device 120. For example, it is described in the following references as a method of generating reliability.
References: Lee Shin-nobu, Kawahara Tatsuya, Kano Kiyohiro. "High-speed reliability calculation method based on word posterior probabilities in two-pass search algorithm", IPSJ SIG, 2003-SLP-49-48, 2003-12.
In FIG. 4, for example, “sell” as a recognition result includes 33 frames to 57 frames, and the reliability is 0.86.

誤り区間指定部２４０は、受信部２３５により受信された音声認識結果に基づいて誤り区間を指定する部分である。この誤り区間指定部２４０は、例えば、サーバ装置１２０から送信された音声認識結果に含まれている信頼度情報に基づいて誤り区間を指定することができる。 The error section specifying unit 240 is a part that specifies an error section based on the speech recognition result received by the receiving unit 235. The error interval specification unit 240 can specify an error interval based on reliability information included in the speech recognition result transmitted from the server device 120, for example.

例えば、図４では、認識結果として、テキストデータは９０５（きゅうまるご）、時間情報は９フレーム（９０ｍｓｅｃ）、その信頼度は０．５９であることが示されており、また、別の箇所では認識結果である「どこ」の信頼度は、０．０４であることが示されている。そして、この誤り区間指定部２４０は、信頼度が所定の閾値以下のものは誤っていると判断し、その区間を誤り区間として指定することができる。例えば、信頼度が０．２以下のものは誤っていると設定した場合には、“どこ”、“で”、“豆腐”の部分が誤っていると判断し、その部分を誤り区間として指定することができる。この閾値はクライアント装置１１０側で予め設定されている数値である。なお、音声の個人差、雑音（ノイズ）の量、または信頼度の計算方法によって可変設定されるようにしてもよい。すなわち雑音が多い場合には、信頼度がさらに落ちるため、閾値を低めに設定しておき、また、音声認識結果に付加されている信頼度が全体的に低めであったり、逆に高めであったりした場合に、その信頼度の高低に応じて代えてもよい。例えば、信頼度の中央値に基づいて閾値を設定したり、また平均値に基づいて閾値を設定するようにしてもよい。 For example, FIG. 4 shows that the recognition result is 905 (Kyumarugo), the time information is 9 frames (90 msec), and the reliability is 0.59 as the recognition result. The reliability of “where”, which is the recognition result, is 0.04. Then, the error section specifying unit 240 can determine that the reliability is equal to or less than a predetermined threshold value, and can specify the section as an error section. For example, if the reliability is set to 0.2 or less, it is determined that the “where”, “de”, and “tofu” portions are incorrect, and that portion is designated as the error section. can do. This threshold value is a numerical value set in advance on the client device 110 side. Note that it may be variably set depending on the individual difference of the voice, the amount of noise (noise), or the reliability calculation method. In other words, the reliability decreases further when there is a lot of noise, so the threshold is set low, and the reliability added to the speech recognition result is generally low or conversely high. In such a case, the reliability may be changed according to the level of reliability. For example, the threshold value may be set based on the median value of reliability, or the threshold value may be set based on the average value.

なお、クライアント装置１１０は、認識結果の信頼度情報を計算する信頼度計算部（図示せず）を備え、誤り区間指定部２４０は、クライアント装置１１０内において計算された信頼度情報に基づいて、誤り区間を指定するようにしてもよい。 The client device 110 includes a reliability calculation unit (not shown) that calculates the reliability information of the recognition result, and the error section specification unit 240 is based on the reliability information calculated in the client device 110. An error interval may be designated.

誤り区間前後コンテキスト指定部２５０は、誤り区間指定部２４０において指定された誤り区間に基づいて、当該誤り区間の前後において認識された単語（少なくとも一認識単位）を指定する部分である。以下では前後1単語だけを利用する場合を例に説明する。図５（ａ）に、誤り区間の前後において認識された一認識単位（誤り区間前後コンテキスト）を指定した場合の概念図を示す。図５（ａ）に示すように、認識結果の誤り区間の前後に誤り区間前の単語の音声区間、誤り区間後の単語の音声区間を指定する。 The context specifying unit 250 before and after the error section is a part that specifies words (at least one recognition unit) recognized before and after the error section based on the error section specified by the error section specifying unit 240. In the following, the case where only one word is used will be described as an example. FIG. 5A shows a conceptual diagram when one recognition unit (context before and after the error interval) recognized before and after the error interval is designated. As shown in FIG. 5A, the speech section of the word before the error section and the speech section of the word after the error section are specified before and after the error section of the recognition result.

誤り区間特徴量抽出部２６０は、誤り区間前後コンテキスト指定部２５０により指定された誤り区間（前後の少なくとも一認識単位を含んでもよい）の特徴量データを、特徴量保存部２３０から抽出する部分である。 The error section feature quantity extraction unit 260 is a part for extracting feature quantity data of the error section (which may include at least one recognition unit before and after) specified by the context specification section 250 before and after the error section from the feature quantity storage unit 230. is there.

訂正部２７０は、誤り区間特徴量抽出部２６０により抽出された特徴量データを再度音声認識する部分である。この訂正部２７０は、音響モデル保持部２８１、言語モデル保持部２８２、および辞書保持部２８３を用いて音声認識を行う。さらに、この訂正部２７０は、誤り区間前後コンテキスト指定部２５０により指定された前後の音声区間で示される単語（前後コンテキスト）を拘束条件として音声認識を行う。図５（ｂ）に、誤り区間前後コンテキスト指定部２５０により指定された単語に基づいて認識処理を行うときの概念図を示す。図５（ｂ）に示すように、誤り区間の前の区間の単語Ｗ１と後の区間の単語Ｗ２とを拘束条件とした場合、認識候補は限られたものとなる。よって、認識の精度を向上させることができる。図５（ｂ）の例では、認識候補としてＡ〜Ｚに絞り込むことができ、この絞り込まれた後方の中から適切な候補を選択することができ、効率的に認識処理を行うことができる。 The correction unit 270 is a part that recognizes again the feature amount data extracted by the error section feature amount extraction unit 260. The correction unit 270 performs speech recognition using the acoustic model holding unit 281, the language model holding unit 282, and the dictionary holding unit 283. Further, the correction unit 270 performs speech recognition using the words (front and back contexts) indicated in the preceding and following speech intervals specified by the error interval preceding and following context specifying unit 250 as constraint conditions. FIG. 5B shows a conceptual diagram when the recognition process is performed based on the word specified by the context specifying unit 250 before and after the error section. As shown in FIG. 5B, when the word W1 in the previous section and the word W2 in the subsequent section of the error section are used as constraint conditions, the recognition candidates are limited. Therefore, recognition accuracy can be improved. In the example of FIG. 5B, the recognition candidates can be narrowed down to A to Z, an appropriate candidate can be selected from the narrowed back, and the recognition process can be performed efficiently.

また、訂正部２７０は、前後の単語との係り受けの関係、活用形などに基づいて訂正処理を行うようにしてもよい。例えば、訂正部２７０は、誤り区間の単語に対する認識候補Ａ〜Ｚを複数抽出し、その前後の単語Ｗ１およびＷ２との係り受けの関係に基づいて、訂正候補ごとにスコアを算出し、スコアの高い訂正候補を認識結果とするようにしてもよい。 Further, the correction unit 270 may perform the correction process based on the dependency relationship with the preceding and following words, the utilization form, and the like. For example, the correction unit 270 extracts a plurality of recognition candidates A to Z for the words in the error section, calculates a score for each correction candidate based on the dependency relationship with the preceding and following words W1 and W2, and calculates the score High correction candidates may be used as recognition results.

また、訂正部２７０は、前の区間の単語Ｗ１や後の区間の単語Ｗ２が言語モデル保持部２８２や辞書保持部２８３に含まれていない場合でも、その単語を特定するための単語情報や前後の単語を特定するための単語情報を拘束条件として訂正処理（再音声認識処理）を行うことができる。 In addition, the correction unit 270 also includes the word information and the front and rear for specifying the word even when the word W1 in the previous section and the word W2 in the subsequent section are not included in the language model holding unit 282 and the dictionary holding unit 283. Correction processing (re-speech recognition processing) can be performed using word information for specifying the word as a constraint.

例えば、クライアント装置１１０は、単語情報として、単語Ｗ１、単語Ｗ２のそれぞれ品詞を示す品詞情報がサーバ装置１２０から受信しており、訂正部２７０は、単語Ｗ１、単語Ｗ２のそれぞれの品詞情報を拘束条件として訂正処理を行う。これにより、より正確な訂正処理、すなわち音声認識処理を行うことができる。具体的には、受信部２３５において受信した音声認識結果に付加されている単語情報のうち、誤り区間指定部２４０が、誤り区間の前後（またはいずれか一方）の単語情報を抽出し、訂正部２７０に出力する。訂正部２７０では、この単語情報を拘束条件として指定された部分を訂正処理する。その概念図を図２４に示す。図２４に示すとおり、単語Ｗ１に対応して品詞情報Ａ（例えば、助詞）が、単語Ｗ２に対応して品詞情報Ｂ（例えば動詞）が拘束条件として設定されている。訂正部２７０は、それぞれ品詞情報Ａおよび品詞情報Ｂを満たすように訂正処理を行うことにより、より正確な音声認識処理を行うことができる。 For example, the client device 110 has received part-of-speech information indicating the part-of-speech of the word W1 and the word W2 from the server device 120 as the word information, and the correction unit 270 constrains the part-of-speech information of the word W1 and the word W2. Correction processing is performed as a condition. Thereby, more accurate correction processing, that is, voice recognition processing can be performed. Specifically, out of the word information added to the speech recognition result received by the receiving unit 235, the error section specifying unit 240 extracts word information before and after the error section (or any one), and the correcting section To 270. The correction unit 270 corrects the portion designated with this word information as a constraint condition. The conceptual diagram is shown in FIG. As shown in FIG. 24, part-of-speech information A (for example, a particle) is set as a constraint condition for word W1, and part-of-speech information B (for example, a verb) is set for word W2. The correction unit 270 can perform more accurate speech recognition processing by performing correction processing so as to satisfy the part-of-speech information A and the part-of-speech information B, respectively.

なお、単語情報として、品詞情報に限ることなく、例えば、読み方など単語以外の単語を特定するための情報としてもよい。 The word information is not limited to the part-of-speech information, and may be information for specifying a word other than the word such as how to read.

また、必要な単語情報が音声認識結果に含まれていない場合、認識対象である文章を周知の形態素解析システム（例えば、“茶筅”、“Ｍｅｃａｂ”）、日本語係り受け解析ツール（例えば“南瓜”）などを使って解析することで、単語情報を生成することができる。すなわち、図２５において示されているクライアント装置１１０の変形例においては、新たに単語情報解析部２５１が付加されており、単語情報解析部２５１は上述の通り周知の形態素解析システム、日本語係り受け解析ツールなどで構成されており、音声認識結果を解析することができる。そして、解析した結果を誤り区間前後コンテキスト指定部２５０に出力し、誤り区間前後コンテキスト指定部２５０はその単語情報に基づいて誤り区間の前後の単語の単語情報を抽出し、訂正部２７０に出力することができる。 If the necessary word information is not included in the speech recognition result, the sentence to be recognized is converted into a well-known morphological analysis system (for example, “tea bowl”, “Mecab”), a Japanese dependency analysis tool (for example, “Nanban” Word information can be generated by analyzing using ")". That is, in the modified example of the client device 110 shown in FIG. 25, a word information analysis unit 251 is newly added, and the word information analysis unit 251 is a well-known morphological analysis system, Japanese dependency as described above. It consists of analysis tools and can analyze the speech recognition result. Then, the analyzed result is outputted to the context specifying unit 250 before and after the error section, and the context specifying unit 250 before and after the error section extracts the word information of the words before and after the error section based on the word information and outputs the word information to the correcting unit 270. be able to.

上記の単語情報を生成する処理は、クライアント装置１１０またはサーバ装置１２０で行ってもよいが、サーバ装置１２０で行うように指示を出し、処理の結果を受信するほうが、クライアント装置１１０での処理量を低減することができる。 The processing for generating the word information may be performed by the client device 110 or the server device 120. However, it is more processing amount at the client device 110 to instruct the server device 120 to perform the processing and to receive the processing result. Can be reduced.

上述の処理は単語Ｗ１およびＷ２が未知語であった場合に特に有効である。未知語とは、言語モデル保持部２８２や辞書保持部２８３に含まれていない単語である。例えば、訂正部２７０（未知語判定手段）は、単語ＷおよびＷ２が未知語であるか否かを判断し、未知語である場合には、サーバ装置１２０から送出された認識結果に含まれている単語情報を拘束条件として、訂正処理を行う。 The above processing is particularly effective when the words W1 and W2 are unknown words. An unknown word is a word that is not included in the language model holding unit 282 or the dictionary holding unit 283. For example, the correction unit 270 (unknown word determination unit) determines whether or not the words W and W2 are unknown words. If the words W and W2 are unknown words, the correction unit 270 is included in the recognition result transmitted from the server device 120. Correction processing is performed using the existing word information as a constraint condition.

また、クライアント装置１１０において、その拘束条件を登録してもよい。すなわち、図２５に示されているクライアント装置１１０の変形例において、指定された誤り区間の単語およびその前後（または少なくとも一方）の単語、またはその単語情報をセットにしたものを拘束条件として、拘束条件記憶部２８５（拘束条件記憶手段）に記憶させてもよい。これにより、訂正部２７０は誤り区間指定部２４０において指定された誤り区間における単語と同じであり、またその前後の単語が同じであった場合には、拘束条件記憶部２８５において記憶されている拘束条件にしたがって、訂正処理を行うことができる。よって、その処理を迅速に行うことができる。すなわち、次回以降に、未知語が検出されても、すでに登録されている拘束条件を読み出すだけで、拘束条件を適用することができる。あらたに拘束条件を作成する必要がないため、より少ない処理で拘束条件を設定することができる。 In the client device 110, the constraint condition may be registered. That is, in the modified example of the client device 110 shown in FIG. 25, the word in the specified error section and the words before and after (or at least one of) the word, or a set of the word information is used as a restriction condition. You may memorize | store in the condition memory | storage part 285 (restraint condition memory | storage means). As a result, the correction unit 270 is the same as the word in the error section specified by the error section specifying unit 240, and when the preceding and succeeding words are the same, the restriction condition stored in the restriction condition storage unit 285 is stored. Correction processing can be performed according to conditions. Therefore, the process can be performed quickly. That is, even if an unknown word is detected after the next time, the constraint condition can be applied only by reading the constraint condition already registered. Since it is not necessary to newly create a constraint condition, the constraint condition can be set with less processing.

また、訂正部２７０において訂正した結果にしたがって、その誤り区間における単語およびその前後における単語の接続確率を更新するようにしてもよい。すなわち、接続確率は、接続確率記憶手段として機能する言語モデル保持部２８２および辞書保持部２８３に記憶されており、適宜訂正処理の度に訂正部２７０において計算・作成された接続確率は、言語モデル保持部２８２および辞書保持部２８３において更新されるようにしてもよい。 In addition, according to the result corrected by the correction unit 270, the word probabilities in the error section and the word connection probabilities before and after that may be updated. That is, the connection probabilities are stored in the language model holding unit 282 and the dictionary holding unit 283 that function as connection probability storage means, and the connection probabilities calculated and created by the correction unit 270 each time correction processing is performed appropriately are language models. It may be updated in the holding unit 282 and the dictionary holding unit 283.

また、訂正部２７０は、再認識した認識結果と、この誤り区間におけるサーバ装置１２０において認識された認識結果とが同じであるか否かを判断し、その際には認識結果を統合部２８０に出力することなく、表示部２９０に認識結果を表示させないようにすることが好適である。 Further, the correction unit 270 determines whether or not the re-recognized recognition result is the same as the recognition result recognized by the server device 120 in this error section, and in that case, the recognition result is sent to the integration unit 280. It is preferable not to display the recognition result on the display unit 290 without outputting.

また、訂正部２７０において認識して得られた認識結果と、この誤り区間におけるサーバ装置１２０において認識された認識結果との間で、一認識単位にずれが生じた場合も同様に認識誤りと判断し、認識結果を統合部２８０に出力することなく、表示部２９０に認識結果を表示させないようにすることが好適である。 Similarly, when a recognition unit is deviated between the recognition result recognized by the correction unit 270 and the recognition result recognized by the server device 120 in this error section, it is similarly determined as a recognition error. It is preferable that the recognition result is not displayed on the display unit 290 without outputting the recognition result to the integration unit 280.

例えば、図４における音声区間と認識結果との対応関係が異なった場合、より具体的には、音声区間において、サーバ装置１２０における認識結果としてフレームインデックスが０−９で、その場合に“９０５（きゅうまるご）”となっていた場合に、訂正部２７０における再認識においては、フレームインデックスが０−１５、“９０５５５（きゅうまるごごご）”のようになっていた場合には、その音声区間と認識結果との対応関係が、認識結果と再認識結果との間でずれている。このため、認識誤りと判断することができる。その場合には、訂正部２７０は、表示部２９０において認識結果を表示させないように、出力をしないなどの処理を行う。 For example, when the correspondence relationship between the speech section and the recognition result in FIG. 4 is different, more specifically, in the speech section, the frame index is 0-9 as the recognition result in the server device 120. In that case, “905 ( If the frame index is 0-15 or “90555 (Kyumarugogo)” in the re-recognition in the correction unit 270, the voice is indicated. The correspondence between the section and the recognition result is shifted between the recognition result and the re-recognition result. For this reason, it can be determined as a recognition error. In that case, the correction unit 270 performs processing such as not outputting so that the recognition result is not displayed on the display unit 290.

さらに、訂正部２７０は、ユーザから文字情報を受け付ける受付部（図示せず）において、上述した認識誤りと判断できた場合において、文字が入力されると、訂正部２７０は、受け付けられた文字（例えば仮名）を拘束条件として、誤り区間における認識結果の訂正処理を行うようにしてもよい。すなわち、誤り区間の認識結果に対して、何らかの文字入力があった場合に、その文字を前提として、残りの部分における認識処理を行うようにしてもよい。この場合は、認識誤りの判断がなされると、受付部における文字入力受付を可能にする。 Further, when the receiving unit (not shown) that receives character information from the user can determine that the above-described recognition error has occurred, the correcting unit 270 receives the received character ( For example, the recognition result in the error section may be corrected using the kana) as a constraint condition. That is, when there is some character input for the recognition result of the error section, the recognition processing in the remaining part may be performed on the premise of the character. In this case, when the recognition error is determined, the character input can be accepted by the accepting unit.

なお、訂正部２７０は、サーバ装置１２０において行われた認識処理とは異なった音声認識処理を行うことにより誤った認識を再度行うことを防止する。例えば、音響モデル、言語モデル、辞書をかえて認識処理を行うようにする。 Note that the correction unit 270 prevents erroneous recognition again by performing a voice recognition process different from the recognition process performed in the server device 120. For example, the recognition process is performed by changing the acoustic model, the language model, and the dictionary.

音響モデル保持部２８１は、音素とそのスペクトルを対応付けて記憶するデータベースである。言語モデル保持部２８２は、単語、文字などの連鎖確率を示す統計的情報を記憶する部分である。辞書保持部２８３は、音素とテキストとのデータベースを保持するものであり、例えばＨＭＭ（Hidden Marcov Model)を記憶する部分である。 The acoustic model holding unit 281 is a database that stores phonemes and their spectra in association with each other. The language model holding unit 282 is a part that stores statistical information indicating the chain probability of words, characters, and the like. The dictionary holding unit 283 holds a database of phonemes and texts, and stores, for example, an HMM (Hidden Marcov Model).

統合部２８０は、受信部２３５において受信された音声認識結果のうち、誤り区間外のテキストデータと、訂正部２７０において再認識されたテキストデータとを統合する部分である。この統合部２８０は、訂正部２７０において再認識されたテキストデータを統合する位置を示す誤り区間（時間情報）にしたがって、統合する。 The integration unit 280 is a part that integrates the text data outside the error section and the text data re-recognized by the correction unit 270 in the speech recognition result received by the reception unit 235. The integration unit 280 integrates the text data re-recognized by the correction unit 270 according to an error section (time information) indicating a position where the text data is integrated.

表示部２９０は、統合部２８０において統合されて得られたテキストデータを表示する部分である。なお、表示部２９０は、サーバ装置１２０において認識された結果は表示し内容に構成されていることが好ましい。また、訂正部２７０において再認識された結果と、誤り区間におけるサーバ装置１２０において認識された認識結果とが同じである場合、その認識結果を表示することないように表示することが好ましく、またその場合には、認識不可である旨を表示するようにしてもよい。さらに、訂正部２７０において再認識して得られた認識結果と、サーバ装置１２０において認識されて得られた認識結果と間で時間情報がずれていた場合も、誤っている可能性があるため表示せず、また認識不可である旨を表示するようにすることが好ましい。 The display unit 290 is a part that displays text data obtained by integration in the integration unit 280. The display unit 290 is preferably configured to display and recognize the result recognized by the server device 120. In addition, when the result of re-recognition by the correction unit 270 and the recognition result recognized by the server device 120 in the error section are the same, it is preferable to display the recognition result so as not to be displayed. In such a case, it may be displayed that the recognition is impossible. Further, even if the time information is deviated between the recognition result obtained by re-recognizing in the correction unit 270 and the recognition result obtained by recognizing in the server device 120, it is displayed because it may be erroneous. It is preferable to display that the recognition is not possible.

また、常に再認識処理を実行する必要はなく、誤り区間の長さに応じて、再認識処理を実行するかどうかを判断してもよい。例えば、誤り区間が1文字である場合には、再認識処理を実行せず、文字入力など別の方法による訂正をするようにする。 Further, it is not always necessary to execute the re-recognition process, and it may be determined whether to execute the re-recognition process according to the length of the error interval. For example, when the error section is one character, the re-recognition process is not executed, and correction by another method such as character input is performed.

このように構成されたクライアント装置１１０の動作について説明する。図６は、クライアント装置１１０の動作を示すフローチャートである。マイクを介して入力された音声は特徴量算出部２１０によりその特徴データが抽出される（Ｓ１０１）。そして、特徴量保存部２３０に特徴量データは保存される（Ｓ１０２）。つぎに、特徴量圧縮部２２０により特徴量データは圧縮される（Ｓ１０３）。圧縮された圧縮特徴量データは、送信部２２５によりサーバ装置１２０に送信される（Ｓ１０４）。 An operation of the client apparatus 110 configured as described above will be described. FIG. 6 is a flowchart showing the operation of the client device 110. The feature data of the voice input through the microphone is extracted by the feature amount calculation unit 210 (S101). The feature amount data is stored in the feature amount storage unit 230 (S102). Next, the feature amount data is compressed by the feature amount compression unit 220 (S103). The compressed compressed feature data is transmitted to the server device 120 by the transmission unit 225 (S104).

つぎに、サーバ装置１２０において音声認識が行われ、サーバ装置１２０からその認識結果が送信され、受信部２３５により受信される（Ｓ１０５）。そして、音声認識結果から誤り区間指定部２４０により誤り区間が指定され、この指定された誤り区間に基づいて前後コンテキストが指定される（Ｓ１０６）。この前後コンテキストを含んだ誤り区間に基づいて、誤り区間特徴量抽出部２６０により特徴量データが特徴量保存部２３０から抽出される（Ｓ１０７）。ここで抽出された特徴量データに基づいて訂正部２７０により再度音声認識が行われ、誤り区間におけるテキストデータが生成される（Ｓ１０８）。そして、誤り区間におけるテキストデータと、受信部２３５において受信されたテキストデータとが統合され、正しく認識されて得られたテキストデータが表示部２９０に表示される（Ｓ１０９）。 Next, voice recognition is performed in the server device 120, and the recognition result is transmitted from the server device 120 and received by the receiving unit 235 (S105). Then, an error section is specified by the error section specifying unit 240 from the speech recognition result, and the preceding and following contexts are specified based on the specified error section (S106). Based on the error section including the preceding and following contexts, the feature data is extracted from the feature storage unit 230 by the error section feature extraction unit 260 (S107). Based on the feature amount data extracted here, speech correction is performed again by the correction unit 270, and text data in the error section is generated (S108). Then, the text data in the error section and the text data received by the receiving unit 235 are integrated, and the text data obtained by being correctly recognized is displayed on the display unit 290 (S109).

つぎに、上述Ｓ１０６〜Ｓ１０８における処理についてさらに詳細に説明する。図７は、その詳細な処理を示すフローチャートである。適宜、図５（ａ）を参照しながら説明する。 Next, the processing in S106 to S108 will be described in more detail. FIG. 7 is a flowchart showing the detailed processing. This will be described with reference to FIG.

誤り区間指定部２４０により認識結果に基づいて誤り区間が指定される（Ｓ２０１（Ｓ１０６））。この誤り区間に基づいて、誤り区間前後コンテキスト指定部２５０により誤り区間の前の単語Ｗ１（図５（ａ））が指定され、保存される（Ｓ２０２）。また、誤り区間前後コンテキスト指定部２５０により、誤り区間の後の単語Ｗ２（図５（ａ））が指定され記憶される（Ｓ２０３）。つぎに、誤り区間前後コンテキスト指定部２５０により、この単語Ｗ１の開始時間Ｔ１（図５（ａ））が指定され（Ｓ２０４）、また単語Ｗ２の終了時間Ｔ２（図５（ａ））が指定され、それぞれ保存される（Ｓ２０５）。 An error section is specified by the error section specifying unit 240 based on the recognition result (S201 (S106)). Based on this error section, the word section W1 before the error section (FIG. 5A) is specified and stored by the context specifying section 250 before and after the error section (S202). Further, the word section W2 (FIG. 5 (a)) after the error section is specified and stored by the context section before and after the error section 250 (S203). Next, the start time T1 (FIG. 5A) of the word W1 is specified by the context specifying unit 250 before and after the error section (S204), and the end time T2 of the word W2 (FIG. 5A) is specified. Are stored (S205).

このようにして誤り区間にさらにその前後一単語（一認識単位）ずつ加えて得られた誤り区間である開始時間Ｔ１から終了時間Ｔ２までの区間の特徴量データが、誤り区間特徴量抽出部２６０により抽出される（Ｓ２０６（Ｓ１０７））。単語Ｗ１を始点、単語Ｗ２を終点とする拘束条件の設定が訂正部２７０において行われる（Ｓ２０７）。そして、この拘束条件にしたがって、訂正部２７０による特徴量データに対する認識処理が行われ、訂正処理が実行される（Ｓ２０８）。 In this way, the feature data of the section from the start time T1 to the end time T2, which is the error section obtained by adding one word before and after (one recognition unit) to the error section, is the error section feature quantity extraction unit 260. (S206 (S107)). Setting of the constraint condition with the word W1 as the start point and the word W2 as the end point is performed in the correction unit 270 (S207). Then, according to this constraint condition, recognition processing for the feature data by the correction unit 270 is performed, and correction processing is executed (S208).

以上の説明したとおり、本実施形態におけるクライアント装置１１０において、その作用効果について説明する。このクライアント装置１１０において、特徴量算出部２１０が入力された音声の特徴量データを算出し、特徴量圧縮部２２０が、特徴量データを音声認識装置であるサーバ装置１２０に送信する。一方、特徴量保存部２３０は、特徴量データを保存する。 As described above, the operation and effect of the client device 110 according to this embodiment will be described. In the client device 110, the feature amount calculation unit 210 calculates the feature value data of the input voice, and the feature amount compression unit 220 transmits the feature amount data to the server device 120 that is a speech recognition device. On the other hand, the feature amount storage unit 230 stores feature amount data.

そして、サーバ装置１２０において認識処理を行い、受信部２３５は認識結果をサーバ装置１２０から受信する。誤り区間指定部２４０は、受信された認識結果において、認識誤りが発生している誤り区間を指定する。この誤り区間指定部２４０は、信頼度に基づいて判断することができる。そして、誤り区間特徴量抽出部２６０は、誤り区間の特徴量データを抽出し、訂正部２７０は、抽出された誤り区間における認識結果を、再認識処理を行うことにより訂正処理を行う。すなわち、統合部２８０において、再認識された結果と、受信部２３５において受信された認識結果とを統合することにより、訂正処理が行われ、表示部２９０は訂正された認識結果を表示することができる。これにより、認識した結果のうち必要な部分を訂正するため、簡易に音声認識の誤りを訂正することができ、正しい認識結果を得ることができる。例えば、誤り単語の最大７０％を削減することができる。また、未知語による誤りの６０％以上を訂正できる。なお、信頼度は、サーバ装置１２０から受信してもよいし、クライアント装置１１０において計算してもよい。 The server device 120 performs recognition processing, and the reception unit 235 receives the recognition result from the server device 120. The error interval specification unit 240 specifies an error interval in which a recognition error has occurred in the received recognition result. The error interval specification unit 240 can make a determination based on the reliability. Then, the error section feature value extraction unit 260 extracts feature value data of the error section, and the correction unit 270 performs correction processing by performing re-recognition processing on the recognition result in the extracted error section. That is, the integration unit 280 integrates the re-recognized result and the recognition result received by the receiving unit 235 to perform a correction process, and the display unit 290 can display the corrected recognition result. it can. Accordingly, since a necessary part of the recognized result is corrected, an error in speech recognition can be easily corrected, and a correct recognition result can be obtained. For example, up to 70% of erroneous words can be reduced. Moreover, 60% or more of errors due to unknown words can be corrected. The reliability may be received from the server device 120 or calculated by the client device 110.

さらに、このクライアント装置１１０は、誤り区間前後コンテキスト指定部２５０を用いて、拘束条件に従った訂正処理（再認識処理）を行うことができる。すなわち、誤り区間の前後の単語を固定しておき、この固定した単語に従った認識処理を行うことでより精度のよい認識結果を得ることができる。 Further, the client device 110 can perform correction processing (re-recognition processing) in accordance with the constraint conditions using the context specifying unit 250 before and after the error section. That is, it is possible to obtain a more accurate recognition result by fixing words before and after the error section and performing recognition processing according to the fixed words.

なお、本実施形態または以降に示される他の実施形態において、１回目の認識処理をサーバ装置１２０で行っているが、これに限定するものではなく、１回目の認識処理をクライアント装置１１０において行い、２回目の認識処理をサーバ装置１２０において行うようにしてもよい。その際、当然に誤り区間の指定処理等はサーバ装置１２０において行われる。例えば、その場合には、クライアント装置１１０は、特徴量算出部２１０において算出された特徴量データに基づいて認識処理を行う認識処理部を備え、また送信部２２５は、ここでの認識結果と特徴量データとをサーバ装置１２０に送信する。 In the present embodiment or other embodiments described later, the first recognition process is performed by the server apparatus 120. However, the present invention is not limited to this, and the first recognition process is performed by the client apparatus 110. The second recognition process may be performed in the server device 120. At that time, naturally, the error section designation processing and the like are performed in the server apparatus 120. For example, in this case, the client device 110 includes a recognition processing unit that performs a recognition process based on the feature amount data calculated by the feature amount calculation unit 210, and the transmission unit 225 includes the recognition result and the feature here. The amount data is transmitted to the server device 120.

サーバ装置１２０では、クライアント装置１１０における誤り区間指定部２４０、誤り区間前後コンテキスト指定部２５０、特徴量保存部２３０、誤り区間特徴量抽出部２６０、訂正部２７０に相当する各部を備えており、クライアント装置１１０から送信された特徴量データは、特徴量保存部に記憶させ、認識結果に基づいて誤り区間の指定、誤り区間前後コンテキストの指定が行われ、これらに基づいて、先に保存した特徴量データの訂正処理（認識処理）が行われる。このように処理された認識結果はクライアント装置１１０に送信される。 The server apparatus 120 includes units corresponding to an error section specifying unit 240, an error section pre- and post-error section specifying unit 250, a feature amount storing unit 230, an error section feature amount extracting unit 260, and a correcting unit 270 in the client device 110. The feature amount data transmitted from the device 110 is stored in the feature amount storage unit, the error section is specified based on the recognition result, and the context before and after the error section is specified. Based on these, the feature amount stored previously is stored. Data correction processing (recognition processing) is performed. The recognition result processed in this way is transmitted to the client device 110.

また、本実施形態または以降に示される他の実施形態において、誤り区間前後コンテキスト指定部２５０により定められた拘束条件を用いて再認識（訂正処理）を行っているが、この場合、誤り区間の特徴量データのみを利用する。このような拘束条件を用いることなく、再認識処理を行うようにしてもよい。 In this embodiment or other embodiments described below, re-recognition (correction processing) is performed using the constraint conditions determined by the context specifying unit 250 before and after the error interval. Only feature data is used. Re-recognition processing may be performed without using such a constraint condition.

また、サーバ装置１２０において認識方法と、本実施形態（または以降に示される他の実施形態）における認識方法を変えるようにすることが好ましい。すなわち、サーバ装置１２０において、不特定多数のユーザの音声を認識する必要があるため、汎用的である必要がある。例えば、サーバ装置１２０において用いられる音響モデル保持部、言語モデル保持部、辞書保持部における各モデル数、辞書数を大容量のものとし、音響モデルにおいては音素の数を多くし、言語モデルにおいては単語の数を大きくするなど、各モデル数、辞書数を大容量のものとしあらゆるユーザに対応できるようにする。 Moreover, it is preferable to change the recognition method in the server apparatus 120 and the recognition method in this embodiment (or other embodiments shown below). That is, since it is necessary for the server apparatus 120 to recognize the voices of an unspecified number of users, it is necessary to be general-purpose. For example, the number of models and the number of dictionaries in the acoustic model holding unit, language model holding unit, and dictionary holding unit used in the server device 120 are large, the number of phonemes is increased in the acoustic model, and the number of phonemes is increased in the language model. The number of models and the number of dictionaries are made large, such as increasing the number of words, so that it can handle all users.

一方、クライアント装置１１０における訂正部２７０は、あらゆるユーザに対応させる必要はなく、そのクライアント装置１１０のユーザの音声に合致した音響モデル、言語モデル、辞書を用いるようにする。そのため、このクライアント装置１１０は、訂正処理、認識処理、またメール作成時における文字入力処理を参考に、適宜各モデル、辞書を更新することが必要となる。 On the other hand, the correction unit 270 in the client device 110 does not need to correspond to any user, and uses an acoustic model, a language model, and a dictionary that match the voice of the user of the client device 110. Therefore, the client device 110 needs to update each model and dictionary as appropriate with reference to correction processing, recognition processing, and character input processing at the time of mail creation.

また、クライアント装置１１０は、訂正部２７０により訂正された認識結果を表示する表示部２９０をさらに備え、サーバ装置１２０において認識された認識結果は、この表示部２９０は表示しないようにする。これにより、認識誤りのある可能性のある認識結果を表示しないため、ユーザに誤解を与えることがない。 The client device 110 further includes a display unit 290 that displays the recognition result corrected by the correction unit 270, and the display unit 290 does not display the recognition result recognized by the server device 120. Thereby, since the recognition result which may have a recognition error is not displayed, a misunderstanding is not given to a user.

また、クライアント装置１１０は、訂正部２７０において再認識により得られた認識結果と、受信部２３５により受信された認識結果とが同じであった場合、またはこれら認識結果それぞれに含まれる時間情報にずれが生じている場合には、訂正部２７０は、認識誤りと判断し、表示部２９０は認識結果を表示しない。これにより、誤った認識結果を表示することを防止することができる。具体的には、誤り単語の最大７０％を削減することができる。また、未知語による誤りの６０％以上を訂正できる。
In addition, the client device 110 detects that the recognition result obtained by re-recognition in the correction unit 270 is the same as the recognition result received by the reception unit 235, or the time information included in each of the recognition results is shifted. If the error occurs, the correction unit 270 determines that the recognition error has occurred, and the display unit 290 does not display the recognition result. Thereby, it can prevent displaying an incorrect recognition result. Specifically, up to 70% of erroneous words can be reduced. Moreover, 60% or more of errors due to unknown words can be corrected.

＜第２の実施形態＞
つぎに、誤り区間を信頼度に基づいて自動的に判断することなく、ユーザが手動により判断するように構成されたクライアント装置１１０ａについて説明する。図８は、ユーザ入力により誤り区間を受け付けるクライアント装置１１０ａの機能を示すブロック図である。図８に示すように、このクライアント装置１１０ａは、特徴量算出部２１０、特徴量圧縮部２２０、特徴量保存部２３０、送信部２２５、受信部２３５、操作部２３６、結果保存部２３７、ユーザ入力検出部２３８、誤り区間指定部２４０ａ、誤り区間前後コンテキスト指定部２５０、誤り区間特徴量抽出部２６０、訂正部２７０、統合部２８０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３、表示部２９０を含んで構成されている。このクライアント装置１１０ａは、クライアント装置１１０と同様に図３に示されるハードウェアにより実現される。 <Second Embodiment>
Next, a description will be given of the client device 110a configured so that the user manually determines the error interval without automatically determining the error interval based on the reliability. FIG. 8 is a block diagram illustrating a function of the client device 110a that accepts an error interval by a user input. As shown in FIG. 8, the client device 110a includes a feature amount calculation unit 210, a feature amount compression unit 220, a feature amount storage unit 230, a transmission unit 225, a reception unit 235, an operation unit 236, a result storage unit 237, a user input Detection section 238, error section specifying section 240a, error section front and rear context specifying section 250, error section feature quantity extracting section 260, correcting section 270, integrating section 280, acoustic model holding section 281, language model holding section 282, dictionary holding section 283 The display unit 290 is included. The client device 110a is realized by the hardware shown in FIG.

このクライアント装置１１０ａは、クライアント装置１１０とは、操作部２３６、結果保存部２３７、ユーザ入力検出部２３８、誤り区間指定部２４０ａを備えている点で相違している。以下、この相違点を中心に説明する。 This client device 110a is different from the client device 110 in that it includes an operation unit 236, a result storage unit 237, a user input detection unit 238, and an error section specification unit 240a. Hereinafter, this difference will be mainly described.

操作部２３６は、ユーザ入力を受け付ける部分である。ユーザは表示部２９０に表示されている認識結果を確認しながら、誤り区間を指定することができる。操作部２３６は、その指定を受け付けることができる。 The operation unit 236 is a part that receives user input. The user can specify an error section while confirming the recognition result displayed on the display unit 290. The operation unit 236 can accept the designation.

結果保存部２３７は、受信部２３５により受信された音声認識結果を保存する部分である。保存した音声認識結果は、ユーザが視認することができるように表示部２９０に表示される。 The result storage unit 237 is a part that stores the speech recognition result received by the reception unit 235. The stored speech recognition result is displayed on the display unit 290 so that the user can visually recognize it.

ユーザ入力検出部２３８は、操作部２３６により受け付けられたユーザ入力を検出する部分であり、入力された誤り区間を誤り区間指定部２４０ａに出力する。 The user input detection unit 238 is a part that detects a user input received by the operation unit 236, and outputs the input error interval to the error interval specification unit 240a.

誤り区間指定部２４０ａは、ユーザ入力検出部２３８から入力された誤り区間にしたがってその区間を指定する部分である。 The error section designation unit 240a is a part that designates the section according to the error section input from the user input detection unit 238.

つぎに、このように構成されたクライアント装置１１０ａの処理について説明する。図９は、クライアント装置１１０ａの処理を示すフローチャートである。マイクを介して入力された音声は特徴量算出部２１０によりその特徴データが抽出される（Ｓ１０１）。そして、特徴量保存部２３０に特徴量データは保存される（Ｓ１０２）。つぎに、特徴量圧縮部２２０により特徴量データは圧縮される（Ｓ１０３）。圧縮された圧縮特徴量データは、送信部２２５によりサーバ装置１２０に送信される（Ｓ１０４）。 Next, processing of the client device 110a configured as described above will be described. FIG. 9 is a flowchart showing processing of the client device 110a. The feature data of the voice input through the microphone is extracted by the feature amount calculation unit 210 (S101). The feature amount data is stored in the feature amount storage unit 230 (S102). Next, the feature amount data is compressed by the feature amount compression unit 220 (S103). The compressed compressed feature data is transmitted to the server device 120 by the transmission unit 225 (S104).

つぎに、サーバ装置１２０において音声認識が行われ、サーバ装置１２０からその認識結果が送信され、受信部２３５により受信され、一時保存されるとともに、その認識結果は表示部２９０に表示される（Ｓ１０５ａ）。そして、ユーザは表示部２９０に表示されている認識結果に基づいて誤り区間を判断し、その誤り区間を入力する。そして、ユーザ入力検出部２３８によりその入力が検出され、誤り区間指定部２４０により誤り区間が指定される。そして、この指定された誤り区間に基づいて前後コンテキストが指定される（Ｓ１０６ａ）。この前後コンテキストを含んだ誤り区間に基づいて、誤り区間特徴量抽出部２６０により特徴量データが抽出され（Ｓ１０７）、訂正部２７０により再度音声認識が行われ、誤り区間におけるテキストデータが生成される（Ｓ１０８）。そして、誤り区間におけるテキストデータと、受信部２３５において受信されたテキストデータとが統合され、正しいテキストデータが表示部２９０に表示される（Ｓ１０９）。 Next, voice recognition is performed in the server device 120, the recognition result is transmitted from the server device 120, received by the receiving unit 235, temporarily stored, and the recognition result is displayed on the display unit 290 (S105a). ). And a user judges an error area based on the recognition result currently displayed on the display part 290, and inputs the error area. Then, the input is detected by the user input detection unit 238, and the error interval is specified by the error interval specification unit 240. Then, the preceding and following contexts are designated based on the designated error section (S106a). Based on the error interval including the preceding and following contexts, feature amount data is extracted by the error interval feature amount extraction unit 260 (S107), speech recognition is performed again by the correction unit 270, and text data in the error interval is generated. (S108). Then, the text data in the error section and the text data received by the receiving unit 235 are integrated, and correct text data is displayed on the display unit 290 (S109).

つぎに、上述Ｓ１０５a〜Ｓ１０８における処理についてさらに詳細に説明する。図１０は、クライアント装置１１０ａにおけるユーザ入力により誤り区間を指定するときの詳細な処理を示すフローチャートである。 Next, the processing in S105a to S108 described above will be described in more detail. FIG. 10 is a flowchart showing detailed processing when an error interval is designated by user input in the client apparatus 110a.

受信部２３５により認識結果が受信され、表示部２９０に表示される（Ｓ３０１）。ユーザは表示部２９０に表示されている認識結果に確認しながら、誤り区間を指定し、ユーザ入力検出部２３８によりその誤り区間の始点箇所が検出され、一時保存される（Ｓ３０２）。そして、誤り区間前後コンテキスト指定部２５０により誤り区間の前の単語Ｗ１が指定され、保存され（Ｓ３０３）、保存された単語Ｗ１の開始時間Ｔ１が指定され、保存される（Ｓ３０４）。 The recognition result is received by the receiving unit 235 and displayed on the display unit 290 (S301). While confirming the recognition result displayed on the display unit 290, the user designates an error interval, and the user input detection unit 238 detects the start point of the error interval and temporarily stores it (S302). Then, the preceding word section W1 of the error section is specified and stored by the context section before and after the error section 250 (S303), and the start time T1 of the stored word W1 is specified and stored (S304).

また、ユーザ指定により誤り区間の終点箇所がユーザ入力検出部２３８により検出され、一時保存される（Ｓ３０５）。そして、誤り区間前後コンテキスト指定部２５０により誤り区間の後の単語Ｗ２が指定され、保存され（Ｓ３０６）、保存された単語Ｗ２の終了時間Ｔ２が指定され、保存される（Ｓ３０７）。 Further, the end point of the error section is detected by the user input detection unit 238 according to the user designation and temporarily stored (S305). Then, the word W2 after the error section is specified and stored by the context specifying unit 250 before and after the error section (S306), and the end time T2 of the stored word W2 is specified and stored (S307).

これら処理の後、開始時間Ｔ１から終了時間Ｔ２の特徴量データが、誤り区間特徴量抽出部２６０により抽出される（Ｓ３０８）。単語Ｗ１を始点、単語Ｗ２を終点とする拘束条件の設定が訂正部２７０において行われる（Ｓ３０９）。そして、この拘束条件にしたがって、訂正部２７０による特徴量データに対する認識処理が行われ、訂正処理が実行される（Ｓ３１０）。 After these processes, the feature amount data from the start time T1 to the end time T2 is extracted by the error section feature amount extraction unit 260 (S308). Setting of the constraint condition with the word W1 as the start point and the word W2 as the end point is performed in the correction unit 270 (S309). Then, according to this constraint condition, a recognition process for the feature data by the correction unit 270 is performed, and the correction process is executed (S310).

このような処理により、ユーザ入力による誤り区間を指定することができ、これにより再認識することによる認識結果の訂正処理を行うことができる。 By such processing, an error interval by user input can be designated, and thereby the recognition result can be corrected by re-recognizing.

このようなクライアント装置１１０ａにおいて、表示部２９０が認識結果を表示し、ユーザはそれを視認するとともに、ユーザは、操作部２３６を操作することにより誤り区間、すなわち訂正しようとする箇所を指定することができる。これにより、認識した結果のうち必要な部分を訂正するため、簡易に訂正処理を行うことができるとともに、正しい認識結果を得ることができる。
In such a client device 110a, the display unit 290 displays the recognition result, and the user visually recognizes the recognition result, and the user designates an error section, that is, a portion to be corrected by operating the operation unit 236. Can do. Accordingly, since a necessary part of the recognized result is corrected, correction processing can be easily performed and a correct recognition result can be obtained.

＜第３の実施形態＞
つぎに、サーバ装置１２０から送信される認識結果に時間情報を含んでいない場合に、誤り区間を正しく指定することができるクライアント装置１１０ｂについて説明する。図１１は、このクライアント装置１１０ｂの機能を示すブロック図である。このクライアント装置１１０ｂは、特徴量算出部２１０、特徴量圧縮部２２０、送信部２２５、特徴量保存部２３０、受信部２３５、時間情報算出部２３９、誤り区間指定部２４０、誤り区間特徴量抽出部２６０、誤り区間前後コンテキスト指定部２５０、訂正部２７０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３を含んで構成されている。このクライアント装置１１０ｂは、第1の実施形態のクライアント装置１１０と同様に図３に示されるハードウェアにより実現される。 <Third Embodiment>
Next, a description will be given of the client device 110b that can correctly specify an error interval when the recognition result transmitted from the server device 120 does not include time information. FIG. 11 is a block diagram showing functions of the client device 110b. The client device 110b includes a feature amount calculation unit 210, a feature amount compression unit 220, a transmission unit 225, a feature amount storage unit 230, a reception unit 235, a time information calculation unit 239, an error section specification unit 240, and an error section feature amount extraction unit. 260, an error interval context designation unit 250, a correction unit 270, an acoustic model holding unit 281, a language model holding unit 282, and a dictionary holding unit 283. This client device 110b is realized by the hardware shown in FIG. 3 as with the client device 110 of the first embodiment.

また、第1の実施形態のクライアント装置１１０との相違点は、このクライアント装置１１０ｂがサーバ装置１２０から経過情報を含んでいない認識結果を受信し、そして、時間情報算出部２３９において認識結果であるテキストデータに基づいて経過時間（フレームインデックス）を自動的に算出しようとする点にある。以下、この相違点を中心にクライアント装置１１０ｂを説明する。 Further, the difference from the client device 110 of the first embodiment is that the client device 110b receives a recognition result that does not include progress information from the server device 120, and the time information calculation unit 239 recognizes the result. The point is to automatically calculate the elapsed time (frame index) based on the text data. Hereinafter, the client device 110b will be described focusing on this difference.

時間情報算出部２３９は、受信部２３５において受信された認識結果のうちテキストデータおよび特徴量保存部２３０に記憶されている特徴量データを用いて、テキストデータにおける経過時間を算出する部分である。より具体的には、時間情報算出部２３９は、入力されたテキストデータと、特徴量保存部２３０に記憶されている特徴量データとを比較することによって、テキストデータの一単語または一認識単位を周波数データに変換した場合に、特徴量データのどの部分まで一致するかを判断することによって、テキストデータにおける経過時間を算出することができる。例えば、特徴量データの１０フレーム分まで、テキストデータの一単語と一致していた場合には、その一単語は１０フレーム分の経過時間を有することになる。 The time information calculation unit 239 is a part that calculates the elapsed time in the text data using the text data and the feature amount data stored in the feature amount storage unit 230 among the recognition results received by the reception unit 235. More specifically, the time information calculation unit 239 compares one word or one recognition unit of the text data by comparing the input text data with the feature amount data stored in the feature amount storage unit 230. When converted to frequency data, it is possible to calculate the elapsed time in the text data by determining to what part of the feature data matches. For example, if there is a match with one word of text data up to 10 frames of feature amount data, the one word has an elapsed time of 10 frames.

誤り区間指定部２４０ｂは、時間情報算出部２３９により算出された経過時間およびテキストデータを用いて誤り区間を指定することができる。この誤り区間指定部２４０ｂは、認識結果に含まれている信頼度情報に基づいて誤り区間を判断する。なお、第２の実施形態のように、ユーザ入力により誤り区間が指定されるようにしてもよい。 The error section specifying unit 240b can specify an error section using the elapsed time and text data calculated by the time information calculating unit 239. The error interval designating unit 240b determines an error interval based on the reliability information included in the recognition result. Note that an error interval may be designated by user input as in the second embodiment.

このように誤り区間指定部２４０ｂにより指定された誤り区間に基づいて、誤り区間前後コンテキスト指定部２５０は、前後のコンテキストを含んだ誤り区間を指定し、誤り区間特徴量抽出部２６０は、その誤り区間の音声データを抽出し、そして訂正部２７０は再度認識処理を行うことで訂正処理を行うことができる。 As described above, based on the error section specified by the error section specifying unit 240b, the context section specifying section 250 before and after the error section specifies an error section including the preceding and following contexts, and the feature extracting section 260 is the error section feature amount extraction section 260. The voice data of the section is extracted, and the correction unit 270 can perform the correction process by performing the recognition process again.

つぎに、このクライアント装置１１０ｂの処理について説明する。図１２は、クライアント装置１１０ｂの処理を示すフローチャートである。マイクを介して入力された音声は特徴量算出部２１０によりその特徴データが抽出される（Ｓ１０１）。そして、特徴量保存部２３０に特徴量データは保存される（Ｓ１０２）。つぎに、特徴量圧縮部２２０により特徴量データは圧縮される（Ｓ１０３）。圧縮された圧縮特徴量データは、送信部２２５によりサーバ装置１２０に送信される（Ｓ１０４）。 Next, processing of the client device 110b will be described. FIG. 12 is a flowchart showing the processing of the client device 110b. The feature data of the voice input through the microphone is extracted by the feature amount calculation unit 210 (S101). The feature amount data is stored in the feature amount storage unit 230 (S102). Next, the feature amount data is compressed by the feature amount compression unit 220 (S103). The compressed compressed feature data is transmitted to the server device 120 by the transmission unit 225 (S104).

つぎに、サーバ装置１２０において音声認識が行われ、サーバ装置１２０からその認識結果（経過時間を含まず）が送信され、受信部２３５により受信される（Ｓ１０５）。そして、音声認識結果および特徴量保存部２３０の特徴量データから、時間情報算出部２３９により経過時間が算出され、この経過時間および音声認識結果を用いて、誤り区間指定部２４０により誤り区間が指定される。誤り区間前後コンテキスト指定部２５０により、この指定された誤り区間に基づいて前後コンテキストが指定される（Ｓ１０６ｂ）。この前後のコンテキストを含んだ誤り区間に基づいて、誤り区間特徴量抽出部２６０により特徴量データが抽出され（Ｓ１０７）、訂正部２７０により再度音声認識が行われ、誤り区間におけるテキストデータが生成される（Ｓ１０８）。そして、誤り区間におけるテキストデータと、受信部２３５において受信されたテキストデータとが統合され、正しいテキストデータが表示部２９０に表示される（Ｓ１０９）。 Next, voice recognition is performed in the server device 120, and the recognition result (not including the elapsed time) is transmitted from the server device 120 and received by the receiving unit 235 (S105). Then, the elapsed time is calculated by the time information calculation unit 239 from the voice recognition result and the feature amount data of the feature amount storage unit 230, and the error section is specified by the error section specifying unit 240 using the elapsed time and the voice recognition result. Is done. The preceding and following context is specified based on the specified error section by the error section preceding and following context specifying unit 250 (S106b). Based on the error sections including the preceding and following contexts, the feature data is extracted by the error section feature quantity extraction unit 260 (S107), the speech recognition is performed again by the correction unit 270, and the text data in the error section is generated. (S108). Then, the text data in the error section and the text data received by the receiving unit 235 are integrated, and correct text data is displayed on the display unit 290 (S109).

つぎに、Ｓ１０６ｂを含んださらに詳細な処理について説明する。図１３は、Ｓ１０５からＳ１０８における詳細な処理を示すフローチャートである。 Next, further detailed processing including S106b will be described. FIG. 13 is a flowchart showing detailed processing from S105 to S108.

受信部２３５により経過時間を含まない認識結果が受信され（Ｓ４０１）、時間情報算出部２３９においてテキストデータにおける経過時間が算出される（Ｓ４０２）。誤り区間指定部２４０により認識結果から誤り区間が指定される（Ｓ４０３）。この誤り区間に基づいて、誤り区間前後コンテキスト指定部２５０により誤り区間の前の単語Ｗ１（図５（ａ））が指定され、保存される（Ｓ４０４）。また、誤り区間前後コンテキスト指定部２５０により、誤り区間の後の単語Ｗ２（図５（ａ））が指定され記憶される（Ｓ４０５）。つぎに、誤り区間前後コンテキスト指定部２５０により、この単語Ｗ１の開始時間Ｔ１（図５（ａ））が指定され（Ｓ４０６）、また単語Ｗ２の終了時間Ｔ２（図５（ａ））が指定される（Ｓ４０７）。 The reception unit 235 receives a recognition result that does not include the elapsed time (S401), and the time information calculation unit 239 calculates the elapsed time in the text data (S402). An error section is specified from the recognition result by the error section specifying unit 240 (S403). Based on this error section, the word section W1 before the error section (FIG. 5A) is specified and stored by the context specifying unit 250 before and after the error section (S404). Further, the word section W2 (FIG. 5A) after the error section is specified and stored by the context section before and after the error section 250 (S405). Next, the start time T1 (FIG. 5A) of the word W1 is specified by the context specifying unit 250 before and after the error section (S406), and the end time T2 of the word W2 (FIG. 5A) is specified. (S407).

このようにして誤り区間にさらにその前後一単語ずつ加えて得られた誤り区間である開始時間Ｔ１から終了時間Ｔ２までの区間の特徴量データが、誤り区間特徴量抽出部２６０により抽出される（Ｓ４０８）。単語Ｗ１を始点、単語Ｗ２を終点とする拘束条件の設定が訂正部２７０において行われる（Ｓ４０９）。そして、この拘束条件にしたがって、訂正部２７０による特徴量データに対する認識処理が行われ、訂正処理が実行される（Ｓ４１０）。 The error section feature quantity extraction unit 260 extracts the feature quantity data of the section from the start time T1 to the end time T2, which is an error section obtained by adding one word before and after the error section in this way. S408). Setting of the constraint condition with the word W1 as the start point and the word W2 as the end point is performed in the correction unit 270 (S409). Then, according to this constraint condition, recognition processing for the feature data by the correction unit 270 is performed, and correction processing is executed (S410).

このクライアント装置１１０ｂによれば、受信部２３５により受信された認識結果と特徴量保存部２３０に記憶されている特徴量データとに基づいて、時間情報算出部２３９が認識結果における経過時間を算出する。そして、誤り区間指定部２４０は、この時間情報に基づいて誤り区間を指定することができる。ここで指定された誤り区間に基づいてその前後コンテキストを指定し、そして、その特徴量データに基づいて訂正処理を行うことができる。これにより認識結果に時間情報を含んでいない場合でも適切な誤り区間を指定することができる。
According to the client device 110b, the time information calculation unit 239 calculates the elapsed time in the recognition result based on the recognition result received by the reception unit 235 and the feature amount data stored in the feature amount storage unit 230. . Then, the error interval specification unit 240 can specify an error interval based on this time information. It is possible to specify the context before and after the error section specified here, and to perform correction processing based on the feature amount data. As a result, an appropriate error interval can be designated even when the recognition result does not include time information.

＜第４の実施形態＞
つぎに、サーバ装置１２０において音声認識されて得られた認識結果からのみ訂正処理を行うクライアント装置１１０ｃについて説明する。図１４は、クライアント装置１１０ｃの機能を示すブロック図である。このクライアント装置１１０ｃは、特徴量算出部２１０、特徴量圧縮部２２０、誤り区間指定部２４０、誤り区間前後コンテキスト指定部２５０、訂正部２７０ａ、および言語ＤＢ保持部２８４を含んで構成されている。このクライアント装置１１０ｃは、クライアント装置１１０と同様に図３に示されるハードウェアにより実現される。 <Fourth Embodiment>
Next, the client device 110c that performs the correction process only from the recognition result obtained by the voice recognition in the server device 120 will be described. FIG. 14 is a block diagram illustrating functions of the client device 110c. The client device 110c includes a feature amount calculating unit 210, a feature amount compressing unit 220, an error section specifying unit 240, an error section pre- and post-context specifying unit 250, a correcting unit 270a, and a language DB holding unit 284. The client device 110c is realized by the hardware shown in FIG.

このクライアント装置１１０ｃは、クライアント装置１１０と比較して、音声入力して得られた特徴量データを記憶せず、またこの特徴量データを訂正処理の際、再度用いることないように構成されている点で、相違しており、具体的には、特徴量保存部２３０、誤り区間特徴量抽出部２６０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３を備えていない点で、相違する。以下、相違点に基づいて説明する。 Compared to the client device 110, the client device 110c is configured not to store feature amount data obtained by voice input, and to not use the feature amount data again during correction processing. Specifically, the feature amount storage unit 230, the error section feature amount extraction unit 260, the acoustic model holding unit 281, the language model holding unit 282, and the dictionary holding unit 283 are not provided. Is different. Hereinafter, description will be made based on the differences.

特徴量算出部２１０は、音声入力から特徴量データを算出し、特徴量圧縮部２２０は、特徴量データを圧縮して、サーバ装置１２０に送信する。そして、受信部２３５は、サーバ装置１２０から認識結果を受信する。誤り区間指定部２４０は、信頼度情報またはユーザ操作により誤り区間を指定し、誤り区間前後コンテキスト指定部２５０は、その前後コンテキストを指定して、誤り区間を指定する。 The feature amount calculation unit 210 calculates feature amount data from the voice input, and the feature amount compression unit 220 compresses the feature amount data and transmits it to the server device 120. Then, the receiving unit 235 receives the recognition result from the server device 120. The error section specifying unit 240 specifies an error section by reliability information or a user operation, and the error section pre- and post-context specifying unit 250 specifies the pre- and post-context and specifies the error section.

訂正部２７０ａは、前後のコンテキストを含んだ誤り区間により指定されたテキストデータを、言語ＤＢ保持部２８４に記憶されているデータベースに基づいて変換処理を行う。この言語ＤＢ保持部２８４は、言語モデル保持部２８２とほぼ同様の情報を記憶しており、音節ごとの連鎖確率を記憶している。 The correction unit 270a performs a conversion process on the text data specified by the error section including the preceding and following contexts based on the database stored in the language DB holding unit 284. The language DB holding unit 284 stores almost the same information as the language model holding unit 282, and stores the chain probability for each syllable.

さらにこの訂正部２７０ａは、誤り区間に出てくる可能性のある単語列ｗ（Wi,Wi+1…Wj）をリストアップする。ここでは単語列ｗの数をＫに制限することもある。制限の数Ｋについては誤り単語数Ｐと同じ、或いはＰ近くの一定の範囲（Ｋ=Ｐ-cからＰ+cまで）とする。 Further, the correction unit 270a lists word strings w (Wi, Wi + 1... Wj) that may appear in the error section. Here, the number of word strings w may be limited to K. The limit number K is the same as or close to the number P of error words (K = P-c to P + c).

さらに、訂正部２７０ａは、リストアップされたすべての単語列を前後単語W１とW2に限定する場合の尤度を計算する。つまり、すべてのＷ系列に対し端末に保存の言語ＤＢを利用し、以下の式（１）を用いて尤度を求める。
単語列(W1 ｗ W2)の尤度Ｐ（ｗ１ｗｗ２)＝Ｐ（W1, Wi,Wi+1…Wj ,W2）＝P(W1)*P(Wi/W1)…* P(W2/Wj) ・・・（１）

さらに誤り区間の単語列と候補の距離を計算し、この距離を加えることもある。この場合以下の式（２）の計算式になる。
単語列(W1 ｗ W2)の尤度Ｐ（ｗ１ｗｗ２)＝Ｐ（W1, Wi,Wi+1…Wj ,W2）*P(Wi,Wi+1…Wj ,Werror) ・・・（2）
P(Wi,Wi+1…Wj ,Werror)は誤り単語列Werrorと候補列Wi,Wi+1…Wj 間の距離を示す。
Further, the correction unit 270a calculates the likelihood when all the word strings listed are limited to the preceding and following words W1 and W2. That is, the likelihood DB is obtained using the following equation (1) using the language DB stored in the terminal for all W sequences.
Likelihood P (w1 w w2) = P (W1, Wi, Wi + 1 ... Wj, W2) = P (W1) * P (Wi / W1) ... * P (W2 / Wj) of word string (W1 w W2) (1)

Further, the distance between the word string in the error section and the candidate is calculated, and this distance may be added. In this case, the following formula (2) is obtained.
Likelihood P (w1 w w2) of word string (W1 w W2) = P (W1, Wi, Wi + 1 ... Wj, W2) * P (Wi, Wi + 1 ... Wj, Werror) (2)
P (Wi, Wi + 1... Wj, Werror) indicates the distance between the error word string Werror and the candidate strings Wi, Wi + 1.

この式のP（Wn/Wm）は、N-gramモデルのうちBi-gramを対象としたものであって、Wmの次にWnが出現する確率を表す。ここではBi-gramの例で説明するがその他のN-gramモデルを利用しても良い。 P (Wn / Wm) in this equation is for a Bi-gram in the N-gram model, and represents the probability that Wn appears after Wm. Here, a Bi-gram example will be described, but other N-gram models may be used.

統合部２８０は、このように訂正部２７０ａにより変換されたテキストデータを、受信された認識結果におけるテキストデータと統合し、表示部２９０は統合され訂正されたテキストデータを表示する。なお、統合に先立って、訂正部２７０ａにより算出された尤度を用いてソートされた候補をリストアップし、ユーザにより選択させるようにしてもよいし、尤度の最も高い候補に自動的に決定するようにしてもよい。 The integration unit 280 integrates the text data thus converted by the correction unit 270a with the text data in the received recognition result, and the display unit 290 displays the integrated and corrected text data. Prior to the integration, the candidates sorted using the likelihood calculated by the correction unit 270a may be listed and selected by the user, or the candidate having the highest likelihood is automatically determined. You may make it do.

つぎに、このように構成されたクライアント装置１１０ｃの処理について説明する。図１５は、クライアント装置１１０ｃの処理を示すフローチャートである。音声入力された音声データに基づいて特徴量算出部２１０により特徴量データが算出され、特徴量圧縮部２２０により圧縮された特徴量データはサーバ装置１２０に送信される（Ｓ５０２）。 Next, processing of the client device 110c configured as described above will be described. FIG. 15 is a flowchart showing processing of the client device 110c. The feature amount data is calculated by the feature amount calculation unit 210 based on the voice data input by voice, and the feature amount data compressed by the feature amount compression unit 220 is transmitted to the server device 120 (S502).

サーバ装置１２０において音声認識された認識結果は受信部２３５により受信され（Ｓ５０２）、誤り区間指定部２４０により誤り区間が指定される（Ｓ５０３）。ここでの誤り区間の指定は、信頼度に基づいたものでもよいし、ユーザ入力により指定されたものでもよい。 The recognition result recognized by the server device 120 as a voice is received by the receiving unit 235 (S502), and an error section is specified by the error section specifying unit 240 (S503). The designation of the error section here may be based on reliability or may be designated by user input.

その後、誤り区間前後コンテキスト指定部２５０により誤り区間の前後コンテキスト（単語）が指定される（Ｓ５０４）。そして、訂正部２７０ａにより、再変換処理が行われ、その際に誤り区間の候補がリストアップアされる（Ｓ５０５）。ここで、訂正部２７０ａにより各候補の尤度が計算され（Ｓ５０６）、尤度に基づいたソーティング処理が行われ（Ｓ５０７）、ソーティング処理された候補群が表示部２９０に表示される（Ｓ５０８）。 Subsequently, the context before and after the error section (word) is specified by the error section front and rear context specifying unit 250 (S504). Then, re-conversion processing is performed by the correction unit 270a, and error zone candidates are listed up at that time (S505). Here, the likelihood of each candidate is calculated by the correction unit 270a (S506), a sorting process based on the likelihood is performed (S507), and the candidate group subjected to the sorting process is displayed on the display unit 290 (S508). .

このクライアント装置１１０ｃにおいて、特徴量算出部２１０が、入力された音声から特徴量データを算出し、特徴量圧縮部２２０がこれを圧縮し、送信部２２５がこれをサーバ装置１２０に送信する。サーバ装置１２０では、音声認識が行われ、その認識結果を受信部２３５が受信する。そして、誤り区間指定部２４０、誤り区間前後コンテキスト指定部２５０において指定された誤り区間に基づいて訂正部２７０ａが訂正処理を行う。そして、統合部２８０による統合処理の後、表示部２９０は訂正後の認識結果を表示する。これにより、認識した結果のうち必要な部分を訂正するため、簡易に音声認識の誤りを訂正することができ、正しい認識結果を得ることができる。なお、この実施形態においては、第１の実施形態と比較して、特徴量データを記憶させず、またその特徴量データを再認識処理で用いない点で、その構成を簡易なものとすることができる。
In the client device 110 c, the feature amount calculation unit 210 calculates feature amount data from the input voice, the feature amount compression unit 220 compresses the feature amount data, and the transmission unit 225 transmits this to the server device 120. In the server device 120, voice recognition is performed, and the reception unit 235 receives the recognition result. Then, the correction unit 270a performs correction processing based on the error section specified by the error section specifying unit 240 and the context specifying unit 250 before and after the error section. After the integration process by the integration unit 280, the display unit 290 displays the corrected recognition result. Accordingly, since a necessary part of the recognized result is corrected, an error in speech recognition can be easily corrected, and a correct recognition result can be obtained. In this embodiment, as compared with the first embodiment, the feature data is not stored, and the feature data is not used in the re-recognition process, and the configuration is simplified. Can do.

＜第５の実施形態＞
つぎに、サーバ装置１２０に音声認識を行わせる分散型処理ではなく、クライアント装置１１０ｄにおいて、第一の音声認識および第二の音声認識を行う形態について説明する。 <Fifth Embodiment>
Next, a mode in which the first speech recognition and the second speech recognition are performed in the client device 110d instead of the distributed processing in which the server device 120 performs speech recognition will be described.

図１６は、クライアント装置１１０ｄの機能構成を示すブロック図である。クライアント装置１１０ｄは、特徴量算出部２１０、第一認識部２２６（取得手段）、言語モデル保持部２２７、辞書保持部２２８、音響モデル保持部２２９、特徴量保存部２３０、誤り区間指定部２４０、誤り区間前後コンテキスト指定部２５０、誤り区間特徴量抽出部２６０、訂正部２７０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３、統合部２８０、表示部２９０を含んで構成されている。このクライアント装置１１０ｄは、クライアント装置１１０と同様に図３に示されるハードウェアにより実現される。 FIG. 16 is a block diagram illustrating a functional configuration of the client apparatus 110d. The client device 110d includes a feature amount calculation unit 210, a first recognition unit 226 (acquisition means), a language model storage unit 227, a dictionary storage unit 228, an acoustic model storage unit 229, a feature amount storage unit 230, an error section specification unit 240, It includes a context designation unit 250 before and after an error section, an error section feature quantity extraction unit 260, a correction unit 270, an acoustic model storage unit 281, a language model storage unit 282, a dictionary storage unit 283, an integration unit 280, and a display unit 290. Yes. The client device 110d is realized by the hardware shown in FIG.

このクライアント装置１１０ｄは、第１の実施形態のクライアント装置１１０とは、サーバ装置１２０と通信するための構成がない点、および第一認識部２２６、言語モデル保持部２２７、辞書保持部２２８、音響モデル保持部２２９を備えている点で相違する。以下、相違点を中心に説明する。 The client device 110d is not configured to communicate with the server device 120 with the client device 110 of the first embodiment, and the first recognition unit 226, language model holding unit 227, dictionary holding unit 228, acoustic The difference is that a model holding unit 229 is provided. Hereinafter, the difference will be mainly described.

第一認識部２２６は、特徴量算出部２１０において算出された特徴量データに対して、言語モデル保持部２２７、辞書保持部２２８、および音響モデル保持部２２９を用いて音声認識を行う。 The first recognition unit 226 performs speech recognition on the feature amount data calculated by the feature amount calculation unit 210 using the language model holding unit 227, the dictionary holding unit 228, and the acoustic model holding unit 229.

言語モデル保持部２２７は、単語、文字などの連鎖確率を示す統計的情報を記憶する部分である。辞書保持部２２８は、音素とテキストとのデータベースを保持するものであり、例えばＨＭＭ（Hidden Marcov Model)を記憶する部分である。音響モデル保持部２２９は、音素とそのスペクトルを対応付けて記憶するデータベースである。 The language model holding unit 227 is a part that stores statistical information indicating the chain probability of words, characters, and the like. The dictionary holding unit 228 holds a database of phonemes and texts, and stores, for example, an HMM (Hidden Marcov Model). The acoustic model holding unit 229 is a database that stores phonemes and their spectra in association with each other.

誤り区間指定部２４０は、上述の第一認識部２２６において認識された認識結果を入力し、誤り区間を指定する。誤り区間前後コンテキスト指定部２５０は、誤り区間の前後コンテキストを指定し、誤り区間特徴量抽出部２６０は、前後コンテキストを含んだ誤り区間の特徴量データを抽出する。そして、訂正部２７０は、特徴量データに基づいて再度認識処理を行う。この訂正部２７０は、第二認識部として機能することになる。 The error section specifying unit 240 inputs the recognition result recognized by the first recognition unit 226 described above, and specifies an error section. The error section pre- and post-context specifying unit 250 specifies the context before and after the error section, and the error section feature quantity extraction unit 260 extracts feature data of the error section including the front and back context. Then, the correction unit 270 performs recognition processing again based on the feature amount data. The correction unit 270 functions as a second recognition unit.

そして、統合部２８０による統合処理が行われると、表示部２９０は訂正された認識結果を表示することができる。 When the integration process by the integration unit 280 is performed, the display unit 290 can display the corrected recognition result.

つぎに、このクライアント装置１１０ｄの動作について説明する。図１７は、クライアント装置１１０ｄの処理を示すフローチャートである。特徴量算出部２１０により入力された音声の特徴量データが算出され（Ｓ６０１）、算出された特徴量データは、特徴量保存部２３０に保存される（Ｓ６０２）。この保存処理と平行して、第一認識部２２６により音声認識が行われる（Ｓ６０３）。 Next, the operation of the client device 110d will be described. FIG. 17 is a flowchart showing processing of the client device 110d. The feature amount data of the voice input by the feature amount calculation unit 210 is calculated (S601), and the calculated feature amount data is stored in the feature amount storage unit 230 (S602). In parallel with this storage processing, the first recognition unit 226 performs voice recognition (S603).

第一認識部２２６により音声認識された認識結果の誤り区間が、誤り区間指定部２４０および誤り区間前後コンテキスト指定部２５０により指定される（Ｓ６０４）。この指定された誤り区間（前後コンテキストを含む）の特徴量データが、特徴量保存部２３０から誤り区間特徴量抽出部２６０により抽出される（Ｓ６０５）。そして、訂正部２７０により誤り区間の音声が再度認識される（Ｓ６０６）。ここで認識された認識結果は、統合部２８０により統合され、表示部２９０により認識結果が表示される（Ｓ６０７）。 The error section of the recognition result recognized by the first recognition unit 226 is specified by the error section specifying unit 240 and the context section specifying unit 250 before and after the error section (S604). The feature amount data of the designated error section (including the preceding and following contexts) is extracted from the feature amount storage unit 230 by the error section feature amount extraction unit 260 (S605). Then, the voice in the error section is recognized again by the correction unit 270 (S606). The recognition result recognized here is integrated by the integration part 280, and a recognition result is displayed on the display part 290 (S607).

このようにクライアント装置１１０ｄ内で、第一認識部２２６および第二認識部（訂正部）２７０により認識処理が行われるため、より正確な音声認識を行うことができる。なお、第一認識部２２６と第２認識部とでは、異なる認識方法であることが望ましい。これにより、第一認識部２２６において認識されなかった音声に対しても、第二認識部２７０においてこれを補完することができ、全体として正確な音声認識の結果を期待できる。 Thus, since the recognition process is performed by the first recognition unit 226 and the second recognition unit (correction unit) 270 in the client device 110d, more accurate voice recognition can be performed. It is desirable that the first recognition unit 226 and the second recognition unit have different recognition methods. Thereby, even the voice that has not been recognized by the first recognizing unit 226 can be supplemented by the second recognizing unit 270, and an accurate voice recognition result as a whole can be expected.

クライアント装置１１０ｄによれば、特徴量算出部２１０において入力された音声から特徴量データを算出し、特徴量保存部２３０においてこれを記憶させる。一方、第一認識部２２６は、特徴量データに基づいて音声認識処理を行い、誤り区間指定部２４０および誤り区間前後コンテキスト指定部２５０は、認識された認識結果において、認識誤りが発生している誤り区間を指定する。そして、訂正部２７０（第二認識部）は、指定された誤り区間における認識結果を訂正する。これにより、認識した結果のうち必要な部分を訂正するため、簡易に訂正処理を行うことができるとともに、正しい認識結果を得ることができる。また、クライアント装置１１０ｄ内で二度認識処理を行うことにより、サーバ装置１２０を用いる必要がなくなる。
According to the client device 110d, feature amount data is calculated from the voice input in the feature amount calculation unit 210, and is stored in the feature amount storage unit 230. On the other hand, the first recognizing unit 226 performs speech recognition processing based on the feature amount data, and the error section specifying unit 240 and the error section pre- and post-context specifying unit 250 have a recognition error in the recognized recognition result. Specify the error interval. Then, the correction unit 270 (second recognition unit) corrects the recognition result in the designated error section. Accordingly, since a necessary part of the recognized result is corrected, correction processing can be easily performed and a correct recognition result can be obtained. Further, by performing the recognition process twice in the client device 110d, it is not necessary to use the server device 120.

＜第６の実施形態＞
つぎに、第２の実施形態における変形例である第６の実施形態について説明する。この実施形態によると、誤り区間の終点を自動的に判断することに特徴がある。 <Sixth Embodiment>
Next, a sixth embodiment, which is a modification of the second embodiment, will be described. This embodiment is characterized in that the end point of the error section is automatically determined.

図１８は、第６の実施形態のクライアント装置１１０ｆの機能構成を示すブロック図である。クライアント装置１１０ｆは、特徴量算出部２１０、特徴量圧縮部２２０、特徴量保存部２３０、送信部２２５、受信部２３５、操作部２３６、結果保存部２３７、ユーザ入力検出部２３８、誤り区間指定部２４０ｃ、終点判断部２４１、誤り区間前後コンテキスト指定部２５０、誤り区間特徴量抽出部２６０、訂正部２７０、統合部２８０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３、表示部２９０を含んで構成されている。このクライアント装置１１０ｆは、クライアント装置１１０と同様に図３に示されるハードウェアにより実現される。 FIG. 18 is a block diagram illustrating a functional configuration of the client device 110f according to the sixth embodiment. The client device 110f includes a feature amount calculation unit 210, a feature amount compression unit 220, a feature amount storage unit 230, a transmission unit 225, a reception unit 235, an operation unit 236, a result storage unit 237, a user input detection unit 238, and an error section specification unit. 240c, end point judgment unit 241, error section pre- and post-context designation unit 250, error section feature quantity extraction unit 260, correction unit 270, integration unit 280, acoustic model holding unit 281, language model holding unit 282, dictionary holding unit 283, display unit 290. The client device 110f is realized by the hardware shown in FIG.

このクライアント装置１１０ｆは、誤り区間指定部２４０ｃにおいて、誤り区間の始点のみを受付け、終点判断部２４１が所定の条件に基づいて誤り区間の終点を判断する点で、第２の実施形態と相違する。以下、図１８に示されているブロック図に基づいて、第２の実施形態との相違点を中心に説明する。 The client device 110f is different from the second embodiment in that the error section specifying unit 240c receives only the start point of the error section, and the end point determination unit 241 determines the end point of the error section based on a predetermined condition. . Hereinafter, based on the block diagram shown in FIG. 18, it demonstrates centering on difference with 2nd Embodiment.

第２の実施形態にて示されている構成と同様に、クライアント装置１１０ｆは、サーバ装置１２０において認識された認識結果を受信部２３５が受信し、その認識結果を結果保存部２３７が保存する。そして、その認識結果を表示部２９０が表示しつつ、ユーザはその表示部２９０に表示されている認識結果を見ながら、操作部２３６を操作することにより、誤り区間の始点を指定する。ユーザ入力検出部２３８は、その始点を検出し、それを誤り区間指定部２４０ｃに出力する。 Similarly to the configuration shown in the second embodiment, in the client device 110f, the reception unit 235 receives the recognition result recognized by the server device 120, and the result storage unit 237 stores the recognition result. Then, while the display unit 290 displays the recognition result, the user operates the operation unit 236 while viewing the recognition result displayed on the display unit 290, thereby specifying the start point of the error section. The user input detection unit 238 detects the start point and outputs it to the error interval specification unit 240c.

誤り区間指定部２４０ｃは、ユーザから指定された始点および終点判断部２４１において判断された終点に従って、誤り区間を指定する。誤り区間の終点を判断する際においては、誤り区間指定部２４０ｃは、ユーザから始点が指定されたことを検出すると、その旨を終点判断部２４１に出力し、終点の判断を指示する。 The error section specifying unit 240c specifies an error section according to the start point specified by the user and the end point determined by the end point determination unit 241. In determining the end point of the error section, when the error section specifying unit 240c detects that the start point is specified by the user, the error section specifying unit 240c outputs that fact to the end point determining unit 241 and instructs the end point determination.

終点判断部２４１は、誤り区間指定部２４０ｃからの指示に従って、誤り区間の終点を自動的に判断する部分である。例えば、終点判断部２４１は、受信部２５において受信され、結果保存部２３７において保存されている音声認識結果に含まれている信頼度情報と、予め設定された閾値とを比較し、信頼度が閾値を超えた単語（或いは信頼度が一番高い単語）を誤りの終点と判断する。そして、終点判断部２４１は、判断した終点を誤り区間指定部２４０ｃに出力することで、誤り区間指定部２４０ｃは、誤り区間を指定することができる。 The end point determination unit 241 is a part that automatically determines the end point of the error interval in accordance with an instruction from the error interval specification unit 240c. For example, the end point determination unit 241 compares the reliability information included in the speech recognition result received by the reception unit 25 and stored in the result storage unit 237 with a preset threshold, and the reliability is A word exceeding the threshold (or a word having the highest reliability) is determined as an error end point. Then, the end point determining unit 241 outputs the determined end point to the error section specifying unit 240c, so that the error section specifying unit 240c can specify the error section.

例えば以下の音声を例に説明する。なお、ここでは便宜上、誤り区間の始点として、“活性化”が指定されていたとする。
＜発声内容＞
「この目標を達成するためには、皆さんの協力が必要です。」
＜音声認識結果＞
「この目標を活性化のためには、皆さんの協力が必要です。」
ここで、音声認識結果を単語単位に区切ってみる。なお、“／”は、単語の区切りを示したものである。
「この／目標／を／活性化／の／ため／に／は、／皆／さん／の／協力／が／必要／です。」
この音声認識結果として、“活性化”の信頼度が０．１、“の”の信頼度が０．０１、“ため”の信頼度が０．４、“に”の信頼度が０．６であった場合で、閾値を０．５とした場合には、“活性化／の／ため／に”における“に”が終点と判断することができる。 For example, the following voice will be described as an example. Here, for the sake of convenience, it is assumed that “activation” is designated as the start point of the error section.
<Voice content>
“Your cooperation is needed to achieve this goal.”
<Voice recognition result>
“Your cooperation is needed to revitalize this goal.”
Here, the speech recognition result is divided into words. "/" Indicates a word break.
“This / Goal / To / Activate / To / To / To / To / Everyone / Ms. / No / Cooperation / To / Need /”
As a result of the speech recognition, the reliability of “activation” is 0.1, the reliability of “no” is 0.01, the reliability of “for” is 0.4, and the reliability of “ni” is 0.6. In the case where the threshold value is 0.5, it is possible to determine that “ni” in “activation / for / for / ni” is the end point.

なお、終点判断部２４１は、閾値以上の信頼度となった単語の一つ前（上の例示では“ため”）を終点と判断することもできるが、誤り区間を指定する上で、結果的に誤っている部分を含んでいればよいため、いずれの方法をもとりうる。 The end point determination unit 241 can also determine the previous point (in the above example, “because”) of the word having a reliability equal to or higher than the threshold as the end point. Any method can be used as long as it includes an erroneous part.

このような誤り区間の指定方法は、ユーザの普段の訂正習慣に即したものであるため、使い勝手のよいものである。つまり、例えば漢字変換において、ユーザは誤りを指定する場合には、まず始点を入力し、つぎに誤りを削除し、正しい単語列を入力するのが通例と思われる。上述の誤り区間の指定方法も、視点を入力した後、自動的に終点が定まるため、その操作方法にあったものであり、ユーザにとって違和感なく操作することができる。 Such an error interval designation method is convenient because it is in line with the user's usual correction habits. That is, for example, in Kanji conversion, when a user designates an error, it is customary to first input a starting point, then delete the error, and input a correct word string. Since the end point is automatically determined after inputting the viewpoint, the above-described error section designation method is also suitable for the operation method, and can be operated without a sense of incongruity for the user.

また、終点判断部２４１は、終点を判断する際において、上述の方法に限定するものではない。例えば、特定の発音記号にしたがって終点を判断する方法、誤り始点開始後、Ｍ番目の単語を終点とする方法としてもよい。ここで、特定の発音記号に従った方法とは、発話中のポーズに基づいて判断する方法であり、フレーズの境界に出現するショートポーズ（読点）、発話の最後に出現するロングポーズ（句点）に基づいて判断するようにしてもよい。これにより文章の区切りで判断することとなり、より正確な音声認識を期待することができる。 Further, the end point determination unit 241 is not limited to the above-described method when determining the end point. For example, a method of determining the end point according to a specific phonetic symbol, or a method of setting the Mth word as the end point after the start of the error start point may be used. Here, the method according to a specific phonetic symbol is a method of judging based on a pause during utterance, a short pause (reading mark) that appears at the boundary of a phrase, and a long pause (punctuation) that appears at the end of an utterance. You may make it judge based on. As a result, determination is made at sentence breaks, and more accurate speech recognition can be expected.

以下にその具体例を示す。音声として上述と同様の内容である以下のものを例に説明する。
＜発声内容＞
「この目標を達成するためには、皆さんの協力が必要です。」
＜音声認識結果＞
「この目標を活性化のためには、皆さんの協力が必要です。」 Specific examples are shown below. The following will be described by way of example with the same contents as described above as speech.
<Voice content>
“Your cooperation is needed to achieve this goal.”
<Voice recognition result>
“Your cooperation is needed to revitalize this goal.”

ユーザが、操作部２３６を操作することにより、誤り区間の始点として、“この目標を”の後を設定すると、終点判断部２４１は、この部分に一番近いポーズ（読点部分）を終点と判断する。誤り区間指定部２４０ｃは、この終点に基づいて誤り区間を指定することができる。上述の例では、誤り区間の終点として、“ためには、”における“、”の部分が指定される。なお、“、”の部分は実際には音声はなく、一瞬の間があいた状態である。 When the user operates the operation unit 236 to set “after this target” as the start point of the error section, the end point determination unit 241 determines the pose (reading point portion) closest to this portion as the end point. To do. The error interval specification unit 240c can specify an error interval based on this end point. In the above-described example, the part “,” in “For” is designated as the end point of the error section. Note that the “,” part is in fact a state where there is no voice and there is a moment.

なお、特定の発音としては、読点、句点以外に、“え〜”、“あの〜”といった発音、または“ます”、“です”といった単語としてもよい。 In addition to the punctuation marks and punctuation marks, the specific pronunciation may be a pronunciation such as “e ~” or “that ~” or a word such as “masu” or “is”.

つぎに、誤り始点移行のＭ番目の単語を終点にする方法の例を示す。以下に示す文章は、単語単位に区分した状態を示したものである。なお、“／”は、単語の区切りを示したものである。
「この／目標／を／活性化／の／ため／に／は、／皆／さん／の／協力／が／必要／です。」 Next, an example of a method of setting the Mth word of the error start point transition as the end point is shown. The text shown below shows a state divided into words. "/" Indicates a word break.
“This / Goal / To / Activate / To / To / To / To / Everyone / Ms. / No / Cooperation / To / Need /”

例えば、始点を“活性化”とした場合で、Ｍ＝３とした場合には、“活性化／の／ため”における“ため”が終点の単語となる。したがって、誤り区間指定部２４０ｃは、“活性化／の／ため”を誤り区間として指定することができる。なお、当然ながらＭ＝３以外でもよい。 For example, when the starting point is “activated” and M = 3, “for” in “activated / not / for” becomes the end word. Therefore, the error section specifying unit 240c can specify “activation / no / for” as the error section. Of course, it may be other than M = 3.

つぎに、認識結果の候補数（競合数）が少ない単語を終点にする方法の例を示す。例えば、以下の例を用いて説明する。
「この／目標／を／活性化／の／ため」において、以下の候補が挙げられるとする。
「活性化」：“だれ”、“沢山”、“お勧め”
「の」： “か”、“ある”
「ため」：−（候補なし） Next, an example of a method of setting a word with a small number of recognition result candidates (competition number) as an end point will be described. For example, it demonstrates using the following examples.
In “this / target / do / activate / for / begin”, the following candidates are listed.
“Activation”: “Who”, “Many”, “Recommended”
“No”: “ka”, “is”
"For":-(No candidate)

この候補の数は、その区間のあいまいさを反映したものとなっており、信頼性が低いほど多数の候補が、サーバ装置１２０から送信される。なお、この例においては、サーバ装置１２０において、信頼度情報を送信する代わりに、信頼度情報に基づいて得られた他の候補をそのままクライアント装置１１０に送信するように構成されている。 The number of candidates reflects the ambiguity of the section. The lower the reliability, the more candidates are transmitted from the server device 120. In this example, the server apparatus 120 is configured to transmit other candidates obtained based on the reliability information to the client apparatus 110 as they are, instead of transmitting the reliability information.

この場合、「ため」については、候補がないため、それだけその信頼度が高いものと考えることができる。よって、この例では、誤り区間としてはその手前の“の”が誤り区間の終点を判断することができる。なお、誤り区間の終点として、その手前とすることに限定するものではなく、ある程度幅を持たせたものとしてもよい。 In this case, since there is no candidate for “for”, it can be considered that the reliability is high. Therefore, in this example, “no” in front of the error section can determine the end point of the error section. It should be noted that the end point of the error section is not limited to the end of the error section, but may have a certain width.

以上の通り、終点箇所を信頼度に基づく方法、特定の発音記号（または発音）を利用する方法、始点からＭ番目までを誤り区間とする方法が考えられるが、これら方法の組み合わせ、つまり、これら複数方法の訂正結果をＮ−ｂｅｓｔの形式或いは複数方法の認識結果から一つを選択する形式としてもよい。この場合、認識結果のスコア順に認識結果をリスト表示し、ユーザはそのリストから任意の認識結果を選択するようにしてもよい。 As described above, there are a method based on the reliability of the end point, a method using a specific phonetic symbol (or pronunciation), and a method of setting an error interval from the start point to the Mth, but a combination of these methods, that is, these It is good also as a format which selects one from the N-best format or the recognition result of multiple methods for the correction result of multiple methods. In this case, the recognition results may be displayed in a list in the order of the recognition result scores, and the user may select an arbitrary recognition result from the list.

このように誤り区間指定部２４０ｃが指定した誤り区間に基づいて、誤り区間前後コンテキスト指定部２５０がその前後を含めた区間を指定し、誤り区間特徴量抽出部２６０はその特徴量データを特徴量保存部２３０から抽出し、訂正部２７０は、その特徴量データに対して再認識処理を行うことにより訂正処理を行う。 In this way, based on the error section specified by the error section specifying unit 240c, the context specifying unit 250 before and after the error section specifies a section including the preceding and following sections, and the error section feature amount extracting unit 260 converts the feature amount data into the feature amount. Extracted from the storage unit 230, the correction unit 270 performs correction processing by performing re-recognition processing on the feature amount data.

つぎに、このように構成されたクライアント装置１１０ｆの動作について説明する。図１９は、クライアント装置１１０ｆの処理を示すフローチャートである。 Next, the operation of the client device 110f configured as described above will be described. FIG. 19 is a flowchart showing processing of the client device 110f.

マイクを介して入力された音声は特徴量算出部２１０によりその特徴データが抽出される（Ｓ１０１）。そして、特徴量保存部２３０に特徴量データは保存される（Ｓ１０２）。つぎに、特徴量圧縮部２２０により特徴量データは圧縮される（Ｓ１０３）。圧縮された圧縮特徴量データは、送信部２２５によりサーバ装置１２０に送信される（Ｓ１０４）。 The feature data of the voice input through the microphone is extracted by the feature amount calculation unit 210 (S101). The feature amount data is stored in the feature amount storage unit 230 (S102). Next, the feature amount data is compressed by the feature amount compression unit 220 (S103). The compressed compressed feature data is transmitted to the server device 120 by the transmission unit 225 (S104).

つぎに、サーバ装置１２０において音声認識が行われ、サーバ装置１２０からその認識結果が送信され、受信部２３５により受信され、一時保存されるとともに、その認識結果は表示部２９０に表示される（Ｓ１０５ａ）。そして、ユーザは表示部２９０に表示されている認識結果に基づいて誤り区間の始点を判断し、その始点を、操作部２３６を操作することで指定する。そして、ユーザ入力検出部２３８により始点が指定されたことが検出されると、終点判断部２４１により、誤り区間の終点が自動的に判断される。例えば、音声認識結果に含まれている信頼度に基づいて判断され、また予め定められた発音記号が出現する箇所が終点と判断され、さらには始点からＭ番目（Ｍは予め定められた任意に値）が終点と判断される。 Next, voice recognition is performed in the server device 120, the recognition result is transmitted from the server device 120, received by the receiving unit 235, temporarily stored, and the recognition result is displayed on the display unit 290 (S105a). ). Then, the user determines the start point of the error section based on the recognition result displayed on the display unit 290 and designates the start point by operating the operation unit 236. When the user input detection unit 238 detects that the start point has been designated, the end point determination unit 241 automatically determines the end point of the error section. For example, a determination is made based on the reliability included in the speech recognition result, and a place where a predetermined phonetic symbol appears is determined as an end point. Further, the Mth from the start point (M is a predetermined arbitrary value). Value) is determined as the end point.

そして、このように始点および終点が誤り区間指定部２４０ｃにより指定される。そして、この指定された誤り区間に基づいて前後コンテキストが指定される（Ｓ１０６ｃ）。この前後コンテキストを含んだ誤り区間に基づいて、誤り区間特徴量抽出部２６０により特徴量データが抽出され（Ｓ１０７）、訂正部２７０により再度音声認識が行われ、誤り区間におけるテキストデータが生成される（Ｓ１０８）。そして、誤り区間におけるテキストデータと、受信部２３５において受信されたテキストデータとが統合され、正しいテキストデータが表示部２９０に表示される（Ｓ１０９）。 In this way, the start point and the end point are specified by the error section specifying unit 240c. Then, the preceding and following contexts are designated based on the designated error section (S106c). Based on the error interval including the preceding and following contexts, feature amount data is extracted by the error interval feature amount extraction unit 260 (S107), speech recognition is performed again by the correction unit 270, and text data in the error interval is generated. (S108). Then, the text data in the error section and the text data received by the receiving unit 235 are integrated, and correct text data is displayed on the display unit 290 (S109).

なお、Ｓ１０６ｃを含むＳ１０５ａ〜１０８の処理については、図１０に示されるフローチャートとほぼ同様の処理が行われるが、Ｓ３０５の処理について、終点判断部２４１が自動的に誤り区間の終点箇所を判断し、それを保存する点で相違する。 In addition, about the process of S105a-108 including S106c, the process substantially the same as the flowchart shown in FIG. 10 is performed, but the end point judgment part 241 automatically judges the end point part of an error area about the process of S305. The difference in saving it.

以上の通り、この実施形態によれば、このような誤り区間の指定方法は、ユーザの普段の訂正習慣に即したものとすることができ、大変使い勝手のよい装置を提供することができる。
As described above, according to this embodiment, such an error section designation method can be adapted to the user's usual correction habits, and an apparatus that is very easy to use can be provided.

＜第７の実施形態＞
つぎに第７の実施形態について説明する。この実施形態によると、誤り区間において先頭の文字をユーザが指定することで、その指定した文字を拘束条件として音声認識をより正しく行わせようとするものである。 <Seventh Embodiment>
Next, a seventh embodiment will be described. According to this embodiment, when the user designates the first character in the error section, voice recognition is performed more correctly using the designated character as a constraint condition.

図２０は、第７の実施形態のクライアント装置１１０ｇの機能構成を示すブロック図である。クライアント装置１１０ｇは、特徴量算出部２１０、特徴量圧縮部２２０、特徴量保存部２３０、送信部２２５、受信部２３５、操作部２３６、結果保存部２３７、ユーザ入力検出部２３８、誤り区間指定部２４０ａ、誤り区間前後コンテキスト指定部２５０ａ、誤り区間特徴量抽出部２６０、訂正部２７０、統合部２８０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３、表示部２９０を含んで構成されている。このクライアント装置１１０ｇは、クライアント装置１１０と同様に図３に示されるハードウェアにより実現される。 FIG. 20 is a block diagram illustrating a functional configuration of the client apparatus 110g according to the seventh embodiment. The client device 110g includes a feature amount calculation unit 210, a feature amount compression unit 220, a feature amount storage unit 230, a transmission unit 225, a reception unit 235, an operation unit 236, a result storage unit 237, a user input detection unit 238, and an error section specification unit. 240a, an error interval pre- and post-context specifying unit 250a, an error interval feature amount extraction unit 260, a correction unit 270, an integration unit 280, an acoustic model storage unit 281, a language model storage unit 282, a dictionary storage unit 283, and a display unit 290. Has been. The client device 110g is realized by the hardware shown in FIG.

このクライアント装置１１０ｇは、操作部２３６がユーザから拘束条件として誤り区間における訂正後の文字を受け付け、誤り区間前後コンテキスト指定部２５０ａが、誤り区間前後にコンテキストと、操作部２３６において受け付けた訂正後の文字とを指定し、訂正部２７０は、これら誤り区間前後コンテキストと訂正後の文字とを拘束条件として再認識処理を行うことで訂正処理を行う点に特徴を有するものである。 In this client device 110g, the operation unit 236 receives a corrected character in the error section as a constraint condition from the user, and the context specification unit 250a before and after the error section receives the context and the corrected text received in the operation unit 236 before and after the error section. Characters are designated, and the correction unit 270 is characterized in that correction processing is performed by performing re-recognition processing using the context before and after the error section and the corrected character as a constraint condition.

すなわち、操作部２３６は、ユーザから誤り区間を指定するための入力を受け付け、その後、誤り区間における訂正後の文字入力を受け付ける。 That is, the operation unit 236 receives an input for designating an error section from the user, and then receives a corrected character input in the error section.

誤り区間前後コンテキスト指定部２５０ａは、上述第１の実施形態における誤り区間前後コンテキスト指定部２５０とほぼ同様の処理を行い、誤り区間の前後において認識された単語（一認識単位）を指定するととともに、操作部２３６において受け付けられた訂正後の文字を指定する。 The context specifying unit 250a before and after the error section performs substantially the same processing as the context specifying unit 250 before and after the error section in the first embodiment, and specifies a word (one recognition unit) recognized before and after the error section. The corrected character received by the operation unit 236 is designated.

訂正部２７０は、誤り区間特徴量抽出部２６０において抽出された特徴量データおよび誤り区間前後コンテキスト指定部２５０ａにおいて指定された拘束条件に基づいて再認識処理を行い、訂正処理を実行することができる。 The correction unit 270 can perform the re-recognition processing based on the feature amount data extracted by the error section feature amount extraction unit 260 and the constraint conditions specified by the context specifying unit 250a before and after the error section, and can execute the correction processing. .

例えば、以下の例に基づいて上述の処理について説明する。
＜発声内容＞
「この目標を達成するためには、皆さんの協力が必要です。」
＜音声認識結果＞
「この目標を活性化のためには、皆さんの協力が必要です。」
として場合に、ユーザは、操作部２３６を操作することで、誤り区間における始点（上述の例では“この目標を”の次の位置）に、正しい文字内容を入力する。入力すべき仮名列は、“たっせいするために”である。以下の例は入力の先頭の一部である“た”を入力する場合を例に説明する。なお、誤り区間の始点と終点とは、上述と同様の方法により決定済みまたは決定されるものとする。 For example, the above process will be described based on the following example.
<Voice content>
“Your cooperation is needed to achieve this goal.”
<Voice recognition result>
“Your cooperation is needed to revitalize this goal.”
In this case, the user operates the operation unit 236 to input the correct character content at the start point in the error section (in the above example, the position next to “this target”). The kana string to be entered is “for the sake of convenience”. In the following example, a case where “TA”, which is a part of the head of input, is input will be described as an example. It is assumed that the start point and end point of the error section have been determined or determined by the same method as described above.

ユーザが操作部２３６を介して“た”を入力すると、誤り区間前後コンテキスト指定部２５０ａは、前後コンテキストとして“この目標を”を、入力された文字として“た”を拘束条件とし、すなわち“この目標をた”を、特徴量データを認識する際における拘束条件として設定する。 When the user inputs “TA” via the operation unit 236, the error interval pre- and post-context specifying unit 250a uses “this target” as the pre- and post-context and “TA” as the input character, that is, “this” “Target” is set as a constraint condition when recognizing feature amount data.

このようにユーザの文字入力内容を拘束条件として再度音声認識を行った認識結果をユーザに提示することで、より正確な認識結果を提示することができる。なお、訂正方法は、音声認識に加えてキー文字入力方法と併用してもよい。例えば、キー文字入力方法として仮名漢字変換が考えられる。仮名漢字変換では入力文字内容を辞書と比較し、その変換結果を予測する機能がある。例えば“た”を入力すると、データベースから“た”が先頭の単語を順番にリストアップし、ユーザに提示する。 Thus, by presenting the user with a recognition result obtained by performing speech recognition again using the user's character input content as a constraint condition, a more accurate recognition result can be presented. The correction method may be used in combination with the key character input method in addition to the voice recognition. For example, kana-kanji conversion can be considered as a key character input method. Kana-Kanji conversion has a function of comparing input character contents with a dictionary and predicting the conversion result. For example, if “ta” is entered, the word “ta” from the database is listed in order and presented to the user.

ここでは、この機能を利用して、仮名漢字変換のデータベースの候補と音声認識により得られた候補とをリスト表示しておき、これらリストに基づいてユーザが任意の候補を選択するようにしてもよい。リスト表示される順番は、変換結果または認識結果に付与されているスコア順であってもよいし、仮名漢字変換に基づいた候補と音声認識による候補とを比較し、完全一致または一部一致している候補については、それぞれ付与されているスコアを合算し、そのスコアに基づいた順であってもよい。例えば、仮名漢字変換の候補Ａ１“達成”のスコア５０、音声認識結果の候補Ｂ１“達成する”のスコア８０とした場合で、候補Ａ１と候補Ｂ１とが一部一致しているため、各スコアにおいて、所定の係数を乗算し、合算して得られたスコアに基づいて表示してもよい。なお、完全一致の場合には、所定の係数を乗算するといった調整処理を行う必要はない。また、ユーザが仮名漢字変換の候補Ａ１“達成”を選択した段階で、“この目標を達成”を拘束条件とし、まだ確定されていない残りの“する”に相当する特徴量データを再度認識するようにし、候補リストを表示しなおすようにしてもよい。 Here, by using this function, a list of candidates for the kana-kanji conversion database and candidates obtained by speech recognition are displayed in a list, and the user may select an arbitrary candidate based on these lists. Good. The order displayed in the list may be the order of the scores given to the conversion result or recognition result, or the candidate based on kana-kanji conversion is compared with the candidate based on speech recognition, and is completely or partially matched. The candidates may be added in the order of their scores, and the order may be based on the scores. For example, in the case of a kana-kanji conversion candidate A1 “achieve” score 50 and a speech recognition result candidate B1 “achieve” score 80, the candidate A1 and the candidate B1 partially match each other. In FIG. 5, the predetermined coefficient may be multiplied and displayed based on the score obtained by addition. In the case of a perfect match, there is no need to perform adjustment processing such as multiplication by a predetermined coefficient. In addition, when the user selects Kana-Kanji conversion candidate A1 “achievement”, “reach this goal” is used as a constraint, and feature data corresponding to the remaining “yes” that has not yet been determined is recognized again. In this way, the candidate list may be displayed again.

つぎに、このように構成されたクライアント装置１１０ｇの動作について説明する。図２１は、クライアント装置１１０ｇの処理を示すフローチャートである。 Next, the operation of the client device 110g configured as described above will be described. FIG. 21 is a flowchart showing the processing of the client device 110g.

つぎに、サーバ装置１２０において音声認識が行われ、サーバ装置１２０からその認識結果が送信され、受信部２３５により受信され、一時保存されるとともに、その認識結果は表示部２９０に表示される（Ｓ１０５ａ）。そして、ユーザは表示部２９０に表示されている認識結果に基づいて誤り区間を指定する（Ｓ１０６ｄ）。さらに、ユーザは、誤り区間における認識結果を訂正するための文字入力を操作部２３６に対して行う。操作部２３６では、文字入力が受け付けられると、誤り区間前後コンテキスト指定部２５０ａに出力し、誤り区間前後コンテキスト指定部２５０ａは、入力された文字とともに、この指定された誤り区間に基づいて前後コンテキストが指定される。この前後コンテキストを含んだ誤り区間に基づいて、誤り区間特徴量抽出部２６０により特徴量データが抽出され（Ｓ１０７）、訂正部２７０により再度音声認識が行われ、誤り区間におけるテキストデータが生成される（Ｓ１０８）。そして、誤り区間におけるテキストデータと、受信部２３５において受信されたテキストデータとが統合され、正しいテキストデータが表示部２９０に表示される（Ｓ１０９）。 Next, voice recognition is performed in the server device 120, the recognition result is transmitted from the server device 120, received by the receiving unit 235, temporarily stored, and the recognition result is displayed on the display unit 290 (S105a). ). Then, the user designates an error section based on the recognition result displayed on the display unit 290 (S106d). Further, the user inputs characters to the operation unit 236 to correct the recognition result in the error section. When the operation unit 236 accepts character input, the operation unit 236 outputs the input to the preceding and following context specifying unit 250a. The preceding and following context specifying unit 250a determines the preceding and following contexts based on the specified error section together with the input characters. It is specified. Based on the error interval including the preceding and following contexts, feature amount data is extracted by the error interval feature amount extraction unit 260 (S107), speech recognition is performed again by the correction unit 270, and text data in the error interval is generated. (S108). Then, the text data in the error section and the text data received by the receiving unit 235 are integrated, and correct text data is displayed on the display unit 290 (S109).

なお、Ｓ１０６ｄを含むＳ１０５ａ〜１０８の処理については、図１０に示されるフローチャートとほぼ同様の処理が行われる。さらに、本実施形態においては、図１０のフローチャートにおける各処理に加えて、Ｓ３０９において、操作部２３６において受け付けられた文字を拘束条件として設定する処理が必要となる。なお、Ｓ３０９までに拘束条件となる文字の入力受付けを完了させておく必要がある。 In addition, about the process of S105a-108 containing S106d, the process substantially the same as the flowchart shown in FIG. 10 is performed. Further, in the present embodiment, in addition to the processes in the flowchart of FIG. 10, in S309, a process for setting the character received by the operation unit 236 as a constraint condition is required. In addition, it is necessary to complete the input acceptance of the character which becomes a constraint condition by S309.

以上の通り、この実施形態によれば、拘束条件として前後コンテキストに加えてユーザから指定された文字を設定することで、より正確な音声認識を行うことができる。
As described above, according to this embodiment, more accurate voice recognition can be performed by setting a character designated by the user in addition to the preceding and following contexts as a constraint condition.

＜第８の実施形態＞
つぎに第８の実施形態について説明する。この実施形態によると、訂正部２７０において再認識した結果、再認識前の認識結果と同じ認識結果とならないようにしたものである。 <Eighth Embodiment>
Next, an eighth embodiment will be described. According to this embodiment, as a result of re-recognition by the correction unit 270, the same recognition result as that before the re-recognition is prevented.

図２２は、第８の実施形態のクライアント装置１１０ｈの機能構成を示すブロック図である。クライアント装置１１０ｈは、特徴量算出部２１０、特徴量圧縮部２２０、特徴量保存部２３０、送信部２２５、受信部２３５、操作部２３６、結果保存部２３７、ユーザ入力検出部２３８、誤り区間指定部２４０ａ、誤り区間前後コンテキスト指定部２５０、誤り区間特徴量抽出部２６０、訂正部２７０ｂ、統合部２８０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３、表示部２９０を含んで構成されている。このクライアント装置１１０ｈは、クライアント装置１１０と同様に図３に示されるハードウェアにより実現される。以下、図２におけるクライアント装置１１０との相違点を中心に説明する。 FIG. 22 is a block diagram illustrating a functional configuration of the client apparatus 110h according to the eighth embodiment. The client device 110h includes a feature amount calculation unit 210, a feature amount compression unit 220, a feature amount storage unit 230, a transmission unit 225, a reception unit 235, an operation unit 236, a result storage unit 237, a user input detection unit 238, and an error section specification unit. 240a, an error interval pre- and post-context specifying unit 250, an error interval feature amount extraction unit 260, a correction unit 270b, an integration unit 280, an acoustic model storage unit 281, a language model storage unit 282, a dictionary storage unit 283, and a display unit 290. Has been. The client device 110h is realized by the hardware shown in FIG. Hereinafter, the description will focus on the differences from the client device 110 in FIG.

訂正部２７０ｂは、図３における訂正部２７０と同様に再認識処理等を行う部分である。さらに訂正部２７０ｂは、結果保存部２３７において記憶されている認識結果に基づいて、同じ認識誤りをしないように再認識処理を行う。すなわち、訂正部２７０ｂは、誤り区間指定部２４０ａにおいて指定された誤り区間における認識結果と比較して、同じ認識結果を得ないようにするために、再認識の探索過程において、誤り区間における認識結果を含むパスを候補から除外する処理を行う。除外する処理としては、訂正部２７０ｂは、誤り区間の特徴量データに対する候補における仮説の確率を極小化するように、所定の係数を乗算することにより、結果的に極小となった候補を選択しないようにする。なお、上述の方法では、再認識するときに誤っている可能性のある候補（例えば、“活性化”）を認識結果の候補から除外するようにしているが、これに限るものではなく、再認識した認識結果を提示する際において、誤っている可能性のある認識結果の一候補（例えば“活性化”）を表示しないようにしてもよい。 The correction unit 270b is a part that performs re-recognition processing and the like in the same manner as the correction unit 270 in FIG. Further, the correction unit 270b performs re-recognition processing based on the recognition result stored in the result storage unit 237 so as not to make the same recognition error. That is, the correction unit 270b compares the recognition result in the error section specified in the error section specifying unit 240a with the recognition result in the error section in the re-recognition search process so as not to obtain the same recognition result. The process which excludes the path | pass which contains is performed from a candidate. As a process of exclusion, the correction unit 270b does not select a candidate that is minimized as a result of multiplication by a predetermined coefficient so as to minimize the probability of the hypothesis in the candidate for the feature amount data in the error section. Like that. In the above-described method, a candidate (for example, “activation”) that may be erroneous when re-recognizing is excluded from the recognition result candidates. However, the present invention is not limited to this. When presenting the recognized recognition result, one candidate of the recognition result that may be erroneous (for example, “activation”) may not be displayed.

なお、このクライアント装置１１０ｈは、図８で示されるフローチャートとほぼ同様の処理を実行するものである。なお、Ｓ１０８における誤り区間の認識処理については、同じ認識結果を表示しないように、その候補から除外するような認識処理を行う点で相違する。 The client device 110h executes almost the same processing as that in the flowchart shown in FIG. Note that the error interval recognition processing in S108 is different in that recognition processing is performed so that the same recognition result is excluded from the candidates so as not to be displayed.

以上の通り、訂正対象となる単語は間違いであったことから、再認識後の結果に訂正対象となった単語を出力すべきではないことから、この実施形態においては、このような訂正結果を表示しないようにすることができる。
As described above, since the word to be corrected is an error, the word to be corrected should not be output as the result after re-recognition. You can hide it.

＜第９の実施形態＞
つぎに、第９の実施形態について説明する。この実施形態によると、誤り区間特徴量抽出部２６０において抽出した特徴量データの誤り区間において、平均値を算出し、その平均値を特徴量データから減算したデータを用いて再認識処理を行おうとするものである。 <Ninth Embodiment>
Next, a ninth embodiment will be described. According to this embodiment, in the error section of the feature amount data extracted by the error section feature amount extraction unit 260, an average value is calculated, and re-recognition processing is performed using data obtained by subtracting the average value from the feature amount data. To do.

その具体的構成について、説明する。図２３は、第９の実施形態におけるクライアント装置１１０ｉの機能を示すブロック図である。このクライアント装置１１０ｉは、特徴量算出部２１０、特徴量圧縮部２２０、特徴量保存部２３０、送信部２２５、受信部２３５、誤り区間指定部２４０、誤り区間前後コンテキスト指定部２５０、誤り区間特徴量抽出部２６０、平均値計算部２６１（算出手段）、特徴正規化部２６２（訂正手段）、訂正部２７０（訂正手段）、統合部２８０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３、表示部２９０を含んで構成されている。このクライアント装置１１０ｉは、クライアント装置１１０と同様に図３に示されるハードウェアにより実現される。以下、図２におけるクライアント装置１１０との相違点である、平均値計算部２６１および特徴正規化部２６２を中心に説明する。 The specific configuration will be described. FIG. 23 is a block diagram illustrating functions of the client device 110i according to the ninth embodiment. The client device 110i includes a feature amount calculation unit 210, a feature amount compression unit 220, a feature amount storage unit 230, a transmission unit 225, a reception unit 235, an error section specification unit 240, an error section front and rear context specification unit 250, an error section feature amount. Extraction unit 260, average value calculation unit 261 (calculation unit), feature normalization unit 262 (correction unit), correction unit 270 (correction unit), integration unit 280, acoustic model storage unit 281, language model storage unit 282, dictionary storage A part 283 and a display part 290 are included. The client device 110i is realized by the hardware shown in FIG. Hereinafter, the average value calculation unit 261 and the feature normalization unit 262 that are different from the client device 110 in FIG. 2 will be mainly described.

平均値計算部２６１は、誤り区間特徴量抽出部２６０において抽出された特徴量データにおける誤り区間の平均値（または誤り区間の前後含んだ平均値）を算出する部分である。より具体的には、平均値計算部２６１は、誤り区間における各認識単位の周波数ごとに出力値（大きさ）を累積加算する。そして、累積加算して得られた出力値をその認識単位数で除算することで平均値を算出する。例えば、“活性化／の／ため”の誤り区間における認識単位は、スラッシュ“／”で区切られた部分である。夫々の認識単位である認識フレームｎが、周波数ｆｎ１〜ｆｎ１２から構成され、その出力値がｇｎ１〜ｇｎ１２であったとすると、周波数ｆ１の平均値ｇ１＝Σｇｎ１／ｎ（上述の例ではｎ＝１から３）で表すことができる。 The average value calculation unit 261 is a part that calculates an average value of error sections (or an average value including before and after the error section) in the feature amount data extracted by the error section feature amount extraction unit 260. More specifically, the average value calculation unit 261 cumulatively adds the output value (magnitude) for each frequency of each recognition unit in the error interval. Then, an average value is calculated by dividing the output value obtained by cumulative addition by the number of recognition units. For example, the recognition unit in the error section “activation / no / for” is a part delimited by a slash “/”. Assuming that the recognition frame n as each recognition unit is composed of frequencies fn1 to fn12 and the output values thereof are gn1 to gn12, the average value g1 of the frequency f1 = Σgn1 / n (in the above example, from n = 1). 3).

すなわち、“活性化”を構成する周波数ｆ１１〜ｆ１１２（出力値をｇ１１〜ｇ１１２）、“の”を構成する周波数ｆ２１〜ｆ２１２（出力値をｇ２１〜ｇ２１２）、“ため”を構成する周波数ｆ３１〜ｆ３１２（出力値をｇ３１〜ｇ３１２）とした場合、周波数ｆ１の平均値は、（ｇ１１＋ｇ２１＋ｇ３１）／３で算出される。 That is, the frequencies f11 to f112 (output values are g11 to g112) that constitute “activation”, the frequencies f21 to f212 (output values are g21 to g212) that constitute “no”, and the frequencies f31 to f that constitute “for”. When f312 (output values are g31 to g312), the average value of the frequency f1 is calculated as (g11 + g21 + g31) / 3.

特徴正規化部２６２は、平均値計算部２６１において算出された各周波数の平均値を、各周波数から構成されている特徴量データから減算する処理を行う。そして、訂正部２７０は、減算して得られたデータに対して再認識処理を行うことにより訂正処理を行うことができる。 The feature normalization unit 262 performs a process of subtracting the average value of each frequency calculated by the average value calculation unit 261 from the feature amount data composed of each frequency. Then, the correction unit 270 can perform correction processing by performing re-recognition processing on the data obtained by subtraction.

本実施形態においては、平均値計算部２６１において算出した平均値を用いて特徴量データを修正することにより、例えば特徴量算出部２１０に音声を入力するためのマイクなどの集音装置の特性を除去したデータとすることができる。すなわち、マイクの集音時のノイズを除去することができ、より正確な音声に対する訂正（認識処理）を行うことができる。なお、上述の実施形態においては、誤り区間特徴量抽出部２６０において抽出した誤り区間に対して適用しているが、その誤り区間を含む一定の長さの区間の特徴量データを利用してもよい。 In the present embodiment, by correcting the feature amount data using the average value calculated by the average value calculation unit 261, for example, the characteristics of a sound collecting device such as a microphone for inputting sound to the feature amount calculation unit 210 are obtained. It can be the removed data. In other words, noise at the time of microphone collection can be removed, and more accurate correction (recognition processing) for speech can be performed. In the above-described embodiment, the present invention is applied to the error section extracted by the error section feature quantity extraction unit 260. However, the feature quantity data of a section having a certain length including the error section may be used. Good.

また、上述平均値計算部２６１および特徴正規化部２６２は、上述の第２の実施形態から第８の実施形態にそれぞれ適用することができる。
The average value calculation unit 261 and the feature normalization unit 262 can be applied to the second to eighth embodiments described above, respectively.

＜第１０の実施形態＞
上述の第１の実施形態から第９の実施形態に記載の音声認識結果訂正装置であるクライアント装置１１０〜１１０ｉにおいて、訂正部２７０が訂正処理（再認識処理）を行っているが、これに限るものではない。すなわち、誤り区間指定部２４０が指定した誤り区間をサーバ装置１２０に通知するような構成をとることにより、サーバ装置１２０において再度訂正処理を行い、その訂正結果を受信部２３５が受信するような構成としてもよい。サーバ装置１２０における再訂正処理は上述のクライアント装置１１０の訂正部２７０における訂正処理をとるものとする。クライアント装置１１０における通知処理の具体例としては、誤り区間指定部２４０において指定された誤り区間の時間情報、またはその前後の単語を含めた時間情報を、誤り区間指定部２４０が計算し、送信部２２５がその時間情報をサーバ装置１２０に通知するものが考えられる。サーバ装置１２０においては、最初に行った認識処理とは異なった音声認識処理を行うことにより誤った認識を再度行うことを防止する。例えば、音響モデル、言語モデル、辞書をかえて認識処理を行うようにする。
<Tenth Embodiment>
In the client apparatuses 110 to 110i that are the speech recognition result correction apparatuses described in the first to ninth embodiments, the correction unit 270 performs the correction process (re-recognition process). It is not a thing. That is, the server apparatus 120 is configured to notify the error section specified by the error section specifying unit 240 to the server apparatus 120, so that the server apparatus 120 performs correction processing again, and the receiving unit 235 receives the correction result. It is good. The re-correction process in the server apparatus 120 is the correction process in the correction unit 270 of the client apparatus 110 described above. As a specific example of the notification process in the client apparatus 110, the error section specifying unit 240 calculates time information of the error section specified by the error section specifying unit 240 or time information including words before and after the error section specifying unit 240, and the transmitting unit It is possible that 225 notifies the server device 120 of the time information. In the server apparatus 120, erroneous recognition is prevented from being performed again by performing voice recognition processing different from the recognition processing performed first. For example, the recognition process is performed by changing the acoustic model, the language model, and the dictionary.

＜第１１の実施形態＞
つぎに、第１１の実施形態のクライアント装置１１０ｋについて説明する。この第１１の実施形態におけるクライアント装置１１０ｋは、サブワード区間を認識し、当該サブワード区間に記述されているサブワード文字列を用いた訂正処理を行うものである。図２６は、当該クライアント装置１１０ｋの機能を示すブロック図である。 <Eleventh embodiment>
Next, the client device 110k according to the eleventh embodiment will be described. The client device 110k in the eleventh embodiment recognizes a subword section and performs a correction process using the subword character string described in the subword section. FIG. 26 is a block diagram illustrating functions of the client device 110k.

このクライアント装置１１０ｋは、特徴量算出部２１０、特徴量圧縮部２２０、送信部２２５、特徴量保存部２３０、受信部２３５、誤り区間指定部２４０、サブワード区間指定部２４２、分割部２４３、誤り区間特徴量抽出部２６０、辞書追加部２６５、訂正部２７０、統合部２８０、音響モデル保持部２８１、言語モデル保持部２８２、辞書保持部２８３および表示部２９０を含んで構成されている。 The client device 110k includes a feature amount calculation unit 210, a feature amount compression unit 220, a transmission unit 225, a feature amount storage unit 230, a reception unit 235, an error interval specification unit 240, a subword interval specification unit 242, a division unit 243, an error interval. A feature amount extraction unit 260, a dictionary addition unit 265, a correction unit 270, an integration unit 280, an acoustic model holding unit 281, a language model holding unit 282, a dictionary holding unit 283, and a display unit 290 are configured.

第１の実施形態とは、サブワード区間指定部２４２、分割部２４３、および辞書追加部２６５を含んでいる点で相違している。以下、この相違点を中心に、その構成を説明する。 This embodiment is different from the first embodiment in that it includes a subword section specifying unit 242, a dividing unit 243, and a dictionary adding unit 265. Hereinafter, the configuration will be described focusing on this difference.

サブワード区間指定部２４２は、誤り区間指定部２４０において指定された誤り区間からサブワード文字列を含んだ区間を指定する部分である。サブワード文字列には、その属性情報として、未知語であることを示す“subword”である旨が付加されており、サブワード区間指定部２４２は、その属性情報に基づいてサブワード区間を指定することができる。 The subword section specifying unit 242 is a part that specifies a section including a subword character string from the error section specified by the error section specifying unit 240. In the subword character string, “subword” indicating an unknown word is added as attribute information, and the subword section specifying unit 242 can specify a subword section based on the attribute information. it can.

例えば、図２８に、サーバ装置１２０において、発話内容に基づいて認識された認識結果を示す図を示す。図２８によると、“サンヨウムセン”には属性情報として“subword”が付加されており、サブワード区間指定部２４２は、その属性情報に基づいて“サンヨウムセン”をサブワード文字列として認識し、その文字列部分をサブワード区間として指定することができる。 For example, FIG. 28 shows a diagram showing a recognition result recognized based on the utterance content in server device 120. 28, “subword” is added as attribute information to “Sanyomusen”, and the subword section designation unit 242 recognizes “Sanyomusen” as a subword character string based on the attribute information, and the character A column part can be designated as a subword section.

なお、図２８においては、発話内容にしたがって認識された認識結果の認識単位にフレームインデックスが付加されている。上述と同様に１フレームは、１０ｍｓｅｃ程度である。また、図２８においては、誤り区間指定部２４０は、上述と同様の処理にしたがって、誤り区間を指定することができ、“では”（２番目の認識単位）から“が”（８番目の認識単位）までが誤り区間と指定することができる。 In FIG. 28, a frame index is added to the recognition unit of the recognition result recognized according to the utterance content. As described above, one frame is about 10 msec. Also, in FIG. 28, the error interval specification unit 240 can specify an error interval in accordance with the same processing as described above, and “from” (second recognition unit) to “ga” (eighth recognition). (Unit) can be designated as an error interval.

分割部２４３は、サブワード区間指定部２４２により指定されたサブワード区間に含まれているサブワード文字列を境界に、誤り区間指定部２４０により指定された誤り区間を分割する部分である。図２８に示される例に基づくと、サブワード文字列である“サンヨウムセン”に基づいて区間１と区間２に分割する。すなわち、２番目の認識単位である“では”から５番目の認識単位である“サンヨウムセン”まで、すなわち、フレームインデックスでいうところの１００ｍｓｅｃから５００ｍｓｅｃまでが区間１に分割され、５番目の認識単位である“サンヨウムセン”から８番目の認識単位である“が“までが、すなわち３００ｍｓｅｃから６６０ｍｓｅｃまでが区間２に分割される。 The dividing unit 243 is a part that divides the error section specified by the error section specifying unit 240 with the subword character string included in the subword section specified by the subword section specifying unit 242 as a boundary. Based on the example shown in FIG. 28, the section is divided into section 1 and section 2 based on the subword character string “Sanyoumusen”. That is, from the second recognition unit “N” to the fifth recognition unit “Sanyoumusen”, that is, from 100 msec to 500 msec in the frame index, it is divided into sections 1 and the fifth recognition unit. From “Sanyomusen” to “8” as the eighth recognition unit, that is, from 300 msec to 660 msec is divided into the sections 2.

辞書追加部２６５は、サブワード区間指定部２４２により指定されたサブワード文字列を辞書保持部２８３に追加する部分である。図２８の例では、新規に“サンヨウムセン”が一つの単語として辞書保持部２８３に追加される。また、この辞書保持部２８３に、サブワードの読みを追加するとともに、言語モデル保持部２８２にサブワードと他の単語の接続確率を追加する。言語モデル保持部２８２における接続確率の値は、事前に用意したサブワード専用のクラスを利用してよい。また、サブワードモデルの文字列は、ほとんど固有名詞なので、名詞（固有名詞）のクラスの値を利用してよい。 The dictionary addition unit 265 is a part that adds the subword character string designated by the subword section designation unit 242 to the dictionary holding unit 283. In the example of FIG. 28, “Sanyoumusen” is newly added to the dictionary holding unit 283 as one word. Further, subword readings are added to the dictionary holding unit 283, and connection probabilities of subwords and other words are added to the language model holding unit 282. As the connection probability value in the language model holding unit 282, a class dedicated to subwords prepared in advance may be used. Moreover, since the character string of the subword model is almost a proper noun, the value of the noun (proper noun) class may be used.

このような構成により、誤り区間特徴量抽出部２６０は、分割部２４３により分割されて得られた区間１および区間２にしたがって、特徴量保存部２３０に保持されている特徴量データを抽出する。そして訂正部２７０は、それぞれの区間に対応した特徴量データに対して再認識処理を行うことで訂正処理を実行する。具体的には、図２８を例にとると、区間１の訂正結果は、“では電気メーカのサンヨウムセン”となり、区間２の訂正結果は“サンヨウムセンの製品は評判が”となる。 With such a configuration, the error section feature quantity extraction unit 260 extracts the feature quantity data held in the feature quantity storage unit 230 according to the sections 1 and 2 obtained by the division by the division unit 243. Then, the correction unit 270 performs correction processing by performing re-recognition processing on the feature amount data corresponding to each section. Specifically, taking FIG. 28 as an example, the correction result of section 1 is “in the electric manufacturer Sanyo Musen”, and the correction result of section 2 is “San Yomusen's product has a reputation”.

統合部２８０は、訂正部２７０に訂正されて得られた認識結果（区間１および区間２）を境界となっているサブワード文字列に基づいて統合処理するとともに、受信部２３５において受信された認識結果とを統合して、表示部２９０に表示させる。図２８を例にとると、統合された結果として、最終的な誤り区間のテキストは“では電気メーカのサンヨウムセンの製品は評判が”となる。 The integration unit 280 performs integration processing on the recognition results (section 1 and section 2) obtained by the correction by the correction unit 270 based on the subword character string serving as a boundary, and the recognition result received by the reception unit 235. Are integrated and displayed on the display unit 290. Taking FIG. 28 as an example, as a result of the integration, the text of the final error section is “So that the product of the electric manufacturer Sanyomusen has a reputation”.

つぎに、このように構成されたクライアント装置１１０ｋの動作について説明する。図２７は、クライアント装置１１０ｋの動作を示すフローチャートである。 Next, the operation of the client device 110k configured as described above will be described. FIG. 27 is a flowchart showing the operation of the client device 110k.

Ｓ１０１からＳ１０５までは、図６に示されているクライアント装置１１０と同様の処理が行われる。すなわち、マイクを介して入力された音声は特徴量算出部２１０によりその特徴データが抽出される（Ｓ１０１）。そして、特徴量保存部２３０に特徴量データは保存される（Ｓ１０２）。つぎに、特徴量圧縮部２２０により特徴量データは圧縮される（Ｓ１０３）。圧縮された圧縮特徴量データは、送信部２２５によりサーバ装置１２０に送信される（Ｓ１０４）。そして、サーバ装置１２０において音声認識が行われ、サーバ装置１２０からその認識結果が送信され、受信部２３５により受信される（Ｓ１０５）。そして、音声認識結果から誤り区間指定部２４０により誤り区間が指定される（Ｓ１０６）。なお、、この指定された誤り区間に基づいて前後コンテキストが指定されるようにしてもよい。 From S101 to S105, processing similar to that of the client apparatus 110 shown in FIG. 6 is performed. That is, the feature data of the voice input through the microphone is extracted by the feature amount calculation unit 210 (S101). The feature amount data is stored in the feature amount storage unit 230 (S102). Next, the feature amount data is compressed by the feature amount compression unit 220 (S103). The compressed compressed feature data is transmitted to the server device 120 by the transmission unit 225 (S104). Then, voice recognition is performed in the server device 120, and the recognition result is transmitted from the server device 120 and received by the receiving unit 235 (S105). Then, an error section is specified by the error section specifying unit 240 from the speech recognition result (S106). Note that the preceding and following contexts may be specified based on the specified error interval.

つぎに、サブワード区間がサブワード区間指定部２４２により指定され、確定される（Ｓ７０１）。なお、この際、サブワード区間にあるサブワード文字列が、クライアント装置１１０ｋに備えられているユーザ辞書（例えば、仮名漢字変換辞書におけるユーザが登録した単語や、アドレス帳・電話帳に登録されている名前など）にある場合には、その単語に置き換える処理が行われるようにしてもよい。そして、分割部２４３により、サブワード区間を境界に誤り区間が分割される（Ｓ７０２）。この分割処理を行うとともに、辞書追加部２６５により、指定されたサブワード文字列が辞書保持部２８３に保持される（Ｓ７０３）。 Next, the subword section is designated and fixed by the subword section designation unit 242 (S701). At this time, the subword character string in the subword section is a user dictionary (for example, a word registered by the user in the kana-kanji conversion dictionary or a name registered in the address book / phone book) provided in the client device 110k. Etc.) may be performed to replace the word. Then, the division unit 243 divides the error section with the subword section as a boundary (S702). While performing this division processing, the dictionary adding unit 265 holds the designated subword character string in the dictionary holding unit 283 (S703).

その後、誤り区間特徴量抽出部２６０により、誤り区間の特徴量データおよびサブワード区間の特徴量データが抽出され（Ｓ１０７ａ）、訂正部２７０により誤り区間およびサブワード区間の特徴量データを再認識することにより訂正処理が行われる（Ｓ１０８ａ）。そして、誤り区間におけるテキストデータと、受信部２３５において受信されたテキストデータとが統合され、正しく認識されて得られたテキストデータが表示部２９０に表示される（Ｓ１０９）。なお、統合に際しては、境界の単語を目安に区間１と区間２との結果を連結する。また、訂正部２７０は、上述サブワード文字列がユーザ辞書に基づいて変換された場合には、変換された文字列を拘束条件として音声認識処理を行うことで、訂正処理を行うようにしても良い。 Thereafter, the error section feature quantity extraction unit 260 extracts the error section feature quantity data and the subword section feature quantity data (S107a), and the correction section 270 re-recognizes the error section and subword section feature quantity data. Correction processing is performed (S108a). Then, the text data in the error section and the text data received by the receiving unit 235 are integrated, and the text data obtained by being correctly recognized is displayed on the display unit 290 (S109). In the integration, the results of the sections 1 and 2 are connected using the word at the boundary as a guide. Further, when the sub-word character string is converted based on the user dictionary, the correction unit 270 may perform the correction process by performing the voice recognition process using the converted character string as a constraint condition. .

本実施形態では、サブワードの文字列はサーバの認識結果にあることを前提に説明したが、このサブワードの文字列はクライアント装置１１０ｋで生成してもよい。この場合は、図２７の処理Ｓ１０６における誤り区間指定処理の次に、サブワード文字列を生成してから、サブワード区間確定処理をする。また、クライアント装置１００ｋでの上述図２７の処理は、サーバや他の装置で行っても良い。さらに、訂正方法は認識により、行う方法を説明したが、他のやり方例えば文字列間の類似度に基づく方法でもよい。この場合は特徴量保存部２３０および音響特徴量データを保存する処理（Ｓ１０２）、誤り区間特徴量抽出部２６０、訂正部２７０および音響特徴で認識する（Ｓ１０８ａ）は必要ない。 In the present embodiment, the subword character string has been described based on the recognition result of the server. However, the subword character string may be generated by the client device 110k. In this case, after the error section designation process in the process S106 of FIG. 27, a subword character string is generated and then a subword section determination process is performed. Further, the processing in FIG. 27 described above in the client device 100k may be performed by a server or another device. Furthermore, although the correction method has been described based on recognition, other methods such as a method based on similarity between character strings may be used. In this case, the feature quantity storage unit 230 and the process of saving the acoustic feature quantity data (S102), the error section feature quantity extraction unit 260, the correction unit 270, and the acoustic feature recognition (S108a) are not necessary.

さらに、サブワードの文字列は辞書保持部２８３にある場合に、辞書保持部２８３中の情報を利用してもよい。例えば辞書保持部２８３に“サンヨウムセン”に対応する単語、例えば“三洋無線”はある場合はサブワード辞書に追加しなくていい。 Further, when the character string of the subword is in the dictionary holding unit 283, the information in the dictionary holding unit 283 may be used. For example, if the dictionary holding unit 283 has a word corresponding to “Sanyomusen”, for example, “Sanyo Radio”, it is not necessary to add it to the subword dictionary.

また、先の例では区間を分割するときは区間１と区間２にはそれぞれサブワード区間を包含するようになっているが、これは必須でなく、各分割区間にサブワードを包含しなくて良い。つまり、二番目の単語“では”から５番目のサブワード文字列の開始までを区間１に、５番目のサブワード文字列終了から８番目の単語終了までの“が”を区間２に分割するようにしてもよい。この場合はサブワードの文字列を辞書へ追加処理しなくてもよい。 In the previous example, when the section is divided, each of the sections 1 and 2 includes a subword section. However, this is not essential, and each divided section may not include a subword. In other words, in the second word “from” to the start of the fifth subword character string, section 1 is divided into “1”, and from the fifth subword character string end to the eighth word end, “ga” is divided into section 2. May be. In this case, the subword character string need not be added to the dictionary.

つぎに、本実施形態のクライアント装置１１０ｋの作用効果について説明する。このクライアント装置１１０ｋにおいて、受信部２３５は認識結果をサーバ装置１２０から受信し、誤り区間指定部２４０は、誤り区間を指定する。さらに、サブワード区間指定部２４２は、誤り区間におけるサブワード区間を指定する。これはサーバ装置１２０から送信される認識結果に付加されている属性情報により判断することができる。そして、訂正部２７０は、サブワード区間指定部２４２により指定されたサブワード区間に対応する特徴量データを、特徴量保存部２３０から抽出し、当該抽出した特徴量データを用いて再認識を行うことにより、認識結果の訂正を実行する。これにより、サブワードのような未知語についての訂正処理を行うことができる。すなわち、サブワード区間といった未知語の区間したがった再認識を行うことができる。 Next, operational effects of the client device 110k according to the present embodiment will be described. In the client device 110k, the receiving unit 235 receives the recognition result from the server device 120, and the error section specifying unit 240 specifies an error section. Further, the subword section specifying unit 242 specifies a subword section in the error section. This can be determined from the attribute information added to the recognition result transmitted from the server device 120. Then, the correction unit 270 extracts feature quantity data corresponding to the subword section designated by the subword section designation unit 242 from the feature quantity storage unit 230, and performs re-recognition using the extracted feature quantity data. Then, the recognition result is corrected. As a result, correction processing for unknown words such as subwords can be performed. That is, re-recognition can be performed according to an unknown word section such as a subword section.

また、本実施形態のクライアント装置１１０ｋにおいて、分割部２４３がサブワード区間指定部２４０により指定されたサブワード区間に従って、認識結果を複数の区間に分割する。そして、訂正部２７０は、分割部２４３により分割された分割区間ごとに、認識結果の訂正を実行する。これにより、認識対象を短くすることができ、より正確な認識処理を行うことができる。 Further, in the client device 110k of this embodiment, the dividing unit 243 divides the recognition result into a plurality of sections according to the subword section specified by the subword section specifying unit 240. Then, the correction unit 270 corrects the recognition result for each divided section divided by the dividing unit 243. Thereby, the recognition target can be shortened, and more accurate recognition processing can be performed.

また、クライアント装置１１０ｋにおいて、分割部２４３は、サブワード区間の終点を一の分割区間の終点とするとともに、サブワード区間の始点を、前記一の分割区間の次の分割区間の始点とするよう認識結果を分割する。そして、訂正部２７０は、分割部２４３により分割された分割区間ごとに、認識結果の訂正を実行するとともに、サブワード区間を各分割区間の訂正における拘束条件とする。これにより、サブワード区間が、分割区間のいずれにも含まれることになる。よって、認識処理する際には必ずサブワード区間が含まれることにより、サブワード文字列を拘束条件とした認識処理を行うことができる。 Also, in the client device 110k, the dividing unit 243 recognizes the end point of the subword section as the end point of one divided section and the start point of the subword section as the start point of the next divided section of the one divided section. Split. Then, the correction unit 270 performs correction of the recognition result for each divided section divided by the dividing unit 243 and sets the subword section as a constraint condition for correction of each divided section. As a result, the subword section is included in any of the divided sections. Therefore, when the recognition process is performed, the subword section is always included, so that the recognition process using the subword character string as a constraint condition can be performed.

また、クライアント装置１１０ｋにおいて、辞書追加部２６５は、サブワード区間指定部２４２により指定されたサブワード区間におけるサブワード文字列を認識処理のための辞書保持部２８３に追加する。これにより、サブワード文字列を蓄積することができ、今後の認識処理に有効に用い、より正確な認識処理を行うことができる。
In the client device 110k, the dictionary adding unit 265 adds the subword character string in the subword section designated by the subword section designating unit 242 to the dictionary holding unit 283 for recognition processing. As a result, subword character strings can be accumulated, and can be used effectively for future recognition processing, and more accurate recognition processing can be performed.

＜第１２の実施形態＞
第１１の実施形態ではサブワード文字列を境界に分割するやり方を説明したが、本実施形態では分割しなくても再認識する場合にサブワード文字列を必ず用いる方法について説明する。本実施形態は、上述第１１の実施形態と同様の装置構成をとったものとする。 <Twelfth Embodiment>
In the eleventh embodiment, the method of dividing a subword character string into boundaries has been described. In this embodiment, a method of using a subword character string without fail when it is recognized again will be described. This embodiment assumes the same apparatus configuration as that of the eleventh embodiment.

図２９は、音声認識における探索過程の概念図であり、図２９（ａ）は、サブワード文字列である“サンヨウムセン”を含んだ探索過程を示す概念図であり、図２９（ｂ）は、サブワード文字列を拘束条件として、複数の区間における探索過程を示す概念図である。 FIG. 29 is a conceptual diagram of a search process in speech recognition, FIG. 29 (a) is a conceptual diagram showing a search process including the subword character string “Sanyomusen”, and FIG. It is a conceptual diagram which shows the search process in a some area by making a subword character string into a constraint condition.

一般的に音声認識探索過程にすべての経路の仮説の尤度を計算し、途中の結果を保存し、最終的に尤度が大きい順番に結果を生成する。実際にはコストの面を考慮し、途中で探索の範囲を一定の範囲以内に絞る方法が利用される。本実施形態では、サブワード区間指定部２４２により指定されたサブワード区間が所定区間（例えば、２秒から３秒の間）にある場合には、訂正部２７０は、このサブワード区間に記述されているサブワード文字列を用いて、探索の過程にサブワード文字列が出現する経路を他の経路より順位を高め、最終的にサブワード文字列を包含する認識結果を優先的に出力するように認識処理を行う。例えば、以下の探索経路が訂正部２７０により得られ、保持される。
経路１：最近／では／玄関／で／待ち合わせ
経路２：昨日／の／会議／は／世界／中／
経路３：最近／では／単価／高い／サンヨウムセン
経路４：最近／では／電気メーカ／の／サンヨウムセン In general, the likelihoods of hypotheses for all paths are calculated in the speech recognition search process, the intermediate results are stored, and the results are finally generated in descending order of likelihood. In practice, considering the cost, a method of narrowing the search range within a certain range is used. In the present embodiment, when the subword section specified by the subword section specifying unit 242 is in a predetermined section (for example, between 2 seconds and 3 seconds), the correction unit 270 includes the subword described in the subword section. Using the character string, the recognition process is performed so that the rank in which the subword character string appears in the search process is ranked higher than the other paths, and finally the recognition result including the subword character string is preferentially output. For example, the following search route is obtained by the correction unit 270 and held.
Route 1: Recent / At / Entrance / At / Meeting Route 2: Yesterday / No / Meeting / Ha / World / Medium /
Path 3: Recent / At / Unit price / High / San Iomsen Path 4: Recent / At / Electric manufacturer // San Iomsen

この中の経路３と経路４に“サンヨウムセン“があるため、この二つの経路を経路１、経路２より順位を高めるよう訂正部２７０は処理を行う。ここで範囲を絞るなら、経路１および経路２を残さずに、経路３および経路４を残す。さらに“サンヨウムセン”の出現位置を判断し、もとの認識結果にある“サンヨウムセン”の出現位置（300msから500ms）に近い一定範囲に限定した経路に絞っても良い。また、最終的に認識の結果に“サンヨウムセン”が出現してない候補より“サンヨウムセン”が出現した候補を優先的に出力するようにしても良い。 Since there is “Sanyomusen” in the route 3 and the route 4 among these, the correction unit 270 performs processing so that the rank of these two routes is higher than that of the route 1 and the route 2. If the range is narrowed down, the route 3 and the route 4 are left without the route 1 and the route 2 being left. Furthermore, the appearance position of “Sanyomusen” may be determined and narrowed down to a route limited to a certain range close to the appearance position of “Sanyomusen” (300 ms to 500 ms) in the original recognition result. In addition, a candidate in which “Sanyomusen” appears in preference to a candidate in which “Sanyomusen” does not appear in the final recognition result may be output preferentially.

以上の通り、クライアント装置１１０ｋにおいて、訂正部２７０は、サブワード区間指定部２４２により指定されたサブワード区間に記述されているサブワード文字列を含む仮説を認識の探索過程として優先順位を上げて保持し、当該仮説から最終的な認識結果を選択することにより訂正を実行する。これにより、必ずサブワード文字列を用いた認識処理を行うことができる。 As described above, in the client device 110k, the correction unit 270 holds the hypothesis including the subword character string described in the subword section specified by the subword section specifying unit 242 with a higher priority as a recognition search process, and holds the hypothesis. Correction is performed by selecting a final recognition result from the hypothesis. Thereby, the recognition process using the subword character string can be performed without fail.

本実施形態の音声認識結果訂正装置であるクライアント装置１１０（１１０ａ〜１１０ｋ）を含む）を含んだ通信システムのシステム構成図である。It is a system configuration | structure figure of the communication system containing the client apparatus 110 (110a-110k) which is the speech recognition result correction apparatus of this embodiment. クライアント装置１１０の機能を示すブロック図である。3 is a block diagram illustrating functions of a client device 110. FIG. クライアント装置１１０のハードウェア構成図である。2 is a hardware configuration diagram of a client device 110. FIG. 音声認識結果に含まれる各種情報の概念を示す概念図である。It is a conceptual diagram which shows the concept of the various information contained in a speech recognition result. （ａ）が誤り区間前後コンテキストを指定した場合の概念図を示し、（ｂ）が拘束条件に基づいて認識処理を行う際における概念を示す概念図である。(A) is a conceptual diagram when a context before and after an error section is designated, and (b) is a conceptual diagram showing a concept when performing recognition processing based on a constraint condition. クライアント装置１１０の動作を示すフローチャートである。4 is a flowchart illustrating an operation of the client device 110. 誤り区間の指定を含んだ訂正処理の詳細な処理を示すフローチャートである。It is a flowchart which shows the detailed process of the correction process including designation | designated of an error area. ユーザ入力により誤り区間を受け付けるクライアント装置１１０ａの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110a which receives an error area by a user input. クライアント装置１１０ａの処理を示すフローチャートである。It is a flowchart which shows the process of the client apparatus 110a. クライアント装置１１０ａにおけるユーザ入力により誤り区間を指定するときの詳細な処理を示すフローチャートである。It is a flowchart which shows the detailed process when an error area is designated by the user input in the client apparatus 110a. このクライアント装置１１０ｂの機能を示すブロック図である。It is a block diagram which shows the function of this client apparatus 110b. クライアント装置１１０ｂの処理を示すフローチャートである。It is a flowchart which shows the process of the client apparatus 110b. クライアント装置１１０ｂにおける誤り区間を指定するときの詳細な処理を示すフローチャートである。It is a flowchart which shows the detailed process when designating the error area in the client apparatus 110b. クライアント装置１１０ｃの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110c. クライアント装置１１０ｃの処理を示すフローチャートである。It is a flowchart which shows the process of the client apparatus 110c. クライアント装置１１０ｄの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110d. クライアント装置１１０ｄの処理を示すフローチャートである。It is a flowchart which shows the process of the client apparatus 110d. クライアント装置１１０ｆの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110f. クライアント装置１１０ｆの処理を示すフローチャートである。It is a flowchart which shows the process of the client apparatus 110f. クライアント装置１１０ｇの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110g. クライアント装置１１０ｇの処理を示すフローチャートである。It is a flowchart which shows the process of the client apparatus 110g. クライアント装置１１０ｈの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110h. クライアント装置１１０ｉの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110i. 単語情報を拘束条件として指定された部分を訂正処理するときの概念を示す概念図である。It is a conceptual diagram which shows the concept when correct | amending the part designated as word information as a constraint condition. クライアント装置１１０の変形例を示すブロック図である。FIG. 10 is a block diagram illustrating a modified example of the client device 110. クライアント装置１１０ｋの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110k. クライアント装置１１０ｋの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the client apparatus 110k. 発話内容、認識結果、分割区間の対応について説明する説明図である。It is explanatory drawing explaining the correspondence of utterance content, a recognition result, and a division | segmentation area. 音声認識における探索過程の概念図である。It is a conceptual diagram of the search process in speech recognition.

Explanation of symbols

１１０、１１０ａ、１１０ｂ、１１０ｃ、１１０ｄ、１１０ｆ、１１０ｇ、１１０ｈ…クライアント装置、１２０…サーバ装置、２１０…特徴量算出部、２２０…特徴量圧縮部、２２５…送信部、２２６…第一認識部、２２７…言語モデル保持部、２２８…辞書保持部、２２９…音響モデル保持部、２３０…特徴量保存部、２３５…受信部、２３６…操作部、２３７…結果保存部、２３８…ユーザ入力検出部、２３９…時間情報算出部、２４０、２４０ａ、２４０ｂ、２４０ｃ…誤り区間指定部、２４１…終点判断部、２４２…サブワード区間指定部、２４３…分割部、２５０、２５０ａ…誤り区間前後コンテキスト指定部、２５１…単語情報解析部、２６０…誤り区間特徴量抽出部、２６５…辞書追加部、２７０、２７０ａ、２７０ｂ…訂正部、２８０…統合部、２８１…音響モデル保持部、２８２…言語モデル保持部、２８３…辞書保持部、２８４…言語ＤＢ保持部、２８５…拘束条件記憶部、２９０…表示部。 110, 110a, 110b, 110c, 110d, 110f, 110g, 110h ... client device, 120 ... server device, 210 ... feature amount calculation unit, 220 ... feature amount compression unit, 225 ... transmission unit, 226 ... first recognition unit, 227 ... Language model holding unit, 228 ... Dictionary holding unit, 229 ... Acoustic model holding unit, 230 ... Feature amount storage unit, 235 ... Reception unit, 236 ... Operation unit, 237 ... Result storage unit, 238 ... User input detection unit, 239 ... Time information calculation section, 240, 240a, 240b, 240c ... Error section designation section, 241 ... End point determination section, 242 ... Subword section designation section, 243 ... Division section, 250, 250a ... Context designation section before and after error section, 251 ... word information analysis unit, 260 ... error section feature quantity extraction unit, 265 ... dictionary addition unit, 270, 270a, 270b ... correction Department, 280 ... integrating unit, 281 ... acoustic model holding unit, 282 ... language model holding unit, 283 ... dictionary holding unit, 284 ... language DB holding unit, 285 ... constraint condition storage unit, 290 ... display unit.

Claims

An input means for inputting voice;
Calculating means for calculating feature data based on the voice input by the input means;
Storage means for storing feature amount data calculated by the calculation means;
Obtaining means for obtaining a recognition result for the voice input by the input means;
In the recognition result recognized by the acquisition means, designation means for designating an error section where a recognition error has occurred;
Extracting the feature amount data corresponding to the error section specified by the specifying unit from the feature amount data stored in the storage unit, and performing re-recognition using the extracted feature amount data; A speech recognition result correction apparatus comprising: correction means for correcting the recognition result obtained by the above.

The acquisition means includes
Transmitting means for transmitting the voice input by the input means to a voice recognition device;
Receiving means for receiving a recognition result recognized by the voice recognition device,
2. The speech recognition result correction apparatus according to claim 1, wherein the specifying unit specifies an error section in which a recognition error occurs in the recognition result received by the receiving unit.

The speech recognition result correction apparatus according to claim 1, wherein the specifying unit specifies an error section by receiving a user operation.

The said designation | designated means determines an error area based on the reliability of the recognition result provided to the said recognition result, The said determined error area is specified, The any one of Claim 1 to 3 characterized by the above-mentioned. The speech recognition result correction apparatus described in 1.

The said designation | designated means calculates the reliability of the said recognition result, judges an error area based on the said reliability, and designates the judged error area, The any one of Claim 1 to 3 characterized by the above-mentioned. The speech recognition result correction apparatus described in 1.

A specifying means for specifying a recognition result forming either at least one word immediately before the error section specified by the specifying means, at least one word immediately after, or both the immediately preceding word and the immediately following word. Further comprising
The correction means uses the recognition result specified by the specifying means as a constraint condition, and extracts feature amount data corresponding to a section including a word immediately before and an error section from the storage means according to the constraint condition. The speech recognition result correction apparatus according to claim 1, wherein recognition processing is performed on the extracted feature amount data.

A specifying means for specifying a recognition result forming either at least one word immediately before the error section specified by the specifying means, at least one word immediately after, or both the immediately preceding word and the immediately following word. Further comprising
The correction means uses the recognition result specified by the specifying means as a constraint condition, and extracts feature quantity data corresponding to an error section from the storage means according to the constraint condition, and recognizes the extracted feature quantity data. The speech recognition result correction apparatus according to claim 1, wherein:

Word information which is information for specifying at least one word immediately before the error section specified by the specifying means, word information of at least one word immediately after, or word information of the word immediately before and the word immediately after Word information specifying means for specifying the word information of the word in the recognition result forming either of the word information of
The correction means uses the word information specified by the word information specifying means as a constraint condition, and according to the constraint condition, the storage unit stores feature quantity data corresponding to a word immediately before an error section and a section including a word immediately after the error section. The speech recognition result correction apparatus according to claim 1, wherein a recognition process is performed on the extracted feature amount data.

9. The speech recognition result correction apparatus according to claim 8, wherein the word information includes one or more of part-of-speech information indicating a word part-of-speech and reading information indicating a word-reading method.

Based on the word information, at least one word immediately before the error interval specified by the specifying means, or at least one word immediately after, or both the immediately preceding word and the immediately following word are formed. It further comprises an unknown word determination means for determining whether or not the word of the recognition result is an unknown word,
9. The correction unit corrects the recognition result based on the word information when the unknown word determination unit determines that the word of the recognition result is an unknown word. Or the speech recognition result correction apparatus of 9.

It further comprises connection probability storage means for storing the connection probability between words,
The correction means creates a connection probability with the word in the error section and the word before or after the error section by performing the correction process, and uses the connection probability to calculate the connection probability stored in the connection probability storage means. The speech recognition result correction apparatus according to claim 1, wherein the speech recognition result correction apparatus is updated.

A constraint condition storage means for storing the word information specified by the word information specifying means or the word specified by the specification means as a constraint condition;
The speech recognition result correction apparatus according to claim 6, wherein the correction unit performs correction processing according to a constraint condition stored in the constraint condition storage unit.

It further comprises a receiving means for receiving character information from the user,
The speech recognition result according to any one of claims 1 to 12, wherein the correction unit performs a correction process on a recognition result in an error section using the character information received by the receiving unit as a constraint condition. Correction device.

Based on the recognition result received by the receiving means and the feature amount data stored in the storage means, further comprising time information calculating means for calculating an elapsed time in the recognition result,
The speech recognition result correction apparatus according to claim 1, wherein the specifying unit specifies an error section based on the time information calculated by the time information calculating unit.

Further comprising display means for displaying the recognition result corrected by the correction means,
The speech recognition result correcting apparatus according to claim 1, wherein the display unit does not display the recognition result acquired by the acquiring unit.

When the recognition result obtained by re-recognition by the correcting means is the same as the recognition result obtained by the obtaining means, or when there is a deviation in the time information included in each of these recognition results The speech recognition result correcting apparatus according to claim 15, wherein a recognition error is determined and the display means does not display a recognition result.

The designation means designates the start point of an error section by a user operation, and designates the end point of the error section based on the reliability of the recognition result given to the recognition result acquired by the acquisition means. The speech recognition result correction apparatus according to claim 3.

4. The speech recognition result correcting apparatus according to claim 3, wherein the designating unit designates a start point of an error section by a user operation, and designates an end point of the error section with a predetermined number of recognition units from the start point.

The said designation | designated means designates the start point of an error area by user operation, and designates the end point of an error area based on the predetermined phonetic symbol in the recognition result acquired by the said acquisition means. Voice recognition result correction device.

The acquisition means acquires a plurality of recognition candidates as a recognition result when acquiring the recognition result,
4. The speech recognition result correcting apparatus according to claim 3, wherein the specifying unit specifies a start point of an error section by a user operation and specifies an end point based on the number of recognition candidates acquired by the acquiring unit. .

A calculation means for calculating an average value of the section including the error section of the feature amount data calculated by the calculation means;
The correction means subtracts the average value calculated by the calculation means from the extracted feature quantity data, and performs re-recognition processing using the data obtained by the subtraction as feature quantity data. 21. The speech recognition result correcting apparatus according to any one of 1 to 20.

An input means for inputting voice;
Obtaining means for obtaining a recognition result for the voice input by the input means;
In the recognition result recognized by the acquisition means, designation means for designating an error section where a recognition error has occurred;
Notification means for requesting the external server to re-recognize the error section by notifying the external server of the error section specified by the specifying means;
Receiving means for receiving a recognition result of an error section re-recognized in the external server in response to a request by the notification means;
A speech recognition result correction apparatus comprising:

An input step for inputting voice;
A calculation step of calculating feature amount data based on the voice input in the input step;
A storage step of storing the feature amount data calculated by the calculation step;
An acquisition step of acquiring a recognition result for the voice input by the input step;
In the recognition result recognized by the obtaining step, a designation step for designating an error section in which a recognition error occurs;
By extracting feature quantity data corresponding to the error section designated by the designation means from the feature quantity data stored in the storage step, and performing re-recognition using the extracted feature quantity data, A speech recognition result correction method comprising: a correction step for correcting the obtained recognition result.

An input step for inputting voice;
An acquisition step of acquiring a recognition result for the voice input in the input step;
In the recognition result recognized by the obtaining step, a designation step for designating an error section in which a recognition error occurs;
A notification step of requesting the external server to re-recognize the error section by notifying the external server of the error section specified in the specifying step;
A receiving step of receiving a recognition result of an error section re-recognized in the external server in response to a request by the notification step;
A speech recognition result correction method comprising:

In the recognition result obtained by the obtaining means, subword section designating means for designating a subword section, and
The correcting means further extracts feature data corresponding to the subword section specified by the subword section specifying means from the storage means in the error section specified by the specifying means, and uses the extracted feature data. 23. The speech recognition result correction apparatus according to claim 1, wherein the recognition result obtained by the acquisition unit is corrected by performing re-recognition.

Further comprising a dividing means for dividing the recognition result acquired from the acquiring means into a plurality of sections according to the subword section specified by the subword section specifying means;
26. The speech recognition result correction apparatus according to claim 25, wherein the correction unit performs correction of a recognition result for each divided section divided by the dividing unit.

The dividing means divides the recognition result so that the end point of the subword section is the end point of one divided section and the start point of the subword section is the start point of the next divided section of the one divided section. The speech recognition result correction apparatus according to claim 26.

28. The correction unit according to claim 27, wherein the correction unit performs correction of a recognition result for each divided section divided by the dividing unit, and uses the subword section as a constraint condition in correction of each divided unit. Voice recognition result correction device.

The correction means retains a hypothesis including a subword character string described in the subword section designated by the subword section designation means as a recognition search process, and corrects by selecting a final recognition result from the hypothesis 26. The speech recognition result correction apparatus according to claim 25, wherein:

30. The dictionary adding means according to any one of claims 25 to 29, further comprising a dictionary adding means for adding a subword character string in the subword section designated by the subword section designating means to a dictionary database for recognition processing. Speech recognition result correction device.

A dictionary database generated by the user;
31. The speech recognition result correction apparatus according to claim 25, wherein the correction unit performs correction processing using a character string obtained by converting a subword character string according to the dictionary database.

The speech recognition result correction apparatus according to any one of claims 1 to 22 or 25 to 31,
A server device that performs speech recognition based on the speech transmitted from the speech recognition result correction device, and transmits the recognition result to the speech recognition result correction device;
Speech recognition result correction system consisting of