JP5701327B2

JP5701327B2 - Speech recognition apparatus, speech recognition method, and program

Info

Publication number: JP5701327B2
Application number: JP2013053290A
Authority: JP
Inventors: 洋平磯部; 鈴木　雄太; 雄太鈴木
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2013-03-15
Filing date: 2013-03-15
Publication date: 2015-04-15
Anticipated expiration: 2033-03-15
Also published as: JP2014178567A

Description

本発明は、音声を認識する音声認識装置等に関する。 The present invention relates to a speech recognition device that recognizes speech.

従来の音声認識装置では、誤認識によって不適切な認識結果が出力された場合には、ユーザが、音声入力のやり直しや、誤認識箇所の修正を行っていた（例えば、特許文献１参照）。 In the conventional speech recognition apparatus, when an inappropriate recognition result is output due to misrecognition, the user re-executes speech input or corrects a misrecognized portion (for example, see Patent Document 1).

特開２００８−９０６２５号公報JP 2008-90625 A

しかしながら、音声入力のやり直しや、誤認識箇所の修正を行うのは、音声認識後のユーザの負担が増加するという問題があった。 However, redoing voice input or correcting a misrecognized location has a problem of increasing the burden on the user after voice recognition.

上記課題に対し、本発明の目的は、音声認識後のユーザの負担を減らすことである。 In view of the above problems, an object of the present invention is to reduce the burden on the user after speech recognition.

本第一の発明の音声認識装置は、ユーザより発話された音声のデータである音声データを受け付ける音声データ受付手段と、音声データ受付手段が受け付けた音声データに対して音声認識処理を実施し、音声データに対応する音声認識結果の一部である要素の候補である要素候補の２以上の並びを含む音声認識結果情報を取得する音声認識結果情報取得手段と、音声認識結果情報が有する２以上の要素候補を表示する要素候補表示手段と、要素候補表示手段による要素候補の表示に対して、要素候補の並びの選択を受け付ける要素候補選択受付手段と、要素候補選択受付手段が選択を受け付けた要素候補の並びである出力情報を出力する出力手段とを具備する音声認識装置である。 The voice recognition device according to the first aspect of the invention performs voice recognition processing on voice data receiving means that receives voice data that is voice data spoken by a user, and voice data received by the voice data receiving means, Speech recognition result information acquisition means for acquiring speech recognition result information including two or more arrangements of element candidates that are candidate elements that are part of the speech recognition result corresponding to the speech data, and two or more included in the speech recognition result information The element candidate display means for displaying the element candidates, the element candidate selection accepting means for accepting selection of the arrangement of the element candidates for the display of the element candidates by the element candidate displaying means, and the element candidate selection accepting means accepting the selection The speech recognition apparatus includes output means for outputting output information that is an array of element candidates.

また、本第二の発明の音声認識装置は、第一の発明に対して、要素候補表示手段は、表示する領域のサイズ、または音声認識結果情報の情報量に応じて、音声認識結果情報の全ての要素候補、または一部の要素候補を表示する、音声認識装置である。 In addition, the speech recognition apparatus according to the second aspect of the present invention is different from the first aspect in that the element candidate display means stores the speech recognition result information according to the size of the area to be displayed or the information amount of the speech recognition result information. This is a speech recognition apparatus that displays all element candidates or some element candidates.

また、本第三の発明の音声認識装置は、第一または二の発明に対して、音声認識結果情報取得手段は、要素候補の並びに関する尤度である尤度情報を含む音声認識結果情報を取得し、要素候補表示手段は、尤度情報に応じて要素候補を表示する、音声認識装置である。 In the speech recognition device according to the third aspect of the present invention, in contrast to the first or second aspect of the invention, the speech recognition result information acquisition unit obtains speech recognition result information including likelihood information that is likelihood related to the arrangement of element candidates. The acquired element candidate display means is a speech recognition device that displays an element candidate according to likelihood information.

また、本第四の発明の音声認識装置は、第三の発明に対して、要素候補表示手段は、要素候補の尤度が最も高い並びが直線的になるように表示する、音声認識装置である。 The speech recognition apparatus according to the fourth aspect of the present invention is a speech recognition apparatus in which the element candidate display means displays the arrangement with the highest likelihood of the element candidates so that the arrangement of the element candidates is linear. is there.

また、本第五の発明の音声認識装置は、第一から第四のいずれか一項の発明に対して、要素候補に含まれる少なくとも一部の用語と同音であり、用語と異なる用語である１以上の同音用語を取得する同音用語取得手段をさらに具備し、要素候補表示手段は、同音用語取得手段が取得した１以上の同音用語を用いて、要素候補に含まれる用語を同音用語で置換した要素候補をも表示する音声認識装置である。 The speech recognition apparatus according to the fifth aspect of the invention is a term different from the term, which is the same sound as at least some of the terms included in the element candidates with respect to the invention according to any one of the first to fourth aspects. The apparatus further comprises a homophone term acquisition means for acquiring one or more homophone terms, and the element candidate display means replaces the terms included in the element candidates with the homophone terms using the one or more homophone terms acquired by the homophone term acquisition means. This is a speech recognition device that also displays the selected element candidates.

また、本第六の発明の音声認識装置は、第一から第五のいずれか一項の発明に対して、要素候補選択受付手段は、ユーザによって指定された要素候補の順番に応じた要素候補の並びの選択を受け付ける、音声認識装置である。 The speech recognition apparatus according to the sixth aspect of the present invention provides the element candidate selection accepting means according to the invention according to any one of the first to fifth aspects, wherein the element candidate selection accepting means is an element candidate according to the order of the element candidates designated by the user It is a voice recognition device that accepts selection of the arrangement of.

また、本第七の発明の音声認識装置は、第一から第五のいずれか一項の発明に対して、要素候補選択受付手段は、音声認識結果情報に含まれる要素候補のいずれかの並びの選択を受け付ける、音声認識装置である。 Further, in the speech recognition device according to the seventh aspect of the invention, in contrast to the invention according to any one of the first to fifth aspects, the element candidate selection accepting unit arranges any of the element candidates included in the speech recognition result information. It is a voice recognition device that accepts the selection.

本発明による音声認識装置等によれば、音声認識後のユーザの負担を減らすことができる。 According to the speech recognition apparatus and the like according to the present invention, it is possible to reduce the burden on the user after speech recognition.

実施の形態１における音声認識装置のブロック図Block diagram of speech recognition apparatus according to Embodiment 1 同実施の形態における音声認識装置の動作を示すフローチャートThe flowchart which shows operation | movement of the speech recognition apparatus in the embodiment. 同実施の形態における同音用語格納手段に格納されている同音用語の一例を示す図The figure which shows an example of the same sound term stored in the same sound term storage means in the embodiment 同実施の形態における要素候補表示手段による音声認識結果情報に含まれる要素候補の並びを表示する場合の一例を示す図The figure which shows an example in the case of displaying the arrangement | sequence of the element candidate contained in the speech recognition result information by the element candidate display means in the embodiment 同実施の形態における要素候補表示手段による同音用語を表示する場合の一例を示す図The figure which shows an example in the case of displaying the same sound term by the element candidate display means in the embodiment 同実施の形態における出力手段が出力した情報の表示の一例を示す図The figure which shows an example of the display of the information which the output means in the embodiment output 同実施の形態における要素候補表示手段による表示の変更の一例を示す図The figure which shows an example of the change of the display by the element candidate display means in the embodiment 同実施の形態における要素候補選択受付手段が受け付ける選択の一例を示す図The figure which shows an example of the selection which the element candidate selection reception means in the embodiment receives 同実施の形態における要素候補表示手段による音声認識結果情報に含まれる要素候補の並びを表示する場合の一例を示す図The figure which shows an example in the case of displaying the arrangement | sequence of the element candidate contained in the speech recognition result information by the element candidate display means in the embodiment 同実施の形態における要素候補表示手段による音声認識結果情報に含まれる要素候補の並びを表示する場合の一例を示す図The figure which shows an example in the case of displaying the arrangement | sequence of the element candidate contained in the speech recognition result information by the element candidate display means in the embodiment 同実施の形態における要素候補選択受付手段が受け付ける選択の一例を示す図The figure which shows an example of the selection which the element candidate selection reception means in the embodiment receives 同実施の形態におけるコンピュータシステムの構成の一例を示す図The figure which shows an example of a structure of the computer system in the embodiment

以下、音声認識装置の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of a speech recognition apparatus will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１）
本実施の形態において、音声認識した際に得られる要素候補の並びを表示し、表示された要素候補から要素候補の並びを選択できる音声認識装置１について説明する。 (Embodiment 1)
In the present embodiment, a speech recognition apparatus 1 that displays a sequence of element candidates obtained upon speech recognition and can select the sequence of element candidates from the displayed element candidates will be described.

図１は、本実施の形態における音声認識装置１のブロック図である。音声認識装置１は、音声データ受付手段１０１、音声認識結果情報取得手段１０２、同音用語格納手段１０３、同音用語取得手段１０４、要素候補表示手段１０５、表示変更受付手段１０６、要素候補選択受付手段１０７、出力手段１０８、マイク１００１、およびタッチパネル１００２を備える。マイク１００１は、ユーザが発話した音声から音声のデータである音声データを取得する。また、タッチパネル１００２は、要素候補表示手段１０５、および出力手段１０８が出力した情報を表示する。また、タッチパネル１００２は、ユーザが行った操作を示す情報を取得する。なお、マイク１００１、およびタッチパネル１００２は、公知技術であるため、それぞれの詳細な説明は省略する。 FIG. 1 is a block diagram of a speech recognition apparatus 1 in the present embodiment. The voice recognition apparatus 1 includes a voice data receiving unit 101, a voice recognition result information acquiring unit 102, a homophone term storing unit 103, a homophone term acquiring unit 104, an element candidate display unit 105, a display change receiving unit 106, and an element candidate selection receiving unit 107. , Output means 108, microphone 1001, and touch panel 1002. The microphone 1001 acquires voice data that is voice data from voice uttered by the user. The touch panel 1002 displays the information output by the element candidate display unit 105 and the output unit 108. The touch panel 1002 acquires information indicating an operation performed by the user. Note that the microphone 1001 and the touch panel 1002 are well-known technologies, and thus detailed descriptions thereof are omitted.

音声データ受付手段１０１は、音声データを受け付ける。この音声データは、ユーザより発話された音声のデータである。音声データ受付手段１０１は、発話された音声をマイク１００１が集音して音声信号に変換した音声データを受け付けても良く、マイク１００１以外が取得した音声データを受け付けても良い。例えば、音声データ受付手段１０１は、有線もしくは無線の通信回線を介して送信された音声データを受信しても良く、光ディスクや磁気ディスク、半導体メモリ等の所定の記録媒体から読み出された音声データを受け付けても良い。音声は、１または２以上の単語の音声であっても良く、１または２以上の文節の音声であっても良く、１または２以上の文章の音声であっても良い。なお、音声データ受付手段１０１は、受け付けを行うための、インタフェースカードやモデムやネットワークカード等のデバイスを含んでも良く、あるいは含まなくても良い。また、音声データ受付手段１０１は、ハードウェアによって実現されても良く、あるいは所定のデバイスを駆動するドライバ等のソフトウェアによって実現されても良い。 The voice data receiving unit 101 receives voice data. This voice data is data of voice uttered by the user. The voice data accepting unit 101 may accept voice data obtained by collecting the spoken voice by the microphone 1001 and converting it into a voice signal, or may accept voice data obtained by other than the microphone 1001. For example, the audio data reception unit 101 may receive audio data transmitted via a wired or wireless communication line, and audio data read from a predetermined recording medium such as an optical disk, a magnetic disk, or a semiconductor memory. May be accepted. The voice may be a voice of one or more words, may be a voice of one or more phrases, or may be a voice of one or more sentences. Note that the voice data receiving unit 101 may or may not include a device such as an interface card, a modem, or a network card for receiving. The audio data receiving unit 101 may be realized by hardware, or may be realized by software such as a driver that drives a predetermined device.

音声認識結果情報取得手段１０２は、音声データに対応する音声認識結果の一部である要素の候補である要素候補の２以上の並びを含む音声認識結果情報を取得する。音声認識とは、音声データが示す発話の内容を文字情報として取得する処理のことである。この文字情報は、複数の要素の集合である。音声認識の一般的な処理の流れは、音声データから人の発話を含む区間である発話区間を検出し、発話区間における音声データに音響モデルと語彙辞書に基づく音声照合を行って文字情報を取り出すという流れである。なお、音声認識結果情報取得手段１０２が行う音声認識処理の詳細は、いかなる処理であっても良い。音声認識の処理については、公知技術であるため、その詳細は省略する。音声認識結果情報は、音声データ受付手段１０１が受け付けた音声データに対して音声認識処理を実施した結果、取得された情報である。要素の単位は、単語であっても良く、形態素であっても良く、文節であっても良く、それらが混在していても良い。要素の単位は、音声認識のアルゴリズムによって決められても良い。例えば、ユーザが「きしゅうのかきをかいたい」と発話した場合の要素候補は、「紀州の」と「柿を」と「買い」と「たい」とであっても良く、「紀州」と「の」と「柿」と「を」と「買い」と「たい」とであっても良い。以下、要素の単位が、主に前者の場合について説明する。要素候補は、音声データ受付手段１０１が受け付けた音声データに対する音声認識の結果、取得されたものである。「候補」としているのは、本実施の形態では、音声認識の結果、２以上の要素候補の並びを取得することが前提であるからである。 The voice recognition result information acquisition unit 102 acquires voice recognition result information including two or more arrangements of element candidates that are candidate elements that are part of the voice recognition result corresponding to the voice data. Speech recognition is a process of acquiring the content of an utterance indicated by speech data as character information. This character information is a set of a plurality of elements. The general processing flow of speech recognition is to detect speech segments that are segments containing human speech from speech data, and perform speech collation based on an acoustic model and a vocabulary dictionary for speech data in speech segments to extract character information. This is the flow. The details of the voice recognition process performed by the voice recognition result information acquisition unit 102 may be any process. Since the voice recognition process is a known technique, its details are omitted. The voice recognition result information is information obtained as a result of performing voice recognition processing on the voice data received by the voice data receiving unit 101. The unit of the element may be a word, a morpheme, a phrase, or a mixture of them. The unit of the element may be determined by a speech recognition algorithm. For example, if the user utters “I want to squeeze”, the candidate elements may be “Kishu”, “Kashiwa”, “Buy”, and “Tai”. “No”, “」 ”,“ O ”,“ Buy ”and“ Tai ”may be used. Hereinafter, the case where the element unit is mainly the former will be described. The element candidates are acquired as a result of voice recognition on the voice data received by the voice data receiving unit 101. The reason for “candidate” is that, in the present embodiment, as a result of speech recognition, it is assumed that an array of two or more element candidates is acquired.

音声認識結果情報は、上述のように、音声認識処理を実施した結果、音声データを要素の単位ごとに文字情報に変換した要素候補の並びを２以上含む情報である。例えば、ユーザが「きしゅうのかきをかいたい」と発話した場合に、音声認識結果情報は、「紀州の柿を買いたい」と「紀州の牡蠣を買いたい」と「紀州の花器を買いたい」等といった要素候補の並びを有していても良い。この場合の音声認識結果情報は、音声認識処理において、「柿」と「牡蠣」と「花器」とで、どの「かき」がユーザが発話した内容と同じか判断できなかったため、３種類以上の要素候補の並びを有している。なお、音声認識結果情報は、上記のように、文字列である要素候補の並びを有していても良く、要素候補をノード、要素候補間をエッジとしたグラフ構造で表現される要素候補の並びを有しても良い。 As described above, the speech recognition result information is information including two or more element candidate sequences obtained by converting speech data into character information for each element unit as a result of performing speech recognition processing. For example, when the user utters “I want to buy oysters”, the speech recognition result information is “I want to buy Kishu cocoons”, “I want to buy Kishu oysters” and “I want to buy Kishu flowers” ”Or the like. In this case, the voice recognition result information has not been able to determine which “oyster” is the same as the content spoken by the user in “、”, “oyster” and “vase” in the voice recognition process. It has a list of element candidates. Note that the speech recognition result information may have a sequence of element candidates that are character strings, as described above, and the element candidate expressed by a graph structure in which the element candidates are nodes and the element candidates are edges. You may have a line.

音声認識結果情報取得手段１０２は、尤度情報を含む音声認識結果情報を取得しても良い。尤度情報は、要素候補の並びに関する尤度を示す情報である。尤度とは、尤もらしさを示す値である。尤度情報は、音声認識の処理において算出される値である。尤度情報は、要素候補の並びごとの情報であっても良く、要素候補の並びの一部ごとの情報であっても良い。要素候補の並びの一部は、例えば、１個の要素候補であっても良く、２個の連続した要素候補であっても良く、３個以上の連続した要素候補であっても良い。なお、音声認識結果情報が有する要素候補の並びは、音声認識の処理において、要素候補の並びに関する尤度が所定の閾値より高い要素候補の並びであっても良く、音声認識処理で取得される要素候補の並びのうち、要素候補の並びに関する尤度が上位Ｎ件に含まれる要素候補の並びであっても良い。なお、Ｎは２以上の自然数とする。音声認識結果情報取得手段１０２は、通常、ＭＰＵやメモリ等から実現され得る。音声認識結果情報取得手段１０２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The speech recognition result information acquisition unit 102 may acquire speech recognition result information including likelihood information. The likelihood information is information indicating the likelihood regarding the arrangement of element candidates. The likelihood is a value indicating the likelihood. The likelihood information is a value calculated in the speech recognition process. The likelihood information may be information for each arrangement of element candidates, or may be information for each part of the arrangement of element candidates. A part of the element candidate list may be, for example, one element candidate, two consecutive element candidates, or three or more consecutive element candidates. Note that the element candidate sequence included in the speech recognition result information may be an element candidate sequence having a likelihood higher than a predetermined threshold in the speech recognition process, and is obtained by the speech recognition process. Among the candidate element sequences, the candidate element sequences that include the likelihoods related to the candidate element sequence may be included in the top N items. N is a natural number of 2 or more. The voice recognition result information acquisition unit 102 can be usually realized by an MPU, a memory, or the like. The processing procedure of the voice recognition result information acquisition unit 102 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

同音用語格納手段１０３には、同音用語が格納される。同音用語は、ある用語と同音であり、その用語と異なる用語である。同音とは、「柿」と「牡蠣」のように、発音が同じであることである。なお、ここでの発音には、イントネーションを含めても良く、イントネーションを含めなくても良い。なお、同音用語は、同音異義語を含んでいても良く、同音同義語を含んでいても良い。同音同義語とは、同音で同じ意味を有する用語である。同音同義語は、例えば、「十分」と「充分」とのように音と意味とが共通し、表記が異なる関係の用語である。同音用語は、ある用語と異なる用語が同じ音であることを表現できる情報であればどのような方法で表現されても良い。例えば、同音用語の表現は、「柿，牡蠣」や「柿，花器」等のように１対１の関係で同音の用語を表現しても良く、「柿，牡蠣，花器，・・・」等のように複数の同音の用語をグループ化して表現しても良い。同音用語格納手段１０３には、用語のみからなる同音用語が格納されていても良く、記号を含めた同音用語が格納されていても良い。記号は、例えば、「☆」や「★」等に対して、「ほし」という音をあてて、「星」や「ほし」等と対応付けても良い。 The homophone term storage means 103 stores homophone terms. A homophone term is a term that is the same as a certain term and different from that term. The same sound means that the pronunciation is the same, such as “柿” and “oyster”. It should be noted that the pronunciation here may or may not include intonation. Note that the homophone terms may include homonyms or synonyms. Homophone synonyms are terms that have the same meaning with the same sound. The synonym synonym is a term having a relationship in which the sound and the meaning are common, such as “sufficient” and “sufficient”, and the notation is different. The same sound term may be expressed by any method as long as the information can express that a different term from a certain term is the same sound. For example, homophone terms may be expressed in a one-to-one relationship such as “柿, oysters” or “柿, vase”, or “、, oysters, vases,. A plurality of terms of the same sound may be expressed as a group. The homophone term storage means 103 may store a homophone term consisting only of a term, or may store a homophone term including a symbol. For example, the symbol may be associated with “star” or “hoshi” by applying a sound “hoshi” to “☆” or “★”.

同音用語格納手段１０３は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。同音用語格納手段１０３に同音用語が格納される過程は問わない。例えば、記録媒体を介して同音用語が同音用語格納手段１０３で格納されるようになっても良く、通信回線等を介して送信された同音用語が同音用語格納手段１０３で格納されるようになっても良く、あるいは、入力デバイスを介して入力された同音用語が同音用語格納手段１０３で格納されるようになっても良い。 The homophone term storage means 103 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium. The process of storing the homophone term in the homophone term storage means 103 does not matter. For example, a homophone term may be stored in the homophone term storage means 103 via a recording medium, and a homophone term transmitted via a communication line or the like is stored in the homophone term storage means 103. Alternatively, the homophone term input via the input device may be stored in the homophone term storage means 103.

同音用語取得手段１０４は、要素候補に含まれる少なくとも一部の用語の同音用語を取得する。同音用語取得手段１０４が同音用語を取得する用語を含む要素候補は、主に音声認識結果情報取得手段１０２が取得した音声認識結果情報に含まれる要素候補である。同音用語取得手段１０４は、同音用語格納手段１０３から同音用語を取得しても良く、同音用語格納手段１０３以外から同音用語を取得しても良い。「同音用語格納手段１０３以外から同音用語を取得する」場合は、同音用語取得手段１０４は、図示しないネットワークを介して、外部の同音用語を検索できる装置から取得しても良い。外部の同音用語を検索できる装置とは、例えば、ＩＭＥ（ＩｎｐｕｔＭｅｔｈｏｄＥｄｉｔｏｒ）等で使用する変換辞書を公開している装置であっても良く、同音用語格納手段１０３と同様の内容を格納している格納手段を有する装置であっても良い。「要素候補に含まれる少なくとも一部の用語」は、単語であっても良く、形態素であっても良い。同音用語取得手段１０４は、形態素解析等を行って、「要素候補に含まれる少なくとも一部の用語」を取得すると、その用語を検索キーとして用いて、同音用語格納手段１０３に対して検索し、同音用語を取得する。なお、同音用語を取得する場合は、同音用語取得手段１０４は、自立語のみから同音用語を取得しても良い。例えば、同音用語取得手段１０４は、名詞・代名詞・動詞・形容詞・形容動詞等を検索キーとして同音用語を取得しても良い。また、同音用語取得手段１０４が、文字列から単語や形態素を抽出する方法は、いかなる方法であっても良い。公知技術により形態素解析等は可能であるので、形態素解析の詳細な説明は、省略する。また、音声認識結果情報に、各要素候補の形態素解析の結果等の一部の用語を示す情報が含まれている場合は、その情報が示す用語を検索キーとして用いて、同音用語格納手段１０３に対して検索しても良い。同音用語取得手段１０４は、通常、ＭＰＵやメモリ等から実現され得る。同音用語取得手段１０４の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The homophone term acquisition unit 104 acquires a homophone term of at least some of the terms included in the element candidates. The element candidate including the term for which the homophone term acquisition unit 104 acquires the homonym term is an element candidate included mainly in the voice recognition result information acquired by the voice recognition result information acquisition unit 102. The homophone term acquisition unit 104 may acquire the homonym term from the homophone term storage unit 103 or may acquire the homonym term from other than the homophone term storage unit 103. In the case of “acquiring homophone terms from other than the homophone term storage means 103”, the homophone term acquisition means 104 may be obtained from an external device that can search for homophone terms via a network (not shown). The device that can search for external homophone terms may be a device that publishes a conversion dictionary used in IME (Input Method Editor), for example, and stores the same contents as the homophone term storage means 103. It may be a device having storage means. “At least some of the terms included in the element candidates” may be words or morphemes. The syllable term acquisition unit 104 performs morphological analysis and acquires “at least a part of terms included in the element candidates”, and searches the syllable term storage unit 103 using the term as a search key. Get homophone terms. In addition, when acquiring a homonym term, the homonym term acquisition means 104 may acquire a homonym term only from an independent word. For example, the homophone term acquisition unit 104 may acquire a homophone term using a noun, pronoun, verb, adjective, adjective verb, and the like as a search key. Moreover, the method by which the homonym term acquisition means 104 extracts a word and a morpheme from a character string may be any method. Since morphological analysis or the like can be performed by a known technique, detailed description of morphological analysis is omitted. In addition, when the speech recognition result information includes information indicating some terms such as the result of morphological analysis of each element candidate, the homophone term storage unit 103 uses the term indicated by the information as a search key. You may search against. The syllable term acquisition unit 104 can be usually realized by an MPU, a memory, or the like. The processing procedure of the homophone term acquisition means 104 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

要素候補表示手段１０５は、音声認識結果情報が有する２以上の要素候補を表示する。要素候補表示手段１０５は、要素候補の２以上の並びが分かるように表示しても良く、音声データの同じ部分の音声認識の結果である２以上の要素候補ごとに表示しても良い。「要素候補の２以上の並びが分かるように表示する」場合は、要素候補表示手段１０５は、縦方向に要素候補の並びを並べて表示しても良く、横方向に要素候補の並びを並べて表示しても良い。「要素候補の２以上の並びが分かるように表示する」は、要素候補の並びの一部を表示することであっても良く、要素候補の並びの全てを表示することであっても良い。「音声データの同じ部分の音声認識の結果である２以上の要素候補ごとに表示する」場合は、要素候補表示手段１０５は、縦方向に音声データの同じ部分の音声認識の結果である要素候補を並べて表示しても良く、横方向に音声データの同じ部分の音声認識の結果である要素候補を並べて表示しても良い。また、この場合は、要素候補表示手段１０５は、要素候補選択受付手段１０７が要素候補の選択を受け付けるたびに、選択された要素候補を含む並びにおける、その要素候補の次の要素候補を表示するようにしても良い。また、この場合で、選択された要素候補が複数の要素候補の並びに含まれているときは、要素候補表示手段１０５は、それぞれの要素候補の並びにおける、選択された要素候補の次の要素候補を表示しても良く、選択された要素候補の次の要素候補と、その選択された要素候補の次の要素候補と同じ部分の音声データに対する音声認識の結果である要素候補とを表示しても良い。選択された要素候補の次の要素候補を表示する場合は、要素候補表示手段１０５は、選択された要素候補の次の要素候補を選択しやすいように表示しても良い。例えば、要素候補表示手段１０５は、選択された要素候補の次の要素候補を、表示する領域の上部に表示しても良く、表示する領域の中心に表示しても良い。なお、要素候補表示手段１０５は、音声認識結果情報に含まれる要素候補のうち、複数の要素候補の並びに含まれる、音声データの同じ部分の音声認識の結果である要素候補を重複して表示しないようにしても良く、重複して表示するようにしても良い。例えば、音声データが「きしゅうのかき」を示している場合で、音声認識結果情報に「紀州の柿」と「紀州の牡蠣」とが含まれていたときは、要素候補表示手段１０５は、音声データの「きしゅうの」から取得された、二つの要素候補の並びに共通する要素である「紀州の」のうち、一方の「紀州の」を表示しなくても良い。つまり、この場合は、要素候補表示手段１０５は、「紀州の」と「柿」と「牡蠣」とが１度の表示で１個ずつ表示されるようにしても良い。なお、「音声データの同じ部分の音声認識結果」は、音声データの一部が同じ部分の音声認識結果であっても良い。例えば、音声データが「〜は、かわらない」であった場合で、［「〜は」「瓦」「無い」］と［「〜は」「変わらない」］の２種類の要素候補の並びが取得されたとき、「瓦」と「変わらない」とが、「音声データの同じ部分の音声認識結果」であっても良い。以下、「要素候補の２以上の並びが分かるように表示する」場合について、主に説明する。 The element candidate display unit 105 displays two or more element candidates included in the speech recognition result information. The element candidate display means 105 may display so that two or more arrangements of element candidates can be understood, or may display for each of two or more element candidates that are the result of speech recognition of the same portion of the speech data. In the case of “displaying so that two or more arrangements of element candidates can be understood”, the element candidate display unit 105 may display the arrangement of element candidates in the vertical direction and display the arrangement of element candidates in the horizontal direction. You may do it. “Display so that two or more arrangements of element candidates can be understood” may be to display a part of the arrangement of element candidates, or to display all of the arrangement of element candidates. In the case of “display for every two or more element candidates that are the result of speech recognition of the same part of the speech data”, the element candidate display means 105 displays the element candidate that is the result of speech recognition of the same part of the speech data in the vertical direction May be displayed side by side, and element candidates that are the result of speech recognition of the same portion of the audio data may be displayed side by side in the horizontal direction. In this case, the element candidate display unit 105 displays the next element candidate of the element candidate in the array including the selected element candidate every time the element candidate selection receiving unit 107 receives the selection of the element candidate. You may do it. In this case, when the selected element candidate includes a plurality of element candidates, the element candidate display unit 105 displays the next element candidate next to the selected element candidate in each element candidate array. The next element candidate of the selected element candidate and the element candidate that is the result of speech recognition for the same part of the speech data as the next element candidate of the selected element candidate are displayed. Also good. When displaying the element candidate next to the selected element candidate, the element candidate display means 105 may display the element candidate next to the selected element candidate so that it can be easily selected. For example, the element candidate display unit 105 may display the element candidate next to the selected element candidate at the upper part of the display area or at the center of the display area. Note that the element candidate display unit 105 does not redundantly display element candidates that are the result of speech recognition of the same portion of the speech data included in the plurality of element candidates among the element candidates included in the speech recognition result information. You may make it do, and you may make it display it overlappingly. For example, when the voice data indicates “Kikyu no Oki” and the speech recognition result information includes “Kiju no Oki” and “Kishu no Oyster”, the element candidate display means 105 It is not necessary to display one “Kishu” of “Kishu”, which is a common element of two candidate elements obtained from the voice data “Kisyu”. In other words, in this case, the element candidate display means 105 may display “kishuno”, “柿”, and “oyster” one by one in one display. The “speech recognition result of the same part of the voice data” may be a voice recognition result of a part of the voice data that is the same. For example, when the voice data is “~ does not change”, the arrangement of two types of element candidates “[˜ha”, “tile”, “none”] and [“˜ha” “does not change”] When acquired, “tile” and “not changed” may be “speech recognition result of the same part of the voice data”. Hereinafter, the case of “displaying so that two or more arrangements of element candidates can be understood” will be mainly described.

また、要素候補表示手段１０５は、音声認識結果情報に含まれる全ての要素候補を表示しても良く、一部の要素候補を表示しても良い。例えば、図４は、ユーザが「きしゅうのかきをかいたい」と発話した場合の音声認識結果情報に含まれる全ての要素候補を表示している様子を示している。また、図７の左側は、図４と同様の音声認識結果情報に含まれる前半部分の要素候補を表示している様子を示している。要素候補表示手段１０５は、表示する領域のサイズ、または音声認識結果情報の情報量に応じて、その音声認識結果情報に含まれる全ての要素候補、または一部の要素候補を表示しても良い。なお、要素候補表示手段１０５は、表示する領域のサイズ、または音声認識結果情報の情報量から、その音声認識結果情報に含まれる全ての要素候補を表示できるかどうかを判断し、全ての要素候補を表示できる場合に、全ての要素候補を表示しても良く、全ての要素候補を表示できない場合に、一部の要素候補を表示しても良い。また、要素候補表示手段１０５は、表示する領域に対して、あらかじめ決められたサイズで各要素候補を表示し、全ての要素候補が表示できた場合に、全ての要素候補を表示しても良く、全ての要素候補が表示できなかった場合に一部の要素候補を表示しても良い。つまり、要素候補表示手段１０５は、結果として、全ての要素候補を表示しても良く、一部の要素候補を表示しても良い。全ての要素候補を表示できるかどうかを判断する場合は、要素候補表示手段１０５は、表示する領域に対して、あらかじめ決められたサイズで各要素候補を配置した際に、表示する領域のサイズに収まるかどうかを判断しても良く、縦方向と横方向にそれぞれいくつの要素候補が配置されるのか算出し、それらが表示する領域のサイズに収まるかどうかを判断しても良い。また、要素候補表示手段１０５は、全ての要素候補を表示できるように、要素候補の文字のサイズを変更して表示しても良い。なお、表示する領域のサイズは、例えば、画面のサイズであっても良く、作業ウィンドウのサイズであっても良い。 Further, the element candidate display unit 105 may display all element candidates included in the speech recognition result information, or may display some element candidates. For example, FIG. 4 shows a state in which all element candidates included in the speech recognition result information when the user utters “I want to write a meal” are displayed. Further, the left side of FIG. 7 shows a state in which the first half element candidates included in the speech recognition result information similar to FIG. 4 are displayed. The element candidate display unit 105 may display all or some element candidates included in the speech recognition result information according to the size of the area to be displayed or the information amount of the speech recognition result information. . The element candidate display means 105 determines whether or not all element candidates included in the speech recognition result information can be displayed from the size of the area to be displayed or the amount of information of the speech recognition result information. Can be displayed, all element candidates may be displayed, and when all element candidates cannot be displayed, some element candidates may be displayed. In addition, the element candidate display unit 105 may display each element candidate with a predetermined size for the area to be displayed, and may display all the element candidates when all the element candidates can be displayed. If all the element candidates cannot be displayed, some element candidates may be displayed. That is, the element candidate display unit 105 may display all element candidates or a part of element candidates as a result. When determining whether or not all element candidates can be displayed, the element candidate display unit 105 sets the size of the area to be displayed when each element candidate is arranged in a predetermined size with respect to the area to be displayed. It may be determined whether or not it can fit, or how many element candidates are arranged in the vertical and horizontal directions, respectively, and whether or not they fit within the size of the area to be displayed may be determined. In addition, the element candidate display unit 105 may change and display the size of the element candidate characters so that all element candidates can be displayed. Note that the size of the display area may be, for example, the size of the screen or the size of the work window.

音声認識結果情報が尤度情報を含んでいる場合は、要素候補表示手段１０５は、尤度情報に応じて要素候補を表示しても良い。「尤度情報に応じて表示」する場合は、要素候補表示手段１０５は、例えば、要素候補の尤度の高い並びを選択しやすいよう表示しても良く、要素候補の尤度の高い並びを強調して表示しても良い。「選択しやすいよう表示」する場合は、要素候補表示手段１０５は、例えば、要素候補の尤度が最も高い並びが直線的になるように表示しても良く、要素候補の尤度が高い並びほど直線的になるように表示しても良い。また、「強調して表示」する場合は、要素候補表示手段１０５は、例えば、尤度の高い要素候補またはその並びを２重線で囲んで表示しても良く、尤度の高い要素候補またはその並びの色を変えて表示しても良い。また、「強調して表示」する場合は、要素候補表示手段１０５は、その尤度の値に応じて、多段階に表示方法を変えて表示しても良い。また、「直線的になるように表示」する場合は、要素候補表示手段１０５は、最も高い尤度を有する要素候補の並びを、表示する領域の中心に配置して表示しても良い。なお、図４は、音声認識結果情報に含まれる要素候補の並びのうち、「紀州の柿を買いたい」が最も高い尤度情報を有していたときの例を示している。例えば、図４では、要素候補表示手段１０５が「紀州の柿を買いたい」が直線的になるように要素候補の並びを表示している。 When the speech recognition result information includes likelihood information, the element candidate display unit 105 may display the element candidates according to the likelihood information. In the case of “displaying according to likelihood information”, the element candidate display unit 105 may display, for example, an element candidate with a high likelihood of the element candidate, and may display the element candidate with a high likelihood of the element candidate. It may be highlighted. In the case of “display for easy selection”, for example, the element candidate display unit 105 may display the arrangement with the highest likelihood of the element candidates such that the arrangement with the highest likelihood of the element candidates is linear. You may display so that it may become so linear. When “highlighted” is displayed, the element candidate display unit 105 may display, for example, a high likelihood element candidate or a sequence thereof surrounded by a double line. You may change and display the color of the arrangement. In addition, in the case of “highlighted display”, the element candidate display unit 105 may display the display method by changing the display method in multiple stages according to the likelihood value. In addition, in the case of “displaying so as to be linear”, the element candidate display unit 105 may display the arrangement of element candidates having the highest likelihood in the center of the display area. FIG. 4 shows an example in which “I want to buy a Kishu bag” has the highest likelihood information in the list of element candidates included in the speech recognition result information. For example, in FIG. 4, the element candidate display means 105 displays an array of element candidates so that “I want to buy Kishu's coffee” is linear.

要素候補表示手段１０５は、図５のように、同音用語取得手段１０４が取得した１または２以上の同音用語を用いて、要素候補に含まれる用語を同音用語で置換した要素候補をも表示しても良い。要素候補表示手段１０５は、同音用語の要素候補を、その同音用語を取得するために用いた要素候補と対応付けて表示しても良く、対応付けずに表示しても良い。対応付けて表示する場合は、要素候補表示手段１０５は、要素候補に、その要素候補含まれる少なくとも一部の用語を用いて取得した同音用語をつながりが分かるように表示しても良い。例えば、要素候補表示手段１０５は、図５の「紀州の」の部分のように、要素候補の一部の用語を同音用語で置換し、置換された部分以外の部分を省略した要素候補を表示しても良く、「解体」のところのように、要素候補の全部を同音用語で置換した要素候補を表示しても良い。前者の場合であっても、表示を省略している部分「の」は、明らかであるため、「奇襲」の表示は、要素候補の表示であると考えることができる。なお、要素候補表示手段１０５は、図５の「柿」等の部分のように、同音用語を非表示にした状態で表示しても良い。この場合、要素候補表示手段１０５は、同音用語の表示と非表示とを切り替えるインタフェースを画面内に配置しても良い。同音用語の表示と非表示とを切り替えるインタフェースは、例えば、ボタンであっても良い。要素候補表示手段１０５は、表示変更受付手段１０６が表示と非表示とを切り替えるボタンが押されたかどうかを受け付けることで、同音用語の表示と非表示とを切り替えて表示しても良い。図５において、同音用語を表示するボタンは、同音要素が表示されていない要素候補の下部にある「▽」ボタンである。また、同音用語を被表示にするボタンは、同音用語が表示されている要素候補の下部にある「△」ボタンである。なお、同音用語を表示する場合は、要素候補表示手段１０５は、他の要素候補に含まれている用語と同じ同音用語を表示するようにしても良く、表示しないようにしても良い。他の要素候補に含まれている用語と同じ同音用語を表示しない場合は、同音用語取得手段１０４がそもそも重複する同音用語を取得しないようにしても良い。また、要素候補表示手段１０５は、図９のように、音声認識結果情報に含まれている要素候補の並びの関係をも表示しても良い。また、要素候補表示手段１０５は、図１０のように、音声認識結果情報取得手段１０２が取得した音声認識結果情報に含まれる要素候補の並びを明示して表示しても良い。また、要素候補表示手段１０５は、尤度情報をも表示しても良い。 As shown in FIG. 5, the element candidate display means 105 also displays element candidates obtained by replacing the terms included in the element candidates with the homophone terms using one or more homophone terms acquired by the homophone term acquisition means 104. May be. The element candidate display means 105 may display the element candidate of the same sound term in association with the element candidate used for acquiring the same sound term, or may display it without associating. In the case of displaying in association with each other, the element candidate display unit 105 may display the same sound term acquired using at least some of the terms included in the element candidate so that the connection can be understood. For example, the element candidate display unit 105 displays a candidate element in which a part of the element candidate is replaced with a homophone term and a part other than the replaced part is omitted, as in the “KISHU” part of FIG. Alternatively, an element candidate in which all of the element candidates are replaced with the same term may be displayed as in “disassembly”. Even in the former case, since the portion “NO” where the display is omitted is clear, it can be considered that the display of “Ambush” is an element candidate display. Note that the element candidate display unit 105 may display the same term in a non-displayed state, such as a portion such as “柿” in FIG. In this case, the element candidate display means 105 may arrange on the screen an interface for switching between display and non-display of the homophone terms. For example, a button may be used as the interface for switching between display and non-display of the same sound term. The element candidate display unit 105 may switch between display and non-display of the same term by receiving whether the display change receiving unit 106 has pressed a button for switching between display and non-display. In FIG. 5, a button for displaying a homophone term is a “▽” button at the bottom of an element candidate for which no homophone element is displayed. The button for displaying the same term is a “Δ” button below the candidate element where the same term is displayed. When displaying a homonym term, the element candidate display means 105 may display the same homonym term as a term included in another element candidate, or may not display it. In the case where the same homonym term as the term included in the other element candidates is not displayed, the homonym term obtaining unit 104 may not obtain the same homonym term in the first place. Moreover, the element candidate display means 105 may also display the arrangement relationship of the element candidates included in the speech recognition result information as shown in FIG. Further, as shown in FIG. 10, the element candidate display unit 105 may explicitly display the arrangement of element candidates included in the speech recognition result information acquired by the speech recognition result information acquisition unit 102. Moreover, the element candidate display means 105 may also display likelihood information.

なお、要素候補表示手段１０５は、表示変更受付手段１０６が受け付ける変更指示情報に応じて表示している情報を変更しても良い。表示変更受付手段１０６が受け付ける変更指示情報については、後述する。要素候補表示手段１０５は、ディスプレイデバイスを含むと考えても含まないと考えても良い。要素候補表示手段１０５は、ディスプレイデバイスのドライバーソフト、またはディスプレイデバイスのドライバーソフトとディスプレイデバイス等で実現され得る。 The element candidate display unit 105 may change the displayed information in accordance with the change instruction information received by the display change receiving unit 106. The change instruction information received by the display change receiving means 106 will be described later. The element candidate display means 105 may be considered as including a display device or not. The element candidate display means 105 can be realized by display device driver software, display device driver software and a display device, or the like.

表示変更受付手段１０６は、要素候補表示手段１０５による表示に対する変更を指示する情報である変更指示情報を受け付ける。例えば、変更指示情報は、要素候補を拡大させる指示を示す情報であっても良く、縮小させる指示を示す情報であっても良く、全ての要素候補を表示させる指示を示す情報であっても良く、一部の要素候補を表示させる指示を示す情報であっても良く、同音用語を表示させる指示を示す情報であっても良く、同音用語を非表示にする指示を示す情報であっても良く、画面をスクロールさせる指示を示す情報であっても良い。なお、画面をスクロールさせる指示を示す情報には、スクロールさせる分量を示す数値情報が含まれていても良い。表示変更受付手段１０６は、タッチパネル１００２から変更指示情報を受け付けるが、タッチパネル１００２以外から変更指示情報を受け付けても良い。タッチパネル１００２以外から変更指示情報を受け付ける場合は、例えば、テンキーやキーボードやマウスやメニュー画面によるもの等から変更指示情報を受け付けても良い。表示変更受付手段１０６は、タッチパネル１００２やテンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The display change receiving unit 106 receives change instruction information that is information for instructing a change to the display by the element candidate display unit 105. For example, the change instruction information may be information indicating an instruction for enlarging the element candidate, information indicating an instruction for reducing the element candidate, or information indicating an instruction for displaying all the element candidates. , Information indicating an instruction to display some element candidates, information indicating an instruction to display a homophone term, or information indicating an instruction to hide the homophone term may be used. Information indicating an instruction to scroll the screen may be used. Note that the information indicating the instruction to scroll the screen may include numerical information indicating the amount of scrolling. The display change accepting unit 106 accepts change instruction information from the touch panel 1002, but may accept change instruction information from other than the touch panel 1002. When the change instruction information is received from other than the touch panel 1002, the change instruction information may be received from a numeric keypad, a keyboard, a mouse, a menu screen, or the like. The display change accepting unit 106 can be realized by a device driver of an input unit such as a touch panel 1002, a numeric keypad, or a keyboard, control software for a menu screen, and the like.

要素候補選択受付手段１０７は、要素候補表示手段１０５による要素候補の表示に対して、要素候補の並びの選択を受け付ける。要素候補選択受付手段１０７は、ユーザが行った選択を受け付ける。要素候補選択受付手段１０７が受け付ける選択の要素候補の並びは、ユーザによって指定された要素候補の順番に応じた要素候補の並びであっても良く、音声認識結果情報に含まれる要素候補のいずれかの並びであっても良い。なお、要素候補の並びを選択するユーザは、音声データ受付手段１０１に音声データを入力したユーザと同一人物であっても良く、異なる人物であっても良い。また、要素候補選択受付手段１０７は、同音用語である要素候補をも含む要素候補の並びを選択しても良い。 The element candidate selection accepting unit 107 accepts selection of the arrangement of element candidates for the display of the element candidates by the element candidate display unit 105. The element candidate selection receiving unit 107 receives a selection made by the user. The selection of element candidates selected by the element candidate selection receiving unit 107 may be an element candidate arrangement according to the order of the element candidates specified by the user, and any of the element candidates included in the speech recognition result information. It may be a sequence of Note that the user who selects the arrangement of element candidates may be the same person or a different person as the user who has input the voice data to the voice data reception unit 101. In addition, the element candidate selection accepting unit 107 may select an arrangement of element candidates including element candidates that are homonymous terms.

「ユーザによって指定された要素候補の順番に応じた要素候補の並び」の選択を受け付ける場合は、要素候補選択受付手段１０７は、ユーザが選択した順に並んだ要素候補の並びを受け付ける。この場合、例えば、図８のように、ユーザは、出力したい順番で要素候補の並びを選択できる。なお、図８の場合は、要素候補選択受付手段１０７は、「柿を」「買い」「たい」「紀州の」という要素候補の並びの選択を受け付ける。「音声認識結果情報に含まれている要素候補のいずれかの並び」の選択を受け付ける場合は、要素候補表示手段１０５は、図１１のように、ユーザが、あらかじめ決められたいくつかの要素候補の並びから選択できるように表示しても良い。なお、図１１の場合は、要素候補選択受付手段１０７は、「紀州の」「柿を」「買い」「たい」という要素候補の並びの選択を受け付ける。なお、要素候補選択受付手段１０７が受け付ける要素候補、および要素候補の並びは、そのままの情報であっても良く、要素候補、および要素候補の並びを識別する情報であっても良い。要素候補、および要素候補の並びを識別する情報は、音声認識結果情報取得手段１０２が取得した音声認識結果情報に含まれる要素候補、および要素候補の並びを識別する情報である。 When receiving the selection of “element candidate arrangement according to the order of element candidates designated by the user”, the element candidate selection accepting unit 107 accepts the element candidate arrangement arranged in the order selected by the user. In this case, for example, as shown in FIG. 8, the user can select the arrangement of the element candidates in the order of output. In the case of FIG. 8, the element candidate selection accepting means 107 accepts the selection of the element candidate arrangements of “柿”, “Buy”, “Tai”, and “Kishu”. When accepting the selection of “any one of the element candidates included in the speech recognition result information”, the element candidate display unit 105 displays several element candidates determined in advance by the user as shown in FIG. It may be displayed so that it can be selected from the sequence of. In the case of FIG. 11, the element candidate selection accepting unit 107 accepts selection of the arrangement of element candidates such as “Kishuno”, “Sake”, “Buy”, and “I want”. It should be noted that the element candidates received by the element candidate selection receiving unit 107 and the arrangement of the element candidates may be information as they are, or may be information for identifying the element candidates and the arrangement of the element candidates. The information for identifying the element candidates and the arrangement of the element candidates is information for identifying the element candidates included in the voice recognition result information acquired by the voice recognition result information acquisition unit 102 and the arrangement of the element candidates.

要素候補選択受付手段１０７は、タッチパネル１００２から要素候補の並びの選択を受け付けるが、タッチパネル１００２以外から要素候補の並びの選択を受け付けても良い。タッチパネル１００２以外から要素候補の並びの選択を受け付ける場合は、要素候補選択受付手段１０７は、例えば、テンキーやキーボードやマウスによるもの等から要素候補の並びの選択を受け付けても良い。要素候補選択受付手段１０７は、タッチパネル１００２やテンキーやキーボード等の入力手段のデバイスドライバー等で実現され得る。 The element candidate selection receiving unit 107 receives selection of the arrangement of element candidates from the touch panel 1002, but may receive selection of the arrangement of element candidates from other than the touch panel 1002. When accepting selection of the arrangement of element candidates from other than the touch panel 1002, the element candidate selection accepting unit 107 may accept selection of the arrangement of element candidates from, for example, a numeric keypad, a keyboard, or a mouse. The element candidate selection accepting means 107 can be realized by a touch panel 1002 or a device driver of an input means such as a numeric keypad or a keyboard.

出力手段１０８は、要素候補選択受付手段１０７が選択を受け付けた要素候補の並びである出力情報を出力する。出力情報は、要素候補の並びを、区切り文字等を用いずにつなげた１個の文字列であっても良い。出力手段１０８は、要素候補選択受付手段１０７が、「ユーザによって指定された要素候補の順番に応じた要素候補の並び」の選択を受け付けた場合は、その要素候補の並びの順に出力情報を構成しても良い。また、出力手段１０８は、要素候補選択受付手段１０７が、「音声認識結果情報に含まれる要素候補のいずれかの並び」の選択を受け付けた場合は、音声認識結果情報に含まれる要素候補の選択された並びの順に出力情報を構成しても良い。なお、出力手段１０８は、他の構成要素、または他の装置に対して出力しても良い。例えば、出力手段１０８は、検索キーワードを受け付ける装置や、文章を作成するアプリケーション等に対して出力しても良い。出力手段１０８は、出力を行うデバイス（例えば、表示デバイスやプリンタ等）を含んでも良く、あるいは含まなくても良い。また、出力手段１０８は、ハードウェアによって実現されても良く、あるいは、それらのデバイスを駆動するドライバ等のソフトウェアによって実現されても良い。 The output means 108 outputs output information that is a list of element candidates that the element candidate selection accepting means 107 has accepted. The output information may be a single character string obtained by connecting a sequence of element candidates without using a delimiter or the like. When the element candidate selection accepting unit 107 accepts the selection of “element candidate arrangement according to the order of element candidates designated by the user”, the output unit 108 configures output information in the order of the element candidate arrangement. You may do it. In addition, the output unit 108 selects the element candidate included in the speech recognition result information when the element candidate selection receiving unit 107 receives the selection “any one of the element candidates included in the speech recognition result information”. The output information may be configured in the order in which they are arranged. Note that the output unit 108 may output to other components or other devices. For example, the output unit 108 may output to a device that accepts a search keyword, an application that creates a sentence, or the like. The output unit 108 may or may not include an output device (for example, a display device or a printer). The output unit 108 may be realized by hardware, or may be realized by software such as a driver that drives these devices.

図２は、本実施の形態における音声認識装置１の動作の一例を示すフローチャートである。以下、図２を用いて動作について説明する。 FIG. 2 is a flowchart showing an example of the operation of the speech recognition apparatus 1 in the present embodiment. The operation will be described below with reference to FIG.

（ステップＳ２０１）音声データ受付手段１０１は、音声データを受け付けたかどうか判断する。音声データを受け付けた場合は、ステップＳ２０２へ進み、音声データ受け付けなかった場合は、音声データを受け付けるまでステップＳ２０１の処理を繰り返す。 (Step S201) The voice data receiving unit 101 determines whether voice data has been received. If the voice data is accepted, the process proceeds to step S202. If the voice data is not accepted, the process of step S201 is repeated until the voice data is accepted.

（ステップＳ２０２）音声認識結果情報取得手段１０２は、ステップＳ２０１で受け付けた音声データを音声認識した結果である音声認識結果情報を取得する。 (Step S202) The voice recognition result information acquisition unit 102 acquires voice recognition result information that is a result of voice recognition of the voice data received in step S201.

（ステップＳ２０３）同音用語取得手段１０４は、ステップＳ２０２で取得した音声認識結果情報に含まれている要素候補に含まれている用語の同音用語を取得する。 (Step S203) The homophone term acquisition unit 104 acquires a homophone term of a term included in the element candidate included in the speech recognition result information acquired in Step S202.

（ステップＳ２０４）要素候補表示手段１０５は、ステップＳ２０２で取得した音声認識結果情報に含まれている全ての要素候補を、表示する領域に表示できるかどうか判断する。表示できない場合は、ステップＳ２０５へ進み、表示できる場合は、ステップＳ２０６へ進む。 (Step S204) The element candidate display means 105 determines whether or not all element candidates included in the speech recognition result information acquired in step S202 can be displayed in the display area. If it cannot be displayed, the process proceeds to step S205. If it can be displayed, the process proceeds to step S206.

（ステップＳ２０５）要素候補表示手段１０５は、ステップＳ２０２で取得した音声認識結果情報に含まれている要素候補の一部を表示する。 (Step S205) The element candidate display unit 105 displays a part of the element candidates included in the speech recognition result information acquired in step S202.

（ステップＳ２０６）要素候補表示手段１０５は、ステップＳ２０２で取得した音声認識結果情報に含まれている要素候補の全部を表示する。 (Step S206) The element candidate display means 105 displays all of the element candidates included in the speech recognition result information acquired in step S202.

（ステップＳ２０７）表示変更受付手段１０６は、表示に対する変更を受け付けたかどうか判断する。表示に対する変更を受け付けた場合は、ステップＳ２０８へ進み、表示に対する変更を受け付けなかった場合は、ステップＳ２０９へ進む。 (Step S207) The display change receiving means 106 determines whether a change to the display has been received. If a change to the display has been accepted, the process proceeds to step S208. If a change to the display has not been accepted, the process proceeds to step S209.

（ステップＳ２０８）要素候補表示手段１０５は、表示変更受付手段１０６が受け付けた変更指示情報に応じて表示を変更する。そして、ステップＳ２０７へ戻る。 (Step S208) The element candidate display unit 105 changes the display according to the change instruction information received by the display change receiving unit 106. Then, the process returns to step S207.

（ステップＳ２０９）要素候補変更受付手段１０６は、要素候補の並びの選択を受け付けたかどうか判断する。選択を受け付けた場合は、ステップＳ２１０へ進み、選択を受け付けなかった場合は、ステップＳ２０７へ戻る。 (Step S209) The element candidate change accepting means 106 judges whether or not selection of the arrangement of element candidates has been accepted. If the selection is accepted, the process proceeds to step S210. If the selection is not accepted, the process returns to step S207.

（ステップＳ２１０）出力手段１０８は、ステップＳ２０９で受け付けた選択の要素候補の並びである出力情報を出力する。そして、ステップＳ２０１へ戻る。
なお、図２のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 (Step S <b> 210) The output unit 108 outputs output information that is a list of element candidates for selection received in step S <b> 209. Then, the process returns to step S201.
In the flowchart of FIG. 2, the process is terminated by powering off or a process termination interrupt.

以下、本実施の形態における音声認識装置１の具体的な動作について説明する。なお、本具体例において示した各図面の情報は、説明のために便宜上用意されたものであって、実際のデータを示すものではない。なお、本具体例において、音声認識結果情報取得手段１０２は、音声認識結果情報を取得する際に、その音声認識結果情報に含まれる要素候補のそれぞれの並びの尤度情報も取得するものとする。また、本具体例において、音声認識結果情報取得手段１０２が取得する音声認識結果情報に含まれる要素候補には、形態素解析の結果を含んでいるものとする。 Hereinafter, a specific operation of the speech recognition apparatus 1 in the present embodiment will be described. In addition, the information of each drawing shown in this specific example is prepared for convenience of explanation, and does not indicate actual data. In this specific example, when the speech recognition result information acquisition unit 102 acquires the speech recognition result information, the speech recognition result information acquisition unit 102 also acquires likelihood information of each arrangement of element candidates included in the speech recognition result information. . Further, in this specific example, it is assumed that the element candidates included in the speech recognition result information acquired by the speech recognition result information acquisition unit 102 include the result of morphological analysis.

本具体例において、同音用語格納手段１０３には、図３で示されるテーブルが格納されているものとする。図３のテーブルは、同音用語を有している。例えば、同音用語「かき，カキ，柿，火器，牡蠣，下記，火器」が登録されている。 In this specific example, it is assumed that the same term storage means 103 stores the table shown in FIG. The table of FIG. 3 has homophone terms. For example, the homophone term “oyster, oyster, firewood, firearm, oyster, below, firearm” is registered.

ユーザは、音声認識装置１とリンクする電子メールを作成するソフトウェアであるメーラーを起動し、メールの作成を開始したとする。そして、ユーザは、「音声入力ボタン」を押して、音声入力機能を有する音声認識装置１を立ち上げ、マイク１００１に向かって「きしゅうのかきをかいたい」と発話したものとする。 It is assumed that the user activates a mailer that is software for creating an e-mail linked to the voice recognition device 1 and starts creating a mail. Then, it is assumed that the user presses the “voice input button”, starts up the voice recognition device 1 having a voice input function, and speaks to the microphone 1001 “I want to wear a sword”.

音声データ受付手段１０１は、マイク１００１が取得した音声データ「きしゅうのかきをかいたい」を受け付ける（ステップＳ２０１）。音声データ受付手段１０１が受け付けた音声データは、音声認識結果情報取得手段１０２によって音声認識処理が実行される。そして、音声認識結果情報取得手段１０２は、音声認識結果情報｛［「紀州の」「柿を」「買い」「たい」，０．８８］，［「紀州の」「牡蠣を」「買い」「たい」，０．７２］，［「紀州の」「花器を」「買い」「たい」，０．６８］，［「紀州の」「牡蠣を」「解体」，０．５５］，［「紀州の」「花器を」「解体」，０．５２］｝を取得したものとする（ステップＳ２０２）。なお、ここで取得した音声認識結果情報に含まれる要素候補の並びの後ろの数字は、直前の要素候補の並びに対応する尤度情報であるものとする。同音用語取得手段１０４は、この音声認識結果情報を渡されると音声認識結果情報が有する要素候補に含まれている形態素のうち、助詞と助動詞とを除いた形態素「紀州」と「牡蠣」と「柿」と「花器」と「買い」と「解体」とに対応する同音用語を取得する。その結果、同音用語取得手段１０４は、「紀州」に対応する同音用語「きしゅう，キシュウ，奇襲，既修，貴酬」等のように、助詞と助動詞とを除いた音声認識結果情報に含まれる全て要素候補の形態素の同音用語を取得したものとする（ステップＳ２０３）。同音用語取得手段１０４によって同音用語を取得されると、要素候補表示手段１０５は、この音声認識結果情報に含まれる要素候補の並びが、タッチパネル１００２の表示する領域に全て表示できかどうかを判断する（ステップＳ２０４）。ここでは、要素候補表示手段１０５は、全て表示できると判断したものとする。要素候補表示手段１０５は、音声認識結果情報に含まれる尤度情報が「０．８８」で最も高い「紀州の」「柿を」「買い」「たい」の要素候補が表示する領域の中央に直線的に並ぶようにして表示する。なお、ここでは、同音用語は、非表示の状態で表示されるものとし、他の要素候補の並びに含まれる、音声データの同じ部分の認識結果である要素候補も表示しないようにした。この場合のタッチパネル１００２の画面には、図４のように表示される（ステップＳ２０５）。 The voice data accepting unit 101 accepts voice data “I want to squeak you” acquired by the microphone 1001 (step S201). The voice data received by the voice data receiving unit 101 is subjected to voice recognition processing by the voice recognition result information acquiring unit 102. Then, the voice recognition result information acquisition means 102 obtains the voice recognition result information {[[Kishu no], “Kashiwa,” “Buy”, “Taiwa, 0.88], [“ Kiju no, ”“ Oyster, ”“ Buy, ”“ [Taiki], 0.72], [[Kishu], [Vase], [Buy], [Taii], 0.68], [[Kishu], [Oyster], [Demolition], 0.55], [[Kishu ”,“ Vase ”,“ dismantling ”, 0.52]} (step S202). It should be noted that the number after the element candidate array included in the speech recognition result information acquired here is the likelihood information corresponding to the immediately preceding element candidate. When the speech recognition result information is passed, the homophone term acquisition means 104, among the morphemes included in the candidate elements included in the speech recognition result information, the morphemes “kishu”, “oyster” and “ Acquires the same sound terms corresponding to “柿”, “Vase”, “Buy” and “Dismantling”. As a result, the homophone term acquisition means 104 is included in the speech recognition result information excluding particles and auxiliary verbs, such as the homophone term corresponding to “Kishu” “Kishu, Kishu, Surprise, Already Repaired, Noble”. It is assumed that the same vocabulary terms of the morphemes of all element candidates obtained are acquired (step S203). When the homophone term is acquired by the homophone term acquisition unit 104, the element candidate display unit 105 determines whether or not the arrangement of element candidates included in the speech recognition result information can be displayed in the display area of the touch panel 1002. (Step S204). Here, it is assumed that the element candidate display unit 105 determines that all can be displayed. The element candidate display means 105 has a likelihood information included in the speech recognition result information of “0.88”, which is the highest in the center of the area where the element candidates of “Kishu”, “Sake”, “Buy” and “I want” are displayed. Display in a straight line. Here, the homophone terms are displayed in a non-display state, and element candidates that are recognition results of the same portion of the audio data included in the sequence of other element candidates are not displayed. In this case, the screen of the touch panel 1002 is displayed as shown in FIG. 4 (step S205).

ユーザは、図４の画面を確認すると、タッチパネルを操作して、「紀州の」の下部にある「▽」ボタンを押したものとする。表示変更受付手段１０６は、「紀州の」の同音用語を表示させる変更指示情報を受け付ける（ステップＳ２０７）。要素候補表示手段１０５は、「紀州の」の同音用語を表示させる変更指示情報に応じて、「紀州の」の同音用語を表示する（ステップＳ２０８）。さらに、ユーザは、「解体」の下部にある「▽」ボタンを押したものとする。「解体」の同音用語を表示させる変更指示情報は、同様に表示変更受付手段１０６が受け付け、要素候補表示手段１０５が、「解体」の同音用語を表示する。すると、図５のように表示される。 When the user confirms the screen of FIG. 4, the user operates the touch panel and presses the “▽” button at the bottom of “Kishu”. The display change accepting unit 106 accepts change instruction information for displaying the same term of “Kishu” (step S207). The element candidate display means 105 displays the homologous term “Kishu” in accordance with the change instruction information for displaying the homogenous term “Kishu” (step S208). Furthermore, it is assumed that the user presses the “▽” button below “disassembly”. The change instruction information for displaying the "dismantling" homophone term is similarly received by the display change receiving means 106, and the element candidate display means 105 displays the "dismantling" homophone term. Then, it is displayed as shown in FIG.

ユーザは、タッチパネルを操作して、「紀州の」と「柿を」と「買い」と「たい」とをこの順に選択したものとする。すると、要素候補選択受付手段１０７は、要素候補の並び［「紀州の」，「柿を」，「買い」，「たい」］を受け付ける（ステップＳ２０９）。要素候補選択受付手段１０７が選択を受け付けると、出力手段１０８は、要素候補選択受付手段１０７が受け付けた要素候補の並びから出力情報「紀州の柿を買いたい」を構成し、メーラーに出力する（ステップＳ２１０）。すると、メーラーは、図６のように表示する。 It is assumed that the user operates the touch panel to select “Kishu no”, “Kashiwa”, “Buy”, and “Tai” in this order. Then, the element candidate selection accepting means 107 accepts an array of element candidates [“Kishu no”, “Saga”, “Buy”, “I want”] (step S209). When the element candidate selection accepting unit 107 accepts the selection, the output unit 108 configures output information “I want to buy Kishu cocoons” from the list of element candidates accepted by the element candidate selection accepting unit 107 and outputs it to the mailer ( Step S210). Then, the mailer displays as shown in FIG.

本具体例では、要素候補表示手段１０５が、この音声認識結果情報に含まれる要素候補の並びを、タッチパネル１００２の表示する領域に全て表示できると判断した場合について説明したが、要素候補表示手段１０５がタッチパネル１００２の表示する領域に全ての要素候補の並びが表示できないと判断した場合には、要素候補表示手段１０５は、音声認識結果情報に含まれる要素候補の並びの一部の要素候補を表示する（ステップＳ２０４）。すると、図７の左の図のように表示される。そして、ユーザは、タッチパネル１００２を右から左へフリックしたものとする。すると、表示変更受付手段１０６は、画面をスクロールさせる指示を示す変更指示情報を受け付ける（ステップＳ２０７）。要素候補表示手段１０５は、変更指示情報に含まれるスクロールさせる分量に応じて、全ての要素候補が右から左へ移動するように見えるよう表示している要素候補を左へ移動させて表示する（ステップＳ２０８）。すると、図７の右の図のように表示される。 In this specific example, the case where the element candidate display unit 105 determines that all of the element candidate arrangements included in the speech recognition result information can be displayed in the display area of the touch panel 1002 has been described. If it is determined that the arrangement of all the element candidates cannot be displayed in the area displayed on the touch panel 1002, the element candidate display unit 105 displays a part of the element candidates included in the voice recognition result information. (Step S204). Then, it is displayed as shown in the left figure of FIG. The user flicks the touch panel 1002 from right to left. Then, the display change receiving unit 106 receives change instruction information indicating an instruction to scroll the screen (step S207). The element candidate display means 105 moves the element candidates displayed so that all the element candidates appear to move from the right to the left according to the amount of scrolling included in the change instruction information, and displays them (see FIG. Step S208). Then, it is displayed as shown on the right side of FIG.

また、本具体例では、ユーザが、「紀州の」と「柿を」と「買い」と「たい」とを順に選択した場合について説明したが、ユーザが、図８のように、「柿を」と「買い」と「たい」と「紀州の」とを順に選択した場合は、要素候補選択受付手段１０７は、その順番通りの要素候補の並びの選択を受け付ける（ステップＳ２０９）。要素候補選択受付手段１０７が選択を受け付けると、出力手段１０８は、要素候補選択受付手段１０７が受け付けた要素候補の並びから出力情報「柿を買いたい紀州の」を構成し、メーラーに出力する（ステップＳ２１０）。なお、要素候補選択受付手段１０７は、ユーザが選択した順番にかかわらず音声認識結果情報に含まれる要素候補の並びと同じ順番になるように受け付けても良い（ステップＳ２０９）。 Further, in this specific example, the case where the user selects “Kishu”, “柿”, “Buy”, and “Tai” in order has been described. However, as shown in FIG. "," Buy "," tai "and" Kishu "are selected in order, the element candidate selection receiving means 107 receives selection of the arrangement of element candidates in that order (step S209). When the element candidate selection accepting unit 107 accepts the selection, the output unit 108 configures output information “Kishu I want to buy candy” from the element candidate list accepted by the element candidate selection accepting unit 107 and outputs it to the mailer ( Step S210). Note that the element candidate selection accepting unit 107 may accept the element candidate so as to be in the same order as the element candidate array included in the speech recognition result information regardless of the order selected by the user (step S209).

また、本具体例では、要素候補表示手段１０５は、図４のように表示したが、要素候補表示手段１０５は、音声認識結果情報取得手段１０２が取得した音声認識結果情報に含まれる要素候補の並びが分かるように表示しても良い。例えば、要素候補表示手段１０５は、図９のように、音声認識結果情報に含まれている要素候補の並びの関係を矢印で接続することで表示しても良く、図１０のように、音声認識結果情報取得手段１０２が取得した音声認識結果情報に含まれる要素候補の並び全体を矢印で明示して表示しても良い。なお、音声認識結果情報取得手段１０２が取得した音声認識結果情報に含まれる要素候補の並びが分かるように表示した場合に、ユーザが、図１１のように、要素候補の並びの矢印を選択したものとする。すると要素候補選択受付手段１０７は、「紀州の」「柿を」「食べ」「たい」の順に要素候補の並びの選択を受け付ける。 Further, in this specific example, the element candidate display means 105 is displayed as shown in FIG. 4, but the element candidate display means 105 displays the element candidates included in the speech recognition result information acquired by the speech recognition result information acquisition means 102. It may be displayed so that the arrangement can be understood. For example, the element candidate display means 105 may display the relationship of the arrangement of element candidates included in the speech recognition result information by connecting with arrows as shown in FIG. The entire arrangement of element candidates included in the speech recognition result information acquired by the recognition result information acquisition unit 102 may be clearly displayed with an arrow. In addition, when displaying so that the arrangement | sequence of the element candidate contained in the speech recognition result information which the speech recognition result information acquisition means 102 acquired can be understood, the user selected the arrow of the element candidate arrangement as shown in FIG. Shall. Then, the element candidate selection accepting means 107 accepts selection of element candidate arrangements in the order of “Kishu”, “Rin”, “Eat”, “Tai”.

本実施の形態において、要素候補表示手段１０５が、要素候補を表示し、要素候補選択受付手段１０７が、ユーザによる、要素候補の並びの選択を受け付けることができる。要素候補の並びから、ユーザが取得したい要素候補の並びを選択できる。その結果、例えば、これまで行われてきた、ユーザによる誤認識箇所の修正の作業が、要素候補の並びの選択のみですむため、要素候補の並びを選択するユーザの負担が軽減される。また、要素候補表示手段１０５が、音声認識結果情報に含まれる要素候補のうち、複数の要素候補の並びに含まれる、音声データの同じ部分の音声認識の結果である要素候補を重複して表示しない場合は、要素候補の全ての並びを列挙するよりも無駄な情報が少なくなるため、一覧性の高い表示ができる。その結果、例えば、ユーザは、容易に要素候補の並びを選択できる。また、要素候補表示手段１０５が、音声認識結果情報に含まれる要素候補の並びの量に応じて表示を変更できるようにする場合は、例えば、タブレット端末とスマートフォンとで異なる表示が実現できる。具体的には、タブレット端末では、全ての要素候補を表示し、スマートフォンでは、一部の要素候補を表示する等、画面のサイズに適した表示ができる。また、要素候補表示手段１０５が、尤度情報に応じて要素候補の並びを表示できるようにする場合は、例えば、尤度情報の高い要素候補の並びを選択しやすいように表示できる。その結果、例えば、ユーザが、適切な要素候補を探す時間が短縮される。また、要素候補表示手段１０５が、尤度情報が高い要素候補の並びを直線的に表示できるようにする場合は、例えば、多くの場合において、ユーザは、直線をなぞるように尤度の高い要素候補の並びを選択するだけで、適切な要素候補の並びを選択できる。その結果、例えば、ユーザが、適切な要素候補を探す時間が短縮される。また、要素候補表示手段１０５が、表示する各要素候補に対応する同音用語を表示できるようにする場合は、音声認識で認識されなかった同音の用語も要素候補の並びに含めて選択できる。また、要素候補選択受付手段１０７が、ユーザによって指定された要素候補の順番に応じた要素候補の並びの選択を受け付ける場合には、音声認識結果情報に含まれない順番の要素候補の並びからも出力情報を構成できる。例えば、ユーザは、発話時と異なる順番の出力情報を出力させることができる。また、要素候補選択受付手段１０７が、ユーザによって指定された要素候補の順番に関わらず、音声認識結果情報に含まれる要素候補の並びの選択を受け付ける場合には、選択したい要素候補の並びに含まれる全ての要素候補を選択しなくても、要素候補の並びを選択するだけで出力情報を構成できる。 In the present embodiment, the element candidate display unit 105 can display the element candidates, and the element candidate selection receiving unit 107 can receive selection of the arrangement of the element candidates by the user. From the list of element candidates, the user can select a list of element candidates that the user wants to acquire. As a result, for example, the work of correcting the misrecognized portion by the user, which has been performed so far, is only to select the arrangement of the element candidates, so that the burden on the user who selects the arrangement of the element candidates is reduced. In addition, the element candidate display unit 105 does not redundantly display element candidates that are the result of speech recognition of the same portion of the speech data included in the plurality of element candidates among the element candidates included in the speech recognition result information. In this case, since unnecessary information is less than enumerating all the arrangements of element candidates, display with high listability can be performed. As a result, for example, the user can easily select the arrangement of element candidates. Moreover, when the element candidate display means 105 can change a display according to the amount of arrangement | sequence of the element candidate contained in audio | voice recognition result information, a different display can be implement | achieved by a tablet terminal and a smart phone, for example. Specifically, all the element candidates are displayed on the tablet terminal, and some of the element candidates are displayed on the smartphone. In addition, when the element candidate display unit 105 can display the arrangement of the element candidates according to the likelihood information, for example, the element candidate display unit 105 can display the arrangement of the element candidates with high likelihood information so that it can be easily selected. As a result, for example, the time for the user to search for an appropriate element candidate is shortened. In addition, in a case where the element candidate display unit 105 can linearly display an array of element candidates having high likelihood information, for example, in many cases, the user can select an element with high likelihood to trace a straight line. An appropriate element candidate sequence can be selected simply by selecting a candidate sequence. As a result, for example, the time for the user to search for an appropriate element candidate is shortened. Further, when the element candidate display means 105 can display a homophone term corresponding to each element candidate to be displayed, a homophone term that has not been recognized by speech recognition can be selected including the element candidates. In addition, when the element candidate selection accepting unit 107 accepts selection of the element candidate arrangement according to the order of the element candidates designated by the user, the element candidate selection accepting means 107 also starts from the element candidate arrangement in the order not included in the speech recognition result information. Output information can be configured. For example, the user can output the output information in a different order from the time of utterance. In addition, when the element candidate selection accepting unit 107 accepts selection of the arrangement of the element candidates included in the speech recognition result information regardless of the order of the element candidates designated by the user, the element candidate selection acceptance unit 107 includes the element candidates to be selected. Even if all the element candidates are not selected, the output information can be configured only by selecting the arrangement of the element candidates.

なお、本実施の形態では、マイク１００１を含む場合について説明したが、音声認識装置１は、マイク１００１を含んでいなくても良い。音声認識装置１がマイク１００１を含まない場合は、音声データ受付手段１０１は、図示しない格納手段に格納されている音声データを受け付けても良く、図示しないネットワークを介して音声データを受信しても良く、外部の装置に含まれるマイクで録音した音声データを、メモリーカード等の記憶媒体を介して受け付けても良い。 In the present embodiment, the case where the microphone 1001 is included has been described. However, the speech recognition apparatus 1 may not include the microphone 1001. When the voice recognition apparatus 1 does not include the microphone 1001, the voice data receiving unit 101 may receive voice data stored in a storage unit (not shown) or may receive voice data via a network (not shown). Alternatively, audio data recorded by a microphone included in an external device may be received via a storage medium such as a memory card.

また、本実施の形態では、タッチパネル１００２を含む場合について説明したが、音声認識装置１は、タッチパネル１００２を含んでいなくても良い。音声認識装置１がタッチパネル１００２を含まない場合は、要素候補表示手段１０５は、他のディスプレイに表示しても良く、出力手段１０８は、他のディスプレイ、他の装置、または他の構成要素に出力しても良い。また、タッチパネル１００２を含まない場合は、表示変更受付手段１０６は、ユーザがマウスやキーボード等を用いて出力した変更指示情報を受け付けても良く、要素候補選択受付手段１０７は、ユーザがマウスやキーボード等を用いて選択した要素候補の並びの選択を受け付けても良い。 In this embodiment, the case where the touch panel 1002 is included has been described. However, the voice recognition device 1 may not include the touch panel 1002. When the speech recognition apparatus 1 does not include the touch panel 1002, the element candidate display unit 105 may display on another display, and the output unit 108 outputs to another display, another device, or another component. You may do it. When the touch panel 1002 is not included, the display change accepting unit 106 may accept change instruction information output by the user using a mouse, a keyboard, or the like, and the element candidate selection accepting unit 107 may be used by the user. The selection of the arrangement of candidate elements selected using the above may be accepted.

また、本実施の形態では、表示変更受付手段１０６を含む場合について説明したが、音声認識装置１は、表示変更受付手段１０６を含んでいなくても良い。音声認識装置１が表示変更受付手段１０６を含んでいない場合は、要素候補表示手段１０５は、要素候補を選択する度に、次の要素候補を表示するように表示する要素候補を変更しても良い。 In the present embodiment, the case where the display change receiving unit 106 is included has been described. However, the speech recognition apparatus 1 may not include the display change receiving unit 106. When the speech recognition apparatus 1 does not include the display change accepting unit 106, the element candidate display unit 105 may change the element candidate to be displayed so that the next element candidate is displayed every time an element candidate is selected. good.

また、本実施の形態では、同音用語格納手段１０３と同音用語取得手段１０４とを含む場合について説明したが、音声認識装置１は、同音用語格納手段１０３と同音用語取得手段１０４とを含んでいなくても良い。音声認識装置１が、同音用語格納手段１０３と同音用語取得手段１０４とを含んでいない場合は、要素候補表示手段１０５は、同音用語を表示しなくても良い。 Further, in the present embodiment, the case where the homonym term storage unit 103 and the homonym term acquisition unit 104 are included has been described, but the speech recognition apparatus 1 does not include the homonym term storage unit 103 and the homonym term acquisition unit 104. It is not necessary. When the speech recognition apparatus 1 does not include the homonym term storage unit 103 and the homonym term acquisition unit 104, the element candidate display unit 105 may not display the homonym term.

また、本実施の形態における音声認識装置１を実現するソフトウェアは、以下のようなプログラムである。つまり、プログラムは、コンピュータを、ユーザより発話された音声のデータである音声データを受け付ける音声データ受付手段、音声データ受付手段が受け付けた音声データに対して音声認識処理を実施し、音声認識の単位である要素の候補である要素候補の２以上の並びを含む音声認識結果情報を取得する音声認識結果情報取得手段、音声認識結果情報が有する２以上の要素候補を表示する要素候補表示手段、要素候補表示手段による要素候補の表示に対して、要素候補の並びの選択を受け付ける要素候補選択受付手段、要素候補選択受付手段が選択を受け付けた要素候補の並びである出力情報を出力する出力手段として機能させるためのプログラムである。 Moreover, the software which implement | achieves the speech recognition apparatus 1 in this Embodiment is the following programs. That is, the program causes the computer to perform voice recognition processing on the voice data received by the voice data receiving means, voice data receiving means that receives voice data that is voice data spoken by the user, and Speech recognition result information acquisition means for acquiring speech recognition result information including two or more arrangements of element candidates that are candidate elements, element candidate display means for displaying two or more element candidates included in the speech recognition result information, and elements In response to the display of element candidates by the candidate display means, as an element candidate selection accepting means for accepting selection of the arrangement of element candidates, an output means for outputting output information that is an array of element candidates accepted by the element candidate selection accepting means It is a program to make it function.

なお、本実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されても良く、または複数の装置によって分散処理されることによって実現されても良い。また、本実施の形態において、一の装置に存在する２以上の通信手段は、物理的に一の媒体で実現されても良いことは言うまでもない。 In the present embodiment, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. Also good. In the present embodiment, it goes without saying that two or more communication means existing in one apparatus may be physically realized by one medium.

また、本実施の形態において、各構成要素は、専用のハードウェアにより構成されても良く、またはソフトウェアにより実現可能な構成要素については、プログラムを実行することによって実現されても良い。例えば、ハードディスクや半導体メモリ等の記録媒体に記録されたソフトウェア・プログラムをＣＰＵ等のプログラム実行部が読み出して実行することによって、各構成要素が実現され得る。 In the present embodiment, each component may be configured by dedicated hardware, or the component that can be realized by software may be realized by executing a program. For example, each component can be realized by a program execution unit such as a CPU reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory.

なお、上記プログラムにおいて、上記プログラムが実現する機能には、ハードウェアでしか実現できない機能は含まれない。例えば、情報を取得する取得部や、情報を出力する出力部等におけるモデムやインタフェースカード等のハードウェアでしか実現できない機能は、上記プログラムが実現する機能には含まれない。 In the program, the functions realized by the program do not include functions that can be realized only by hardware. For example, functions that can be realized only by hardware such as a modem and an interface card in an acquisition unit that acquires information, an output unit that outputs information, and the like are not included in the functions realized by the program.

図１２は、上記プログラムを実行して、上記実施の形態による本発明を実現するコンピュータの内部構成の一例を示す図である。上記実施の形態は、コンピュータハードウェアおよびその上で実行されるコンピュータプログラムによって実現され得る。図１２において、コンピュータシステム１０００は、マイク１００１と、タッチパネル１００２と、ＭＰＵ１００３と、ブートアッププログラム等のプログラム、およびデータを格納するためのフラッシュＲＯＭ１００４と、アプリケーションプログラムの命令を一時的に格納すると共に、一時記憶空間を提供するＲＡＭ１００５と、ＭＰＵ１００３等を相互に接続するバス１００６とを備える。 FIG. 12 is a diagram illustrating an example of an internal configuration of a computer that executes the program and implements the present invention according to the embodiment. The embodiment described above can be realized by computer hardware and a computer program executed on the computer hardware. In FIG. 12, a computer system 1000 temporarily stores a microphone 1001, a touch panel 1002, an MPU 1003, a flash ROM 1004 for storing a program such as a boot-up program and data, and an application program instruction. A RAM 1005 that provides a temporary storage space and a bus 1006 that connects the MPU 1003 and the like to each other are provided.

プログラムは、コンピュータシステム１０００に、上記実施の形態による本発明の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティプログラム等を必ずしも含んでいなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいても良い。コンピュータシステム１０００がどのように動作するのかについては周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer system 1000 to execute the functions of the present invention according to the above-described embodiment. The program may include only a part of an instruction that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 1000 operates is well known and will not be described in detail.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。また、本発明における各手段の「手段」は、「部」や「回路」と読み替えても良い。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention. The “means” of each means in the present invention may be read as “part” or “circuit”.

以上のように、本発明にかかる音声認識装置等は、音声認識後のユーザの負担を減らす効果を有し、音声認識装置等として有用である。 As described above, the speech recognition apparatus and the like according to the present invention have an effect of reducing the burden on the user after speech recognition, and are useful as a speech recognition apparatus and the like.

１音声認識装置
１０１音声データ受付手段
１０２音声認識結果情報取得手段
１０３同音用語格納手段
１０４同音用語取得手段
１０５要素候補表示手段
１０６表示変更受付手段
１０７要素候補選択受付手段
１０８出力手段 DESCRIPTION OF SYMBOLS 1 Voice recognition apparatus 101 Voice data reception means 102 Voice recognition result information acquisition means 103 Homophone term storage means 104 Homophone term acquisition means 105 Element candidate display means 106 Display change reception means 107 Element candidate selection reception means 108 Output means

Claims

Voice data receiving means for receiving voice data which is voice data spoken by the user;
A speech recognition result including two or more sequences of element candidates that are candidates for elements that are part of the speech recognition result corresponding to the speech data, by performing speech recognition processing on the speech data received by the speech data receiving unit Speech recognition result information acquisition means for acquiring information;
Element candidate display means for displaying two or more element candidates included in the speech recognition result information;
Element candidate selection accepting means for accepting selection of the arrangement of element candidates for display of element candidates by the element candidate display means;
Output means for outputting output information that is a list of element candidates for which the element candidate selection receiving means has received selection ;
One or more homophone terms that are the same sound as a part of at least one of the two or more element candidates included in the speech recognition result information and that are different from the term are acquired. Further comprising a homonym term acquisition means,
The element candidate display means includes
A speech recognition apparatus that also displays an element candidate obtained by replacing a term included in an element candidate with the same term using one or more same term obtained by the same term term acquisition means .

The element candidate display means includes
The speech recognition apparatus according to claim 1, wherein all or some element candidates of the speech recognition result information are displayed according to a size of a region to be displayed or an amount of information of the speech recognition result information.

The voice recognition result information acquisition means includes
Obtain speech recognition result information including likelihood information that is the likelihood related to the arrangement of element candidates,
The element candidate display means includes
The speech recognition apparatus according to claim 1, wherein element candidates are displayed according to the likelihood information.

The element candidate display means includes
The speech recognition apparatus according to claim 3, wherein the arrangement having the highest likelihood of element candidates is displayed so as to be linear.

One or more element candidates of the two or more element candidates included in the speech recognition result information include independent words and non-independent words,
The homonym term acquisition means includes:
The speech recognition device according to any one of claims 1 to 4, wherein a homonym term is acquired only from an independent word included in the element candidate .

The element candidate selection receiving means
The speech recognition device according to any one of claims 1 to 5, which receives selection of an arrangement of element candidates according to an order of element candidates designated by a user.

The element candidate selection receiving means
The speech recognition apparatus according to any one of claims 1 to 5, which receives selection of any one of element candidates included in the speech recognition result information.

A speech recognition method processed using speech data reception means, speech recognition result information acquisition means, element candidate display means, element candidate selection reception means, and output means,
A voice data receiving step in which the voice data receiving means receives voice data which is voice data spoken by a user;
The voice recognition result information acquisition means performs voice recognition processing on the voice data received by the voice data receiving step, and selects candidate elements that are candidates for elements that are part of the voice recognition result corresponding to the voice data. A voice recognition result information acquisition step for acquiring voice recognition result information including two or more sequences;
An element candidate display step in which the element candidate display means displays two or more element candidates included in the speech recognition result information;
The element candidate selection accepting means for accepting selection of the arrangement of element candidates for displaying the element candidates in the element candidate display step; and
The output means includes an output step of outputting output information that is an array of element candidates that the element candidate selection receiving step has received selection;
One or more homophone terms that are the same sound as a part of at least one of the two or more element candidates included in the speech recognition result information and that are different from the term are acquired. A homonym term acquisition step,
In the element candidate display step,
A speech recognition method for displaying an element candidate obtained by replacing a term included in an element candidate with a homophone term by using one or more homophone terms acquired in the homophone term acquisition step .

Computer
Voice data receiving means for receiving voice data which is voice data spoken by the user;
A speech recognition result including two or more sequences of element candidates that are candidates for elements that are part of the speech recognition result corresponding to the speech data, by performing speech recognition processing on the speech data received by the speech data receiving unit Voice recognition result information acquisition means for acquiring information,
Element candidate display means for displaying two or more element candidates included in the voice recognition result information;
Element candidate selection accepting means for accepting selection of the arrangement of element candidates for displaying the element candidates by the element candidate display means,
A program for causing the element candidate selection accepting means to function as an output means for outputting output information that is an array of element candidates accepted for selection ,
Computer
One or more homophone terms that are the same sound as a part of at least one of the two or more element candidates included in the speech recognition result information and that are different from the term are acquired. Further function as a homonym term acquisition means,
The element candidate display means includes
A program for causing a computer to function as an element candidate obtained by replacing a term included in an element candidate with a homophone term using one or more homophone terms acquired by the homophone term acquisition means .