JP3762300B2

JP3762300B2 - Text input processing apparatus and method, and program

Info

Publication number: JP3762300B2
Application number: JP2001401299A
Authority: JP
Inventors: 浩平桃崎
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-12-28
Filing date: 2001-12-28
Publication date: 2006-04-05
Anticipated expiration: 2021-12-28
Also published as: JP2003202886A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device, method, and program for text input equipped with a user interface enabling a user to properly and easily edit an input text based upon a speech recognition result. <P>SOLUTION: The text input processor providing a dictation function of inputting a text by using speech recognition correlates an index with a pronounced character string and puts them side by side. Candidates having the same pronunciation are displayed in a candidate list where the index is specified is displayed. In a candidate list where the pronounced character string is specified, candidates having different pronunciations are displayed. <P>COPYRIGHT: (C)2003,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、広くは自然言語処理に関し、特に、音声でテキスト（文章）の入力を行うディクテーション機能を提供する音声認識技術を利用したテキスト入力処理装置及び方法並びにプログラムに関する。
【０００２】
【従来の技術】
音声認識技術を利用したテキスト入力処理装置の従来例では、単純な漢字仮名混じり表記のテキスト形式で音声認識結果を表示するようにしている。このような音声認識に基づく入力テキストを修正する際、修正対象を選択して候補表示する操作を行うと、従来例では、表記が違う候補や発音が違う候補などが全て表示されるものとなっていた。このような従来例では、音声認識結果に基づいて表示された入力テキストがユーザの想定していたものと異なる場合に、それが同音語内の表記の違いなのか、それとも音の違いなのか、あるいは単語等の分割単位の違いなのか、といったことの判別が難しいという問題点がある。特に、ユーザが知らない単語や読めない単語が表示されてしまい、それがテキスト修正を困難にするということは、キーボード等によるテキスト入力とは違った音声認識に基づくテキスト入力に特有の問題点である。
【０００３】
また、修正候補の選択において、表記が違う候補や発音が違う候補など複数の要因による認識候補が全て表示されるので、目的の候補を見つけだすのに時間がかかり、操作が煩雑になるという問題点もある。また、候補選択状態に移行した後に、新たな操作ステップを経て初めて表示がなされるよう構成されている場合なども、ユーザが目的とする候補が得られるまでの操作が煩雑になる。
【０００４】
このように、音声認識技術を利用したテキスト入力処理装置の従来例には、入力音声テキストの修正（広義には編集）操作を容易に行えるようなユーザインターフェースが提供されることが望まれている。
【０００５】
【発明が解決しようとする課題】
本発明は、かかる事情を考慮してなされたものであり、音声認識結果に基づく入力テキストの編集をユーザが適切且つ容易に行えるユーザインタフェースを備えたテキスト入力装置、方法、及びプログラムを提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記課題を解決し目的を達成するために本発明は次のように構成されている。
【０００７】
本発明に係る第１のテキスト入力処理装置は、音声認識を利用してテキストを入力処理するテキスト入力処理装置であって、前記テキストを構成する複数の文節のいずれか一つが、複数の異音語を含む発音文字列の候補を有する音声認識結果を記憶する記憶手段と、前記複数の文節のそれぞれの発音文字列を組み合わせて表示する表示手段と、前記いずれか一つの文節について、前記発音文字列の候補を一覧表示する候補表示手段と、一覧表示された前記候補のなかから、いずれか一つの異音語をユーザに選択させるための選択手段と、を具備することを特徴とするテキスト入力処理装置である。
【０００８】
また、本発明に係る第２のテキスト入力処理装置は、音声認識を利用してテキストを入力処理するテキスト入力処理装置であって、前記テキストを構成する複数の文節のいずれか一つが、複数の異音語を含む発音文字列の候補、および、表記が異なる複数の同音語を含む漢字仮名混じり文字列の候補を有する音声認識結果を記憶する記憶手段と、前記複数の文節のそれぞれの発音文字列の一つを組み合わせて表示する第１表示手段と、前記複数の文節のそれぞれの漢字仮名混じり文字列の一つを組み合わせて表示する第２表示手段と、前記いずれか一つの文節について、前記発音文字列の候補を一覧表示する第１候補表示手段と、前記いずれか一つの文節について、前記漢字仮名混じり文字列の候補を一覧表示する第２候補表示手段と、一覧表示された前記発音文字列の候補のなかから、いずれか一つの異音語をユーザに選択させるための第１選択手段と、一覧表示された前記漢字仮名混じり文字列の候補のなかから、いずれか一つの同音語をユーザに選択させるための第２選択手段と、を具備することを特徴とするテキスト入力処理装置である。
【０００９】
本発明に係る第１のテキスト入力処理方法は、音声認識を利用してテキストを入力処理するテキスト入力処理方法であって、前記テキストを構成する複数の文節のいずれか一つが、複数の異音語を含む発音文字列の候補を有する音声認識結果を記憶する記憶ステップと、前記複数の文節のそれぞれの発音文字列を組み合わせて表示する表示ステップと、前記いずれか一つの文節について、前記発音文字列の候補を一覧表示する候補表示ステップと、一覧表示された前記候補のなかから、いずれか一つの異音語をユーザに選択させるための選択ステップと、具備することを特徴とするテキスト入力処理方法である。
【００１０】
また、本発明に係る第２のテキスト入力処理方法は、音声認識を利用してテキストを入力処理するテキスト入力処理方法であって、前記テキストを構成する複数の文節のいずれか一つが、複数の異音語を含む発音文字列の候補、および、表記が異なる複数の同音語を含む漢字仮名混じり文字列の候補を有する音声認識結果を記憶する記憶ステップと、前記複数の文節のそれぞれの発音文字列の一つを組み合わせて表示する第１表示ステップと、前記複数の文節のそれぞれの漢字仮名混じり文字列の一つを組み合わせて表示する第２表示ステップと、前記いずれか一つの文節について、前記発音文字列の候補を一覧表示する第１候補表示ステップと、前記いずれか一つの文節について、前記漢字仮名混じり文字列の候補を一覧表示する第２候補表示ステップと、一覧表示された前記発音文字列の候補のなかから、いずれか一つの異音語をユーザに選択させるための第１選択ステップと、一覧表示された前記漢字仮名混じり文字列の候補のなかから、いずれか一つの同音語をユーザに選択させるための第２選択ステップと、を具備することを特徴とするテキスト入力処理方法である。
【００１１】
本発明に係る第１のプログラムは、音声認識を利用したテキストの入力を処理するプログラムであって、コンピュータを、前記テキストを構成する複数の文節のいずれか一つが、複数の異音語を含む発音文字列の候補を有する音声認識結果を記憶する記憶手段、前記複数の文節のそれぞれの発音文字列を組み合わせて表示する表示手段、前記いずれか一つの文節について、前記発音文字列の候補を一覧表示する候補表示手段、一覧表示された前記候補のなかから、いずれか一つの異音語をユーザに選択させるための選択手段、として機能させるためのプログラムである。
【００１２】
また、本発明に係る第２のプログラムは、音声認識を利用したテキストの入力を処理するプログラムであって、コンピュータを、前記テキストを構成する複数の文節のいずれか一つが、複数の異音語を含む発音文字列の候補、および、表記が異なる複数の同音語を含む漢字仮名混じり文字列の候補を有する音声認識結果を記憶する記憶手段、前記複数の文節のそれぞれの発音文字列の一つを組み合わせて表示する第１表示手段、前記複数の文節のそれぞれの漢字仮名混じり文字列の一つを組み合わせて表示する第２表示手段、前記いずれか一つの文節について、前記発音文字列の候補を一覧表示する第１候補表示手段、前記いずれか一つの文節について、前記漢字仮名混じり文字列の候補を一覧表示する第２候補表示手段、一覧表示された前記発音文字列の候補のなかから、いずれか一つの異音語をユーザに選択させるための第１選択手段、一覧表示された前記漢字仮名混じり文字列の候補のなかから、いずれか一つの同音語をユーザに選択させるための第２選択手段、として機能させるためのプログラムである。
【００１３】
【発明の実施の形態】
以下、図面を参照しながら本発明の実施形態を説明する。
【００１４】
図１は、本発明に係るテキスト入力装置の一実施形態の概略構成を示すブロック図である。本実施形態のテキスト入力装置は、例えば汎用のコンピュータに、音声認識に係るデバイスを設けたものをベースとして構成することができ、マイクロホン等の音声入力デバイスに結合され、ユーザが発した音声を入力する音声入力部１１と、音声入力部１１に入力された音声を認識する音声認識部１２と、音声認識部１２による音声認識結果を保持する候補情報保持部１３と、キーボードやマウス等の入力デバイスに結合され、ユーザが行った候補選択操作についての情報を入力する候補選択操作部１４と、選択的な候補表示の制御を司る本実施形態の主要部であって、候補選択操作部１４から入力される操作情報に従い、候補情報保持部１３が保持する認識結果の情報から、適切な候補表示情報を作成する候補表示制御部１５と、候補表示制御部１５において作成された候補表示情報をディスプレイ上に表示する表示部１６と、から構成されている。
【００１５】
本発明に対応する主要な構成要素は、候補情報保持部１３、候補表示制御部１５、および候補選択操作部１４であり、これらの構成要素は例えばコンピュータソフトウェアによって実現することができる。
【００１６】
図２は、図１における音声認識部１２の概略構成を示すブロック図である。図２に示すように、音声認識部１２は、音声データを入力する音声入力部２１と、音声入力部２１を介して入力された音声データに対して、信号処理及び分析を行い、発声部分を検出して切り出したり、特徴量を抽出してパラメータ化する等の音響的な処理を行う音響処理部２２と、音響処理部２２によってパラメータ化された音声情報を、単語辞書２６に登録されている単語で構成される単語列と照合する照合部２３と、照合部２３における照合処理において参照され、ＨＭＭ（隠れマルコフモデル）等から構成される音響辞書２４と、同じく照合部２３における照合処理において参照され、統計的言語モデル（ｎ−ｇｒａｍ）等から構成される言語辞書２５と、照合部２３における照合処理の結果として得られる単語列を解析し、これを文節単位に再構成したり、同音語の展開を行ったりする言語処理部２７と、言語処理部２７における言語処理において参照され、単語についての種々の情報を格納してなる単語辞書２８と、言語処理部２７における言語処理結果についての履歴を管理し、候補出力を行う候補出力部２９と、によって構成されている。
【００１７】
以上のように構成された本実施形態において、ユーザが発声した音声が音声入力部１１に入力され、発声終了直後に音声認識部１２においてその一回の発声が認識されたとする。ここで、音声認識部１２により図３又は図４のような候補情報が出力され、候補情報保持部１３に格納された場合を例に挙げて説明する。
【００１８】
認識結果の候補情報を候補情報保持部１３から受け取ると、候補表示制御部１５は、直ちに、最も適切と判定された読み（発音）と見出しの組を使用し、表示部１６に図５に示すような「読み（発音）」を併記した候補表示を行わせる。かかる「読み」は、いわゆる「ルビ（読み仮名）」と同様の情報である。
【００１９】
ここで、候補選択操作部１４は、表示部１６の表示に対し、ユーザがキーボードやマウス等で候補選択の対象とする部分を選択指定したり、候補表示を実行する操作を行ったり、表示された複数候補の中から別の候補を選択指定したりするための操作インターフェースを提供する。その詳細については後述する。
【００２０】
次に、音声認識部１２が出力する候補情報について説明する。
【００２１】
図３は、音声認識部１２が出力する候補情報の一例を示している。候補情報には、音声認識部１２により複数得られた音声認識結果について、最も適切と判定された（一位）系列のほか、文節の境界が同じになる複数の候補が格納される。各々の候補は、読み（発音文字列）と見出し（漢字仮名混じりの表記）の情報を有する。また、候補情報には、同一の読み（発音）で異なる表記の（同音語）候補や、読み（発音）の異なる（異音語、異なり語）候補も格納される。
【００２２】
図３に示すように、文節番号１として、音声中の位置０から４０までの区間で認識された「こころから」の発音の候補が同音語を含めて２つ格納されている。この中で最も適切と判定されている表記は「心から」である。
【００２３】
また、文節番号２として、文節番号１に続く位置４０から６０までの区間で認識された「あつく」「あつくも」の２つの発音の候補が同音語を含めて計１０個格納されている。この中では「熱く」が最も適切と判定されている。同音語としては「厚く」などがある。
【００２４】
さらに、文節番号３として、文節番号２に続く位置６０から８８までの区間で認識された「おれい」「おんれい」「おんで」「おれへ」の４つの発音の候補が同音語を含めて計１１個格納されている。この中で「お礼」が最も適切と判定されている。同じ区間の異音語の各々の発音の候補中では、「御礼（おんれい）」「恩で（おんで）」「俺へ（おれへ）」が最も適切と判定されている。
【００２５】
図４は、音声認識部１２が出力する候補情報の他の例を示している。この候補情報は、文節番号１乃至３は図３のものと同様である。そして、文節番号４として、音声中の位置０から３６までの区間で認識された「ここのか」が格納され、文節番号７として、これに続く位置３６から５２までの区間で認識された「だす」が格納され、文節番号９として、これに続く位置５２から８８までの区間で認識された「こんれい」が、各々、同音語を含めて格納されている。これらの候補中では「９日」「出す」「婚礼」が最も適切と判定されている。
【００２６】
さらにこの図４の候補情報では、文節番号５として、音声中の位置０から３２までの区間で認識された「ここも」、文節番号６として、これに続く位置３２から４６までの区間で認識された「ただ」、文節番号８として、これに続く位置４６から８８までの区間で認識された「すっとんで」が、各々、同音語を含めて格納されている。これらの候補中では「ここも」「ただ」「すっ飛んで」が最も適切と判定されている。
【００２７】
すなわち、文節番号１乃至３の「心から」「熱く」「お礼」が一位系列であり、文節の境界が異なる他の系列として、文節番号４及び７並びに９の「９日（ここのか）」「出す」「婚礼」や、文節番号５及び６並びに８の「ここも」「ただ」「すっ飛んで」が格納されている。
【００２８】
ここで、候補選択操作部１４及び候補表示制御部１５並びに表示部１６の動作について説明する。
【００２９】
表示部１６では、初期状態では図５に示すように、「こころから／心から」「あつく／熱く」「おれい／お礼」「もうしあげます／申し上げます」が表示されているとする。
【００３０】
先ず、ユーザにより、「見出し」に対する候補表示指示が行われた場合、例えば「熱く」を選択して候補表示指示が行われた場合について説明する。かかる操作が行われると、その操作情報が候補選択操作部１４を通じて候補表示制御部１５に与えられる。候補表示制御部１５は、候補情報保持部１３に保持されている候補情報中の「熱く」に対応する候補のうち、「熱く」の同音語である候補を図６のように表示部１６に表示させる。
【００３１】
ここで、本実施形態は、候補表示制御部１５に所定のモード切替操作が与えられると、候補表示制御部１５は、図６に示した表示情報に代えて、図７のように、読み（発音）の異なる候補（異音語、異なり語）までをも含めた候補表示を行うよう構成される。図６及び図７の候補表示は、上記モード切替操作に応じて相互に切り替え可能に構成されることが好ましい。
【００３２】
さらに、ここで、図６（又は図７）の表示候補のうち、「厚く」を選択指定する操作が行われると、表示部１６は選択された「厚く」を図８のように表示する。また、図７で表示された候補のうち、「厚くも」を選択指定する操作を行うと、表示部１６は選択された「厚くも」とその読み（発音）「あつくも」を図９のように表示する。
【００３３】
次に、ユーザにより「読み」に対する候補表示指示が行われた場合、例えば「おれい」選択して候補表示の指示が与えられた場合について説明する。かかる操作が行われた場合は、候補情報の中の「お礼」に対応する候補のうち、「おれい」と異なる読み（発音）を図１０のように表示する。このとき、図１１のように、読み（発音）の他に表記を合わせて表示するモードとの切り替えを可能にしておくことが好ましい。表記は、その読み（発音）に対応する候補の中で最も適切と判定された表記を表示するとよい。
【００３４】
ここで、図１０の表示候補のうち、「おんれい」を選択指定する操作が行われると、表示部１６は選択された「おんれい」と、それに対応する表記「御礼」を図１２のように表示する。なお、図１１で表示された候補のうち、「おんれい／御礼」を選択する操作を行った場合も同様である。
【００３５】
以上のような本実施形態によれば、ユーザは、「見出し」及び「読み」についての選択的な候補表示に基づき、読み（発音）及び表記の適切な組み合わせを容易に得て、所望のテキストを入力処理（修正など）することができる。
【００３６】
ここで、上述した構成に基づく他の候補表示処理について説明する。他の候補表示処理は、見出しの表示を行わず、「読み」のみの表示を行うというものである。
【００３７】
候補表示制御部１５は、最も適切と判定された読み（発音）のみを使用して、表示部１６に、図１３に示すような読み（発音）のみの候補表示を行わせる。
【００３８】
この場合、「おれい」の読み（発音）を選択して候補表示する指示が候補選択操作部１４を介してユーザから与えられた場合には、候補情報の中の「お礼」に対応する候補のうち、「おれい」と異なる読み（発音）を含めて図１４のように表示する。このとき、図１５のように、読み（発音）の他に、対応する最も適切な表記を合わせて表示するモードとの切り替えを可能にしてくことが好ましい。
【００３９】
さらにここで、図１４で表示された候補のうち、「おんれい」を選択指定する操作がユーザにより行われると、表示部１６は、選択された「おんれい」を図１６のように表示する。図１５で表示された候補のうち、「おんれい／御礼」を選択する操作を行った場合についても同様である。
【００４０】
次に、候補表示制御部１５における処理内容について、図１７のフローチャートを参照して説明する。
【００４１】
候補表示制御部１５では、音声認識部１２から候補情報が入力されると、候補情報保持部１３にその候補情報を保持する（ステップＳ３１）。
【００４２】
次に、候補情報の中で最も適切と判定されている一位系列の候補情報を候補情報保持部１３から取得し（ステップＳ３２）、見出しを表示するか否かの設定情報を判定する（ステップＳ３３）。この設定情報を、ユーザが設定できるよう構成してもよい。
【００４３】
見出しを表示する設定の場合は、読み（発音）と見出しの組を使用した表示情報を作成する（ステップＳ３４）。一方、見出しを表示しない設定の場合は、読み（発音）のみを使用した表示情報を作成し（ステップＳ３５）、表示部１６における表示を行わせる（ステップＳ３６）。なお、ステップＳ３６における表示は、一位系列の候補情報の表示である。
【００４４】
その後、候補表示制御部１５は、ユーザからの候補表示指示を受け付けるための待機状態に移行する（ステップＳ３７）。
【００４５】
ここで、候補表示の指示がユーザから与えられると、候補選択用候補の表示情報が作成（ステップＳ３８）され、表示部１６により表示が行われる。同ステップＳ３８の処理内容については後述する。この候補表示動作に続いて、ユーザからの候補選択操作を受け入れるための待機状態に移行する（ステップＳ３９）。
【００４６】
ここで、候補選択する操作が行われると、指定された候補の読み（発音）と見出しの組を使用して、表示部１６の表示を更新し（ステップＳ４０）、再びユーザからの候補表示操作を受け入れるための待機状態に入る（ステップＳ３７）。
【００４７】
次に、候補表示制御部１５における候補選択用の候補表示処理（ステップＳ３８）の詳細について、図１８のフローチャートを参照して説明する。
【００４８】
先ず、候補表示する旨のユーザからの指示操作（例えばマウスクリックなど）を検知すると、指定された箇所が見出しであるか、読み（発音）であるかを判定する（ステップＳ４１）。見出しが指定された場合は、全候補を表示するか否かについての所定の設定内容を参照する（ステップＳ４２）。全候補を表示しない設定の場合は、候補情報保持部１３から例えば同音語の候補のみを抽出する（ステップＳ４３）。全候補を表示する設定の場合は、同じ区間内の全ての候補を抽出する（ステップＳ４４）。これら設定に応じて抽出された候補は、ステップＳ４５において表示部１６に表示される。
【００４９】
一方、上記ステップＳ４１において、指定箇所が読み（発音）であった旨判定された場合は、候補情報保持部１３から、異なる読み（発音）の候補であって、読み（発音）ごとに最も適切と判定された表記の候補をステップＳ４６において抽出する。さらに、見出し表示を併用するか否かについての所定の設定内容を参照する（ステップＳ４７）。かかる設定内容に応じて、読み（発音）のみを候補表示する（ステップＳ４８）か、読み（発音）と表記（見出し）を合わせて候補表示する（ステップＳ４９）かについて、処理動作が選択される。かかる動作ののち、ステップＳ４５において、表示部１６に候補表示がなされる。
【００５０】
ここで、上記実施形態の変形例について説明する。
【００５１】
上記実施形態では、見出しと組み合わせて表示される発音文字列として平仮名の「読み」を使用したが、片仮名やローマ字を使用してもよい。また、「お礼」に対して「おれい」ではなく「おれー」というような実際の発音に近い表記を使用してもよい。さらにアクセント型を表す表示を付加してもよい。
【００５２】
また、上記実施形態では、日本語を対象としているが、他の言語でもよい。例えば中国語を対象とし、発音文字列としてピンインや注音符号を使用してもよい。また、声調の表示を付加してもよい。
【００５３】
また、上記実施形態では、候補表示を文節単位で行っているが、単語その他の単位で行ってもよい。
【００５４】
また、候補選択操作の方法については、キーボードやマウスのほか、ペン、音声操作等を利用して行ってもよく、選択対象を指定して実行を指示することのできる任意のデバイスについて、本発明は適用可能である。
【００５５】
また、上記実施形態は、いわゆるポップアップウィンドウによって候補表示しているが、画面の下端などの別領域に列挙表示するなどの方法としてもよい。
【００５６】
また、上記実施形態では、同一の読み（発音）で異なる表記の候補を、予め音声認識処理の中で生成しているが、音声認識処理の中では読み（発音）の異なるものを扱い、異なる表記の候補に展開する言語処理を別途行うように構成してもよい。異なる表記の候補展開は、例えば候補表示操作がなされたときに行えばよい。
【００５７】
また、上記実施形態では、音声認識部１２中に言語処理部２７が含まれる構成としているが、同処理部２７に代えて、主にキーボード入力を処理する仮名漢字変換等の言語処理部を使用することとし、音声認識部１２に外付けする構成としてもよい。
【００５８】
なお、本発明は上述した実施形態及び変形例に限定されず、さらに種々変形して実施可能である。本発明は、各種情報処理装置におけるテキスト入力のための手段の構成方法として有効であり、パーソナルコンピュータのソフトウェア、ワードプロセッサ装置、携帯情報機器等に幅広く利用可能である。
【００５９】
【発明の効果】
以上説明したように、本発明によれば、音声認識結果に基づく入力テキストの編集をユーザが適切且つ容易に行えるユーザインタフェースを備えたテキスト入力装置、方法、およびプログラムを提供できる。
【図面の簡単な説明】
【図１】本発明に係るテキスト入力装置の一実施形態の概略構成を示すブロック図
【図２】図１に示す音声認識部１２の概略構成を示すブロック図
【図３】音声認識部１２が出力する候補情報の一例を示す図
【図４】音声認識部１２が出力する候補情報の他の例を示す図
【図５】初期状態における音声入力テキストを示す図
【図６】「見出し」に対する候補表示指示が行われた場合を説明するための図
【図７】図６の表示内容に加え、読み（発音）の異なる候補（異音語、異なり語）までをも含めた候補表示を行う場合を示す図
【図８】図６の表示候補に対する選択操作後を示す図
【図９】図７の表示候補に対する選択操作後を示す図
【図１０】「読み」に対する候補表示指示が行われた場合を説明するための図
【図１１】読み（発音）の他に、表記を合わせて候補表示する場合を説明するための図
【図１２】図１０の表示候補に対する選択操作後を示す図
【図１３】見出しの表示を行わず、「読み」のみの表示を行う実施形態を説明するための図
【図１４】図１３の表示に対して、ある「読み」に対して候補表示する旨の指示がなされた場合を説明するための図
【図１５】読み（発音）の他に、対応する最も適切な表記を合わせて候補表示する場合を説明するための図
【図１６】図１４の表示候補に対する選択操作後を示す図
【図１７】候補表示制御部１５における処理内容の一例を示すフローチャート
【図１８】図１７のフローチャートにおける候補選択用表示処理（ステップＳ３８）の内容を示すフローチャート
【符号の説明】
１１…音声入力部
１２…音声認識部
１３…候補情報保持部
１４…候補選択操作部
１５…候補表示制御部
１６…表示部
２１…音声入力部
２２…音響処理部
２３…照合部
２４…音響辞書（ＨＭＭ）
２５…言語辞書（ｎ−ｇｒａｍ）
２６…単語辞書
２７…言語処理部
２８…単語辞書
２９…候補出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates generally to natural language processing, and more particularly to a text input processing apparatus, method, and program using speech recognition technology that provides a dictation function for inputting text (sentence) by speech.
[0002]
[Prior art]
In the conventional example of the text input processing device using the speech recognition technology, the speech recognition result is displayed in a simple kana-kana mixed text format. When correcting the input text based on such speech recognition, if the correction target is selected and the candidate display operation is performed, in the conventional example, candidates with different notations and candidates with different pronunciations are all displayed. It was. In such a conventional example, when the input text displayed based on the speech recognition result is different from the one expected by the user, whether it is a difference in notation in the homophone or a difference in sound, Or, there is a problem that it is difficult to determine whether the unit of division is a word or the like. In particular, words that the user does not know or cannot read are displayed, which makes it difficult to correct the text. This is a problem specific to text input based on speech recognition that is different from text input using a keyboard. is there.
[0003]
In addition, when selecting correction candidates, recognition candidates due to multiple factors such as candidates with different notation and pronunciation are displayed, so it takes time to find the target candidate and the operation is complicated. There is also. Further, even when the display is made only after a new operation step after the transition to the candidate selection state, the operation until the user can obtain the target candidate becomes complicated.
[0004]
As described above, it is desired that the conventional example of the text input processing device using the speech recognition technology is provided with a user interface that can easily perform an operation of correcting (editing in a broad sense) the input speech text. .
[0005]
[Problems to be solved by the invention]
The present invention has been made in view of such circumstances, and provides a text input device, method, and program provided with a user interface that allows a user to appropriately and easily edit input text based on a speech recognition result. With the goal.
[0006]
[Means for Solving the Problems]
In order to solve the above problems and achieve the object, the present invention is configured as follows.
[0007]
A first text input processing device according to the present invention is a text input processing device that performs text input processing using speech recognition, and any one of a plurality of clauses constituting the text includes a plurality of abnormal sounds. Storage means for storing a speech recognition result having pronunciation character string candidates including words, display means for combining and displaying each of the pronunciation character strings of the plurality of phrases, and for any one of the phrases, the pronunciation characters Text input comprising: candidate display means for displaying a list of column candidates; and selection means for causing a user to select any one of the abnormal words from the displayed candidates. It is a processing device.
[0008]
A second text input processing device according to the present invention is a text input processing device that performs text input processing using speech recognition, and any one of a plurality of clauses constituting the text includes a plurality of texts. Storage means for storing speech recognition results having pronunciation character string candidates including allophone words and kanji kana mixed character strings including a plurality of homophones having different notations, and pronunciation characters of each of the plurality of phrases A first display means for displaying a combination of one of the columns, a second display means for displaying a combination of kanji characters in each of the plurality of phrases, and any one of the phrases, First candidate display means for displaying a list of pronunciation character string candidates; second candidate display means for displaying a list of candidate character strings mixed with the kanji characters for any one of the phrases; The first selection means for allowing the user to select any one of the phonetic character strings from the displayed pronunciation character string candidates and the kanji kana mixed character string candidates displayed as a list And a second selection means for allowing the user to select one homophone.
[0009]
A first text input processing method according to the present invention is a text input processing method for inputting text using speech recognition, wherein any one of a plurality of clauses constituting the text includes a plurality of abnormal sounds. A storage step of storing a speech recognition result having pronunciation character string candidates including a word; a display step of displaying a combination of the pronunciation character strings of the plurality of phrases; and the pronunciation character for any one of the phrases A text input process comprising: a candidate display step for displaying a list of column candidates; a selection step for allowing a user to select any one of the abnormal words from the candidates displayed in the list; Is the method.
[0010]
A second text input processing method according to the present invention is a text input processing method for inputting text using speech recognition, wherein any one of a plurality of clauses constituting the text includes a plurality of texts. Storing a speech recognition result having pronunciation character string candidates including allophone words and kanji kana mixed character string candidates including a plurality of homophones having different notations; and each pronunciation character of each of the plurality of phrases A first display step that displays a combination of one of the columns, a second display step that displays a combination of one of the kanji-kana mixed character strings of the plurality of phrases, and any one of the phrases, A first candidate display step for displaying a list of pronunciation character string candidates; and a second candidate for displaying a list of candidate character strings mixed with the kanji characters for any one of the phrases A first selection step for allowing the user to select any one of the abnormal words from the displayed pronunciation character string candidates displayed in a list; and the candidate kanji mixed character strings displayed as a list And a second selection step for allowing the user to select any one of the homophones. A text input processing method comprising:
[0011]
A first program according to the present invention is a program for processing input of text using speech recognition, and any one of a plurality of clauses constituting the text includes a plurality of allophones. Storage means for storing a speech recognition result having pronunciation character string candidates, display means for displaying a combination of the pronunciation character strings of the plurality of phrases, and a list of the pronunciation character string candidates for any one of the phrases This is a program for functioning as a candidate display means for displaying and a selection means for allowing the user to select any one of the abnormal words from among the candidates displayed as a list.
[0012]
A second program according to the present invention is a program for processing input of text using speech recognition, wherein any one of a plurality of clauses constituting the text is a plurality of allophones. A storage means for storing a speech recognition result having a candidate for a pronunciation character string including a character and a kanji kana mixed character string candidate including a plurality of homophones having different notations, and one of the pronunciation character strings of each of the plurality of phrases First display means for displaying in combination, second display means for combining and displaying one of the kanji-kana mixed character strings of each of the plurality of phrases, and for any one of the phrases, the candidate pronunciation string First candidate display means for displaying a list, Second candidate display means for displaying a list of candidates for the character string mixed with kanji for any one of the phrases, First selection means for allowing the user to select any one of the phonetic words from the pronunciation character string candidates, and any one homophone from the kanji kana mixed character string candidates displayed in a list. Is a program for causing the user to select the second selection means.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0014]
FIG. 1 is a block diagram showing a schematic configuration of an embodiment of a text input device according to the present invention. The text input device according to the present embodiment can be configured based on, for example, a general-purpose computer provided with a device for speech recognition. The text input device is coupled to a speech input device such as a microphone and inputs speech uttered by a user. A voice input unit 11 that recognizes a voice input to the voice input unit 11, a candidate information holding unit 13 that holds a voice recognition result by the voice recognition unit 12, and an input device such as a keyboard or a mouse The candidate selection operation unit 14 for inputting information on the candidate selection operation performed by the user, and the main part of the present embodiment that controls the selective candidate display, and is input from the candidate selection operation unit 14 A candidate display control unit 15 that creates appropriate candidate display information from the information of the recognition result held by the candidate information holding unit 13 according to the operation information to be displayed, and a candidate table A display unit 16 for displaying the candidate display information created by the control unit 15 on the display, and a.
[0015]
The main components corresponding to the present invention are the candidate information holding unit 13, the candidate display control unit 15, and the candidate selection operation unit 14, and these components can be realized by, for example, computer software.
[0016]
FIG. 2 is a block diagram showing a schematic configuration of the speech recognition unit 12 in FIG. As shown in FIG. 2, the voice recognition unit 12 performs signal processing and analysis on the voice input unit 21 that inputs voice data, and the voice data input via the voice input unit 21, thereby An acoustic processing unit 22 that performs acoustic processing such as detection and clipping, or extraction and parameterization of feature amounts, and voice information parameterized by the acoustic processing unit 22 are registered in the word dictionary 26. A collation unit 23 that collates with a word string composed of words, is referred to in the collation process in the collation unit 23, and is also referred to in the collation process in the collation unit 23, similarly to the acoustic dictionary 24 that is composed of HMM (Hidden Markov Model) and the like And analyzing the word dictionary 25 composed of a statistical language model (n-gram) and the like and the word string obtained as a result of the matching process in the matching unit 23, A language processing unit 27 for reconfiguring in units of clauses and expanding homophones, a word dictionary 28 that stores various information about words that are referred to in the language processing in the language processing unit 27, and a language It comprises a candidate output unit 29 that manages the history of the language processing results in the processing unit 27 and performs candidate output.
[0017]
In the present embodiment configured as described above, it is assumed that the voice uttered by the user is input to the voice input unit 11 and the voice recognition unit 12 recognizes the one utterance immediately after the end of the utterance. Here, a case where candidate information as shown in FIG. 3 or FIG. 4 is output by the voice recognition unit 12 and stored in the candidate information holding unit 13 will be described as an example.
[0018]
When the candidate information of the recognition result is received from the candidate information holding unit 13, the candidate display control unit 15 immediately uses the combination of reading (pronunciation) and heading determined to be most appropriate, and displays the display unit 16 in FIG. 5. Candidate display with “reading (pronunciation)” is displayed. Such “reading” is the same information as so-called “ruby (reading pseudonym)”.
[0019]
Here, the candidate selection operation unit 14 is displayed on the display unit 16 such that the user selects and designates a part to be selected by the user with a keyboard, a mouse, or the like, or performs an operation for executing candidate display. An operation interface for selecting and specifying another candidate from a plurality of candidates is provided. Details thereof will be described later.
[0020]
Next, candidate information output by the speech recognition unit 12 will be described.
[0021]
FIG. 3 shows an example of candidate information output by the speech recognition unit 12. In the candidate information, a plurality of candidates having the same phrase boundary are stored in addition to the most appropriate (first rank) series of speech recognition results obtained by the speech recognition unit 12. Each candidate has information of reading (pronunciation character string) and heading (notation mixed with kanji characters). The candidate information also stores candidates of the same notation (pronunciation) (synonyms) and different candidates (pronunciations) (sound words, different words).
[0022]
As shown in FIG. 3, as phrase number 1, two pronunciation candidates “from the heart” recognized in the section from the position 0 to 40 in the speech are stored, including homophones. The notation determined to be the most appropriate among these is “from the heart”.
[0023]
In addition, as phrase number 2, a total of ten candidate pronunciations including “sound” and “atsuko” recognized in the section from position 40 to 60 following phrase number 1 are stored, including homophones. Among these, “hot” is determined to be the most appropriate. Examples of homophones include “thick”.
[0024]
Further, as phrase number 3, four pronunciation candidates “Orei”, “Orei”, “Ode”, and “Orehe” recognized in the section from position 60 to 88 following phrase number 2 include homophones. A total of 11 are stored. Of these, “thank you” is determined to be the most appropriate. Among the candidate pronunciations of allophones in the same section, “onrei”, “onde”, and “to me” are determined to be the most appropriate.
[0025]
FIG. 4 shows another example of candidate information output by the speech recognition unit 12. In this candidate information, clause numbers 1 to 3 are the same as those in FIG. Then, “here is” recognized in the section from the position 0 to 36 in the voice is stored as the phrase number 4, and “desu” is recognized as the phrase number 7 in the section from the position 36 to 52 following this. "Is stored, and" Konrei "recognized in the subsequent section from position 52 to 88 is stored as the phrase number 9, including the same word. Among these candidates, “9th”, “put out”, and “wedding” are determined to be most appropriate.
[0026]
Further, in the candidate information of FIG. 4, “here also” is recognized as the phrase number 5 in the section from the position 0 to 32 in the speech, and is recognized as the phrase number 6 in the section from the position 32 to 46 following this. “Sutton” recognized in the subsequent section from position 46 to 88 is stored as the phrase number “8”, including the same word. Among these candidates, “also here”, “just”, and “jump” are determined to be the most appropriate.
[0027]
That is, “heart”, “hot”, “thank you” with clause numbers 1 to 3 are the first series, and “9 days (here)” of clause numbers 4 and 7 and 9 are the other series with different clause boundaries. “Out”, “wedding”, and “here also”, “just” and “sweep” are stored.
[0028]
Here, operations of the candidate selection operation unit 14, the candidate display control unit 15, and the display unit 16 will be described.
[0029]
In the initial state, as shown in FIG. 5, it is assumed that “from the heart / from the heart”, “hot / hot”, “Ore / Thank you”, “I will give you / I will tell you” is displayed on the display unit 16.
[0030]
First, a case where a candidate gives a candidate display instruction for “headline”, for example, a case where a candidate display instruction is given by selecting “hot” will be described. When such an operation is performed, the operation information is given to the candidate display control unit 15 through the candidate selection operation unit 14. The candidate display control unit 15 displays the candidate that is a homonym of “hot” among the candidates corresponding to “hot” in the candidate information held in the candidate information holding unit 13 on the display unit 16 as shown in FIG. Display.
[0031]
Here, in the present embodiment, when a predetermined mode switching operation is given to the candidate display control unit 15, the candidate display control unit 15 reads the reading (as shown in FIG. 7 instead of the display information shown in FIG. Candidate display including even candidates (pronunciation, different words) with different pronunciations is configured. The candidate displays in FIGS. 6 and 7 are preferably configured to be switchable with each other in accordance with the mode switching operation.
[0032]
Furthermore, here, when an operation of selecting and specifying “thick” among the display candidates of FIG. 6 (or FIG. 7) is performed, the display unit 16 displays the selected “thick” as shown in FIG. In addition, when an operation of selecting and specifying “thickness” among the candidates displayed in FIG. 7 is performed, the display unit 16 displays the selected “thickness” and its reading (pronunciation) “atsukomo” in FIG. To display.
[0033]
Next, a case where a candidate display instruction for “reading” is given by the user, for example, a case where “Ore” is selected and a candidate display instruction is given will be described. When such an operation is performed, among the candidates corresponding to “thank you” in the candidate information, a reading (pronunciation) different from “Ore” is displayed as shown in FIG. At this time, as shown in FIG. 11, it is preferable to be able to switch to a mode in which notation (pronunciation) is displayed together with notation. The notation may be a notation determined to be most appropriate among candidates corresponding to the reading (pronunciation).
[0034]
Here, when an operation of selecting and specifying “onrei” among the display candidates in FIG. 10 is performed, the display unit 16 displays the selected “onrei” and the corresponding notation “thank you” as shown in FIG. To display. Note that the same applies to the case where an operation of selecting “Onrei / Thank you” is performed from the candidates displayed in FIG.
[0035]
According to the present embodiment as described above, the user can easily obtain an appropriate combination of reading (pronunciation) and notation based on the selective candidate display for “heading” and “reading”, and can obtain a desired text. Can be input (corrected, etc.).
[0036]
Here, another candidate display process based on the above-described configuration will be described. Another candidate display process is to display only “reading” without displaying the headline.
[0037]
The candidate display control unit 15 uses only the reading (pronunciation) determined to be the most appropriate, and causes the display unit 16 to perform candidate display of only the reading (pronunciation) as shown in FIG.
[0038]
In this case, when an instruction to select and display “Ore” reading (pronunciation) is displayed from the user via the candidate selection operation unit 14, the candidate corresponding to “Thank you” in the candidate information. Among them, the reading (pronunciation) different from “Ore” is displayed as shown in FIG. At this time, as shown in FIG. 15, it is preferable to enable switching to a mode in which the most appropriate notation is displayed together with reading (pronunciation).
[0039]
Furthermore, when the user performs an operation of selecting and specifying “onrei” from the candidates displayed in FIG. 14, the display unit 16 displays the selected “onrei” as shown in FIG. 16. . The same applies to the case where an operation of selecting “onre / thank you” from among the candidates displayed in FIG. 15 is performed.
[0040]
Next, the processing content in the candidate display control part 15 is demonstrated with reference to the flowchart of FIG.
[0041]
When candidate information is input from the speech recognition unit 12, the candidate display control unit 15 stores the candidate information in the candidate information holding unit 13 (step S31).
[0042]
Next, the candidate information of the first series determined to be most appropriate among the candidate information is acquired from the candidate information holding unit 13 (step S32), and setting information for determining whether or not to display the headline is determined (step S32). S33). You may comprise so that a user can set this setting information.
[0043]
In the case of setting to display a headline, display information using a combination of reading (pronunciation) and a headline is created (step S34). On the other hand, in the case of setting not to display the headline, display information using only reading (pronunciation) is created (step S35), and display on the display unit 16 is performed (step S36). The display in step S36 is display of candidate information of the first rank series.
[0044]
Thereafter, the candidate display control unit 15 shifts to a standby state for accepting a candidate display instruction from the user (step S37).
[0045]
Here, when an instruction for candidate display is given from the user, display information of candidate selection candidates is created (step S38) and displayed by the display unit 16. The processing content of step S38 will be described later. Subsequent to this candidate display operation, a transition is made to a standby state for accepting a candidate selection operation from the user (step S39).
[0046]
Here, when an operation for selecting a candidate is performed, the display of the display unit 16 is updated using a set of reading (pronunciation) and heading of the designated candidate (step S40), and the candidate display operation from the user is performed again. (Step S37).
[0047]
Next, details of candidate display processing for candidate selection (step S38) in the candidate display control unit 15 will be described with reference to the flowchart of FIG.
[0048]
First, when an instruction operation from the user to display a candidate (for example, mouse click) is detected, it is determined whether the designated location is a headline or a reading (pronunciation) (step S41). When the heading is designated, a predetermined setting content as to whether to display all candidates is referred to (step S42). In the case of setting not to display all candidates, for example, only homophone candidates are extracted from the candidate information holding unit 13 (step S43). In the case of setting to display all candidates, all candidates in the same section are extracted (step S44). Candidates extracted according to these settings are displayed on the display unit 16 in step S45.
[0049]
On the other hand, if it is determined in step S41 that the designated location is reading (pronunciation), the candidate information holding unit 13 is a candidate for different reading (pronunciation), and is most appropriate for each reading (pronunciation). In step S46, the notation candidates determined to be extracted are extracted. Furthermore, a predetermined setting content as to whether or not the headline display is used together is referred to (step S47). In accordance with such setting contents, a processing operation is selected as to whether only reading (pronunciation) is displayed as a candidate (step S48) or whether reading (pronunciation) and notation (heading) are combined and displayed as a candidate (step S49). . After such an operation, candidates are displayed on the display unit 16 in step S45.
[0050]
Here, the modification of the said embodiment is demonstrated.
[0051]
In the above-described embodiment, Hiragana “Yomi” is used as a phonetic character string displayed in combination with a heading, but Katakana or Romaji may be used. Also, a notation similar to actual pronunciation such as “Ore” may be used for “thank you” instead of “Ore”. Further, a display indicating an accent type may be added.
[0052]
Moreover, in the said embodiment, although Japanese is made into object, another language may be sufficient. For example, for Chinese, Pinyin or phonograms may be used as pronunciation character strings. A tone display may also be added.
[0053]
Moreover, in the said embodiment, although candidate display is performed per phrase, you may carry out per word and other units.
[0054]
The candidate selection operation method may be performed using a keyboard, a mouse, a pen, a voice operation, or the like. For any device that can specify a selection target and instruct execution, the present invention Is applicable.
[0055]
In the above-described embodiment, candidates are displayed by a so-called pop-up window, but a method of enumerating and displaying in another area such as the lower end of the screen may be used.
[0056]
In the above embodiment, different notation candidates for the same reading (pronunciation) are generated in advance in the speech recognition processing. However, in the speech recognition processing, different reading (pronunciation) are handled and are different. You may comprise so that the language processing expanded to the candidate of a description may be performed separately. For example, candidate expansion of different notations may be performed when a candidate display operation is performed.
[0057]
In the above embodiment, the speech recognition unit 12 includes the language processing unit 27, but instead of the processing unit 27, a language processing unit such as kana-kanji conversion that mainly processes keyboard input is used. It may be configured to be externally attached to the voice recognition unit 12.
[0058]
The present invention is not limited to the above-described embodiments and modifications, and can be implemented with various modifications. The present invention is effective as a method for configuring means for inputting text in various information processing apparatuses, and can be widely used in software for personal computers, word processor devices, portable information devices, and the like.
[0059]
【The invention's effect】
As described above, according to the present invention, it is possible to provide a text input device, method, and program provided with a user interface that allows a user to appropriately and easily edit input text based on a speech recognition result.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of an embodiment of a text input device according to the present invention.
2 is a block diagram showing a schematic configuration of a speech recognition unit 12 shown in FIG.
FIG. 3 is a diagram illustrating an example of candidate information output by the speech recognition unit 12
FIG. 4 is a diagram showing another example of candidate information output by the speech recognition unit 12
FIG. 5 is a diagram showing speech input text in an initial state
FIG. 6 is a diagram for explaining a case where a candidate display instruction is given for “Heading”;
FIG. 7 is a diagram showing a case where candidate display including candidates with different readings (pronunciations) (abnormal words and different words) in addition to the display contents of FIG. 6 is performed.
FIG. 8 is a diagram showing a selection operation for the display candidate shown in FIG.
9 is a diagram showing a selection operation for the display candidate shown in FIG.
FIG. 10 is a diagram for explaining a case where a candidate display instruction for “reading” is performed;
FIG. 11 is a diagram for explaining a case where candidates are displayed together with notation in addition to reading (pronunciation)
12 is a diagram showing a selection operation for the display candidate shown in FIG.
FIG. 13 is a diagram for explaining an embodiment in which only “reading” is displayed without displaying a heading;
14 is a diagram for explaining a case where an instruction to display a candidate for a certain “reading” is given with respect to the display of FIG. 13;
FIG. 15 is a diagram for explaining a case where a candidate is displayed together with the most appropriate notation in addition to reading (pronunciation)
FIG. 16 is a diagram showing a selection operation for the display candidate shown in FIG.
FIG. 17 is a flowchart illustrating an example of processing contents in the candidate display control unit 15;
18 is a flowchart showing the contents of candidate selection display processing (step S38) in the flowchart of FIG. 17;
[Explanation of symbols]
11 ... Voice input part
12 ... Voice recognition unit
13 ... Candidate information holding unit
14 ... Candidate selection operation section
15 ... Candidate display control unit
16 ... Display section
21 ... Voice input part
22 ... Sound processing section
23 ... Verification unit
24 ... Acoustic dictionary (HMM)
25 ... Language dictionary (n-gram)
26 ... Word dictionary
27 ... Language processor
28 ... Word dictionary
29 ... Candidate output section

Claims

In a text input processing device that inputs and processes text using speech recognition,
A speech recognition result comprising a plurality of phrase information, wherein the phrase information of the same phrase includes a plurality of pronunciation character strings including a plurality of allophone words, and a plurality of homophones having different notations, and Storage means for storing a speech recognition result having a candidate for a heading character string corresponding to the candidate;
First display control means for performing control to display any one of a plurality of pronunciation character string candidates in the same phrase side by side for each phrase;
A first phrase that performs control such that when any one of the pronunciation character strings for each phrase displayed by the control by the first display control means is designated using the input means, any one phrase is selected. A selection means;
First candidate display control means for performing a control to display a list of a plurality of pronunciation character string candidates based on the phrase information for the phrase selected by the first phrase selection means;
First selection means for performing control so that any one abnormal word can be selected by the input means from among the candidates for the pronunciation character string displayed in a list ;
Second control is performed to display a heading character string candidate corresponding to the pronunciation character string candidate displayed by the control by the first display control means or selected by the first selection means together with the pronunciation character string candidate. Display control means;
Second control is performed so that when any one of the heading character strings for each phrase displayed by the control by the second display control means is designated using the input means, any one phrase is selected . Clause selection means;
For the phrase selected by the second phrase selection means, the pronunciation displayed by the control by the first display control means or selected by the first selection means among the plurality of heading character string candidates based on the phrase information Second candidate display control means for performing control to display a list of items corresponding to character string candidates;
Second selection means for performing control so that any one homophone can be selected by the input means from among the candidates for the heading character string displayed in a list ;
A text input processing device comprising:

The speech recognition is for Japanese, the phonetic character string is one of Hiragana, Katakana, or Romaji, and the heading character string is a character string mixed with Kanji characters. Text input processing device.

2. The text input processing device according to claim 1, wherein the first candidate display control unit performs control to display a character string mixed with kanji characters corresponding to a pronunciation character string candidate in a pronunciation character string candidate list.

In a text input processing method for inputting text by using speech recognition,
A speech recognition result comprising a plurality of phrase information, wherein the phrase information of the same phrase includes a plurality of pronunciation character strings including a plurality of allophone words, and a plurality of homophones having different notations, and A storage step of storing a speech recognition result having a candidate of a heading character string corresponding to the candidate;
A first display control step for performing control to display any one of a plurality of pronunciation character string candidates in the same phrase side by side for each phrase;
When any one of the pronunciation character strings for each phrase displayed by the control in the first display control step is designated using the input unit, the first phrase is controlled so that any one phrase is selected. With selection step ,
A first candidate display control step for performing a control to display a list of a plurality of pronunciation character string candidates based on the phrase information for the phrase selected in the first phrase selection step;
A first selection step of performing control so that any one abnormal word can be selected by the input means from among the candidates for the pronunciation character string displayed in a list ;
A second control is performed to display a heading character string candidate corresponding to the pronunciation character string candidate displayed by the control in the first display control step or selected by the first selection step together with the pronunciation character string candidate. A display control step;
Second control is performed so that when any one of the heading character strings for each phrase displayed by the control in the second display control step is designated using the input means, any one phrase is selected . A phrase selection step;
For the phrase selected in the second phrase selection step, among the plurality of heading character string candidates based on the phrase information, the pronunciation displayed by the control in the first display control step or selected in the first selection step A second candidate display control step for performing control to display a list of items corresponding to character string candidates;
A second selection step for performing control so that any one homophone can be selected by the input means from among the candidates for the heading character string displayed in a list ;
A text input processing method comprising:

A program that processes text input using speech recognition,
Computer
A speech recognition result comprising a plurality of phrase information, wherein the phrase information of the same phrase includes a plurality of pronunciation character strings including a plurality of allophone words, and a plurality of homophones having different notations, and Storage means for storing a speech recognition result having a candidate for a heading character string corresponding to the candidate;
First display control means for performing control to display one of a plurality of pronunciation character string candidates in the same phrase side by side for each phrase;
A first phrase that performs control such that when any one of the pronunciation character strings for each phrase displayed by the control by the first display control means is designated using the input means, any one phrase is selected. Selection means,
First candidate display control means for performing a control to display a list of a plurality of pronunciation character string candidates based on the phrase information for the phrase selected by the first phrase selection means;
First selection means for performing control so that any one abnormal word can be selected by the input means from among the candidates for the phonetic character strings displayed in a list ;
Second control is performed to display a heading character string candidate corresponding to the pronunciation character string candidate displayed by the control by the first display control means or selected by the first selection means together with the pronunciation character string candidate. Display control means,
Second control is performed so that when any one of the heading character strings for each phrase displayed by the control by the second display control means is designated using the input means, any one phrase is selected . Phrase selection means,
For the phrase selected by the second phrase selection means, the pronunciation displayed by the control by the first display control means or selected by the first selection means among the plurality of heading character string candidates based on the phrase information Second candidate display control means for performing control to display a list of items corresponding to character string candidates;
A program for functioning as second selection means for performing control so that any one homophone can be selected by the input means from among the candidates for the heading character string displayed in a list.