JP2019144310A

JP2019144310A - Information processor, information processing system, control method and program

Info

Publication number: JP2019144310A
Application number: JP2018026120A
Authority: JP
Inventors: 下郡山　敬己; Itsuki Shimokooriyama; 敬己下郡山
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2018-02-16
Filing date: 2018-02-16
Publication date: 2019-08-29
Anticipated expiration: 2038-02-16
Also published as: JP7231806B2

Abstract

To provide technique for determining priority of proofreading in a character string being a result of voice recognition and allowing a proof-reader to perform proofreading.SOLUTION: An information processor comprising storage means for storing a text at every speech obtained by voice-recognizing voice data by a plurality of speeches and a first certainty factor being likelihood of a voice recognition result with respect to the text comprises: specification means for specifying the text at every speech to be proofread on the basis of the first certainty factor; display control means for allowing the specified text at every speech to be proofread to be identified and displayed; and update means for receiving proofreading with respect to the text for every speech and updating the text at every speech receiving proofreading.SELECTED DRAWING: Figure 3

Description

本発明は、音声認識結果の誤り校正するための支援に関する技術であって、認識誤りのある文字列を校正する際に、重要な部分から校正するための優先度を提示して、最終的な結果を目にする読者の情報保障の精度を高める技術に関する。 The present invention relates to a technology for assisting in correcting an error in a speech recognition result, and presents a priority for proofreading from an important part when proofreading a character string having a recognition error. It relates to a technology that improves the accuracy of information security for readers who see the results.

従来から、人間の発話を文字列に変換する音声認識の研究開発が行われていた。実際の応用として、テレビ放送に字幕をつける、ろう者が他者の発話を理解する、などの目的で使用されてきた。 Conventionally, research and development of speech recognition for converting human speech into a character string has been performed. As actual applications, it has been used for purposes such as adding subtitles to TV broadcasts and deaf people understanding the speech of others.

特に近年、機械学習の進展などもあり実用的な認識精度が実現されるようになってきている。とはいえ、まだ十分な認識精度が達成されているわけではなく、特に発話者の話し方にも注意する必要がある。例えばマイクと口との距離、発話の明確さなどにより認識精度は大きく変わる。 Particularly in recent years, practical recognition accuracy has been realized due to the progress of machine learning. However, sufficient recognition accuracy is not yet achieved, and it is necessary to pay particular attention to how the speaker speaks. For example, the recognition accuracy varies greatly depending on the distance between the microphone and the mouth, the utterance clarity, and the like.

そのため、音声認識結果の文字列をパソコン上などで正しく修正するためのソフトウェアも実現されている。 For this reason, software for correctly correcting the character string of the speech recognition result on a personal computer or the like has also been realized.

もともと“パソコン要約筆記”として、発話者の発話を聞きパソコンに入力してろう者に提供するサービスがあったが、音声認識技術を利用したソフトウェアの出現により入力する人の作業を支援することが可能になってきている。 Originally, there was a service to provide deaf people by listening to the utterer's utterances and inputting them into a personal computer as a “computer summary writing”. However, it is possible to support the input work by the advent of software using speech recognition technology. It is becoming possible.

それらのソフトウェアは、一般的には発話が途切れたタイミングなどで区切って、音声認識結果（文字列）もその区切りの単位で時系列に画面に表示する。それらの文字列をパソコン要約筆記として訓練をされた校正者が修正することになる。 Such software is generally divided at the timing when the speech is interrupted, and the speech recognition result (character string) is also displayed on the screen in time series in the unit of the division. Those character strings will be corrected by a proofreader who has been trained as a PC summary writing.

しかしながら通常、発話は認識結果を修正する作業、すなわち情報処理装置におけるキーボード入力作業よりも高速であるため、修正作業を行う校正者の負担は、まだ十分に軽減されてはいない。 However, since the utterance is usually faster than the operation of correcting the recognition result, that is, the keyboard input operation in the information processing apparatus, the burden on the proofreader who performs the correction operation has not yet been sufficiently reduced.

特許文献１は、音声認識の誤認識を正しく修正する校正者の作業を支援する音声認識結果編集装置を提供している。 Patent Document 1 provides a speech recognition result editing apparatus that supports the work of a proofreader who corrects erroneous recognition of speech recognition correctly.

特許文献１の技術では、マイクから入力された発話を音声認識部により単語毎に信頼度を付与された文字列に変換する。この際、最も信頼度の高い単語だけではなく所定の条件を満たす単語、例えば一定の値以上の確信度を持つ単語を文字列に変換し音声認識結果集合として含むため、校正者は複数の単語の正解候補から正しい認識結果を選択・修正することが出来る（特許文献１の段落００１３、図８）。 In the technique of Patent Literature 1, an utterance input from a microphone is converted into a character string to which reliability is given for each word by a voice recognition unit. At this time, not only the word with the highest reliability but also a word satisfying a predetermined condition, for example, a word having a certainty of a certain value or more is converted into a character string and included as a speech recognition result set. The correct recognition result can be selected and corrected from the correct answer candidates (paragraph 0013 of FIG. 1, FIG. 8).

特開２０１７−０４０８５６号公報JP 2017-040856 A

しかしながら１つの発話には複数の単語が含まれるため、特許文献１の技術では、単語の数が多いときには認識結果は膨大な数になり表示装置に分かりやすく表示することが困難になる。特に前記パソコン要約筆記の場合、文脈から判断して分かりやすく修正していくため、修正中の発話より前の発話の文字列も確認することがあるが、その領域がない可能性もある。すなわち単に認識結果の候補を表示するだけでは校正者にとって効率的な支援とはならない場合がある。 However, since a single utterance includes a plurality of words, with the technique of Patent Document 1, when the number of words is large, the number of recognition results is enormous, and it is difficult to display on the display device in an easy-to-understand manner. In particular, in the case of the above-mentioned personal computer summary writing, since the correction is made in an easy-to-understand manner by judging from the context, the character string of the utterance before the utterance being corrected may be confirmed, but there may be no area. That is, simply displaying the recognition result candidates may not provide efficient support for the proofreader.

また校正者の数が足りない場合には、全ての音声認識誤りを修正不可能な場合もある。この場合には修正する部分を優先的に判断する必要があるが、特許文献１に記載の技術では、修正しないまま一定時間が経過してしまった音声認識結果は、その重要度にかかわらず修正しない（特許文献１の段落００２２、図５に記載のタイムアウト処理）と判断するだけであり、重要な情報であっても一定時間が経過すれば破棄、または未修正のまま表示されるという問題がある。 If there are not enough proofreaders, all speech recognition errors may not be corrected. In this case, it is necessary to preferentially determine the part to be corrected. However, in the technique described in Patent Document 1, the speech recognition result that has not been corrected and a fixed time has passed is corrected regardless of its importance. It is only determined that it is not performed (timeout process described in paragraph 0022 of FIG. 5, Patent Document 1), and even if important information is passed, it is discarded or displayed without modification after a certain period of time. is there.

また、全ての発話に対応する文字列が校正者によって修正されることが理想ではあるが、現実には不可能な場合もある。不可能な場合には適切な優先順位をつけて校正者に修正させることで、修正結果を見る人（例えばろう者）に可能な限り分かりやすい情報を提供することが必要である。 Moreover, although it is ideal that the character strings corresponding to all utterances are corrected by the proofreader, there are cases where it is impossible in practice. When it is not possible, it is necessary to provide information that is easy to understand as much as possible to a person who sees the correction result (for example, a deaf person) by giving the proofreader correction with an appropriate priority.

本発明の目的は、前記の問題に鑑み、音声認識の結果である文字列のうち、修正の優先順位を判断して校正者に修正させるための技術を提供することである。 In view of the above problems, an object of the present invention is to provide a technique for judging a correction priority order and correcting a proofreader among character strings as a result of speech recognition.

複数の発話による音声データを音声認識して得られる発話ごとのテキストと、当該テキストに対する音声認識結果の確からしさである第１確信度とを記憶する記憶手段を備える情報処理装置であって、前記第１確信度に基づいて、修正すべき発話ごとのテキストを特定する特定手段と、前記特定した修正すべき発話ごとのテキストを識別表示させる表示制御手段と、前記発話ごとのテキストに対して修正を受け付け、当該修正を受け付けた発話ごとのテキストを更新する更新手段とを備えることを特徴とする。 An information processing apparatus comprising storage means for storing a text for each utterance obtained by voice recognition of voice data of a plurality of utterances and a first certainty factor that is a likelihood of a voice recognition result for the text, Specification means for specifying text for each utterance to be corrected based on the first certainty factor, display control means for identifying and displaying the text for each specified utterance to be corrected, and correction for the text for each utterance And updating means for updating the text for each utterance for which the correction has been accepted.

本発明により、前記の問題に鑑み、音声認識の結果である文字列のうち、修正の優先順位を判断して校正者に修正させるための技術を提供することが可能となる。 According to the present invention, in view of the above-described problem, it is possible to provide a technique for judging a correction priority order and correcting a proofreader among character strings as a result of speech recognition.

本発明の実施形態に係るシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration | structure which concerns on embodiment of this invention. 本発明の実施形態に係る情報処理装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the information processing apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る機能構成の一例を示す図である。It is a figure which shows an example of the function structure which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識結果を表示する画面の一例を示す図である。It is a figure which shows an example of the screen which displays the speech recognition result which concerns on embodiment of this invention. 本発明の実施形態に係る音声人入力から校正の配布までの処理の一例を示す図である。It is a figure which shows an example of the process from the voice person input which concerns on embodiment of this invention to distribution of calibration. 本発明の実施形態に係る認識結果と認識結果の確信度のデータ形式の一例を示す図である。It is a figure which shows an example of the data format of the recognition result which concerns on embodiment of this invention, and the certainty factor of a recognition result. 本発明の実施形態に係る音声認識結果の解析と校正のための優先順位付けまでの処理を説明するフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart explaining the process to the prioritization for the analysis of the speech recognition result which concerns on embodiment of this invention, and calibration. 本発明の実施形態に係る校正のための優先順位付けの処理を説明するフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart explaining the process of prioritization for the calibration which concerns on embodiment of this invention. 本発明の実施形態に係る優先順位の処理に用いる情報を記憶する記憶部の一例を示すための図である。It is a figure for showing an example of the memory | storage part which memorize | stores the information used for the process of the priority which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識結果の確信度を再計算した結果の一例を示すための図である。It is a figure for showing an example of the result of having recalculated the certainty degree of the voice recognition result concerning the embodiment of the present invention. 本発明の実施形態に係る音声認識結果を表示するユーザインタフェースの一例を示すための図である。It is a figure for showing an example of the user interface which displays the voice recognition result concerning the embodiment of the present invention.

以下、本発明の実施の形態を、図面を参照して詳細に説明する。
図１は、本発明の実施形態に係るシステム構成の一例を示す図である。
＜システム構成例１＞ Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram illustrating an example of a system configuration according to an embodiment of the present invention.
<System configuration example 1>

本発明の実施形態に拘わるシステムは、音声認識サーバ１０１、情報処理端末１０２（発話者用１０２ａ、校正者用１０２ｂ、読者用１０２ｃとする）で構成される。ユーザは情報処理端末１０２ａに接続されたマイク１０４で音声を入力する。情報処理端末１０２ａは、前記音声を音声認識サーバ１０１に送信して文字列に変換し情報処理端末１０２ａ〜ｃに送り、情報処理端末１０２ａ〜ｃで表示、ユーザに提示する。すなわち、情報処理端末１０２ａ〜ｃは、音声の入力と文字列の出力の入出力双方を兼ね備えていてもよい。ここで出力される情報処理端末１０２においては、後述する読者用１０２ｃと校正者用１０２ｂが兼ねられていてもよいし、またそれぞれ専用の情報処理端末であってもよい。また出力は情報処理端末１０２に接続された表示装置上に対して行うが、プロジェクタなどを用いた構成も、本発明の実施形態に拘わるシステム構成とする。プロジェクタを使う場合であれば、情報処理端末１０２は発話者用の一台のみで、当該情報処理端末１０２ａに接続したプロジェクタからスクリーンに表示した音声認識結果の文字列を読者全員が読んでもよい。その場合、発話者用の前記情報処理端末１０２ａで直接、発話者自身あるいは別のユーザが校正者として誤認識を校正してもよい。 The system according to the embodiment of the present invention includes a speech recognition server 101 and an information processing terminal 102 (referred to as a speaker 102a, a proofreader 102b, and a reader 102c). The user inputs voice through the microphone 104 connected to the information processing terminal 102a. The information processing terminal 102a transmits the voice to the voice recognition server 101, converts it into a character string, sends it to the information processing terminals 102a to 102c, displays the information on the information processing terminals 102a to 102c, and presents it to the user. That is, the information processing terminals 102a to 102c may have both input and output of voice input and character string output. The information processing terminal 102 output here may serve both as a reader 102c and a proofreader 102b described later, or may be a dedicated information processing terminal. Further, output is performed on a display device connected to the information processing terminal 102, but a configuration using a projector or the like is also a system configuration according to the embodiment of the present invention. If a projector is used, only one information processing terminal 102 is provided for a speaker, and all readers may read a character string of a speech recognition result displayed on a screen from a projector connected to the information processing terminal 102a. In that case, the misrecognition may be corrected as a proofreader by the utterer himself or another user directly at the information processing terminal 102a for the speaker.

さらに音声認識サーバ１０１は、クラウド上に存在するものであってもよく、その場合には、本システムのユーザは後述する音声認識サーバ１０１上の機能を、クラウドサービスする形態であってもよい。これらのサービスを利用する形態であっても、本発明の実施形態に拘わるシステム構成とする。
＜システム構成例２＞ Furthermore, the voice recognition server 101 may exist on the cloud, and in this case, the user of the present system may be in the form of a cloud service for functions on the voice recognition server 101 described later. Even in the form of using these services, the system configuration according to the embodiment of the present invention is adopted.
<System configuration example 2>

構成例１で説明した情報処理端末１０２ａ〜ｃは、入出力を兼ね備えていたが、入力専用、出力専用と分かれていてもよい。
＜システム構成例３＞ The information processing terminals 102a to 102c described in the configuration example 1 have both input and output, but may be separated from input only and output only.
<System configuration example 3>

音声認識サーバ１０１と情報処理端末１０２ａ〜ｃは同一筐体であってもよい。すなわち、図１における情報処理端末１０２ａ〜ｃのうちの１つに音声認識可能なソフトウェアがインストールされていて、音声認識サーバ１０１を兼ねていてもよい。 The voice recognition server 101 and the information processing terminals 102a to 102c may be the same housing. That is, software capable of voice recognition may be installed in one of the information processing terminals 102 a to 102 c in FIG. 1 and may also serve as the voice recognition server 101.

図２は、本発明の実施形態に係る音声認識サーバ１０１、情報処理端末１０２ａ〜ｃに適用可能なハードウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram showing an example of a hardware configuration applicable to the voice recognition server 101 and the information processing terminals 102a to 102c according to the embodiment of the present invention.

図２に示すように、音声認識サーバ１０１、情報処理端末１０２ａ〜ｃは、システムバス２０４を介してＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０３、入力コントローラ２０５、ビデオコントローラ２０６、メモリコントローラ２０７、通信Ｉ／Ｆコントローラ２０８等が接続された構成を採る。ＣＰＵ２０１は、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。 As shown in FIG. 2, the speech recognition server 101 and the information processing terminals 102 a to 102 c include a CPU (Central Processing Unit) 201, a RAM (Random Access Memory) 202, a ROM (Read Only Memory) 203, and the like via a system bus 204. A configuration in which an input controller 205, a video controller 206, a memory controller 207, a communication I / F controller 208, and the like are connected is adopted. The CPU 201 comprehensively controls each device and controller connected to the system bus 204.

また、ＲＯＭ２０３あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や、各サーバあるいは各ＰＣが実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。また、本発明を実施するために必要な情報が記憶されている。なお外部メモリはデータベースであってもよい。 Further, the ROM 203 or the external memory 211 will be described later, which is necessary for realizing the functions executed by each server or each PC, such as BIOS (Basic Input / Output System) and OS (Operating System) which are control programs of the CPU 201. Various programs are stored. Further, information necessary for carrying out the present invention is stored. The external memory may be a database.

ＲＡＭ２０２は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０３あるいは外部メモリ２１１からＲＡＭ２０２にロードし、ロードしたプログラムを実行することで各種動作を実現する。 The RAM 202 functions as a main memory, work area, and the like for the CPU 201. The CPU 201 implements various operations by loading a program or the like necessary for executing the processing from the ROM 203 or the external memory 211 to the RAM 202 and executing the loaded program.

また、入力コントローラ２０５は、キーボード（ＫＢ）２０９や不図示のマウス等のポインティングデバイス等からの入力を制御する。 The input controller 205 controls input from a keyboard (KB) 209 or a pointing device such as a mouse (not shown).

ビデオコントローラ２０６は、ディスプレイ２１０等の表示器への表示を制御する。尚、表示器は液晶ディスプレイ等の表示器でもよい。これらは、必要に応じて管理者が使用する。 The video controller 206 controls display on a display device such as the display 210. The display device may be a display device such as a liquid crystal display. These are used by the administrator as needed.

メモリコントローラ２０７は、ブートプログラム、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶する外部記憶装置（ハードディスク（ＨＤ））や、フレキシブルディスク（ＦＤ）、あるいは、ＰＣＭＣＩＡ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒＭｅｍｏｒｙＣａｒｄＩｎｔｅｒｎａｔｉｏｎａｌＡｓｓｏｃｉａｔｉｏｎ）カードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等の外部メモリ２１１へのアクセスを制御する。 The memory controller 207 is an external storage device (hard disk (HD)), flexible disk (FD), or PCMCIA (Personal Computer) that stores a boot program, various applications, font data, user files, editing files, various data, and the like. Controls access to an external memory 211 such as a Compact Flash (registered trademark) memory connected to a Memory Card International Association (Card Memory) card slot via an adapter.

通信Ｉ／Ｆコントローラ２０８は、ネットワークを介して外部機器と接続・通信し、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ／ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）を用いた通信等が可能である。 The communication I / F controller 208 connects and communicates with an external device via a network, and executes communication control processing on the network. For example, communication using TCP / IP (Transmission Control Protocol / Internet Protocol) is possible.

尚、ＣＰＵ２０１は、例えばＲＡＭ２０２内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ２１０上に表示することが可能である。また、ＣＰＵ２０１は、ディスプレイ２１０上のマウスカーソル（図示しない）等によるユーザ指示を可能とする。 Note that the CPU 201 can display on the display 210 by executing an outline font rasterization process on a display information area in the RAM 202, for example. Further, the CPU 201 enables a user instruction using a mouse cursor (not shown) on the display 210.

本発明を実現するための後述する各種プログラムは、外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０２にロードされることによりＣＰＵ２０１によって実行されるものである。
図３は、本発明の実施形態に係る機能構成の一例を示す図である。 Various programs to be described later for realizing the present invention are recorded in the external memory 211 and executed by the CPU 201 by being loaded into the RAM 202 as necessary.
FIG. 3 is a diagram illustrating an example of a functional configuration according to the embodiment of the present invention.

なお、情報処理端末１０２は、発話者用１０２ａ、校正者用１０２ｂ、読者用１０３ｂの機能をそれぞれ別々の端末に持っても、共通した端末で持ってもよいので、ここではそれぞれを区別せずに説明する。 The information processing terminal 102 may have the functions of the speaker 102a, the proofreader 102b, and the reader 103b in separate terminals or in a common terminal. Explained.

音声取得部３１１は、情報処理端末１０２が内蔵している、あるいは接続されたマイクなどから話者の音声による発話を音声データとして入力し、音声データ送信部３１２により音声認識サーバ１０１に送信する。 The voice acquisition unit 311 inputs speech uttered by the speaker's voice as voice data from a microphone incorporated in the information processing terminal 102 or connected to the voice recognition server 101 by the voice data transmission unit 312.

音声認識サーバ１０１は、音声データ受信部３２１で受信した音声データを音声認識部３２２に渡して音声データを文字列に変換し、当該文字列を認識結果送信部３２３により情報処理端末１０２に認識結果として送り返す。また、前述の認識結果を認識結果管理部３２４により認識結果記憶部３２０に格納する。 The voice recognition server 101 passes the voice data received by the voice data receiving unit 321 to the voice recognition unit 322, converts the voice data into a character string, and the recognition result transmission unit 323 sends the recognition result to the information processing terminal 102. Send back as. Further, the recognition result management unit 324 stores the above-described recognition result in the recognition result storage unit 320.

情報処理端末１０２は、前記文字列を認識結果受信部３１３にて受信し、表示部３１４により表示することで読者（情報処理端末１０２のユーザ）に提示する。 The information processing terminal 102 receives the character string at the recognition result receiving unit 313 and displays it on the display unit 314 to present it to the reader (user of the information processing terminal 102).

優先順位決定部３２５は、情報処理端末１０２を用いて音声認識の誤りを校正するための校正者が、優先的に校正すべき文字列を識別可能とするため認識結果記憶部３２０に格納された認識結果に優先順位を付与する。 The priority determination unit 325 is stored in the recognition result storage unit 320 so that a proofreader for proofreading a speech recognition error using the information processing terminal 102 can identify a character string to be proofread preferentially. Give priority to recognition results.

優先順位付けされた文字列は、情報処理端末１０２に送信され、情報処理端末１０２の表示部３１４によって前記の通り校正者が校正すべき優先順位を識別可能に表示する。認識結果校正部３１５は、校正者が文字列を編集することで、認識結果の誤りを校正するための機能を提供する。 The prioritized character strings are transmitted to the information processing terminal 102, and the display unit 314 of the information processing terminal 102 displays the priority order to be proofread by the proofreader as described above. The recognition result proofreading unit 315 provides a function for proofreading the recognition result error by the proofreader editing the character string.

前記校正結果は、情報処理端末１０２の校正結果送信部３１６により、音声認識サーバ１０１に送信され、音声認識サーバ１０１の校正結果受信部３２６が受信し、認識結果記憶部３２０に格納されている認識結果を更新する。 The calibration result is transmitted to the speech recognition server 101 by the calibration result transmission unit 316 of the information processing terminal 102, received by the calibration result reception unit 326 of the speech recognition server 101, and stored in the recognition result storage unit 320. Update the result.

前記更新された認識結果は、校正結果配布部３２７により、校正者が校正するために使用した情報処理端末１０２以外の情報処理端末１０２にも配布され、読者が校正結果を見ることが出来るように提示される。
図４は、本発明の実施形態に係る音声認識結果を表示する画面の一例を示す図である。 The updated recognition result is distributed to the information processing terminal 102 other than the information processing terminal 102 used for proofreading by the proofreader by the proofreading result distributing unit 327 so that the reader can see the proofreading result. Presented.
FIG. 4 is a diagram showing an example of a screen that displays a speech recognition result according to the embodiment of the present invention.

発話例４００は、会議や講演会などにおける発話者の発話例である。発話者は1人に特定する必要はなく、例えば会議であれば議長以外にも発言の可能性があり、また講演会などにおいては講演者の他に司会者や質問者などの発話があってもよい。 The utterance example 400 is an utterance example of a speaker in a conference or lecture. There is no need to specify a single speaker. For example, in a meeting, there is a possibility of speaking in addition to the chairperson. In lectures, there are utterances by the presenter and questioners in addition to the speakers. Also good.

発話例４００においては、Ａ〜Ｋに区分されているが、これらは発話者の発話の区切りである。例えば、発話に一定時間の空白（無音の状態）があった場合などを示している。 In the utterance example 400, the utterance example 400 is divided into A to K. These are breaks of the utterance of the speaker. For example, it shows a case where there is a blank (silenced state) for a certain time in the utterance.

これに対して、音声認識結果表示画面４０１においても前記Ａ〜Ｋに対応して区切られているが（複数の表示枠４０４Ａ〜Ｋ）、これらは音声認識サーバ１０１の音声認識部３２２が前記無音の状態を認識するなどして認識結果の文字列を区切るものである。これらを区切った状態で認識結果記憶部３２０に格納し、また、情報処理端末１０２の表示部３１４が、読者に分かりやすく区切って表示するものである。これは例であって、必ずしも４０４を区切らなくてもよく、設計事項に過ぎない。あくまで後述する校正のための優先順位が認識可能に表示されていればよい。 On the other hand, the voice recognition result display screen 401 is also divided corresponding to the above A to K (a plurality of display frames 404A to K), but these are silenced by the voice recognition unit 322 of the voice recognition server 101. The character string of the recognition result is delimited by recognizing the state of. These are separated and stored in the recognition result storage unit 320, and the display unit 314 of the information processing terminal 102 displays them in a manner that is easy to understand for the reader. This is an example, and 404 does not necessarily have to be divided, but is merely a design matter. It suffices if the priority order for proofreading described later is displayed so as to be recognizable.

開始ボタン４０２は、発話を音声認識サーバ１０１にて認識させる際に押下するものである。システム構成図（図１）に複数の情報処理端末１０２とそれらに接続したマイクの図を記しているが、いずれの情報処理端末１０２に接続しているマイクに向かって発話しているかを指定するためのものである。１つの情報処理端末１０２だけに発話を入力可能としてもよいし、複数の情報処理端末１０２に同時に発話を入力してもよく、システムの設計によるものである。また、開始ボタン４０２に対応して発話を入力していない旨を情報処理端末１０２に通知するための終了ボタン４０３があってもよい。 A start button 402 is pressed when the speech recognition server 101 recognizes an utterance. The system configuration diagram (FIG. 1) shows a diagram of a plurality of information processing terminals 102 and microphones connected to them, and designates which information processing terminal 102 is speaking to a microphone connected to the information processing terminals 102 Is for. An utterance may be input to only one information processing terminal 102, or an utterance may be input to a plurality of information processing terminals 102 at the same time, depending on the system design. Further, there may be an end button 403 for notifying the information processing terminal 102 that no utterance has been input in response to the start button 402.

前記４０４Ａ〜Ｋのうち４０４Ａ〜Ｊは前記の“一定時間の空白（無音の状態）”が過ぎた状態を示している。一方で、４０４Ｋは認識結果の出力継続中として、まだ音声認識部３２２が発言者の発話が継続していると判定している状態である。図においては、発話の一部が既に認識済みであるとして、当該一部を表示しているが、区切りが出現した後で、その発言の音声認識結果をまとめて表示してもよい。 Among 404A to 404K, 404A to J indicate a state in which the "blank for a certain period of time (silent state)" has passed. On the other hand, 404K is a state in which the speech recognition unit 322 still determines that the speaker's utterance is continuing as the recognition result is being output. In the figure, a part of the utterance is already recognized, and the part is displayed. However, after the break appears, the speech recognition results of the utterance may be displayed together.

図５は、本発明の実施形態に係る音声人入力から校正の配布までの処理の一例を示す図である。図５のフローチャートの各ステップは、音声認識サーバ１０１上のＣＰＵ２０１、および、情報処理端末１０２ａ〜ｃ上のＣＰＵ２０１で実行される。 FIG. 5 is a diagram illustrating an example of processing from voice person input to proofreading distribution according to the embodiment of the present invention. Each step of the flowchart of FIG. 5 is executed by the CPU 201 on the speech recognition server 101 and the CPU 201 on the information processing terminals 102a to 102c.

ステップＳ５０１においては、情報処理端末１０２ａに接続されたマイクなどを通して発話者の発話を受け付け、音声データに変換する。 In step S501, a speaker's utterance is received through a microphone or the like connected to the information processing terminal 102a, and converted into voice data.

ステップＳ５０２においては、情報処理端末１０２ａは、前記音声データを音声認識サーバ１０１に送信し、ステップＳ５０３により音声認識サーバ１０１にて受信する。 In step S502, the information processing terminal 102a transmits the voice data to the voice recognition server 101, and the voice recognition server 101 receives the voice data in step S503.

ステップＳ５０４においては、音声認識サーバ１０１は、前記音声データにおける発話者の発話を音声認識により文字列に変換する。認識結果の文字列は、前述のとおり発話単位で区切られているが、さらに例えば形態素などの言語的単位で識別可能に区切られている。音声認識の結果には文字列だけではなく、その認識結果を音声認識部３２２がどの程度の確率で正しいと推定しているか確信度が付与されている。また、形態素などの言語的単位で分割されている場合には、各々の形態素に確信度と詳細の品詞がタグとして付与されていてもよい。形態素解析による品詞づけについては図１０で例をあげて説明するが、いわゆる学校で習う学校文法は“固有名詞”などおおざっぱであるが情報処理においては、例えば固有名詞を“人名”、“地名”などと細かく分類する場合がある。形態素解析、音声認識については周知の技術であり詳細な説明は割愛する。 In step S504, the voice recognition server 101 converts the utterance of the speaker in the voice data into a character string by voice recognition. The character string of the recognition result is divided in units of utterances as described above, but is further divided in such a way that it can be identified in linguistic units such as morphemes. As a result of the speech recognition, not only a character string but also a certainty is given to the probability that the speech recognition unit 322 estimates the recognition result to be correct. In addition, in the case where the morpheme is divided into linguistic units, a certainty factor and a detailed part of speech may be assigned to each morpheme as a tag. Participation by morphological analysis will be explained with reference to an example in FIG. 10. The school grammar learned at school is roughly “proprietary nouns”, but in information processing, for example, proper nouns are “person names” and “place names”. And so on. Morphological analysis and speech recognition are well-known techniques and will not be described in detail.

ステップＳ５０５においては、音声認識サーバ１０１は、ステップＳ５０４における変換結果の文字列を情報処理端末１０２に送信する。システム内に複数の情報処理端末１０２が接続されている場合には、発話を入力した情報処理端末１０２ａのみではなく全ての情報処理端末１０２に前記文字列を送信する。発話者が使用し音声データを入力した情報処理端末１０２ａに対しても発話者自身が音声認識結果を確認するため送信してもよい。前記情報処理端末１０２においては、ステップＳ５０６において前記文字列を受信する。 In step S505, the speech recognition server 101 transmits the character string that is the conversion result in step S504 to the information processing terminal 102. When a plurality of information processing terminals 102 are connected in the system, the character string is transmitted not only to the information processing terminal 102a that inputs the utterance but also to all the information processing terminals 102. The speaker himself / herself may also transmit to the information processing terminal 102a used by the speaker and receiving the voice data in order to confirm the speech recognition result. The information processing terminal 102 receives the character string in step S506.

ステップＳ５０７においては、音声認識サーバ１０１は、音声認識の結果を認識結果記憶部３２０に格納する。認識結果が格納される形式については図６を用いて詳細に説明する。 In step S507, the speech recognition server 101 stores the speech recognition result in the recognition result storage unit 320. The format in which the recognition result is stored will be described in detail with reference to FIG.

図６は、本発明の実施の形態にかかわる認識結果と認識結果の確信度のデータ形式の一例を示す図である。一例として認識結果が認識結果情報６００の構造に格納されているとして説明する。 FIG. 6 is a diagram showing an example of a data format of the recognition result and the certainty factor of the recognition result according to the embodiment of the present invention. As an example, it is assumed that the recognition result is stored in the structure of the recognition result information 600.

６０１Ａ〜Ｊは、図４の発話Ａ〜Ｊに対応したデータである。前述したとおりの発話の区切りに対応し、音声認識部３２２の結果である文字列を認識文字列６０３に格納する。６０２Ａ〜Ｊは、発話６０１Ａ〜Ｊの各々に対応した確信度である。各々の認識結果は、認識文字列６０３の他に後述する形態素表記６０４と各々の形態素の認識結果の確信度６０５から構成される。 601A to J are data corresponding to the utterances A to J in FIG. The character string that is the result of the speech recognition unit 322 is stored in the recognized character string 603 corresponding to the utterance delimiter as described above. 602A to J are certainty degrees corresponding to the utterances 601A to 601J. Each recognition result includes a recognition character string 603, a morpheme notation 604, which will be described later, and a certainty factor 605 of the recognition result of each morpheme.

また形態素表記６０４の枠内の背景が濃いもの（６０６など）については図８〜図１０を用いて後述するが、特に音声認識結果が誤認識されていると読者にとって分かりにくくなるため、優先的にどの部分を校正するかを判断するために使用する。 Also, a dark background in the frame of the morpheme notation 604 (606, etc.) will be described later with reference to FIGS. 8 to 10. However, if the speech recognition result is misrecognized, it is difficult for the reader to understand. Used to determine which part to calibrate.

ステップＳ５０８においては、音声認識サーバ１０１は、発話が新しく入力され前述のステップＳ５０７までの処理で認識結果記憶部３２０に格納された１または複数のデータを管理する、校正の優先順位を決定する、などの管理を行う。すなわち図６の認識結果情報６００を管理する。これらの処理は図７、図８で詳細に説明する。 In step S508, the speech recognition server 101 determines the priority of calibration for managing one or more data stored in the recognition result storage unit 320 in the process up to the above-described step S507 when a new utterance is input. Manage. That is, the recognition result information 600 in FIG. 6 is managed. These processes will be described in detail with reference to FIGS.

音声認識サーバ１０１における処理とは非同期に、校正者用の情報処理端末１０２ｂにおいては、ステップＳ５０６で受信した文字列を当該情報処理端末１０２ｂの表示装置にて校正者に提示し、ステップＳ５０９において、校正者の校正作業を受け付ける。校正者の校正作業とは、情報処理端末１０２ｂの表示装置に表示された、識別可能な優先順位に従いながら発話に対応する文字列の編集作業を行うことである。校正をしている状態の画面は図１１を用いて後述する。また、ステップＳ５０９において校正作業が始まった時点でその旨を音声認識サーバ１０１に通知し、認識結果記憶部３２０に格納されているデータの修正状態を“校正中”に変更する。 Asynchronously with the processing in the speech recognition server 101, the proofreader information processing terminal 102b presents the character string received in step S506 to the proofreader on the display device of the information processing terminal 102b. In step S509, Accept proofreaders' proofreading work. The proofreading work of the proofreader is to edit the character string corresponding to the utterance while following the identifiable priority order displayed on the display device of the information processing terminal 102b. A screen in a state where calibration is performed will be described later with reference to FIG. Further, when the calibration work is started in step S509, the voice recognition server 101 is notified of this, and the correction state of the data stored in the recognition result storage unit 320 is changed to “under calibration”.

ステップＳ５１０においては、前述の校正が終了した結果の文字列を情報処理端末１０２ｂから送信し、ステップＳ５１１においては音声認識サーバ１０１がその結果を受信して、認識結果記憶部３２０に格納されているデータを更新する。その際に修正状態は“完了”、修正要否は“不要”に変更する。 In step S510, a character string as a result of completion of the above-described calibration is transmitted from the information processing terminal 102b. In step S511, the speech recognition server 101 receives the result and stores it in the recognition result storage unit 320. Update the data. At that time, the correction state is changed to “completed” and the necessity of correction is changed to “unnecessary”.

ステップＳ５１２において音声認識サーバ１０１の校正結果配布部３２７は、校正が完了した文字列、すなわち音声認識での誤認識部分が校正された文字列を、情報処理端末１０２に送信する。 In step S512, the proofreading result distribution unit 327 of the speech recognition server 101 transmits to the information processing terminal 102 the character string that has been proofread, that is, the character string in which the misrecognition portion in speech recognition has been calibrated.

前記誤りを校正した校正者用の情報処理端末１０２ｂは、校正した時点ですでに正しい文字列が表示されているが、設計事項として当該情報処理端末１０２ｂ、すなわち自分自身にも正しい文字列を送信してもよい。また、図５のフローチャートでは校正が終了された文字列は、いったん音声認識サーバ１０１を経由して情報処理端末１０２に配布されているが、校正用の情報処理端末１０２ｂから直接、他の情報処理端末１０２に配布してもよい。この違いは設計事項に過ぎず、直接配布する場合も本願発明の請求項の範囲に含むものとする。 The correct information processing terminal 102b for the proofreader who corrects the error already displays the correct character string at the time of the correction, but transmits the correct character string to the information processing terminal 102b, that is, to itself as a design matter. May be. In the flowchart of FIG. 5, the character string that has been proofread is once distributed to the information processing terminal 102 via the voice recognition server 101, but other information processing is directly performed from the information processing terminal 102 b for proofreading. You may distribute to the terminal 102. This difference is merely a design matter, and the case of direct distribution is included in the scope of the claims of the present invention.

ステップＳ５１３においては、情報処理端末１０２は、校正された文字列を受信し、情報処理端末１０２の表示装置に既に表示されている“誤認識を含む文字列”を“校正された文字列”に置き換える。 In step S513, the information processing terminal 102 receives the calibrated character string, and changes the “character string including erroneous recognition” already displayed on the display device of the information processing terminal 102 to “calibrated character string”. replace.

なお図４の表示枠４０４Ａ〜Ｋが発言ごとに別々の編集対象となっていてもよいし、合わせて一つの編集対象であってもよい。また同時に１つの表示枠４０４を複数の校正者が同時に校正しないように、１つの情報処理端末１０２ｂで構成中の表示枠４０４は、他の情報処理端末１０２ｂでは校正できないようになっていてもよい。また図４の一番下の表示枠４０４は、音声認識が区切れていない文字列の表示が継続しているため、校正できないようになっていてもよい。これらはあくまで設計事項である。 Note that the display frames 404A to 404K in FIG. 4 may be different editing targets for each utterance, or may be a single editing target. In addition, the display frame 404 being configured by one information processing terminal 102b may not be calibrated by another information processing terminal 102b so that a plurality of proofreaders do not calibrate one display frame 404 at the same time. . Further, the display frame 404 at the bottom of FIG. 4 may be configured so that it cannot be proofread because the display of the character string that is not divided into voice recognition continues. These are only design matters.

図７は、本発明の実施形態にかかわる音声認識結果の解析と校正のための優先順位付けまでの処理（図５のステップＳ５０８）を説明するフローチャートの一例を示す図である。図７のフローチャートの各ステップは、音声認識サーバ１０１上のＣＰＵ２０１で実行される。 FIG. 7 is a diagram illustrating an example of a flowchart for explaining processing (step S508 in FIG. 5) up to prioritization for analysis and calibration of a speech recognition result according to the embodiment of the present invention. Each step of the flowchart of FIG. 7 is executed by the CPU 201 on the voice recognition server 101.

ステップＳ７０１においては、新しい発話の音声データの認識結果が認識結果記憶部５２０に登録されたか否かをチェックする。具体的には図６の６０１Ｊまでが前回のチェックで存在したとして、次の６０１Ｋが新たに追加されたか否かをチェックする。登録された場合（“Ｙｅｓ”の場合）には、ステップＳ７０２に進む。登録されていない場合（“Ｎｏ”の場合）には、ステップＳ７０４に進む。 In step S701, it is checked whether or not the recognition result of the speech data of the new utterance has been registered in the recognition result storage unit 520. More specifically, assuming that up to 601J in FIG. 6 existed in the previous check, it is checked whether or not the next 601K has been newly added. If registered (in the case of “Yes”), the process proceeds to step S702. If not registered (in the case of “No”), the process proceeds to step S704.

ステップＳ７０２においては、新たに追加された音声認識結果の文字列に対して形態素解析を行う。ステップＳ７０２の処理により図１０の例に示されているように文字列を区分して品詞が付与されることになる。これにより形態素列を生成する。ただし音声認識結果自体に形態素解析による品詞が付与されている場合にはステップＳ７０２は不要であり省略する。 In step S702, morphological analysis is performed on the newly added character string of the speech recognition result. As shown in the example of FIG. 10, the part of speech is given by dividing the character string by the process of step S702. As a result, a morpheme string is generated. However, if the speech recognition result itself has a part of speech by morphological analysis, step S702 is unnecessary and is omitted.

ステップＳ７０３においては、前記形態素列から個体名を抽出する。個体名抽出の技術については、特開２００２−２８８１９０などにより周知の技術であるため詳細の説明は割愛する。 In step S703, an individual name is extracted from the morpheme string. The technique for extracting an individual name is a well-known technique disclosed in Japanese Patent Application Laid-Open No. 2002-288190 and the like, and thus detailed description thereof is omitted.

ステップＳ７０４においては、認識結果である文字列（たとえば図６の６０１Ａ〜Ｊ）のうち、校正が未処理であるものに対して、校正すべき優先順位を設定する。詳細は図８、図９を用いて後述する。 In step S704, a priority order to be calibrated is set for a character string (for example, 601A to J in FIG. 6) that is a recognition result that has not been calibrated. Details will be described later with reference to FIGS.

ステップＳ７０５においては、音声認識システムの実行が継続している場合（“Ｙｅｓ”の場合）には、ステップＳ７０１に戻る。音声認識システムの実行が終了している（“Ｎｏ”の場合）には図７のフローチャートの処理を完了し、図５のフローチャートの処理に戻る。すなわち図５のステップＳ５０８を終わった状態に戻る。 In step S705, when the execution of the voice recognition system continues (in the case of “Yes”), the process returns to step S701. When the execution of the speech recognition system is finished (in the case of “No”), the process of the flowchart of FIG. 7 is completed, and the process returns to the process of the flowchart of FIG. That is, the process returns to the state after step S508 in FIG.

図８は、本発明の実施形態にかかわる優先順位付けの処理（図７のステップＳ７０４）を説明するフローチャートの一例を示す図である。図８のフローチャートの各ステップは、音声認識サーバ１０１上のＣＰＵ２０１で実行される。 FIG. 8 is a diagram showing an example of a flowchart for explaining the prioritization processing (step S704 in FIG. 7) according to the embodiment of the present invention. Each step of the flowchart of FIG. 8 is executed by the CPU 201 on the voice recognition server 101.

ステップＳ８０１からステップＳ８０８は、認識結果記憶部に格納されている結果、すなわち全発話音声データに基づき音声認識された結果（例えば図６の６０１Ａ〜Ｊなら１０の発話データ）に対する繰り返し処理である。 Steps S801 to S808 are repetitive processing on the result stored in the recognition result storage unit, that is, the result of speech recognition based on all speech data (for example, 10 speech data in the case of 601A to J in FIG. 6).

ステップＳ８０２においては、１つの音声認識結果に着目する。具体的には前記６０１Ａ〜Ｊの先頭から順にそのうちの１つに着目する。 In step S802, attention is focused on one speech recognition result. Specifically, attention is paid to one of the 601A to J in order from the top.

ステップＳ８０３においては、着目中の音声認識結果の優先順位を判定する必要があるか否かを判定する。既に校正済みであるか否か、または図９の発話後経過条件９０１に記載されている条件を満たすか否か、により分岐する。この判定は、２種類の判定のＯＲ条件となっているため、いずれかの条件が満たされていれば“Ｙｅｓ”となり、ステップＳ８０４にすすむ。何れの条件も満たされていない場合には“Ｎｏ”となり、ステップＳ８０５に進む。 In step S803, it is determined whether or not it is necessary to determine the priority order of the speech recognition result under consideration. The process branches depending on whether calibration has already been performed or whether the condition described in the post-speech progress condition 901 in FIG. 9 is satisfied. Since this determination is an OR condition for two types of determinations, if either condition is satisfied, “Yes” is determined, and the process proceeds to step S804. If neither condition is satisfied, “No” is determined, and the process proceeds to step S805.

前記２つの条件のうち校正済みであるか否かついて、具体的に図１０（図６の一部の認識結果を例として認識状態を付与している）を用いて詳細に説明する。ある一区切りの発話を音声認識した後に最初に図８のフローチャート（即ち図７のステップＳ７０４）を実行する際には、当該発話の図１０の“修正要否”はまだ何も判断していないため記載がない空白状態であるため条件を満たさない（“Ｎｏ”）。既に校正済みの認識結果については、前記Ｓ５１０の説明にて、校正終了後に図１０の“修正要否”を“不要”としているため条件を満たす（“Ｙｅｓ”）。ただしこの部分は設計事項であり、一度校正終了した認識結果も優先順位をつけ直す対象としてもよい。その場合には、Ｓ５１０において“不要”とはしない。 Whether or not the two conditions are already calibrated will be described in detail with reference to FIG. 10 (recognition state is given as an example of a part of the recognition results in FIG. 6). When the flowchart in FIG. 8 (that is, step S704 in FIG. 7) is executed for the first time after voice recognition of a certain segment of speech, the “necessity of correction” in FIG. 10 of the speech has not yet been determined. The condition is not satisfied because it is a blank state with no description (“No”). The recognition result that has already been calibrated satisfies the condition (“Yes”) since “correction necessity” in FIG. 10 is set to “unnecessary” in FIG. However, this part is a design matter, and the recognition result once calibrated may be reassigned. In that case, it is not determined as “unnecessary” in S510.

また前記２つのうち発話後経過条件９０１を条件とする場合を説明する。この条件の意図は、発話が完了した後、時系列的に一定の期間が経過してしまっていると思われるものは、遡って校正しても有用ではないという判断をするためのものである。具体的に図９の９０１に記載している３つの例を用いて説明する。 A case where the post-speech progress condition 901 out of the two is used as a condition will be described. The intent of this condition is to determine that a certain period of time has passed after the utterance is completed and that it is not useful to calibrate retrospectively. . Specifically, description will be made using three examples described in 901 of FIG.

発話後経過条件９０１は、発話されてから一定時間が経過した、ということをどのように判定するかという条件が記載されている。図９に記載の条件はあくまで例であり、これら３つの方法以外であっても時間経過を判定するいかなる方法であれば本願発明に含むものとする。例を1つずつ説明する。 The post-speech progress condition 901 describes how to determine that a certain time has passed since the speech was made. The conditions shown in FIG. 9 are merely examples, and any method for determining the passage of time, other than these three methods, is included in the present invention. One example will be explained at a time.

例１は、図４の発話例４００におけるＡ〜Ｊなど各発話において、その発話が完了した、と見なされる区切りからの実際の時間を測定するものである。例では、終了してから１８０秒以上経過したものは、校正を不要とする条件になっている。経過時間は図１０の例では“経過時間”フィールドに格納されている。 Example 1 is to measure the actual time from the segment where the utterance is considered to be completed in each utterance such as A to J in the utterance example 400 of FIG. In the example, when 180 seconds or more have passed since completion, the condition is that calibration is unnecessary. The elapsed time is stored in the “Elapsed Time” field in the example of FIG.

例２は、時間ではないが文字数でカウントするものであり、発話が完了した、と見なされ区切られた後、続く発話の文字が５００文字以上認識結果として提示されれば、その時点で校正不要とする。図６の６０３を用いて説明すると、６０３Ａの後に６０３Ｂ以降の文字数を合計して５００文字に達すれば、６０１Ａの発話の優先順位を計算せず校正不要となる。 Example 2 counts by the number of characters, not the time. If the utterance is considered to have been completed and separated, if more than 500 characters of the following utterance are presented as recognition results, calibration is not necessary at that time. And Referring to 603 in FIG. 6, if the total number of characters after 603B reaches 500 characters after 603A, the priority of the speech of 601A is not calculated and calibration is unnecessary.

例３は、読者からの見え方により判断するものである。音声認識結果の文字列は、読者の情報処理端末１０２の上では時間が経過するに従って、表示されなくなることが通常である。例えば図４、図１１の音声認識結果表示画面４０１は発話の区切りで上から時系列順に表示され、画面が一杯になると最新のものが最下行に追加され、そのため最上行のもの（最も古い発話を文字列化したもの）は、スクロールされて上方に消えていく、というユーザインタフェースが考えられる（例えば図１１の１１０１点線内の部分）。 Example 3 is determined based on how the reader sees it. The character string of the speech recognition result is usually not displayed on the information processing terminal 102 of the reader as time passes. For example, the speech recognition result display screen 401 in FIG. 4 and FIG. 11 is displayed in chronological order from the top in utterance breaks, and when the screen is full, the latest one is added to the bottom row, so that the top row (the oldest utterance) Can be considered as a user interface that scrolls and disappears upward (for example, a portion within a dotted line 1101 in FIG. 11).

異なる方法であって、時系列順ではなく、即ち新旧に拘わらず画面に残るもの／画面から消えていくものがある場合であっても、消えてしまったものの誤りを校正しても何れの読者も読むことが出来ないため無意味である。従って校正を不要としていくことが考えられる。 Even if there is a different method that is not in chronological order, that is, something that remains on the screen / disappears on the screen regardless of the old and new, even if you correct the error of what has disappeared, any reader Because it cannot be read, it is meaningless. Therefore, it can be considered that calibration is unnecessary.

ここでは３つの例を挙げたが、これら以外の方法であってもよい。またこれらの組み合わせ条件（ＡＮＤ条件、ＯＲ条件）であってもよい。 Three examples are given here, but other methods may be used. Moreover, these combination conditions (AND condition, OR condition) may be sufficient.

ステップＳ８０４においては、校正を不要とするため図１０に格納されている情報の“修正要否”を“不要”とする。 In step S804, in order to eliminate the need for calibration, “correction necessity” of the information stored in FIG. 10 is set to “unnecessary”.

ステップＳ８０５においては、校正のステータスにおける“修正要否”を校正する必要がある場合として“要”、“修正状態”をまだ校正されていないとして“未”とする。 In step S805, “necessary for correction” in the calibration status is set as “necessary” when it is necessary to calibrate, and “corrected state” is set as “not yet” as not yet calibrated.

ちなみに既に説明している通り、図５のステップＳ５０９において校正を開始した段階で、“修正状態”を“校正中”、校正が終了し校正結果が音声認識サーバ１０１に送信された段階でステップＳ５１１にて修正状態は“完了”、修正要否は“不要”に変更される。 Incidentally, as already described, at the stage where calibration is started in step S509 of FIG. 5, the “correction state” is “under calibration”, the calibration is completed, and the calibration result is transmitted to the voice recognition server 101, step S511. The correction state is changed to “completed” and the necessity of correction is changed to “unnecessary”.

次にステップＳ８０６においては、例えば図９の９０２に従って、確信度を再計算するか否かを判定する。９０２には例として３つの条件を記載しているがこの条件に限定されるものではない。 Next, in step S806, for example, according to 902 in FIG. Although three conditions are described in 902 as an example, it is not limited to these conditions.

例えば９０２の例１では、着目中の音声認識結果に要確認品詞の形態素や個体名が含まれるかを判定する。例えば図１０のＢにおいては、“数詞”が含まれており、これが図９の９０３において要確認品詞として登録されている。一般に数詞あるいは数値を含む特定のパターンは、会社の売上げや契約上の金額、日付などになるため、誤りがあった場合に読者にとって重要な情報が保障されないことになる。また図１０のＥには個体名抽出の結果である数的表現（１００２）が含まれている。複数の形態素から校正される、特定の人物、組織、数的な表現を含む場合も誤りがないことを確認必要な個体名である（図９の９０４）。 For example, in Example 1 of 902, it is determined whether the speech recognition result under attention includes a morpheme or individual name of a part of speech to be confirmed. For example, “B” in FIG. 10 includes “numerical”, and this is registered as a part of speech to be confirmed in 903 in FIG. In general, a specific pattern including a number or a numerical value is a company's sales, contract amount, date, etc., so if there is an error, important information for the reader is not guaranteed. Further, E in FIG. 10 includes a numerical expression (1002) that is a result of the individual name extraction. It is an individual name that needs to be confirmed that there is no error even when it includes a specific person, organization, and numerical expression calibrated from a plurality of morphemes (904 in FIG. 9).

９０２の２つめの例としては、音声認識結果の中に特に確信度が低い形態素が多く含まれている場合、３つめの例としては、発話全体の認識結果の確信度が低い場合を上げている。認識の確信度が低い場合には、誤認識された形態素が多く含まれている可能性が高く、従って個別に重要な情報がある例１とは異なる意味で校正の優先順位が高くなる。 As the second example of 902, when many morphemes with particularly low confidence are included in the speech recognition result, the third example is a case where the certainty of the recognition result of the entire utterance is low. Yes. When the certainty of recognition is low, there is a high possibility that many misrecognized morphemes are included, and therefore the priority of calibration is high in a sense different from Example 1 in which there is important information individually.

形態素解析／個体名抽出などの処理と、９０２などに記載されている規則に従って、確信度を再計算するものである。確信度の再計算方法は、例として確信度再計算方法９０５に記載されている。すなわち前述の処理で重要な情報が含まれていれば認識結果の確信度を変更することで校正の優先順位を変更するものである。例えば、要確認品詞９０３に登録されている単語、個体名抽出条件９０４で指定された情報がある場合に、どのように確信度を再計算するかが記載されている（９０５の例１，例２）。 The certainty factor is recalculated according to processing such as morphological analysis / individual name extraction and the rules described in 902 and the like. The certainty factor recalculation method is described in the certainty factor recalculation method 905 as an example. That is, if important information is included in the above-described processing, the priority of calibration is changed by changing the certainty of the recognition result. For example, it is described how to recalculate the certainty factor when there is a word registered in the part of speech to be confirmed 903 and information specified by the individual name extraction condition 904 (Example 1 of 905, Example 1) 2).

なお、ここに図８のフローチャート形態素解析の処理は記載していないが、音声認識結果自体が、形態素単位に分割されていることが多く、また品詞を音声認識結果の情報として含んでいてもよい。含んでいない場合には、形態素解析や他の方式（辞書を用いるなど）による品詞付けを別途行ってもよい。 Although the flowchart morphological analysis process of FIG. 8 is not described here, the speech recognition result itself is often divided into morpheme units, and the part of speech may be included as information of the speech recognition result. . If they are not included, part-of-speech assignment by morphological analysis or other methods (such as using a dictionary) may be performed separately.

個体名抽出についても同様である。本発明の実施の形態の一部として含んでいてもよいし、音声認識側で個体名抽出した結果を音声認識結果として含んでいるものの何れであってもよい。 The same applies to individual name extraction. It may be included as a part of the embodiment of the present invention, or may include any result including an individual name extracted on the speech recognition side as a speech recognition result.

ステップＳ８０７においては、発話が終わってからの時間によって構成の優先順位を変更するための計算を行う。ステップＳ８０３の判定および９０１の例１において、一定時間経過したものは校正不要としたが、ここではその一定時間が経過する前の認識結果に対する対応である。すなわち、例えば一定時間が経過していない（９０１の例１）、まだ画面内に表示されている（９０１の例３）認識結果であれば、校正が“不要”となる状態に近づいているものほど、校正のために残されたタイムリミットが少ないため優先順位を上げて校正させる必要がある。９０５の例３の式は時間が経過しているほどその認識結果の確信度を下げるものである。 In step S807, a calculation is performed to change the configuration priority according to the time after the utterance ends. In the determination of step S803 and the example 1 of 901, the calibration is not required if the fixed time has elapsed, but here it corresponds to the recognition result before the fixed time has elapsed. That is, for example, a certain period of time has not elapsed (Example 1 of 901), and the recognition result still displayed on the screen (Example 3 of 901) is approaching a state where calibration is “unnecessary”. As the time limit left for calibration is small, it is necessary to raise the priority and perform calibration. The expression of Example 3 of 905 decreases the certainty of the recognition result as time passes.

ステップＳ８０９においては、前述で確信度を再計算した結果を受けて、校正が“要”であるものに対して、確信度でソートを行い、確信度が低いものほど優先的に校正するよう情報処理端末１０２の表示装置に提示するものである。 In step S809, in response to the result of the recalculation of the certainty factor as described above, the information that the calibration is “necessary” is sorted by the certainty factor, and the lower the certainty factor, the information is preferentially calibrated. This is presented on the display device of the processing terminal 102.

以上で図８のフローチャートによる処理の説明を完了する。ここでは確信度を一定のルールに応じて変更したが、必ずしも確信度を変更する必要はない。例えばどの程度“減点”したかを記憶する別の数値（マイナス・スコアなど）を用いてもよい。確信度を変更したのはあくまで例であり、設計事項である。 This completes the description of the processing according to the flowchart of FIG. Here, the certainty factor is changed according to a certain rule, but the certainty factor is not necessarily changed. For example, another numerical value (such as a minus score) for storing how much “deduction” has been used may be used. Changing the certainty factor is merely an example and a design matter.

図８の処理をしたことによって、校正すべき優先順位が決定し、図１０においては、例えば認識結果のＥが優先順位１、認識結果のＩが優先順位２となった例を記載している。校正者はこの識別可能な情報に基づき、校正する優先順位を判断する。あるいは、優先順位が高いものからしか編集できないように制御してもよい。 The priority order to be calibrated is determined by performing the processing of FIG. 8, and FIG. 10 shows an example in which the recognition result E is priority order 1 and the recognition result I is priority order 2, for example. . The proofreader determines the priority order for proofreading based on this identifiable information. Or you may control so that it can edit only from a thing with high priority.

図１１は、本発明の実施形態に係る音声認識結果を表示するユーザインタフェースの一例を示すための図である。本質的には図４と同じ図であるが、次の点が異なる。 FIG. 11 is a diagram illustrating an example of a user interface that displays a speech recognition result according to the embodiment of the present invention. This is essentially the same as FIG. 4 except for the following points.

１１０３ｅは、校正者のいずれかが、この認識結果を校正している旨を表す“中”（校正中）を表示している。また１１０３ｇ〜１１０３ｊには優先順位１〜４をしている。これにより校正者は校正すべき優先順位を識別可能となる。 1103e displays “medium” (calibrating) indicating that one of the proofreaders is proofreading the recognition result. Priorities 1 to 4 are assigned to 1103g to 1103j. Thus, the proofreader can identify the priority order to be proofread.

また１１０３ｋは現在発話中の音声認識結果が途中まで認識されその結果が表示されているため“現”と表示されている。この表示枠は校正可能であっても、発話が区切れ次の１１０３ｌが表示されるまでは校正できないように制御されていてもよい。
以上で、図面を用いた本願発明に関する説明を完了する。 Further, 1103k is displayed as “current” because the speech recognition result currently being spoken is recognized halfway and the result is displayed. Even if the display frame can be proofread, it may be controlled so that it cannot be proofed until the utterance is divided and the next 1103 l is displayed.
This completes the description of the present invention using the drawings.

なお、上述した各種データの構成及びその内容はこれに限定されるものではなく、用途や目的に応じて、様々な構成や内容で構成されることは言うまでもない。 It should be noted that the configuration and contents of the various data described above are not limited to this, and it goes without saying that the various data and configurations are configured according to the application and purpose.

以上、いくつかの実施形態について示したが、本発明は、例えば、システム、装置、方法、コンピュータプログラムもしくは記録媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although several embodiments have been described above, the present invention can take an embodiment as, for example, a system, apparatus, method, computer program, or recording medium, and more specifically, a plurality of devices. The present invention may be applied to a system configured from the above, or may be applied to an apparatus including a single device.

また、本発明におけるコンピュータプログラムは、図５、図７、図８に示すフローチャートの処理方法をコンピュータが実行可能なコンピュータプログラムであり、本発明の記憶媒体は図５、図７、図８の処理方法をコンピュータが実行可能なコンピュータプログラムが記憶されている。なお、本発明におけるコンピュータプログラムは図５、図７、図８の各装置の処理方法ごとのコンピュータプログラムであってもよい。 Further, the computer program in the present invention is a computer program that can execute the processing method of the flowcharts shown in FIGS. 5, 7, and 8, and the storage medium of the present invention is the process in FIG. 5, FIG. 7, and FIG. A computer program capable of executing the method is stored. Note that the computer program in the present invention may be a computer program for each processing method of each apparatus in FIGS. 5, 7, and 8.

以上のように、前述した実施形態の機能を実現するコンピュータプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたコンピュータプログラムを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。 As described above, a recording medium in which a computer program for realizing the functions of the above-described embodiments is recorded is supplied to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus is stored in the recording medium. It goes without saying that the object of the present invention can also be achieved by reading and executing a program.

この場合、記録媒体から読み出されたコンピュータプログラム自体が本発明の新規な機能を実現することになり、そのコンピュータプログラムを記憶した記録媒体は本発明を構成することになる。 In this case, the computer program itself read from the recording medium realizes the novel function of the present invention, and the recording medium storing the computer program constitutes the present invention.

コンピュータプログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク、ソリッドステートドライブ等を用いることができる。 As a recording medium for supplying a computer program, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a DVD-ROM, a magnetic tape, a nonvolatile memory card, a ROM, an EEPROM, Silicon disks, solid state drives, etc. can be used.

また、コンピュータが読み出したコンピュータプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのコンピュータプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the computer program read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) running on the computer based on the instructions of the computer program. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.

さらに、記録媒体から読み出されたコンピュータプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのコンピュータプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, after the computer program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function is based on the instructions of the computer program code. It goes without saying that the CPU or the like provided in the expansion board or the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。また、本発明は、システムあるいは装置にコンピュータプログラムを供給することによって達成される場合にも適応できることは言うまでもない。この場合、本発明を達成するためのコンピュータプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Further, the present invention may be applied to a system composed of a plurality of devices or an apparatus composed of a single device. It goes without saying that the present invention can also be applied to a case where the present invention is achieved by supplying a computer program to a system or apparatus. In this case, by reading the recording medium storing the computer program for achieving the present invention into the system or apparatus, the system or apparatus can enjoy the effects of the present invention.

さらに、本発明を達成するためのコンピュータプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。
なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 Furthermore, by downloading and reading out a computer program for achieving the present invention from a server, database, etc. on a network using a communication program, the system or apparatus can enjoy the effects of the present invention.
In addition, all the structures which combined each embodiment mentioned above and its modification are also included in this invention.

１０１音声認識サーバ
１０２情報処理端末
３２０認識結果記憶部
３２１音声データ受信部
３２２音声認識部
３２３認識結果送信部
３２４認識結果管理部
３２５優先順位決定部
３２６校正結果受信部
３２７校正結果配布部
101 voice recognition server 102 information processing terminal 320 recognition result storage unit 321 voice data reception unit 322 voice recognition unit 323 recognition result transmission unit 324 recognition result management unit 325 priority order determination unit 326 calibration result reception unit 327 calibration result distribution unit

Claims

An information processing apparatus comprising storage means for storing a text for each utterance obtained by voice recognition of voice data of a plurality of utterances and a first certainty factor that is a likelihood of a voice recognition result for the text,
A specifying means for specifying text for each utterance to be corrected based on the first certainty factor;
Display control means for identifying and displaying the text for each utterance to be corrected,
An information processing apparatus comprising: an update unit configured to receive correction for the text for each utterance and update the text for each utterance for which the correction has been received.

The storage means stores a morpheme obtained by analyzing the text for each utterance and a second certainty factor that is a probability of a speech recognition result for the morpheme,
The information processing apparatus according to claim 1, wherein the specifying unit specifies a text for each utterance to be corrected based on a second certainty factor of a morpheme included in the text for each utterance.

The information processing apparatus according to claim 2, wherein the specifying unit specifies a text for each utterance to be corrected based on whether or not a morpheme included in the text for each utterance has a predetermined part of speech.

A determination means for determining whether the obtained morpheme is an individual name;
The information processing apparatus according to claim 2, wherein the specifying unit specifies the text for each utterance to be corrected depending on whether or not a morpheme included in the text for each utterance is an individual name. .

The display control means highlights the morpheme when the morpheme included in the text for each utterance has a predetermined part of speech or an individual name. The information processing apparatus described in 1.

The information processing apparatus according to claim 1, wherein the specifying unit specifies a text for each utterance to be corrected according to an elapsed time from the utterance.

An information processing apparatus and a display device including storage means for storing a text for each utterance obtained by voice recognition of voice data of a plurality of utterances and a first certainty factor that is a likelihood of a voice recognition result for the text. An information processing system including:
A specifying means for specifying text for each utterance to be corrected based on the first certainty factor;
Display control means for causing the display device to identify and display the text for each specified utterance to be corrected;
An information processing system comprising: correction accepting means for accepting a correction to the text for each utterance in the display device.

A control method for an information processing apparatus comprising storage means for storing text for each utterance obtained by speech recognition of speech data of a plurality of utterances and a first certainty factor that is the likelihood of a speech recognition result for the text. And
A specifying step of specifying a text for each utterance to be corrected based on the first certainty factor;
A display control step for causing the display control means to identify and display the text for each utterance to be corrected,
An update means comprises: an update step of accepting a modification to the text for each utterance and updating the text for each utterance that has accepted the modification.

A program that can be executed in an information processing apparatus including a storage unit that stores text for each utterance obtained by voice recognition of voice data of a plurality of utterances and a first certainty factor that is a likelihood of a voice recognition result for the text Because
The information processing apparatus;
A specifying means for specifying text for each utterance to be corrected based on the first certainty factor;
Display control means for identifying and displaying the text for each utterance to be corrected,
A program for accepting a modification to the text for each utterance and to function as an updating unit for updating the text for each utterance for which the modification has been accepted.