JP7231806B2

JP7231806B2 - Information processing device, information processing system, control method, and program

Info

Publication number: JP7231806B2
Application number: JP2018026120A
Authority: JP
Inventors: 敬己下郡山
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2018-02-16
Filing date: 2018-02-16
Publication date: 2023-03-02
Anticipated expiration: 2038-02-16
Also published as: JP2019144310A

Description

本発明は、音声認識結果の誤り校正するための支援に関する技術であって、認識誤りのある文字列を校正する際に、重要な部分から校正するための優先度を提示して、最終的な結果を目にする読者の情報保障の精度を高める技術に関する。 The present invention relates to a technique for assisting error proofreading of speech recognition results. It relates to a technology that enhances the accuracy of information security for readers who see the results.

従来から、人間の発話を文字列に変換する音声認識の研究開発が行われていた。実際の応用として、テレビ放送に字幕をつける、ろう者が他者の発話を理解する、などの目的で使用されてきた。 Conventionally, research and development of voice recognition for converting human utterances into character strings have been carried out. As a practical application, it has been used for purposes such as subtitling television broadcasts and deaf people understanding other people's speech.

特に近年、機械学習の進展などもあり実用的な認識精度が実現されるようになってきている。とはいえ、まだ十分な認識精度が達成されているわけではなく、特に発話者の話し方にも注意する必要がある。例えばマイクと口との距離、発話の明確さなどにより認識精度は大きく変わる。 Especially in recent years, with the progress of machine learning, etc., practical recognition accuracy has come to be realized. However, sufficient recognition accuracy has not yet been achieved, and it is necessary to pay particular attention to how the speaker speaks. For example, recognition accuracy varies greatly depending on the distance between the microphone and the mouth and the clarity of speech.

そのため、音声認識結果の文字列をパソコン上などで正しく修正するためのソフトウェアも実現されている。 Therefore, software for correctly correcting the character string of the speech recognition result on a personal computer or the like has been realized.

もともと“パソコン要約筆記”として、発話者の発話を聞きパソコンに入力してろう者に提供するサービスがあったが、音声認識技術を利用したソフトウェアの出現により入力する人の作業を支援することが可能になってきている。 Originally, there was a service called "computer summary writing" that provided deaf people by listening to the speaker's utterance and inputting it into a computer, but with the advent of software that uses speech recognition technology, it is now possible to support the work of the person who inputs. It is becoming possible.

それらのソフトウェアは、一般的には発話が途切れたタイミングなどで区切って、音声認識結果（文字列）もその区切りの単位で時系列に画面に表示する。それらの文字列をパソコン要約筆記として訓練をされた校正者が修正することになる。 Such software generally divides the speech at the timing at which the speech is interrupted, and displays the speech recognition results (character strings) on the screen in chronological order in units of the divisions. Those strings will be corrected by a proofreader trained as a computer abstract transcriber.

しかしながら通常、発話は認識結果を修正する作業、すなわち情報処理装置におけるキーボード入力作業よりも高速であるため、修正作業を行う校正者の負担は、まだ十分に軽減されてはいない。 However, utterances are usually faster than the work of correcting recognition results, that is, the keyboard input work in an information processing apparatus, so the burden on the proofreader who performs the correction work has not yet been sufficiently alleviated.

特許文献１は、音声認識の誤認識を正しく修正する校正者の作業を支援する音声認識結果編集装置を提供している。 Patent Literature 1 provides a voice recognition result editing device that assists a proofreader in correcting misrecognition of voice recognition.

特許文献１の技術では、マイクから入力された発話を音声認識部により単語毎に信頼度を付与された文字列に変換する。この際、最も信頼度の高い単語だけではなく所定の条件を満たす単語、例えば一定の値以上の確信度を持つ単語を文字列に変換し音声認識結果集合として含むため、校正者は複数の単語の正解候補から正しい認識結果を選択・修正することが出来る（特許文献１の段落００１３、図８）。 In the technique disclosed in Patent Literature 1, an utterance input from a microphone is converted into a character string with a degree of reliability assigned to each word by a speech recognition unit. At this time, not only words with the highest degree of confidence, but also words that satisfy a predetermined condition, such as words with certainty or more, are converted into character strings and included in the set of speech recognition results. The correct recognition result can be selected/corrected from the correct candidates of (Patent Document 1, paragraph 0013, FIG. 8).

特開２０１７－０４０８５６号公報JP 2017-040856 A

しかしながら１つの発話には複数の単語が含まれるため、特許文献１の技術では、単語の数が多いときには認識結果は膨大な数になり表示装置に分かりやすく表示することが困難になる。特に前記パソコン要約筆記の場合、文脈から判断して分かりやすく修正していくため、修正中の発話より前の発話の文字列も確認することがあるが、その領域がない可能性もある。すなわち単に認識結果の候補を表示するだけでは校正者にとって効率的な支援とはならない場合がある。 However, since one utterance includes a plurality of words, with the technique of Patent Document 1, when the number of words is large, the number of recognition results becomes enormous, making it difficult to display on a display device in an easy-to-understand manner. Especially in the case of the personal computer summary writing, since the correction is made in an easy-to-understand manner by judging from the context, the character string of the utterance before the utterance being corrected may be checked, but there is a possibility that there is no such area. In other words, simply displaying candidates for recognition results may not provide efficient assistance to the proofreader.

また校正者の数が足りない場合には、全ての音声認識誤りを修正不可能な場合もある。この場合には修正する部分を優先的に判断する必要があるが、特許文献１に記載の技術では、修正しないまま一定時間が経過してしまった音声認識結果は、その重要度にかかわらず修正しない（特許文献１の段落００２２、図５に記載のタイムアウト処理）と判断するだけであり、重要な情報であっても一定時間が経過すれば破棄、または未修正のまま表示されるという問題がある。 Also, if there are not enough proofreaders, it may not be possible to correct all speech recognition errors. In this case, it is necessary to determine the part to be corrected preferentially, but with the technique described in Patent Document 1, speech recognition results that have not been corrected for a certain period of time can be corrected regardless of their importance. (timeout processing described in paragraph 0022 of Patent Literature 1, FIG. 5). be.

また、全ての発話に対応する文字列が校正者によって修正されることが理想ではあるが、現実には不可能な場合もある。不可能な場合には適切な優先順位をつけて校正者に修正させることで、修正結果を見る人（例えばろう者）に可能な限り分かりやすい情報を提供することが必要である。 Moreover, although it is ideal that the proofreader corrects the character strings corresponding to all utterances, in reality, there are cases where it is impossible. If it is not possible, it is necessary to give the person (for example, deaf person) who sees the correction result as easy-to-understand information as possible by having the proofreader make corrections with appropriate priority.

本発明の目的は、前記の問題に鑑み、連続する音声データから区分された音声データの認識結果である一連のテキストデータに対し、校正すべき優先順位を示す情報を表示してユーザに効率的に修正させるための技術を提供することである。 SUMMARY OF THE INVENTION In view of the above problems, an object of the present invention is to display information indicating the order of priority for proofreading for a series of text data, which is the result of recognition of speech data segmented from continuous speech data, in an efficient manner for the user. It is to provide a technique for correcting it.

連続する音声データから所定の条件で区分された音声データの認識に基づく一連のテキストデータと、前記一連のテキストデータの認識の確からしさを示す第１確信度を取得する情報処理装置であって、前記区分された音声データごとの前記一連のテキストデータに対し、前記一連のテキストデータの第１確信度に基づいて当該一連のテキストデータを校正すべき優先順位を示す情報を表示装置に表示させる出力制御手段と、前記情報が表示されたテキストデータを更新すべく、当該テキストデータの修正を受け付ける受付手段とを備え、前記出力制御手段は、前記一連のテキストデータに含まれる文字データが所定の品詞であるかに基づいて調整される前記第１確信度に基づいて当該一連のテキストデータを校正すべき優先順位を示す情報を表示装置に表示させることを特徴とする。
An information processing device for acquiring a series of text data based on recognition of speech data segmented from continuous speech data according to a predetermined condition and a first certainty factor indicating the likelihood of recognition of the series of text data, Output for causing a display device to display information indicating a priority order in which the series of text data for each of the segmented voice data should be corrected based on the first certainty of the series of text data. and receiving means for receiving correction of the text data in order to update the text data in which the information is displayed. The display device displays information indicating the order of priority in which the series of text data should be corrected based on the first certainty adjusted based on whether the text data is a part of speech.

本発明により、前記の問題に鑑み、連続する音声データから区分された音声データの認識結果である一連のテキストデータに対し、校正すべき優先順位を示す情報を表示してユーザに効率的に修正させるための技術を提供することが可能となる。 In view of the above problems, according to the present invention, information indicating the order of priority to be corrected is displayed for a series of text data, which is the result of recognition of speech data segmented from continuous speech data, so that the user can make corrections efficiently. It is possible to provide technology for

本発明の実施形態に係るシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration|structure which concerns on embodiment of this invention. 本発明の実施形態に係る情報処理装置のハードウェア構成の一例を示すブロック図である。1 is a block diagram showing an example hardware configuration of an information processing apparatus according to an embodiment of the present invention; FIG. 本発明の実施形態に係る機能構成の一例を示す図である。It is a figure showing an example of functional composition concerning an embodiment of the present invention. 本発明の実施形態に係る音声認識結果を表示する画面の一例を示す図である。FIG. 5 is a diagram showing an example of a screen displaying speech recognition results according to the embodiment of the present invention; 本発明の実施形態に係る音声人入力から校正の配布までの処理の一例を示す図である。FIG. 4 is a diagram showing an example of processing from voice input to proofreading distribution according to the embodiment of the present invention; 本発明の実施形態に係る認識結果と認識結果の確信度のデータ形式の一例を示す図である。It is a figure which shows an example of the data format of the recognition result which concerns on embodiment of this invention, and the reliability of a recognition result. 本発明の実施形態に係る音声認識結果の解析と校正のための優先順位付けまでの処理を説明するフローチャートの一例を示す図である。FIG. 4 is a diagram showing an example of a flowchart for explaining processing up to prioritization for analysis and proofreading of speech recognition results according to the embodiment of the present invention. 本発明の実施形態に係る校正のための優先順位付けの処理を説明するフローチャートの一例を示す図である。FIG. 5 is a diagram showing an example of a flowchart for explaining a process of prioritization for proofreading according to an embodiment of the present invention; 本発明の実施形態に係る優先順位の処理に用いる情報を記憶する記憶部の一例を示すための図である。FIG. 4 is a diagram showing an example of a storage unit that stores information used for priority processing according to the embodiment of the present invention; 本発明の実施形態に係る音声認識結果の確信度を再計算した結果の一例を示すための図である。FIG. 4 is a diagram for showing an example of a result of recalculating the certainty factor of the speech recognition result according to the embodiment of the present invention; 本発明の実施形態に係る音声認識結果を表示するユーザインタフェースの一例を示すための図である。FIG. 4 is a diagram showing an example of a user interface that displays speech recognition results according to the embodiment of the present invention; FIG.

以下、本発明の実施の形態を、図面を参照して詳細に説明する。
図１は、本発明の実施形態に係るシステム構成の一例を示す図である。
＜システム構成例１＞ BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram showing an example of a system configuration according to an embodiment of the invention.
<System configuration example 1>

本発明の実施形態に拘わるシステムは、音声認識サーバ１０１、情報処理端末１０２（発話者用１０２ａ、校正者用１０２ｂ、読者用１０２ｃとする）で構成される。ユーザは情報処理端末１０２ａに接続されたマイク１０４で音声を入力する。情報処理端末１０２ａは、前記音声を音声認識サーバ１０１に送信して文字列に変換し情報処理端末１０２ａ～ｃに送り、情報処理端末１０２ａ～ｃで表示、ユーザに提示する。すなわち、情報処理端末１０２ａ～ｃは、音声の入力と文字列の出力の入出力双方を兼ね備えていてもよい。ここで出力される情報処理端末１０２においては、後述する読者用１０２ｃと校正者用１０２ｂが兼ねられていてもよいし、またそれぞれ専用の情報処理端末であってもよい。また出力は情報処理端末１０２に接続された表示装置上に対して行うが、プロジェクタなどを用いた構成も、本発明の実施形態に拘わるシステム構成とする。プロジェクタを使う場合であれば、情報処理端末１０２は発話者用の一台のみで、当該情報処理端末１０２ａに接続したプロジェクタからスクリーンに表示した音声認識結果の文字列を読者全員が読んでもよい。その場合、発話者用の前記情報処理端末１０２ａで直接、発話者自身あるいは別のユーザが校正者として誤認識を校正してもよい。 A system according to the embodiment of the present invention comprises a speech recognition server 101 and an information processing terminal 102 (102a for speaker, 102b for proofreader, and 102c for reader). The user inputs voice with the microphone 104 connected to the information processing terminal 102a. The information processing terminal 102a transmits the voice to the voice recognition server 101, converts it into a character string, sends it to the information processing terminals 102a to 102c, and displays it on the information processing terminals 102a to 102c to present it to the user. That is, the information processing terminals 102a to 102c may have both inputs and outputs for voice input and character string output. The information processing terminal 102 for outputting here may serve both as a reader terminal 102c and a proofreader terminal 102b, which will be described later, or may be dedicated information processing terminals. Also, although the output is performed on the display device connected to the information processing terminal 102, a configuration using a projector or the like is also included in the system configuration related to the embodiment of the present invention. If a projector is used, only one information processing terminal 102 is provided for the speaker, and all readers may read the character string of the voice recognition result displayed on the screen from the projector connected to the information processing terminal 102a. In that case, the speaker himself/herself or another user may directly correct misrecognition as a proofreader on the information processing terminal 102a for the speaker.

さらに音声認識サーバ１０１は、クラウド上に存在するものであってもよく、その場合には、本システムのユーザは後述する音声認識サーバ１０１上の機能を、クラウドサービスする形態であってもよい。これらのサービスを利用する形態であっても、本発明の実施形態に拘わるシステム構成とする。
＜システム構成例２＞ Furthermore, the voice recognition server 101 may exist on the cloud, and in that case, the user of this system may use the functions of the voice recognition server 101, which will be described later, as a cloud service. Even if these services are used, the system configuration is related to the embodiment of the present invention.
<System configuration example 2>

構成例１で説明した情報処理端末１０２ａ～ｃは、入出力を兼ね備えていたが、入力専用、出力専用と分かれていてもよい。
＜システム構成例３＞ The information processing terminals 102a to 102c described in Configuration Example 1 have both input and output functions, but may be divided into input-only terminals and output-only terminals.
<System configuration example 3>

音声認識サーバ１０１と情報処理端末１０２ａ～ｃは同一筐体であってもよい。すなわち、図１における情報処理端末１０２ａ～ｃのうちの１つに音声認識可能なソフトウェアがインストールされていて、音声認識サーバ１０１を兼ねていてもよい。 The voice recognition server 101 and the information processing terminals 102a to 102c may be housed in the same housing. That is, one of the information processing terminals 102a to 102c shown in FIG.

図２は、本発明の実施形態に係る音声認識サーバ１０１、情報処理端末１０２ａ～ｃに適用可能なハードウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram showing an example of a hardware configuration applicable to the voice recognition server 101 and information processing terminals 102a to 102c according to the embodiment of the present invention.

図２に示すように、音声認識サーバ１０１、情報処理端末１０２ａ～ｃは、システムバス２０４を介してＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０３、入力コントローラ２０５、ビデオコントローラ２０６、メモリコントローラ２０７、通信Ｉ／Ｆコントローラ２０８等が接続された構成を採る。ＣＰＵ２０１は、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。 As shown in FIG. 2, the speech recognition server 101 and the information processing terminals 102a to 102c are connected via a system bus 204 to a CPU (Central Processing Unit) 201, a RAM (Random Access Memory) 202, a ROM (Read Only Memory) 203, A configuration in which an input controller 205, a video controller 206, a memory controller 207, a communication I/F controller 208, and the like are connected is adopted. The CPU 201 comprehensively controls each device and controller connected to the system bus 204 .

また、ＲＯＭ２０３あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や、各サーバあるいは各ＰＣが実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。また、本発明を実施するために必要な情報が記憶されている。なお外部メモリはデータベースであってもよい。 The ROM 203 or the external memory 211 also stores a BIOS (Basic Input/Output System), which is a control program for the CPU 201, an OS (Operating System), and other data necessary for realizing functions executed by each server or each PC, which will be described later. Various programs are stored. It also stores information necessary for carrying out the present invention. Note that the external memory may be a database.

ＲＡＭ２０２は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０３あるいは外部メモリ２１１からＲＡＭ２０２にロードし、ロードしたプログラムを実行することで各種動作を実現する。 A RAM 202 functions as a main memory, a work area, and the like for the CPU 201 . The CPU 201 loads necessary programs and the like from the ROM 203 or the external memory 211 to the RAM 202 when executing processing, and implements various operations by executing the loaded programs.

また、入力コントローラ２０５は、キーボード（ＫＢ）２０９や不図示のマウス等のポインティングデバイス等からの入力を制御する。 The input controller 205 also controls inputs from a keyboard (KB) 209 and a pointing device such as a mouse (not shown).

ビデオコントローラ２０６は、ディスプレイ２１０等の表示器への表示を制御する。尚、表示器は液晶ディスプレイ等の表示器でもよい。これらは、必要に応じて管理者が使用する。 Video controller 206 controls display on a display such as display 210 . The display may be a display such as a liquid crystal display. These are used by administrators as needed.

メモリコントローラ２０７は、ブートプログラム、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶する外部記憶装置（ハードディスク（ＨＤ））や、フレキシブルディスク（ＦＤ）、あるいは、ＰＣＭＣＩＡ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒＭｅｍｏｒｙＣａｒｄＩｎｔｅｒｎａｔｉｏｎａｌＡｓｓｏｃｉａｔｉｏｎ）カードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等の外部メモリ２１１へのアクセスを制御する。 The memory controller 207 is an external storage device (hard disk (HD)), flexible disk (FD), or PCMCIA (Personal Computer Memory Card International Association) Controls access to external memory 211 such as compact flash (registered trademark) memory connected to a card slot via an adapter.

通信Ｉ／Ｆコントローラ２０８は、ネットワークを介して外部機器と接続・通信し、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ／ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）を用いた通信等が可能である。 A communication I/F controller 208 connects and communicates with an external device via a network, and executes communication control processing in the network. For example, communication using TCP/IP (Transmission Control Protocol/Internet Protocol) is possible.

尚、ＣＰＵ２０１は、例えばＲＡＭ２０２内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ２１０上に表示することが可能である。また、ＣＰＵ２０１は、ディスプレイ２１０上のマウスカーソル（図示しない）等によるユーザ指示を可能とする。 It should be noted that the CPU 201 can display on the display 210 by, for example, rasterizing an outline font to a display information area in the RAM 202 . The CPU 201 also allows the user to issue instructions using a mouse cursor (not shown) or the like on the display 210 .

本発明を実現するための後述する各種プログラムは、外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０２にロードされることによりＣＰＵ２０１によって実行されるものである。
図３は、本発明の実施形態に係る機能構成の一例を示す図である。 Various programs described later for realizing the present invention are recorded in the external memory 211 and are executed by the CPU 201 by being loaded into the RAM 202 as necessary.
FIG. 3 is a diagram showing an example of a functional configuration according to an embodiment of the invention.

なお、情報処理端末１０２は、発話者用１０２ａ、校正者用１０２ｂ、読者用１０３ｂの機能をそれぞれ別々の端末に持っても、共通した端末で持ってもよいので、ここではそれぞれを区別せずに説明する。 The information processing terminal 102 may have the functions of the speaker 102a, the proofreader 102b, and the reader 103b in separate terminals or in a common terminal. to explain.

音声取得部３１１は、情報処理端末１０２が内蔵している、あるいは接続されたマイクなどから話者の音声による発話を音声データとして入力し、音声データ送信部３１２により音声認識サーバ１０１に送信する。 The voice acquisition unit 311 inputs the utterance of the speaker's voice as voice data from a microphone built in or connected to the information processing terminal 102 , and transmits the voice data to the voice recognition server 101 by the voice data transmission unit 312 .

音声認識サーバ１０１は、音声データ受信部３２１で受信した音声データを音声認識部３２２に渡して音声データを文字列に変換し、当該文字列を認識結果送信部３２３により情報処理端末１０２に認識結果として送り返す。また、前述の認識結果を認識結果管理部３２４により認識結果記憶部３２０に格納する。 The speech recognition server 101 transfers the speech data received by the speech data reception unit 321 to the speech recognition unit 322, converts the speech data into a character string, and sends the recognition result to the information processing terminal 102 by the recognition result transmission unit 323. send back as Further, the recognition result described above is stored in the recognition result storage unit 320 by the recognition result management unit 324 .

情報処理端末１０２は、前記文字列を認識結果受信部３１３にて受信し、表示部３１４により表示することで読者（情報処理端末１０２のユーザ）に提示する。 The information processing terminal 102 receives the character string at the recognition result receiving unit 313 and presents it to the reader (the user of the information processing terminal 102) by displaying it on the display unit 314. FIG.

優先順位決定部３２５は、情報処理端末１０２を用いて音声認識の誤りを校正するための校正者が、優先的に校正すべき文字列を識別可能とするため認識結果記憶部３２０に格納された認識結果に優先順位を付与する。 The priority determination unit 325 allows a proofreader who uses the information processing terminal 102 to proofread an error in speech recognition to identify a character string that should be proofread preferentially. Give priority to recognition results.

優先順位付けされた文字列は、情報処理端末１０２に送信され、情報処理端末１０２の表示部３１４によって前記の通り校正者が校正すべき優先順位を識別可能に表示する。認識結果校正部３１５は、校正者が文字列を編集することで、認識結果の誤りを校正するための機能を提供する。 The prioritized character strings are transmitted to the information processing terminal 102, and the display unit 314 of the information processing terminal 102 identifiably displays the priority to be proofread by the proofreader as described above. The recognition result proofreading unit 315 provides a function for proofreading an error in the recognition result by the proofreader editing the character string.

前記校正結果は、情報処理端末１０２の校正結果送信部３１６により、音声認識サーバ１０１に送信され、音声認識サーバ１０１の校正結果受信部３２６が受信し、認識結果記憶部３２０に格納されている認識結果を更新する。 The proofreading result is transmitted to the voice recognition server 101 by the proofreading result transmitting unit 316 of the information processing terminal 102 , received by the proofreading result receiving unit 326 of the voice recognition server 101 , and is stored in the recognition result storage unit 320 . Update results.

前記更新された認識結果は、校正結果配布部３２７により、校正者が校正するために使用した情報処理端末１０２以外の情報処理端末１０２にも配布され、読者が校正結果を見ることが出来るように提示される。
図４は、本発明の実施形態に係る音声認識結果を表示する画面の一例を示す図である。 The updated recognition result is distributed by the proofreading result distributing unit 327 to the information processing terminals 102 other than the information processing terminal 102 used by the proofreader for proofreading, so that the reader can see the proofreading result. Presented.
FIG. 4 is a diagram showing an example of a screen displaying speech recognition results according to the embodiment of the present invention.

発話例４００は、会議や講演会などにおける発話者の発話例である。発話者は1人に特定する必要はなく、例えば会議であれば議長以外にも発言の可能性があり、また講演会などにおいては講演者の他に司会者や質問者などの発話があってもよい。 An utterance example 400 is an utterance example of a speaker at a conference, a lecture, or the like. It is not necessary to specify a single speaker. For example, in a meeting, there is a possibility that someone other than the chairperson will speak, and in a lecture, there may be other speakers besides the speaker, such as the moderator and questioners. good too.

発話例４００においては、Ａ～Ｋに区分されているが、これらは発話者の発話の区切りである。例えば、発話に一定時間の空白（無音の状態）があった場合などを示している。 In the utterance example 400, it is divided into A to K, and these are divisions of the utterance of the speaker. For example, it indicates a case where there is a blank (silent state) for a certain period of time in the utterance.

これに対して、音声認識結果表示画面４０１においても前記Ａ～Ｋに対応して区切られているが（複数の表示枠４０４Ａ～Ｋ）、これらは音声認識サーバ１０１の音声認識部３２２が前記無音の状態を認識するなどして認識結果の文字列を区切るものである。これらを区切った状態で認識結果記憶部３２０に格納し、また、情報処理端末１０２の表示部３１４が、読者に分かりやすく区切って表示するものである。これは例であって、必ずしも４０４を区切らなくてもよく、設計事項に過ぎない。あくまで後述する校正のための優先順位が認識可能に表示されていればよい。 On the other hand, the speech recognition result display screen 401 is also partitioned corresponding to the above A to K (a plurality of display frames 404A to K), but these are separated by the speech recognition unit 322 of the speech recognition server 101. It recognizes the state of and separates the character string of the recognition result. These are divided and stored in the recognition result storage unit 320, and the display unit 314 of the information processing terminal 102 divides and displays them in an easy-to-understand manner for the reader. This is an example and does not necessarily delimit 404, it is just a matter of design. It is sufficient that the order of priority for calibration, which will be described later, is displayed in a recognizable manner.

開始ボタン４０２は、発話を音声認識サーバ１０１にて認識させる際に押下するものである。システム構成図（図１）に複数の情報処理端末１０２とそれらに接続したマイクの図を記しているが、いずれの情報処理端末１０２に接続しているマイクに向かって発話しているかを指定するためのものである。１つの情報処理端末１０２だけに発話を入力可能としてもよいし、複数の情報処理端末１０２に同時に発話を入力してもよく、システムの設計によるものである。また、開始ボタン４０２に対応して発話を入力していない旨を情報処理端末１０２に通知するための終了ボタン４０３があってもよい。 A start button 402 is pressed to allow the speech recognition server 101 to recognize an utterance. The system configuration diagram (FIG. 1) shows a plurality of information processing terminals 102 and microphones connected to them. It is for It is possible to input speech to only one information processing terminal 102, or to input speech to a plurality of information processing terminals 102 at the same time, depending on the design of the system. Also, there may be an end button 403 for notifying the information processing terminal 102 that no utterance has been input in response to the start button 402 .

前記４０４Ａ～Ｋのうち４０４Ａ～Ｊは前記の“一定時間の空白（無音の状態）”が過ぎた状態を示している。一方で、４０４Ｋは認識結果の出力継続中として、まだ音声認識部３２２が発言者の発話が継続していると判定している状態である。図においては、発話の一部が既に認識済みであるとして、当該一部を表示しているが、区切りが出現した後で、その発言の音声認識結果をまとめて表示してもよい。 Of the 404A to 404K, 404A to 404J indicate a state in which the "fixed time blank (silent state)" has passed. On the other hand, 404K is a state in which the speech recognition unit 322 still determines that the speech of the speaker is continuing, indicating that the output of the recognition result is being continued. In the figure, part of the utterance is assumed to have already been recognized and is displayed.

図５は、本発明の実施形態に係る音声人入力から校正の配布までの処理の一例を示す図である。図５のフローチャートの各ステップは、音声認識サーバ１０１上のＣＰＵ２０１、および、情報処理端末１０２ａ～ｃ上のＣＰＵ２０１で実行される。 FIG. 5 is a diagram showing an example of processing from voice input to proofreading distribution according to the embodiment of the present invention. Each step of the flowchart of FIG. 5 is executed by the CPU 201 on the speech recognition server 101 and the CPU 201 on the information processing terminals 102a to 102c.

ステップＳ５０１においては、情報処理端末１０２ａに接続されたマイクなどを通して発話者の発話を受け付け、音声データに変換する。 In step S501, the speech of the speaker is received through a microphone or the like connected to the information processing terminal 102a and converted into voice data.

ステップＳ５０２においては、情報処理端末１０２ａは、前記音声データを音声認識サーバ１０１に送信し、ステップＳ５０３により音声認識サーバ１０１にて受信する。 The information processing terminal 102a transmits the voice data to the voice recognition server 101 in step S502, and the voice data is received by the voice recognition server 101 in step S503.

ステップＳ５０４においては、音声認識サーバ１０１は、前記音声データにおける発話者の発話を音声認識により文字列に変換する。認識結果の文字列は、前述のとおり発話単位で区切られているが、さらに例えば形態素などの言語的単位で識別可能に区切られている。音声認識の結果には文字列だけではなく、その認識結果を音声認識部３２２がどの程度の確率で正しいと推定しているか確信度が付与されている。また、形態素などの言語的単位で分割されている場合には、各々の形態素に確信度と詳細の品詞がタグとして付与されていてもよい。形態素解析による品詞づけについては図１０で例をあげて説明するが、いわゆる学校で習う学校文法は“固有名詞”などおおざっぱであるが情報処理においては、例えば固有名詞を“人名”、“地名”などと細かく分類する場合がある。形態素解析、音声認識については周知の技術であり詳細な説明は割愛する。 In step S504, the voice recognition server 101 converts the utterance of the speaker in the voice data into a character string by voice recognition. The character string of the recognition result is delimited by utterance units as described above, and further delimited by linguistic units such as morphemes so as to be identifiable. The result of speech recognition is given not only a character string but also a degree of certainty indicating the degree of probability that the speech recognition unit 322 estimates that the recognition result is correct. In addition, when it is divided into linguistic units such as morphemes, each morpheme may be tagged with a degree of certainty and a detailed part of speech. Part-of-speech assignment by morphological analysis will be explained with an example in FIG. It can be classified as finely. Morphological analysis and speech recognition are well-known techniques, and detailed descriptions thereof will be omitted.

ステップＳ５０５においては、音声認識サーバ１０１は、ステップＳ５０４における変換結果の文字列を情報処理端末１０２に送信する。システム内に複数の情報処理端末１０２が接続されている場合には、発話を入力した情報処理端末１０２ａのみではなく全ての情報処理端末１０２に前記文字列を送信する。発話者が使用し音声データを入力した情報処理端末１０２ａに対しても発話者自身が音声認識結果を確認するため送信してもよい。前記情報処理端末１０２においては、ステップＳ５０６において前記文字列を受信する。 In step S<b>505 , the speech recognition server 101 transmits the character string resulting from the conversion in step S<b>504 to the information processing terminal 102 . When a plurality of information processing terminals 102 are connected in the system, the character string is transmitted not only to the information processing terminal 102a to which the utterance is input, but to all the information processing terminals 102 as well. The speech recognition result may also be transmitted to the information processing terminal 102a used by the speaker to input the speech data so that the speaker himself/herself can check the speech recognition result. The information processing terminal 102 receives the character string in step S506.

ステップＳ５０７においては、音声認識サーバ１０１は、音声認識の結果を認識結果記憶部３２０に格納する。認識結果が格納される形式については図６を用いて詳細に説明する。 In step S<b>507 , the speech recognition server 101 stores the speech recognition result in the recognition result storage unit 320 . The format in which the recognition results are stored will be described in detail with reference to FIG.

図６は、本発明の実施の形態にかかわる認識結果と認識結果の確信度のデータ形式の一例を示す図である。一例として認識結果が認識結果情報６００の構造に格納されているとして説明する。 FIG. 6 is a diagram showing an example of the data format of the recognition result and the confidence factor of the recognition result according to the embodiment of the present invention. As an example, it is assumed that the recognition result is stored in the structure of the recognition result information 600 .

６０１Ａ～Ｊは、図４の発話Ａ～Ｊに対応したデータである。前述したとおりの発話の区切りに対応し、音声認識部３２２の結果である文字列を認識文字列６０３に格納する。６０２Ａ～Ｊは、発話６０１Ａ～Ｊの各々に対応した確信度である。各々の認識結果は、認識文字列６０３の他に後述する形態素表記６０４と各々の形態素の認識結果の確信度６０５から構成される。 601A-J are data corresponding to the utterances A-J in FIG. The character string resulting from the speech recognition unit 322 is stored in the recognized character string 603 corresponding to the break of speech as described above. 602A-J are confidence factors corresponding to each of utterances 601A-J. Each recognition result is composed of a recognized character string 603, a morpheme notation 604 to be described later, and a certainty 605 of the recognition result of each morpheme.

また形態素表記６０４の枠内の背景が濃いもの（６０６など）については図８～図１０を用いて後述するが、特に音声認識結果が誤認識されていると読者にとって分かりにくくなるため、優先的にどの部分を校正するかを判断するために使用する。 In addition, morphological notation 604 with a dark background (such as 606) will be described later using FIG. 8 to FIG. used to determine which part to calibrate.

ステップＳ５０８においては、音声認識サーバ１０１は、発話が新しく入力され前述のステップＳ５０７までの処理で認識結果記憶部３２０に格納された１または複数のデータを管理する、校正の優先順位を決定する、などの管理を行う。すなわち図６の認識結果情報６００を管理する。これらの処理は図７、図８で詳細に説明する。 In step S508, the speech recognition server 101 manages one or more pieces of data newly input by the utterance and stored in the recognition result storage unit 320 in the processes up to step S507, determines the priority of proofreading, and other management. That is, it manages the recognition result information 600 of FIG. These processes will be described in detail with reference to FIGS. 7 and 8. FIG.

音声認識サーバ１０１における処理とは非同期に、校正者用の情報処理端末１０２ｂにおいては、ステップＳ５０６で受信した文字列を当該情報処理端末１０２ｂの表示装置にて校正者に提示し、ステップＳ５０９において、校正者の校正作業を受け付ける。校正者の校正作業とは、情報処理端末１０２ｂの表示装置に表示された、識別可能な優先順位に従いながら発話に対応する文字列の編集作業を行うことである。校正をしている状態の画面は図１１を用いて後述する。また、ステップＳ５０９において校正作業が始まった時点でその旨を音声認識サーバ１０１に通知し、認識結果記憶部３２０に格納されているデータの修正状態を“校正中”に変更する。 Asynchronously with the processing in the speech recognition server 101, the information processing terminal 102b for the proofreader presents the character string received in step S506 to the proofreader on the display device of the information processing terminal 102b, and in step S509, Receive proofreading work from proofreaders. The proofreading work of the proofreader is to edit the character string corresponding to the utterance while following the identifiable priority displayed on the display device of the information processing terminal 102b. The screen during calibration will be described later with reference to FIG. Also, in step S509, when the proofreading work is started, the fact is notified to the speech recognition server 101, and the correction state of the data stored in the recognition result storage unit 320 is changed to "under proofreading".

ステップＳ５１０においては、前述の校正が終了した結果の文字列を情報処理端末１０２ｂから送信し、ステップＳ５１１においては音声認識サーバ１０１がその結果を受信して、認識結果記憶部３２０に格納されているデータを更新する。その際に修正状態は“完了”、修正要否は“不要”に変更する。 In step S510, the character string as a result of the proofreading described above is transmitted from the information processing terminal 102b. Update data. At this time, the correction status is changed to "completed" and the necessity of correction is changed to "unnecessary".

ステップＳ５１２において音声認識サーバ１０１の校正結果配布部３２７は、校正が完了した文字列、すなわち音声認識での誤認識部分が校正された文字列を、情報処理端末１０２に送信する。 In step S<b>512 , the proofreading result distributing unit 327 of the speech recognition server 101 transmits to the information processing terminal 102 the proofread-completed character string, that is, the character string in which the erroneously recognized portion in the speech recognition has been proofread.

前記誤りを校正した校正者用の情報処理端末１０２ｂは、校正した時点ですでに正しい文字列が表示されているが、設計事項として当該情報処理端末１０２ｂ、すなわち自分自身にも正しい文字列を送信してもよい。また、図５のフローチャートでは校正が終了された文字列は、いったん音声認識サーバ１０１を経由して情報処理端末１０２に配布されているが、校正用の情報処理端末１０２ｂから直接、他の情報処理端末１０２に配布してもよい。この違いは設計事項に過ぎず、直接配布する場合も本願発明の請求項の範囲に含むものとする。 The information processing terminal 102b for the proofreader who corrected the error already displays the correct character string at the time of proofreading, but as a design matter, the correct character string is also sent to the information processing terminal 102b, that is, to itself. You may In addition, in the flowchart of FIG. 5, the character string for which proofreading has been completed is once distributed to the information processing terminal 102 via the speech recognition server 101, but the information processing terminal 102b for proofreading directly distributes it to other information processing. It may be distributed to terminals 102 . This difference is only a matter of design, and direct distribution is also included in the scope of the claims of the present invention.

ステップＳ５１３においては、情報処理端末１０２は、校正された文字列を受信し、情報処理端末１０２の表示装置に既に表示されている“誤認識を含む文字列”を“校正された文字列”に置き換える。 In step S513, the information processing terminal 102 receives the proofread character string, and changes the "character string including misrecognition" already displayed on the display device of the information processing terminal 102 to the "proofread character string". replace.

なお図４の表示枠４０４Ａ～Ｋが発言ごとに別々の編集対象となっていてもよいし、合わせて一つの編集対象であってもよい。また同時に１つの表示枠４０４を複数の校正者が同時に校正しないように、１つの情報処理端末１０２ｂで構成中の表示枠４０４は、他の情報処理端末１０２ｂでは校正できないようになっていてもよい。また図４の一番下の表示枠４０４は、音声認識が区切れていない文字列の表示が継続しているため、校正できないようになっていてもよい。これらはあくまで設計事項である。 Note that the display frames 404A to 404K in FIG. 4 may be edited separately for each utterance, or may be combined into one edit target. In order to prevent multiple proofreaders from proofreading one display frame 404 at the same time, the display frame 404 configured by one information processing terminal 102b may not be proofread by another information processing terminal 102b. . In addition, since the display frame 404 at the bottom of FIG. 4 continues to display a character string that is not segmented for speech recognition, it may not be possible to proofread. These are only design matters.

図７は、本発明の実施形態にかかわる音声認識結果の解析と校正のための優先順位付けまでの処理（図５のステップＳ５０８）を説明するフローチャートの一例を示す図である。図７のフローチャートの各ステップは、音声認識サーバ１０１上のＣＰＵ２０１で実行される。 FIG. 7 is a diagram showing an example of a flow chart for explaining processing up to prioritization for analysis and proofreading of speech recognition results (step S508 in FIG. 5) according to the embodiment of the present invention. Each step of the flow chart of FIG. 7 is executed by the CPU 201 on the speech recognition server 101 .

ステップＳ７０１においては、新しい発話の音声データの認識結果が認識結果記憶部５２０に登録されたか否かをチェックする。具体的には図６の６０１Ｊまでが前回のチェックで存在したとして、次の６０１Ｋが新たに追加されたか否かをチェックする。登録された場合（“Ｙｅｓ”の場合）には、ステップＳ７０２に進む。登録されていない場合（“Ｎｏ”の場合）には、ステップＳ７０４に進む。 In step S701, it is checked whether the recognition result of the voice data of the new utterance has been registered in the recognition result storage unit 520 or not. Specifically, assuming that up to 601J in FIG. 6 existed in the previous check, it is checked whether or not the next 601K is newly added. If registered ("Yes"), the process proceeds to step S702. If not registered ("No"), the process proceeds to step S704.

ステップＳ７０２においては、新たに追加された音声認識結果の文字列に対して形態素解析を行う。ステップＳ７０２の処理により図１０の例に示されているように文字列を区分して品詞が付与されることになる。これにより形態素列を生成する。ただし音声認識結果自体に形態素解析による品詞が付与されている場合にはステップＳ７０２は不要であり省略する。 In step S702, morphological analysis is performed on the newly added character string of the speech recognition result. By the processing in step S702, the character string is classified and the part of speech is given as shown in the example of FIG. This generates a morpheme string. However, if the speech recognition result itself is given a part of speech by morphological analysis, step S702 is unnecessary and omitted.

ステップＳ７０３においては、前記形態素列から個体名を抽出する。個体名抽出の技術については、特開２００２－２８８１９０などにより周知の技術であるため詳細の説明は割愛する。 In step S703, an individual name is extracted from the morpheme string. The technology for extracting the individual name is well-known technology such as Japanese Patent Laid-Open No. 2002-288190, so detailed description is omitted.

ステップＳ７０４においては、認識結果である文字列（たとえば図６の６０１Ａ～Ｊ）のうち、校正が未処理であるものに対して、校正すべき優先順位を設定する。詳細は図８、図９を用いて後述する。 In step S704, among the recognition result character strings (eg, 601A to 601J in FIG. 6), the priority order to be corrected is set for those that have not yet been corrected. Details will be described later with reference to FIGS.

ステップＳ７０５においては、音声認識システムの実行が継続している場合（“Ｙｅｓ”の場合）には、ステップＳ７０１に戻る。音声認識システムの実行が終了している（“Ｎｏ”の場合）には図７のフローチャートの処理を完了し、図５のフローチャートの処理に戻る。すなわち図５のステップＳ５０８を終わった状態に戻る。 In step S705, if execution of the speech recognition system continues ("Yes"), the process returns to step S701. If execution of the speech recognition system has ended ("No"), the process of the flowchart of FIG. 7 is completed, and the process of the flowchart of FIG. 5 is returned to. In other words, the process returns to the state where step S508 in FIG. 5 has ended.

図８は、本発明の実施形態にかかわる優先順位付けの処理（図７のステップＳ７０４）を説明するフローチャートの一例を示す図である。図８のフローチャートの各ステップは、音声認識サーバ１０１上のＣＰＵ２０１で実行される。 FIG. 8 is a diagram showing an example of a flowchart for explaining the prioritization process (step S704 in FIG. 7) according to the embodiment of the present invention. Each step of the flow chart of FIG. 8 is executed by the CPU 201 on the speech recognition server 101 .

ステップＳ８０１からステップＳ８０８は、認識結果記憶部に格納されている結果、すなわち全発話音声データに基づき音声認識された結果（例えば図６の６０１Ａ～Ｊなら１０の発話データ）に対する繰り返し処理である。 Steps S801 to S808 are repeated processes for the results stored in the recognition result storage unit, that is, the results of speech recognition based on all speech data (for example, 10 speech data for 601A to 601J in FIG. 6).

ステップＳ８０２においては、１つの音声認識結果に着目する。具体的には前記６０１Ａ～Ｊの先頭から順にそのうちの１つに着目する。 In step S802, attention is paid to one speech recognition result. Specifically, focus on one of the 601A to 601J in order from the beginning.

ステップＳ８０３においては、着目中の音声認識結果の優先順位を判定する必要があるか否かを判定する。既に校正済みであるか否か、または図９の発話後経過条件９０１に記載されている条件を満たすか否か、により分岐する。この判定は、２種類の判定のＯＲ条件となっているため、いずれかの条件が満たされていれば“Ｙｅｓ”となり、ステップＳ８０４にすすむ。何れの条件も満たされていない場合には“Ｎｏ”となり、ステップＳ８０５に進む。 In step S803, it is determined whether or not it is necessary to determine the priority of the speech recognition result of interest. It branches depending on whether or not it has already been proofread, or whether or not the condition described in the post-utterance condition 901 in FIG. 9 is satisfied. Since this determination is an OR condition of two types of determination, if either condition is satisfied, the result is "Yes", and the process proceeds to step S804. If none of the conditions are satisfied, the result is "No", and the process proceeds to step S805.

前記２つの条件のうち校正済みであるか否かついて、具体的に図１０（図６の一部の認識結果を例として認識状態を付与している）を用いて詳細に説明する。ある一区切りの発話を音声認識した後に最初に図８のフローチャート（即ち図７のステップＳ７０４）を実行する際には、当該発話の図１０の“修正要否”はまだ何も判断していないため記載がない空白状態であるため条件を満たさない（“Ｎｏ”）。既に校正済みの認識結果については、前記Ｓ５１０の説明にて、校正終了後に図１０の“修正要否”を“不要”としているため条件を満たす（“Ｙｅｓ”）。ただしこの部分は設計事項であり、一度校正終了した認識結果も優先順位をつけ直す対象としてもよい。その場合には、Ｓ５１０において“不要”とはしない。 Of the above two conditions, whether or not proofreading has been completed will be specifically described in detail with reference to FIG. 10 (recognition statuses are given using the partial recognition result in FIG. 6 as an example). When the flow chart of FIG. 8 (that is, step S704 of FIG. 7) is executed for the first time after recognizing a certain segment of utterance, no decision has been made as to whether or not correction is necessary in FIG. The condition is not satisfied because there is no description (“No”). Regarding recognition results that have already been calibrated, the condition is satisfied (“Yes”) because “necessity of correction” in FIG. However, this part is a matter of design, and recognition results that have been calibrated once may also be subject to reprioritization. In that case, it is not set as "unnecessary" in S510.

また前記２つのうち発話後経過条件９０１を条件とする場合を説明する。この条件の意図は、発話が完了した後、時系列的に一定の期間が経過してしまっていると思われるものは、遡って校正しても有用ではないという判断をするためのものである。具体的に図９の９０１に記載している３つの例を用いて説明する。 A case in which the post-utterance condition 901 is used as the condition will be described. The intention of this condition is to judge that it is not useful to go back and proofread something that seems to have passed a certain period of time in chronological order after the completion of the utterance. . Three examples described in 901 in FIG. 9 will be specifically described.

発話後経過条件９０１は、発話されてから一定時間が経過した、ということをどのように判定するかという条件が記載されている。図９に記載の条件はあくまで例であり、これら３つの方法以外であっても時間経過を判定するいかなる方法であれば本願発明に含むものとする。例を1つずつ説明する。 The post-utterance condition 901 describes how to determine that a certain period of time has passed since the utterance. The conditions shown in FIG. 9 are merely examples, and any method for determining the passage of time other than these three methods is included in the present invention. Let's take an example one by one.

例１は、図４の発話例４００におけるＡ～Ｊなど各発話において、その発話が完了した、と見なされる区切りからの実際の時間を測定するものである。例では、終了してから１８０秒以上経過したものは、校正を不要とする条件になっている。経過時間は図１０の例では“経過時間”フィールドに格納されている。 In example 1, in each utterance such as A to J in the example utterance 400 of FIG. 4, the actual time from the break where the utterance is considered to be completed is measured. In the example, if 180 seconds or more have passed since the end, the condition is that calibration is not required. The elapsed time is stored in the "elapsed time" field in the example of FIG.

例２は、時間ではないが文字数でカウントするものであり、発話が完了した、と見なされ区切られた後、続く発話の文字が５００文字以上認識結果として提示されれば、その時点で校正不要とする。図６の６０３を用いて説明すると、６０３Ａの後に６０３Ｂ以降の文字数を合計して５００文字に達すれば、６０１Ａの発話の優先順位を計算せず校正不要となる。 Example 2 counts the number of characters, not the time. After the utterance is considered completed and separated, if 500 or more characters of the subsequent utterance are presented as recognition results, proofreading is not required at that point. and To explain using 603 in FIG. 6, if the total number of characters after 603B after 603A reaches 500 characters, the priority of the utterance of 601A is not calculated and proofreading is not required.

例３は、読者からの見え方により判断するものである。音声認識結果の文字列は、読者の情報処理端末１０２の上では時間が経過するに従って、表示されなくなることが通常である。例えば図４、図１１の音声認識結果表示画面４０１は発話の区切りで上から時系列順に表示され、画面が一杯になると最新のものが最下行に追加され、そのため最上行のもの（最も古い発話を文字列化したもの）は、スクロールされて上方に消えていく、というユーザインタフェースが考えられる（例えば図１１の１１０１点線内の部分）。 In Example 3, judgment is made based on how the reader sees it. The character string of the voice recognition result is normally not displayed on the reader's information processing terminal 102 as time passes. For example, the speech recognition result display screen 401 shown in FIGS. 4 and 11 is displayed in chronological order from the top at the breaks of utterances. ) can be conceived of as a user interface that scrolls and disappears upward (for example, the portion within the dotted line 1101 in FIG. 11).

異なる方法であって、時系列順ではなく、即ち新旧に拘わらず画面に残るもの／画面から消えていくものがある場合であっても、消えてしまったものの誤りを校正しても何れの読者も読むことが出来ないため無意味である。従って校正を不要としていくことが考えられる。 Even if there is a different method and not in chronological order, i.e. if there is something that remains on the screen/disappears from the screen regardless of whether it is old or new, any reader can correct the errors of what has disappeared. is meaningless because it cannot be read. Therefore, it is conceivable that calibration will become unnecessary.

ここでは３つの例を挙げたが、これら以外の方法であってもよい。またこれらの組み合わせ条件（ＡＮＤ条件、ＯＲ条件）であってもよい。 Although three examples are given here, other methods may be used. Also, these conditions may be combined (AND condition, OR condition).

ステップＳ８０４においては、校正を不要とするため図１０に格納されている情報の“修正要否”を“不要”とする。 In step S804, the "necessity of correction" of the information stored in FIG. 10 is set to "unnecessary" to make calibration unnecessary.

ステップＳ８０５においては、校正のステータスにおける“修正要否”を校正する必要がある場合として“要”、“修正状態”をまだ校正されていないとして“未”とする。 In step S805, the "correction necessity" in the proofreading status is set to "required" when proofreading is necessary, and the "correction status" is set to "not yet proofread".

ちなみに既に説明している通り、図５のステップＳ５０９において校正を開始した段階で、“修正状態”を“校正中”、校正が終了し校正結果が音声認識サーバ１０１に送信された段階でステップＳ５１１にて修正状態は“完了”、修正要否は“不要”に変更される。 Incidentally, as already explained, at the stage when proofreading is started in step S509 of FIG. , the correction status is changed to "completed" and the need for correction is changed to "unnecessary".

次にステップＳ８０６においては、例えば図９の９０２に従って、確信度を再計算するか否かを判定する。９０２には例として３つの条件を記載しているがこの条件に限定されるものではない。 Next, in step S806, it is determined whether or not to recalculate the confidence factor, for example, according to 902 in FIG. Three conditions are described in 902 as examples, but the conditions are not limited to these.

例えば９０２の例１では、着目中の音声認識結果に要確認品詞の形態素や個体名が含まれるかを判定する。例えば図１０のＢにおいては、“数詞”が含まれており、これが図９の９０３において要確認品詞として登録されている。一般に数詞あるいは数値を含む特定のパターンは、会社の売上げや契約上の金額、日付などになるため、誤りがあった場合に読者にとって重要な情報が保障されないことになる。また図１０のＥには個体名抽出の結果である数的表現（１００２）が含まれている。複数の形態素から校正される、特定の人物、組織、数的な表現を含む場合も誤りがないことを確認必要な個体名である（図９の９０４）。 For example, in example 1 of 902, it is determined whether or not the speech recognition result of interest includes a morpheme of a part of speech requiring confirmation or an individual name. For example, in B of FIG. 10, "number" is included, and this is registered as a confirmation required part of speech in 903 of FIG. In general, specific patterns containing numerals or numbers represent company sales, contracted amounts, dates, etc., so if there is an error, important information for the reader is not guaranteed. FIG. 10E also includes a numerical expression (1002) that is the result of individual name extraction. This is an individual name that needs to be checked for error even if it contains a specific person, organization, or numerical expression that is calibrated from a plurality of morphemes (904 in FIG. 9).

９０２の２つめの例としては、音声認識結果の中に特に確信度が低い形態素が多く含まれている場合、３つめの例としては、発話全体の認識結果の確信度が低い場合を上げている。認識の確信度が低い場合には、誤認識された形態素が多く含まれている可能性が高く、従って個別に重要な情報がある例１とは異なる意味で校正の優先順位が高くなる。 A second example of 902 is when the speech recognition result contains many morphemes with a particularly low degree of confidence. there is If the recognition confidence is low, there is a high possibility that many misrecognized morphemes are included, and thus the priority of proofreading is higher in a sense different from Example 1, in which there is individually important information.

形態素解析／個体名抽出などの処理と、９０２などに記載されている規則に従って、確信度を再計算するものである。確信度の再計算方法は、例として確信度再計算方法９０５に記載されている。すなわち前述の処理で重要な情報が含まれていれば認識結果の確信度を変更することで校正の優先順位を変更するものである。例えば、要確認品詞９０３に登録されている単語、個体名抽出条件９０４で指定された情報がある場合に、どのように確信度を再計算するかが記載されている（９０５の例１，例２）。 Confidence is recalculated according to processing such as morphological analysis/individual name extraction and rules described in 902 and the like. The confidence recalculation method is described in the confidence recalculation method 905 as an example. That is, if important information is included in the above-described processing, the priority of proofreading is changed by changing the degree of certainty of the recognition result. For example, it describes how to recalculate the certainty when there are words registered in the part of speech to be confirmed 903 and information specified in the individual name extraction condition 904 (example 1 of 905, example 2).

なお、ここに図８のフローチャート形態素解析の処理は記載していないが、音声認識結果自体が、形態素単位に分割されていることが多く、また品詞を音声認識結果の情報として含んでいてもよい。含んでいない場合には、形態素解析や他の方式（辞書を用いるなど）による品詞付けを別途行ってもよい。 Although the processing of morphological analysis in the flow chart of FIG. 8 is not described here, the speech recognition result itself is often divided into morpheme units, and the part of speech may be included as information of the speech recognition result. . If not included, morphological analysis or other methods (using a dictionary, etc.) may be used to assign parts of speech separately.

個体名抽出についても同様である。本発明の実施の形態の一部として含んでいてもよいし、音声認識側で個体名抽出した結果を音声認識結果として含んでいるものの何れであってもよい。 The same applies to individual name extraction. It may be included as a part of the embodiment of the present invention, or may include the result of individual name extraction on the voice recognition side as the voice recognition result.

ステップＳ８０７においては、発話が終わってからの時間によって構成の優先順位を変更するための計算を行う。ステップＳ８０３の判定および９０１の例１において、一定時間経過したものは校正不要としたが、ここではその一定時間が経過する前の認識結果に対する対応である。すなわち、例えば一定時間が経過していない（９０１の例１）、まだ画面内に表示されている（９０１の例３）認識結果であれば、校正が“不要”となる状態に近づいているものほど、校正のために残されたタイムリミットが少ないため優先順位を上げて校正させる必要がある。９０５の例３の式は時間が経過しているほどその認識結果の確信度を下げるものである。 In step S807, a calculation is performed to change the priority of the composition according to the time since the end of the speech. In the determination in step S803 and example 1 in step S901, it was decided that calibration was not required after a certain period of time had passed, but here the correspondence is to the recognition result before the certain period of time has passed. In other words, for example, if the recognition result is still displayed on the screen (example 901) without a certain period of time passing (example 1 of 901), it is close to the state where calibration is "unnecessary". The more time limit is left for proofreading, the higher the priority should be given to proofreading. The formula of example 3 of 905 lowers the confidence of the recognition result as time elapses.

ステップＳ８０９においては、前述で確信度を再計算した結果を受けて、校正が“要”であるものに対して、確信度でソートを行い、確信度が低いものほど優先的に校正するよう情報処理端末１０２の表示装置に提示するものである。 In step S809, upon receiving the result of recalculation of the confidence as described above, the items for which proofreading is "required" are sorted according to the confidence, and information is provided to preferentially proofread items with a lower confidence. It is presented on the display device of the processing terminal 102 .

以上で図８のフローチャートによる処理の説明を完了する。ここでは確信度を一定のルールに応じて変更したが、必ずしも確信度を変更する必要はない。例えばどの程度“減点”したかを記憶する別の数値（マイナス・スコアなど）を用いてもよい。確信度を変更したのはあくまで例であり、設計事項である。 This completes the explanation of the processing according to the flowchart of FIG. Although the degree of certainty is changed according to certain rules here, it is not always necessary to change the degree of certainty. For example, another numerical value (such as a negative score) may be used that stores how much "scores" have been "deducted". The change in confidence is only an example and is a matter of design.

図８の処理をしたことによって、校正すべき優先順位が決定し、図１０においては、例えば認識結果のＥが優先順位１、認識結果のＩが優先順位２となった例を記載している。校正者はこの識別可能な情報に基づき、校正する優先順位を判断する。あるいは、優先順位が高いものからしか編集できないように制御してもよい。 By performing the processing in FIG. 8, the priority order for proofreading is determined, and FIG. 10 shows an example in which the recognition result E has priority 1 and the recognition result I has priority 2, for example. . The proofreader determines the priority of proofreading based on this identifiable information. Alternatively, control may be performed so that only the items with the highest priority can be edited.

図１１は、本発明の実施形態に係る音声認識結果を表示するユーザインタフェースの一例を示すための図である。本質的には図４と同じ図であるが、次の点が異なる。 FIG. 11 is a diagram for showing an example of a user interface that displays speech recognition results according to the embodiment of the present invention. Although it is essentially the same diagram as FIG. 4, the following points are different.

１１０３ｅは、校正者のいずれかが、この認識結果を校正している旨を表す“中”（校正中）を表示している。また１１０３ｇ～１１０３ｊには優先順位１～４をしている。これにより校正者は校正すべき優先順位を識別可能となる。 1103e displays "medium" (under proofreading) indicating that one of the proofreaders is proofreading this recognition result. Priority orders 1 to 4 are assigned to 1103g to 1103j. This allows the proofreader to identify the order of priority for proofreading.

また１１０３ｋは現在発話中の音声認識結果が途中まで認識されその結果が表示されているため“現”と表示されている。この表示枠は校正可能であっても、発話が区切れ次の１１０３ｌが表示されるまでは校正できないように制御されていてもよい。
以上で、図面を用いた本願発明に関する説明を完了する。 1103k is displayed as "present" because the speech recognition result of the current utterance is recognized halfway and the result is displayed. Even if this display frame can be calibrated, it may be controlled so that it cannot be calibrated until the next 1103l is displayed after the speech breaks.
This concludes the description of the present invention with reference to the drawings.

なお、上述した各種データの構成及びその内容はこれに限定されるものではなく、用途や目的に応じて、様々な構成や内容で構成されることは言うまでもない。 It goes without saying that the configuration and content of the various data described above are not limited to this, and may be configured in various configurations and content according to the application and purpose.

以上、いくつかの実施形態について示したが、本発明は、例えば、システム、装置、方法、コンピュータプログラムもしくは記録媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although several embodiments have been described above, the present invention can be embodied as, for example, systems, devices, methods, computer programs or recording media. It may be applied to a system composed of, or may be applied to an apparatus composed of one device.

また、本発明におけるコンピュータプログラムは、図５、図７、図８に示すフローチャートの処理方法をコンピュータが実行可能なコンピュータプログラムであり、本発明の記憶媒体は図５、図７、図８の処理方法をコンピュータが実行可能なコンピュータプログラムが記憶されている。なお、本発明におけるコンピュータプログラムは図５、図７、図８の各装置の処理方法ごとのコンピュータプログラムであってもよい。 Further, the computer program in the present invention is a computer program capable of executing the processing methods of the flowcharts shown in FIGS. A computer program is stored that is computer executable for the method. The computer program in the present invention may be a computer program for each processing method of each device shown in FIGS.

以上のように、前述した実施形態の機能を実現するコンピュータプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたコンピュータプログラムを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。 As described above, a recording medium recording a computer program that realizes the functions of the above-described embodiments is supplied to a system or apparatus, and a computer (or CPU or MPU) of the system or apparatus is stored in the recording medium. Needless to say, the object of the present invention can also be achieved by reading and executing the program.

この場合、記録媒体から読み出されたコンピュータプログラム自体が本発明の新規な機能を実現することになり、そのコンピュータプログラムを記憶した記録媒体は本発明を構成することになる。 In this case, the computer program itself read from the recording medium implements the novel functions of the present invention, and the recording medium storing the computer program constitutes the present invention.

コンピュータプログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、ＣＤ－Ｒ、ＤＶＤ－ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク、ソリッドステートドライブ等を用いることができる。 Examples of recording media for supplying computer programs include flexible disks, hard disks, optical disks, magneto-optical disks, CD-ROMs, CD-Rs, DVD-ROMs, magnetic tapes, non-volatile memory cards, ROMs, EEPROMs, A silicon disk, a solid state drive, or the like can be used.

また、コンピュータが読み出したコンピュータプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのコンピュータプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 In addition, by executing a computer program read by a computer, not only the functions of the above-described embodiments are realized, but also the OS (operating system) etc. running on the computer based on the instructions of the computer program. Needless to say, a case where part or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing are included.

さらに、記録媒体から読み出されたコンピュータプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのコンピュータプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, after the computer program read from the recording medium is written in the memory provided in the function expansion board inserted into the computer or the function expansion unit connected to the computer, the function is executed based on the instructions of the computer program code. Needless to say, a case where a CPU or the like provided in an expansion board or function expansion unit performs part or all of the actual processing and the processing implements the functions of the above-described embodiments.

また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。また、本発明は、システムあるいは装置にコンピュータプログラムを供給することによって達成される場合にも適応できることは言うまでもない。この場合、本発明を達成するためのコンピュータプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Moreover, the present invention may be applied to a system composed of a plurality of devices or to an apparatus composed of a single device. Moreover, it goes without saying that the present invention can be applied to a case where it is achieved by supplying a computer program to a system or apparatus. In this case, by loading a recording medium storing a computer program for achieving the present invention into the system or apparatus, the system or apparatus can enjoy the effects of the present invention.

さらに、本発明を達成するためのコンピュータプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。
なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 Furthermore, by downloading and reading out the computer program for achieving the present invention from a server, database, etc. on the network using a communication program, the system or apparatus can enjoy the effects of the present invention.
It should be noted that all configurations obtained by combining each of the above-described embodiments and modifications thereof are also included in the present invention.

１０１音声認識サーバ
１０２情報処理端末
３２０認識結果記憶部
３２１音声データ受信部
３２２音声認識部
３２３認識結果送信部
３２４認識結果管理部
３２５優先順位決定部
３２６校正結果受信部
３２７校正結果配布部
101 speech recognition server 102 information processing terminal 320 recognition result storage unit 321 speech data reception unit 322 speech recognition unit 323 recognition result transmission unit 324 recognition result management unit 325 priority determination unit 326 proofreading result reception unit 327 proofreading result distribution unit

Claims

An information processing device for acquiring a series of text data based on recognition of speech data segmented from continuous speech data according to a predetermined condition and a first certainty factor indicating the likelihood of recognition of the series of text data,
Output for causing a display device to display information indicating a priority order in which the series of text data for each of the segmented voice data should be corrected based on the first certainty of the series of text data. a control means;
Receiving means for receiving correction of the text data in order to update the text data in which the information is displayed ,
The output control means indicates the order of priority for proofreading the series of text data based on the first certainty adjusted based on whether the character data included in the series of text data is a predetermined part of speech. An information processing apparatus characterized by displaying information on a display device .

An information processing device for acquiring a series of text data based on recognition of speech data segmented from continuous speech data according to a predetermined condition and a first certainty factor indicating the likelihood of recognition of the series of text data,
Output for causing a display device to display information indicating a priority order in which the series of text data for each of the segmented voice data should be corrected based on the first certainty of the series of text data. a control means;
Receiving means for receiving correction of the text data in order to update the text data in which the information is displayed ,
The output control means sets a priority order for proofreading the series of text data based on the first degree of certainty adjusted based on a determination as to whether character data included in the series of text data indicates an individual name. 1. An information processing apparatus, characterized in that the information to be displayed is displayed on a display device .

An information processing device for acquiring a series of text data based on recognition of speech data segmented from continuous speech data according to a predetermined condition and a first certainty factor indicating the likelihood of recognition of the series of text data,
Output for causing a display device to display information indicating a priority order in which the series of text data for each of the segmented voice data should be corrected based on the first certainty of the series of text data. a control means;
Receiving means for receiving correction of the text data in order to update the text data in which the information is displayed ,
The information processing apparatus, wherein the output control means causes the display device to display information to the effect that correction of the series of text data is unnecessary based on the time elapsed since the utterance.

4. The series of text data according to any one of claims 1 to 3 , characterized in that the series of text data is text data including at least one character data obtained by recognizing speech data divided by utterance breaks. information processing equipment.

An information processing device and a display device for obtaining text data based on recognition of a series of speech data segmented from continuous speech data according to a predetermined condition, and a first certainty factor indicating the likelihood of recognition of the series of text data. An information processing system comprising
causing the display device to display information indicating a priority order for correcting the series of text data based on a first certainty factor of the series of text data, with respect to the series of text data for each of the divided voice data; an output control means;
Receiving means for receiving correction of the text data in order to update the text data in which the information is displayed ,
The output control means indicates the order of priority for proofreading the series of text data based on the first certainty adjusted based on whether the character data included in the series of text data is a predetermined part of speech. An information processing system characterized by displaying information on a display device .

An information processing device and a display device for obtaining text data based on recognition of a series of speech data segmented from continuous speech data according to a predetermined condition, and a first certainty factor indicating the likelihood of recognition of the series of text data. An information processing system comprising
causing the display device to display information indicating a priority order for correcting the series of text data based on a first certainty factor of the series of text data, with respect to the series of text data for each of the divided voice data; an output control means;
Receiving means for receiving correction of the text data in order to update the text data in which the information is displayed ,
The output control means sets a priority order for proofreading the series of text data based on the first degree of certainty adjusted based on a determination as to whether character data included in the series of text data indicates an individual name. 1. An information processing system characterized by displaying information to be displayed on a display device .

An information processing device and a display device for obtaining text data based on recognition of a series of speech data segmented from continuous speech data according to a predetermined condition, and a first certainty factor indicating the likelihood of recognition of the series of text data. An information processing system comprising
causing the display device to display information indicating a priority order for correcting the series of text data based on a first certainty factor of the series of text data, with respect to the series of text data for each of the divided voice data; an output control means;
Receiving means for receiving correction of the text data in order to update the text data in which the information is displayed ,
The information processing system, wherein the output control means causes the display device to display information to the effect that correction of the series of text data is unnecessary based on the time elapsed since the utterance.

A control method for an information processing device for acquiring a series of text data based on recognition of speech data segmented from continuous speech data according to a predetermined condition and a first certainty factor indicating the likelihood of recognition of the series of text data. There is
The output control means, for the series of text data for each of the divided voice data,
an output control step of causing a display device to display information indicating a priority order for correcting the series of text data based on the first certainty factor of the series of text data;
a receiving step in which the receiving means receives a correction of the text data in order to update the text data in which the information is displayed ;
In the output control step, a priority order for proofreading the series of text data is indicated based on the first certainty adjusted based on whether the character data included in the series of text data is a predetermined part of speech. A control method for an information processing device, comprising : displaying information on a display device .

Executable in an information processing device for obtaining a series of text data based on recognition of speech data segmented from continuous speech data according to a predetermined condition and a first certainty factor indicating the likelihood of recognition of the series of text data a program,
the information processing device,
Output for causing a display device to display information indicating a priority order in which the series of text data for each of the segmented voice data should be corrected based on the first certainty of the series of text data. a control means;
In order to update the text data in which the information is displayed, it functions as a receiving means for receiving correction of the text data,
The output control means indicates the order of priority for proofreading the series of text data based on the first certainty adjusted based on whether the character data included in the series of text data is a predetermined part of speech. A program characterized by displaying information on a display device .